LogoAIAny
Icon for item

GPT-SoVITS-WebUI

GPT-SoVITS-WebUI is an open-source few-shot voice conversion and text-to-speech WebUI. It supports zero-shot (5s) TTS, few-shot (1min) fine-tuning for voice cloning, cross-lingual synthesis, integrated tooling (vocal separation, dataset slicing, ASR, text labeling) and multiple pretrained model versions (v1–v4, v2Pro, etc.).

Introduction

Overview

GPT-SoVITS-WebUI is an open-source project that provides a full-featured WebUI for few-shot voice conversion and text-to-speech (TTS). It emphasizes practical usability: minimal reference audio for cloning (zero-shot with ~5s, few-shot with ~1min), cross-lingual inference, and an integrated toolset to prepare datasets, run ASR, perform vocal/accompaniment separation, and finetune models.

Key features
  • Zero-shot TTS: synthesize speech from a very short (≈5s) reference sample.
  • Few-shot TTS: fine-tune models using only ~1 minute of target-speaker audio for improved timbre similarity.
  • Cross-lingual inference: supports inference in languages different from training data (Chinese, English, Japanese, Korean, Cantonese supported).
  • Integrated WebUI tools: vocal/accompaniment separation (UVR5), automatic audio slicing, Chinese ASR and text labeling to streamline dataset preparation and finetuning.
  • Multiple pretrained versions: v1–v4 and v2Pro series with different trade-offs (quality, VRAM, native 48k output in v4, improved timbre in v3/v4, v2Pro for higher performance with moderate VRAM cost).
  • Demos & hosting: includes Colab training notebook and a Hugging Face Spaces demo for quick tests.
Technical & usage notes
  • Tested environments include Python 3.9–3.11 and PyTorch variants (mentions PyTorch 2.5.1, 2.7.0, etc.), CUDA 12.x for GPU acceleration, and Apple Silicon support (MPS) for CPU/macOS users.
  • Provides installation scripts and packages (Windows integrated package, Docker compose images, shell/PowerShell installers). The project supplies guidance for downloading and organizing required pretrained models and ASR/VOC separation weights.
  • Fine-tuning workflow in the WebUI: audio path auto-fill, slicing, optional denoising, ASR transcription, proofreading, and then finetuning.
  • Inference: bundled inference WebUI and command-line options; supports switching between multiple model versions.
Model & pipeline details
  • The repo integrates SoVITS-based vocoder/backbones and GPT-based text frontend improvements. It leverages a variety of community models (e.g., BigVGAN vocoder, eresnetv2 SV models) and ASR toolchains (Faster Whisper, FunASR) depending on language and user choice.
  • UVR5 is used for vocal/accompaniment separation and optional reverberation removal.
  • Various pretrained model releases (v2/v3/v4/v2Pro) are provided or linked on Hugging Face; users must download and place weights in specified folders for full functionality.
Who is this for
  • Researchers and practitioners who want a convenient WebUI for quick voice cloning experiments.
  • Developers building TTS/voice-conversion demos, prototypes, or small-scale systems without investing in large datasets.
  • Hobbyists and localized deployments that need integrated tools (ASR, slicing, separation) to prepare and finetune models locally or in cloud environments.
Limitations & considerations
  • Model quality and required VRAM vary significantly between versions; some versions require careful selection of GPU and precision (fp16) settings.
  • Certain additional models (ASR, UVR5 weights, G2PW for Chinese frontend) must be separately downloaded from Hugging Face/ModelScope as described in the README.
  • macOS GPU training quality may be lower compared to Linux GPU-trained models; the project documents these caveats.
Resources
  • Repository: the GitHub repo hosts code, docs, changelogs, multi-language README and example commands for installation, Docker usage, and running inference/finetuning.
  • Demos: Colab training notebook and a Hugging Face Spaces demo are available for quick evaluation.

Information

  • Websitegithub.com
  • AuthorsRVC-Boss
  • Published date2024/01/14