Overview
GPT-SoVITS-WebUI is an open-source project that provides a full-featured WebUI for few-shot voice conversion and text-to-speech (TTS). It emphasizes practical usability: minimal reference audio for cloning (zero-shot with ~5s, few-shot with ~1min), cross-lingual inference, and an integrated toolset to prepare datasets, run ASR, perform vocal/accompaniment separation, and finetune models.
Key features
- Zero-shot TTS: synthesize speech from a very short (≈5s) reference sample.
- Few-shot TTS: fine-tune models using only ~1 minute of target-speaker audio for improved timbre similarity.
- Cross-lingual inference: supports inference in languages different from training data (Chinese, English, Japanese, Korean, Cantonese supported).
- Integrated WebUI tools: vocal/accompaniment separation (UVR5), automatic audio slicing, Chinese ASR and text labeling to streamline dataset preparation and finetuning.
- Multiple pretrained versions: v1–v4 and v2Pro series with different trade-offs (quality, VRAM, native 48k output in v4, improved timbre in v3/v4, v2Pro for higher performance with moderate VRAM cost).
- Demos & hosting: includes Colab training notebook and a Hugging Face Spaces demo for quick tests.
Technical & usage notes
- Tested environments include Python 3.9–3.11 and PyTorch variants (mentions PyTorch 2.5.1, 2.7.0, etc.), CUDA 12.x for GPU acceleration, and Apple Silicon support (MPS) for CPU/macOS users.
- Provides installation scripts and packages (Windows integrated package, Docker compose images, shell/PowerShell installers). The project supplies guidance for downloading and organizing required pretrained models and ASR/VOC separation weights.
- Fine-tuning workflow in the WebUI: audio path auto-fill, slicing, optional denoising, ASR transcription, proofreading, and then finetuning.
- Inference: bundled inference WebUI and command-line options; supports switching between multiple model versions.
Model & pipeline details
- The repo integrates SoVITS-based vocoder/backbones and GPT-based text frontend improvements. It leverages a variety of community models (e.g., BigVGAN vocoder, eresnetv2 SV models) and ASR toolchains (Faster Whisper, FunASR) depending on language and user choice.
- UVR5 is used for vocal/accompaniment separation and optional reverberation removal.
- Various pretrained model releases (v2/v3/v4/v2Pro) are provided or linked on Hugging Face; users must download and place weights in specified folders for full functionality.
Who is this for
- Researchers and practitioners who want a convenient WebUI for quick voice cloning experiments.
- Developers building TTS/voice-conversion demos, prototypes, or small-scale systems without investing in large datasets.
- Hobbyists and localized deployments that need integrated tools (ASR, slicing, separation) to prepare and finetune models locally or in cloud environments.
Limitations & considerations
- Model quality and required VRAM vary significantly between versions; some versions require careful selection of GPU and precision (fp16) settings.
- Certain additional models (ASR, UVR5 weights, G2PW for Chinese frontend) must be separately downloaded from Hugging Face/ModelScope as described in the README.
- macOS GPU training quality may be lower compared to Linux GPU-trained models; the project documents these caveats.
Resources
- Repository: the GitHub repo hosts code, docs, changelogs, multi-language README and example commands for installation, Docker usage, and running inference/finetuning.
- Demos: Colab training notebook and a Hugging Face Spaces demo are available for quick evaluation.
