AIAny - GPT-SoVITS-WebUI

Overview

GPT-SoVITS-WebUI is an open-source project that provides a full-featured WebUI for few-shot voice conversion and text-to-speech (TTS). It emphasizes practical usability: minimal reference audio for cloning (zero-shot with ~5s, few-shot with ~1min), cross-lingual inference, and an integrated toolset to prepare datasets, run ASR, perform vocal/accompaniment separation, and finetune models.

Key features

Zero-shot TTS: synthesize speech from a very short (≈5s) reference sample.
Few-shot TTS: fine-tune models using only ~1 minute of target-speaker audio for improved timbre similarity.
Cross-lingual inference: supports inference in languages different from training data (Chinese, English, Japanese, Korean, Cantonese supported).
Integrated WebUI tools: vocal/accompaniment separation (UVR5), automatic audio slicing, Chinese ASR and text labeling to streamline dataset preparation and finetuning.
Multiple pretrained versions: v1–v4 and v2Pro series with different trade-offs (quality, VRAM, native 48k output in v4, improved timbre in v3/v4, v2Pro for higher performance with moderate VRAM cost).
Demos & hosting: includes Colab training notebook and a Hugging Face Spaces demo for quick tests.

Technical & usage notes

Tested environments include Python 3.9–3.11 and PyTorch variants (mentions PyTorch 2.5.1, 2.7.0, etc.), CUDA 12.x for GPU acceleration, and Apple Silicon support (MPS) for CPU/macOS users.
Provides installation scripts and packages (Windows integrated package, Docker compose images, shell/PowerShell installers). The project supplies guidance for downloading and organizing required pretrained models and ASR/VOC separation weights.
Fine-tuning workflow in the WebUI: audio path auto-fill, slicing, optional denoising, ASR transcription, proofreading, and then finetuning.
Inference: bundled inference WebUI and command-line options; supports switching between multiple model versions.

Model & pipeline details

The repo integrates SoVITS-based vocoder/backbones and GPT-based text frontend improvements. It leverages a variety of community models (e.g., BigVGAN vocoder, eresnetv2 SV models) and ASR toolchains (Faster Whisper, FunASR) depending on language and user choice.
UVR5 is used for vocal/accompaniment separation and optional reverberation removal.
Various pretrained model releases (v2/v3/v4/v2Pro) are provided or linked on Hugging Face; users must download and place weights in specified folders for full functionality.

Who is this for

Researchers and practitioners who want a convenient WebUI for quick voice cloning experiments.
Developers building TTS/voice-conversion demos, prototypes, or small-scale systems without investing in large datasets.
Hobbyists and localized deployments that need integrated tools (ASR, slicing, separation) to prepare and finetune models locally or in cloud environments.

Limitations & considerations

Model quality and required VRAM vary significantly between versions; some versions require careful selection of GPU and precision (fp16) settings.
Certain additional models (ASR, UVR5 weights, G2PW for Chinese frontend) must be separately downloaded from Hugging Face/ModelScope as described in the README.
macOS GPU training quality may be lower compared to Linux GPU-trained models; the project documents these caveats.

Resources

Repository: the GitHub repo hosts code, docs, changelogs, multi-language README and example commands for installation, Docker usage, and running inference/finetuning.
Demos: Colab training notebook and a Hugging Face Spaces demo are available for quick evaluation.

GPT-SoVITS-WebUI

Introduction

Overview

Key features

Technical & usage notes

Model & pipeline details

Who is this for

Limitations & considerations

Resources

Information

Categories

Tags

More Items

NeuralOperator: Learning in Infinite Dimensions

VideoCaptioner

torchtitan