Chatterbox TTS — detailed introduction
Chatterbox is an open-source suite of text-to-speech models published by Resemble AI. The project provides three main model families tailored to different use cases:
-
Chatterbox-Turbo: a 350M-parameter, highly efficient model optimized for low compute and VRAM usage. Turbo introduces native paralinguistic tags (e.g.,
[laugh],[chuckle],[cough]) for adding realistic non-speech events, and a distilled speech-token-to-mel decoder that reduces generation to a single step while maintaining high-fidelity audio output. It is particularly suited for zero-shot voice agents and production scenarios where latency and resource use matter. -
Chatterbox: the original English-focused model offering flexible control (CFG & exaggeration tuning) for expressive outputs, useful for creative TTS tasks and general zero-shot voice cloning.
-
Chatterbox-Multilingual: a larger variant (≈500M) supporting 23+ languages with zero-shot cloning and multilingual synthesis for localization and global applications.
Key features
- Paralinguistic tags: native support for non-verbal tokens to boost realism.
- Low-latency Turbo inference: distilled decoder reduces synthesis steps to a single step for faster generation and lower VRAM requirements.
- Zero-shot voice cloning: models accept a short reference audio clip to mimic a target voice.
- Multi-language support: the multilingual model supports 23+ languages including Chinese, Spanish, French, Arabic, Hindi, Japanese, Korean, and others.
- Built-in PerTh watermarking: every generated audio includes Resemble AI's Perth implicit watermark that survives common edits and compression and can be programmatically extracted for provenance and responsible AI workflows.
- Demos & integrations: demo pages and Hugging Face Spaces are provided for quick listening and evaluation.
Installation & usage (summary)
- Install via pip:
pip install chatterbox-ttsor install from source for development. - Typical usage involves loading a model (e.g.,
ChatterboxTurboTTS.from_pretrained(device="cuda")) and callinggenerate(text, audio_prompt_path=...)when voice cloning is required. The library uses torch/torchaudio for audio handling and provides example scripts for TTS and voice conversion.
Supported languages
- Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, Chinese.
Responsible AI
- The project includes Perth watermark embedding and extraction tools to help detect generated audio. The README explicitly discourages misuse and provides detection code samples.
Repository & publishing
- This GitHub repository serves as the canonical open-source distribution for the Chatterbox models and associated code, demo pages, and examples. The project metadata indicates it was created on 2025-04-23.
Use cases
- Real-time voice agents and assistants (low-latency requirements)
- Audiobook and narration production
- Multilingual localization and zero-shot voice cloning experiments
- Research and fine-tuning for higher-accuracy or bespoke voice models (with commercial Resemble AI services available for production scaling)
Links and demos
- Official demo page: provided on the repository homepage
- Hugging Face Spaces: demo spaces linked from the README
(For implementation details, examples and API parameters, refer to the repository README and example scripts.)
