LogoAIAny
Icon for item

CosyVoice (Fun-CosyVoice)

CosyVoice (Fun-CosyVoice) is a multilingual, LLM-based text-to-speech (TTS) system that provides end-to-end capabilities for training, inference and deployment. It focuses on zero-shot voice cloning, strong content consistency, speaker similarity and natural prosody, supports many languages and Chinese dialects, pronunciation inpainting, text normalization, and low-latency bi-streaming for production use.

Introduction

CosyVoice (Fun-CosyVoice)

CosyVoice is an open-source, LLM-centered text-to-speech (TTS) project developed by the FunAudioLLM team. It evolved through versions (CosyVoice 1.0 / 2.0 / 3.0) and aims to provide a scalable, high-quality, zero-shot multilingual speech synthesis stack that covers training, inference and deployment.

Key goals and strengths
  • Multilingual zero-shot synthesis: supports 9 major languages (including Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian) and extensive Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shanxi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.).
  • High content consistency and naturalness: designed to improve content fidelity, speaker similarity and prosody naturalness compared with previous versions and many baselines.
  • Pronunciation inpainting & controllability: supports Chinese Pinyin and English CMU phoneme inpainting to control pronunciations for production scenarios.
  • Text normalization: built-in handling of numbers, symbols and diverse text formats without requiring a separate frontend module (optionally supports ttsfrd for enhanced normalization).
  • Bi-streaming and low latency: supports text-in streaming and audio-out streaming with latencies as low as ~150 ms while preserving audio quality.
  • Instruct-style controls: accepts instructions for language, dialect, emotion, speed, volume and more, enabling flexible generation.
Components and ecosystem
  • Models: includes multiple released models (CosyVoice-300M, CosyVoice2-0.5B, Fun-CosyVoice3-0.5B-2512 and variants including RL-finetuned releases).
  • Demo & Web UI: project provides web demos and a web UI to test models quickly.
  • Training & Inference scripts: examples for training and inference across datasets (e.g., LibriTTS) and scripts to reproduce or fine-tune models.
  • Deployment: docker images, FastAPI and gRPC server examples, and optional acceleration via NVIDIA TensorRT-LLM (triton trtllm runtime) for production deployment.
  • Integrations: model download instructions via ModelScope and Hugging Face; provided evaluation scripts and an evaluation set (CV3-Eval).
Installation & usage highlights
  • Typical setup uses Conda + Python 3.10; repository provides requirements and sample environment setup.
  • Pretrained models can be downloaded via modelscope or huggingface hub; examples show snapshot_download usage.
  • Example entry points: python example.py for basic usage, webui.py for launching demo, runtime/docker for serving.
Evaluation & performance

The repo publishes evaluation tables comparing CosyVoice variants to many other TTS systems on metrics such as CER/WER and speaker similarity. Fun-CosyVoice3-0.5B-2512 and its RL variant show competitive CER/WER and speaker similarity in the provided benchmarks.

Roadmap & recent activity

The project has an explicit roadmap (features, releases and runtime improvements). Notable recent items include the release of Fun-CosyVoice3-0.5B-2512, evaluation sets, Triton/TensorRT-LLM runtime support, and ongoing model/tooling releases.

Who should use it

Researchers and engineers working on TTS, voice cloning, multilingual speech synthesis and production deployment will find CosyVoice useful. The project is suitable for experimentation, replication of published results, and building production TTS services with LLM-backed synthesis.

References

See the repository README, demos, and referenced arXiv papers for details on model architectures, training procedures, and evaluation protocols.

Information

  • Websitegithub.com
  • AuthorsFunAudioLLM
  • Published date2024/07/03

More Items