AIAny - CosyVoice (Fun-CosyVoice)

CosyVoice (Fun-CosyVoice)

CosyVoice is an open-source, LLM-centered text-to-speech (TTS) project developed by the FunAudioLLM team. It evolved through versions (CosyVoice 1.0 / 2.0 / 3.0) and aims to provide a scalable, high-quality, zero-shot multilingual speech synthesis stack that covers training, inference and deployment.

Key goals and strengths

Multilingual zero-shot synthesis: supports 9 major languages (including Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian) and extensive Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shanxi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.).
High content consistency and naturalness: designed to improve content fidelity, speaker similarity and prosody naturalness compared with previous versions and many baselines.
Pronunciation inpainting & controllability: supports Chinese Pinyin and English CMU phoneme inpainting to control pronunciations for production scenarios.
Text normalization: built-in handling of numbers, symbols and diverse text formats without requiring a separate frontend module (optionally supports ttsfrd for enhanced normalization).
Bi-streaming and low latency: supports text-in streaming and audio-out streaming with latencies as low as ~150 ms while preserving audio quality.
Instruct-style controls: accepts instructions for language, dialect, emotion, speed, volume and more, enabling flexible generation.

Components and ecosystem

Models: includes multiple released models (CosyVoice-300M, CosyVoice2-0.5B, Fun-CosyVoice3-0.5B-2512 and variants including RL-finetuned releases).
Demo & Web UI: project provides web demos and a web UI to test models quickly.
Training & Inference scripts: examples for training and inference across datasets (e.g., LibriTTS) and scripts to reproduce or fine-tune models.
Deployment: docker images, FastAPI and gRPC server examples, and optional acceleration via NVIDIA TensorRT-LLM (triton trtllm runtime) for production deployment.
Integrations: model download instructions via ModelScope and Hugging Face; provided evaluation scripts and an evaluation set (CV3-Eval).

Installation & usage highlights

Typical setup uses Conda + Python 3.10; repository provides requirements and sample environment setup.
Pretrained models can be downloaded via modelscope or huggingface hub; examples show snapshot_download usage.
Example entry points: python example.py for basic usage, webui.py for launching demo, runtime/docker for serving.

Evaluation & performance

The repo publishes evaluation tables comparing CosyVoice variants to many other TTS systems on metrics such as CER/WER and speaker similarity. Fun-CosyVoice3-0.5B-2512 and its RL variant show competitive CER/WER and speaker similarity in the provided benchmarks.

Roadmap & recent activity

The project has an explicit roadmap (features, releases and runtime improvements). Notable recent items include the release of Fun-CosyVoice3-0.5B-2512, evaluation sets, Triton/TensorRT-LLM runtime support, and ongoing model/tooling releases.

Who should use it

Researchers and engineers working on TTS, voice cloning, multilingual speech synthesis and production deployment will find CosyVoice useful. The project is suitable for experimentation, replication of published results, and building production TTS services with LLM-backed synthesis.

References

See the repository README, demos, and referenced arXiv papers for details on model architectures, training procedures, and evaluation protocols.

CosyVoice (Fun-CosyVoice)

Introduction

CosyVoice (Fun-CosyVoice)

Key goals and strengths

Components and ecosystem

Installation & usage highlights

Evaluation & performance

Roadmap & recent activity

Who should use it

References

Information

Categories

Tags

More Items

ms-swift (SWIFT: Scalable lightWeight Infrastructure for Fine-Tuning)

MLX LM

MiroThinker