AIAny - Amphion

Amphion — Open-Source Audio, Music, and Speech Generation Toolkit

Overview

Amphion (/æmˈfaɪən/) is an open-source toolkit focused on audio, music, and speech generation research and development. Its goal is to provide reproducible implementations, educational visualizations, dataset preprocessing pipelines, pretrained models, and evaluation metrics to accelerate experiments and onboarding of junior researchers and engineers in audio generation.

Supported tasks

Text-to-Speech (TTS) — implementations of FastSpeech2, VITS, VALL-E, NaturalSpeech2, MaskGCT, Vevo-TTS, and others.
Voice Conversion (VC) — zero-shot and controllable methods such as Vevo and FACodec.
Singing Voice Synthesis (SVS) and Singing Voice Conversion (SVC).
Accent Conversion (AC) and various speech editing tasks.
Text-to-Audio (TTA) / Text-to-Music (TTM) via latent-diffusion style pipelines.
Neural audio codecs and tokenizers for efficient discrete-token generation.

Key features

Comprehensive model implementations: diffusion-, transformer-, VAE-, flow- and GAN-based architectures for generation and vocoding.
Visualization tools: interactive visualization (e.g., SingVisio) to illustrate internal mechanisms of models for educational purposes.
Evaluation suite: objective metrics for F0, energy, intelligibility (WER/CER with Whisper), perceptual scores (FAD, PESQ, STOI), speaker-similarity measures, etc.
Dataset support & preprocessing: unified preprocess for common datasets (LJSpeech, LibriTTS, VCTK, AudioCaps, etc.) and the large in-the-wild Emilia dataset with Emilia-Pipe for cleaning/annotation.
Pretrained models and demos: Hugging Face hosting, ModelScope integrations, and demo pages for several released systems.
Extensible & reproducible: designed to help reproducible research and to be a learning platform for newcomers.

Notable releases (selected)

Amphion v0.1 (2023-12-18) and v0.2 (technical report released 2025-01-30).
Emilia dataset (101k+ hours) and later Emilia-Large combining additional hours (announced 2025-02-26).
Model releases such as Vevo (zero-shot voice imitation), MaskGCT (non-autoregressive TTS), Metis (foundation model for unified speech generation), and DualCodec (neural audio codec for discrete tokens).

Installation & usage

Install from GitHub or use the provided Docker image. Typical workflow: clone the repo, create a conda environment (python 3.9.15 recommended), run env.sh to install dependencies; or pull the official Docker image and mount datasets.
Recipes and examples are organized under egs/ (TTS, SVC, TTA, vocoder, evaluation, visualization), with README guides for each task.

Ecosystem & interoperability

Amphion integrates with Hugging Face (models & datasets), ModelScope, and provides example notebooks/demos. It uses common pretrained backbones (Whisper, WavLM, ContentVec, WeNet) and supports widely-used vocoders (HiFi-GAN, BigVGAN, WaveNet, DiffWave, etc.).

License & citation

Licensed under the MIT License — free for research and commercial use.
Citation information provided for Amphion v0.1 and v0.2 in the repository README.

Who is it for

Researchers building or reproducing state-of-the-art audio generation models.
Engineers prototyping TTS, voice conversion, singing synthesis, text-to-audio, and codec-based generation systems.
Educators and students who want interactive visualizations to understand model internals.

(See the project homepage and repository README for detailed examples, API usage, and task-specific instructions.)

Amphion

Introduction

Amphion — Open-Source Audio, Music, and Speech Generation Toolkit

Overview

Supported tasks

Key features

Notable releases (selected)

Installation & usage

Ecosystem & interoperability

License & citation

Who is it for

Information

Categories

Tags

More Items

ebook2audiobook

Buzz

faster-whisper