Overview
FunASR is an end-to-end speech recognition toolkit designed to bridge academic research and industrial deployment. Originating from Alibaba DAMO Academy and contributed to by an active community, FunASR bundles training & fine-tuning capabilities, a rich model zoo of pretrained models (published on ModelScope and Hugging Face), runtime components for both batch and real-time inference, and utilities for related speech tasks (VAD, punctuation, speaker tasks, emotion recognition, etc.).
Key features
- Multi-task support: non-streaming and streaming ASR, VAD, punctuation restoration, timestamp prediction, speaker verification/diarization, keyword spotting, emotion recognition, and multi-talker pipelines.
- Extensive model zoo: contains industrial and academic pretrained models such as Paraformer variants, Conformer, Whisper integrations, SenseVoice family, and Fun-ASR-Nano large-scale models covering many languages and accents.
- Deployment-ready runtimes: offline file transcription services, real-time transcription services, GPU/CPU runtimes, and ONNX export for optimized inference.
- Production oriented: supports hotword customization, WFST/ngram decoding, low-latency transducers (BAT), and optimizations for memory and throughput.
- Ecosystem integrations: direct support for ModelScope and Hugging Face model hubs, examples and demos for common tasks, and packaging to PyPI (funasr) for easy install.
Model zoo & notable models
FunASR publishes many pretrained models aimed at production use. Representative entries include Paraformer (Chinese/English variants), SenseVoiceSmall (multilingual speech understanding), Fun-ASR-Nano (large-scale trained model supporting dozens of languages and dialects), Whisper integrations, Qwen-Audio/Qwen-Audio-Chat adapters, and specialized models for punctuation, timestamp prediction and keyword spotting. Models are available on ModelScope and Hugging Face, enabling easy downloading and inference.
Usage & developer experience
FunASR provides both CLI tools (funasr, funasr-export) and Python APIs (AutoModel and generate/export flows). Common workflows are: quick inference with a pretrained model, streaming ASR with chunked inputs and low-latency settings, VAD segmentation, and model export to ONNX for optimized runtime. The repo includes many ready-to-run demos and example configurations for different languages and deployment scenarios.
Example typical usage:
- Quick non-streaming inference: instantiate AutoModel with a pretrained model id (e.g. "paraformer-zh" or "FunAudioLLM/Fun-ASR-Nano-2512"), optionally enable vad/punctuation models, then call generate() on local audio files.
- Streaming ASR: use chunk_size and lookback settings to trade off latency and accuracy, call generate() incrementally with is_final flags.
- Export: export models to ONNX and run with funasr-onnx runtime for lower-latency CPU/GPU inference.
Deployment & runtime
FunASR ships runtime modules and deployment docs for file transcription services (Mandarin/English, CPU/GPU variants) and real-time transcription services. The runtime has been incrementally optimized for memory leak fixes, ARM64 docker images, dynamic batching, and GPU acceleration. Tools for hotword support, sentence-level timestamps, and automated threading configurations are provided to ease production deployment.
Community, license & citation
The project is MIT-licensed and includes contributions from Alibaba DAMO Academy and multiple academic/industrial partners. Pretrained models may carry their own model license terms. The repo includes citation entries for the FunASR Interspeech paper and related works (Paraformer, BAT, SeACo-Paraformer). Community support is via GitHub issues and communication groups linked from the repo.
Who should use it
Researchers and engineers building ASR systems, speech analytics, meeting transcription, voice assistants, and other speech-enabled products can use FunASR to prototype, fine-tune on industrial data, and deploy production services with pretrained models and runtime components.
