WhisperLiveKit: Ultra-Low-Latency Speech-to-Text with Speaker Identification
WhisperLiveKit (WLK) is an advanced, open-source toolkit designed for real-time, self-hosted speech-to-text (STT) transcription. It addresses the limitations of traditional Whisper models, which are optimized for complete utterances rather than streaming audio, by integrating state-of-the-art techniques from simultaneous speech processing research. This allows for ultra-low latency transcription without sacrificing accuracy, making it suitable for live applications like virtual meetings, live captioning, and interactive voice interfaces.
Core Features
- Real-Time Transcription: Utilizes Simul-Whisper (SOTA 2025) with AlignAtt policy for low-latency processing and WhisperStreaming (SOTA 2023) with LocalAgreement policy. This ensures words appear almost instantly as you speak, avoiding the delays common in batch-processed audio.
- Multilingual Support: Built on NLLW (2025), derived from NLLB (2022/2024), it supports simultaneous translation to and from over 200 languages. Users can specify source and target languages for seamless cross-lingual transcription.
- Speaker Diarization: Integrates Streaming Sortformer (SOTA 2025) or Diart (SOTA 2021) for real-time speaker identification, distinguishing multiple speakers in conversations. This is crucial for applications like meeting summaries or call center analytics.
- Voice Activity Detection (VAD): Powered by Silero VAD (2024), it efficiently detects speech segments, reducing computational overhead during silence and enabling multi-user concurrency.
Architecture Overview
The system features a modular backend that supports multiple concurrent users. Audio input is processed through an AudioProcessor, which handles buffering and feeds into a TranscriptionEngine. The engine selects between backends like Faster-Whisper, MLX-Whisper (for Apple Silicon), or vanilla Whisper, optimizing for hardware. VAD filters out non-speech, while policies like AlignAtt manage incremental decoding to prevent mid-word cutoffs.
Key components include:
- Frontend: A simple WebSocket-based interface for browser integration, with a Chrome extension for capturing web audio.
- Backend Policies: Choose between SimulStreaming (default, ultra-fast) or LocalAgreement for different latency-accuracy trade-offs.
- Optional Enhancements: LoRA adapters for custom models, CIF for word boundaries, and direct English translation modes.
Installation and Usage
Getting started is straightforward:
- Install via pip:
pip install whisperlivekit. - Launch the server:
wlk --model base --language en. - Access via browser at
http://localhost:8000and start speaking.
For advanced setups:
- Enable diarization:
--diarization. - Translate:
--language fr --target-language da. - Deploy with Docker for GPU/CPU support, or use Gunicorn for production scaling.
Customization options abound, from model selection (tiny to large-v3) to backend tweaks like beam search or frame thresholds.
Why Choose WhisperLiveKit?
Unlike naive Whisper implementations that process audio in fixed batches—leading to context loss and errors—WhisperLiveKit employs intelligent policies for buffering just enough audio to maintain coherence. It's self-hosted for privacy, supports HTTPS, and scales via ASGI servers. With 9k+ GitHub stars, it's a robust choice for developers building AI-powered audio applications.
Use Cases
- Accessibility: Real-time captions for the hearing impaired.
- Productivity: Automatic meeting transcripts with speaker labels.
- Content Creation: Live podcast or video subtitling.
- Customer Service: Transcribe and analyze support calls.
The project is licensed under Apache 2.0 and actively maintained, with ongoing improvements for even lower latency and broader language support.
