LogoAIAny
Icon for item

faster-whisper

faster-whisper is an open-source reimplementation of OpenAI's Whisper model by SYSTRAN that uses CTranslate2 for fast, memory-efficient inference. It offers faster transcription (CPU/GPU), supports 8-bit quantization, batched transcription, word-level timestamps, VAD filtering, and provides tools to convert Transformers/Whisper checkpoints into CTranslate2 models.

Introduction

Overview

faster-whisper is an open-source project from SYSTRAN that reimplements OpenAI's Whisper speech-to-text model using CTranslate2, a high-performance inference engine for Transformer models. The main goal is to provide the same transcription accuracy as Whisper while being significantly faster and more memory-efficient, particularly when using quantization and batched inference.

Key features
  • High-performance inference: built on CTranslate2 to deliver much faster transcription times compared to the reference openai/whisper implementation, especially on CPU or when using quantized models.
  • Quantization support: supports int8 and float16 computation types to reduce memory usage and improve speed on both CPU and GPU where supported.
  • Batched transcription: allows processing multiple audio inputs in batches to maximize throughput for server or large-scale workloads.
  • Word-level timestamps: optional generation of word-level start/end timestamps for finer-grained alignment needs.
  • VAD filtering: integrates Silero VAD to remove long silences and reduce unnecessary processing; parameters are customizable.
  • Model conversion utilities: scripts and APIs to convert Whisper/Transformers checkpoints into CTranslate2 format for faster loading and inference.
  • Compatibility with distilled Whisper: supports distil-whisper checkpoints (e.g., distil-large-v3) and is optimized to work with distil variants.
Performance and benchmarks

The project provides benchmark comparisons showing large speedups over openai/whisper and competitive results versus other implementations (whisper.cpp, Transformers-based inference). Benchmarks cover GPU and CPU scenarios, different model sizes, and show the impact of batching and int8 quantization on time and memory/VRAM usage.

Requirements and environment notes
  • Python 3.9+
  • Uses PyAV for audio decoding (so system-level FFmpeg is not required).
  • For GPU execution, appropriate NVIDIA CUDA and cuDNN libraries are required. The repository documents supported CUDA/cuDNN combinations and workarounds (including recommended ctranslate2 versions for older CUDA/cuDNN stacks).
Installation

Install from PyPI:

pip install faster-whisper

Alternative install methods include installing from the master branch or a specific commit via pip pointing to the GitHub archive URL.

Usage

Typical usage involves creating a WhisperModel instance and calling transcribe. Examples in the repository show running on GPU with float16 or int8, CPU quantized inference, batched transcription via a BatchedInferencePipeline helper, and how to enable word timestamps or VAD filtering. The transcribe API returns a generator of segments (so iteration triggers actual transcription).

Model conversion and loading

faster-whisper can automatically download pre-converted CTranslate2 models from the Hugging Face Hub (Systran-hosted) when given a model size string (e.g., "large-v3"). It also provides a converter script to transform Transformers/Whisper checkpoints into CTranslate2 format, with options for quantization and copying required tokenizer/preprocessor files.

Integrations and ecosystem

The README lists several community projects and integrations that use faster-whisper as a backend, including CLI clients, transcription servers, diarization tools, near-live and streaming solutions, and GUI applications. This makes faster-whisper suitable both as a component in research/prototyping and as part of production transcription stacks.

When to choose faster-whisper
  • You need faster inference than the original Whisper implementation, especially on CPU or resource-limited GPUs.
  • You want low-memory/low-VRAM inference through quantization (int8/float16).
  • You require batched processing for high-throughput transcription services.
  • You want a community-backed open-source solution with conversion tools for existing Whisper/Transformers checkpoints.
Caveats
  • GPU execution requires compatible NVIDIA libraries; users must ensure correct CUDA/cuDNN versions or use recommended Docker images.
  • Benchmarks can vary depending on hardware, driver/library versions, and exact model/checkpoint used; the repository provides guidance to make fair comparisons (same beam size, thread settings, etc.).

For code examples and full API details, see the project README and the transcribe module in the repository.

Information

  • Websitegithub.com
  • AuthorsSYSTRAN
  • Published date2023/02/11

Categories