Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Bet that one neural net, scaled with HPC, could transcribe both English and Mandarin without hand-built pipelines — reaching human-competitive accuracy by training fast enough to iterate on architecture in days, not weeks.

Visual Explainer Visit Website

Introduction

The quiet thesis of this paper is that speech recognition had become an engineering problem, not a modeling one. Baidu's team argued that if you could train a single end-to-end network fast enough, you no longer needed the decades of hand-tuned phonetic features, pronunciation lexicons, and language-specific front-ends — the same architecture could learn English and Mandarin, two languages that share almost nothing acoustically, just by being fed enough labeled audio and enough compute.

Key Findings

One model, two unrelated languages. Swapping the training data was nearly enough to retarget the system from English to Mandarin, evidence that the hand-engineered, language-specific pipeline was largely incidental rather than essential.
Speed is a research multiplier. HPC techniques delivered a 7x training speedup, turning week-long experiments into day-long ones — the paper's real claim is that this iteration velocity, not any single trick, is what produced the accuracy gains.
Batch Normalization and SortaGrad made deep RNNs trainable. Applying BatchNorm to recurrent nets and ordering examples from short to long (curriculum learning) stabilized otherwise fragile deep-RNN training.
Deployment was treated as a first-class problem. Batch Dispatch batched incoming requests on GPUs in production, showing end-to-end models could serve users at low latency and reasonable cost, not just win benchmarks.

Who It's For / When to Skip

Great fit if you want a clear case study in how compute scale and systems engineering — not just architecture — drive ASR progress, or if you care about the practical leap from research model to deployed service. Look elsewhere if you want current speech tech: this predates Transformer-based ASR, Conformer, and self-supervised approaches like wav2vec 2.0 and Whisper, which replaced its CTC-trained RNN stack and removed much of its dependence on large labeled corpora.

Back

Information

Websitear5iv.labs.arxiv.org
OrganizationsBaidu Research
AuthorsDario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos …
Published date2015/12/08

More Items

Computer Vision Papers2026

LightMem-Ego: Your AI Memory for Everyday Life

Yijun Chen, Boyi Xiao +11

Continuously records egocentric visual and audio streams into a lightweight streaming memory that organizes experiences into current, short-term, and long-term tiers and retrieves multimodal evidence to answer queries about past events. Built for on-device use (smartphones/AI glasses) with dynamic retrieval routing.

multimodal vision audio mobile code+1

AI Dataset2026

Waxal NLP Datasets

Google Research, Makerere University +6

Provides open ASR and TTS speech data for 24 Sub‑Saharan African languages to train and evaluate speech models. Includes ~1,250 hours of transcribed ASR and ~235 hours of single‑speaker TTS with train/validation/test/unlabeled splits and mixed CC-BY licenses.

multilingual audio speech ASR tts+3

Speech Technology Papers2026

MMAE: A Massive Multitask Audio Editing Benchmark

Ziyang Ma, Ruiqi Yan +36

Provides a comprehensive benchmark for instruction-based audio editing across seven audio modalities and eight operation types, with 2,000 high-fidelity samples and a rubric that decomposes tasks into 17,741 verifiable criteria for multi-dimensional evaluation.

audio multimodal paper speech ai-leaderboard