Streaming audio — not isolated clips — is increasingly necessary for agents that must track context, react proactively, and manage multi-turn spoken interaction. StreamAudio-2M assembles millions of short audio clips into coherent multi-turn streams and task-specific subsets so models can be trained on both local turn-level signals and longer scene/stream dynamics.
What Sets It Apart
- Stream-centric schema: each example is a "stream" (sequence of turns) with unified metadata and per-turn audio paths, enabling experiments on context carryover, turn segmentation, and agent memory. This contrasts with datasets that only provide isolated utterances.
- Multi-task composition: six curated subsets target complementary capabilities — real-time ASR, EN→ZH speech translation, audio understanding (captions/QA), multi-round voice chatting, proactive response behaviors, and environment-aware montages — allowing joint or staged training strategies.
- Practical delivery and stats: audio is packaged as uncompressed tar shards to keep download/IO scalable; each row includes detailed audio_stats (duration, sample_rate, rms_db, peak_db, zero-crossing_rate, etc.), which helps dataset curation and filtering for model training.
Who It's For and Tradeoffs
Great fit if you are training or evaluating audio-capable LLMs, multimodal agents, or voice-chat systems that need: (a) realistic multi-turn context, (b) mixed tasks for multi-capability models, or (c) corpora with per-clip audio diagnostics for filtering. Look elsewhere if you need single-source high-fidelity studio recordings (this is an aggregated, multi-source corpus), or wide language coverage beyond English/Chinese.
Where It Fits
Use it for pretraining or multi-task fine-tuning when you require streaming/temporal context (e.g., turn-level system prompts, voice agents that must react mid-stream, or continuous ASR pipelines). It complements isolated-utterance corpora (CommonVoice, LibriSpeech) when the research question involves interaction, continuity, or proactive behaviors.
Practical notes
- Licensing: CC-BY-4.0 — suitable for research and many commercial uses with attribution.
- Size & access: audio provided as tar shards; reconstitution requires extracting all shards into an audio/ tree.
- Data quality: mixes multiple sources (AudioSet, CommonVoice, GigaSpeech, CoVoST2, etc.); subset-level noise/content balance varies, so inspect audio_stats and subset provenance before large-scale training.
