Human conversational systems increasingly fail at timing: knowing when to keep listening, when to interject, and how to handle overlaps. SmoothConv supplies high-quality, multi-channel Chinese conversation data with human-curated turn-taking and paralinguistic labels so models can learn realistic timing and interaction dynamics rather than relying on single-channel or scripted audio.
What Sets It Apart
- Multi-channel, naturally occurring dialogs: captures genuine overlaps, backchannels and interruptions across tutoring and social chat domains, not read or scripted speech. This preserves real timing cues necessary for turn-taking models and full‑duplex systems.
- Expert manual annotations: per-segment JSON records include start/end times, channel index, speaker IDs, turn labels (complete/incomplete/backchannel/wait) and rich paralinguistic attributes, enabling supervised training and fine-grained evaluation.
- Compact benchmark footprint for supervised work: ~100.5 hours and 2,503 audio files provide a high-quality labeled benchmark complementary to much larger, automatically annotated corpora for Speech LLM pretraining.
Who It's For and Trade-offs
Great fit if you need gold-standard labeled conversational speech for turn-taking detection, overlap/interruption research, or building/evaluating full‑duplex spoken dialogue components. It is especially useful for supervised experiments and benchmarking where annotation fidelity matters. Look elsewhere if you need massive unlabeled scale for self-supervised pretraining (use the companion DuplexConv for that) or require commercial licensing beyond CC BY‑NC 4.0.
Where It Fits
Use SmoothConv as the curated supervised set to validate models trained on large-scale auto-labeled corpora: it works well for error analysis, ablation studies on timing cues, and as a testbed for audio+language multimodal turn-taking models.
