LogoAIAny
Icon for item

SmoothConv

Provides ~100 hours of expert-annotated, multi-channel Chinese conversational speech with per-segment timestamps, speaker IDs and paralinguistic labels for turn-taking, overlap/interruption detection and full‑duplex dialogue research. Licensed for academic/non-commercial use (CC BY‑NC 4.0).

Introduction

Human conversational systems increasingly fail at timing: knowing when to keep listening, when to interject, and how to handle overlaps. SmoothConv supplies high-quality, multi-channel Chinese conversation data with human-curated turn-taking and paralinguistic labels so models can learn realistic timing and interaction dynamics rather than relying on single-channel or scripted audio.

What Sets It Apart
  • Multi-channel, naturally occurring dialogs: captures genuine overlaps, backchannels and interruptions across tutoring and social chat domains, not read or scripted speech. This preserves real timing cues necessary for turn-taking models and full‑duplex systems.
  • Expert manual annotations: per-segment JSON records include start/end times, channel index, speaker IDs, turn labels (complete/incomplete/backchannel/wait) and rich paralinguistic attributes, enabling supervised training and fine-grained evaluation.
  • Compact benchmark footprint for supervised work: ~100.5 hours and 2,503 audio files provide a high-quality labeled benchmark complementary to much larger, automatically annotated corpora for Speech LLM pretraining.
Who It's For and Trade-offs

Great fit if you need gold-standard labeled conversational speech for turn-taking detection, overlap/interruption research, or building/evaluating full‑duplex spoken dialogue components. It is especially useful for supervised experiments and benchmarking where annotation fidelity matters. Look elsewhere if you need massive unlabeled scale for self-supervised pretraining (use the companion DuplexConv for that) or require commercial licensing beyond CC BY‑NC 4.0.

Where It Fits

Use SmoothConv as the curated supervised set to validate models trained on large-scale auto-labeled corpora: it works well for error analysis, ablation studies on timing cues, and as a testbed for audio+language multimodal turn-taking models.

Information

  • Websitehuggingface.co
  • OrganizationsASLP@NPU, QualiaLabs
  • Published date2026/05/28

Categories