AI Dataset2026

SmoothConv

Provides ~100 hours of expert-annotated, multi-channel Chinese conversational speech with per-segment timestamps, speaker IDs and paralinguistic labels for turn-taking, overlap/interruption detection and full‑duplex dialogue research. Licensed for academic/non-commercial use (CC BY‑NC 4.0).

Visit Website

Introduction

Human conversational systems increasingly fail at timing: knowing when to keep listening, when to interject, and how to handle overlaps. SmoothConv supplies high-quality, multi-channel Chinese conversation data with human-curated turn-taking and paralinguistic labels so models can learn realistic timing and interaction dynamics rather than relying on single-channel or scripted audio.

What Sets It Apart

Multi-channel, naturally occurring dialogs: captures genuine overlaps, backchannels and interruptions across tutoring and social chat domains, not read or scripted speech. This preserves real timing cues necessary for turn-taking models and full‑duplex systems.
Expert manual annotations: per-segment JSON records include start/end times, channel index, speaker IDs, turn labels (complete/incomplete/backchannel/wait) and rich paralinguistic attributes, enabling supervised training and fine-grained evaluation.
Compact benchmark footprint for supervised work: ~100.5 hours and 2,503 audio files provide a high-quality labeled benchmark complementary to much larger, automatically annotated corpora for Speech LLM pretraining.

Who It's For and Trade-offs

Great fit if you need gold-standard labeled conversational speech for turn-taking detection, overlap/interruption research, or building/evaluating full‑duplex spoken dialogue components. It is especially useful for supervised experiments and benchmarking where annotation fidelity matters. Look elsewhere if you need massive unlabeled scale for self-supervised pretraining (use the companion DuplexConv for that) or require commercial licensing beyond CC BY‑NC 4.0.

Where It Fits

Use SmoothConv as the curated supervised set to validate models trained on large-scale auto-labeled corpora: it works well for error analysis, ablation studies on timing cues, and as a testbed for audio+language multimodal turn-taking models.

Back

Information

Websitehuggingface.co
OrganizationsASLP@NPU, QualiaLabs
Published date2026/05/28

More Items

Computer Vision Papers2026

CLBench-V: Evaluating Multimodal Context Learning from Grounding to Knowledge Acquisition

Lai Wei, Chengqi Li +4

Evaluates multimodal context learning across grounding, new information application, and knowledge acquisition using a 3,443-instance benchmark spanning science, finance, long documents, spatial reasoning, and web VQA; finds current multimodal models perform poorly (best score 0.2847) and analyzes failure modes.

multimodal benchmark vision evaluation paper+4

AI Dataset2026

HiFi-UMI-2K

Yuteng Wei, Jinming Ma +15Simple AI

Provides 2,000 hours of synchronized, high‑fidelity robot‑free bimanual manipulation demonstrations with multi‑view video, calibrated end‑effector trajectories, gripper states, and language annotations. Curated from a 20,000+ hour corpus; features 6 camera views, ~3 mm pose accuracy, <40 µs cross‑sensor sync, and LeRobot v3‑style Parquet+MP4 export under CC BY 4.0.

robotics video multimodal parquet huggingface+3

AI Dataset2026

Anthropic/BioMysteryBench-full

Anthropic, Hugging Face

A collection of biology-focused 'mystery' tasks for benchmarking model performance on biomedical reasoning, evidence synthesis, and problem solving; curated by Anthropic and hosted on Hugging Face, designed for granular evaluation of scientific decision-making.

anthropic huggingface evaluation benchmarks reasoning+1

SmoothConv

Introduction

What Sets It Apart

Who It's For and Trade-offs

Where It Fits

Information

Categories

Tags

More Items

CLBench-V: Evaluating Multimodal Context Learning from Grounding to Knowledge Acquisition

HiFi-UMI-2K

Anthropic/BioMysteryBench-full