AI Dataset2026

lazarus19/Vibe-Coding-Instruct

A JSON dataset of ~1.1M anonymized coding-assistant instruction→response interactions for training and evaluating code-generation and instruction-following models; packaged for use with pandas/polars and sized at ~459 MB.

Visit Website

Introduction

Public, large-scale logs of real coding-assistant interactions are rare; this dataset fills that gap with ~1.1 million anonymized instruction→response traces captured in JSON. It is oriented toward researchers and engineers who need realistic client↔server coding assistant interactions for model training, evaluation, or analysis without sharing raw identifiable content.

What Sets It Apart

Scale and format: ~1,100,000 rows in a compact JSON shard (~459 MB) so you can iterate on model training and evaluation without heavyweight storage needs — practical for local prototyping and batch experiments.
Interaction focus: records client↔server message logs and instruction-response pairs rather than isolated code snippets, so you can study multi-turn prompting, instruction clarity, and assistant behavior rather than only final outputs.
Tooling-ready: metadata and structure are compatible with pandas/polars workflows, lowering the friction to preprocess, filter, and sample data for fine-tuning or evaluation pipelines.

Who It's For + Tradeoffs

Great fit if you need realistic conversational coding data to train or benchmark code-generation and instruction-following LLMs, to analyze prompting strategies, or to simulate coding-assistant UX. Look elsewhere if you require labeled functional tests, ground-truth code execution traces, or provenance/attribution metadata for each example; this dataset prioritizes interaction logs and anonymization over executable test harnesses and exhaustive provenance.

Back

Information

Websitehuggingface.co
Authorslazarus19
Published date2026/06/12

More Items

AI Dataset2026

ArithMark 3.0

AxiomicLabs

A multiple-choice benchmark for evaluating language-model arithmetic: 1,000 continuation-style elementary word problems (4 choices, balanced labels) organized by topic, grade band, and difficulty. Designed for base-model continuation log-likelihood scoring; released under Apache-2.0.

evaluation benchmarks benchmark huggingface nlp+4

AI Dataset2026

XYZ-Aquila SFT

XYZAILab

Provides 7,000 bilingual multi-turn, search-oriented tool-use trajectories (5,000 English, 2,000 Chinese) for supervised fine-tuning and analysis of agentic search models. Includes serialized system/user/assistant messages, embedded Qwen3 tool schemas, and conversion scripts; not a standalone benchmark.

web-search agent-skills ai-agent multilingual huggingface+4

AI Dataset2026

Reasoning Corpus 5M

QyrouQyrouNnet-AI, SupraLabs

Provides ~5M model-generated reasoning chains (within 5k sequence length) with structured fields for supervised fine-tuning, reasoning distillation, and instruction tuning. Includes separate fields for prompt, reasoning trace, final answer and a ChatML view; streaming access recommended for large-scale use.

reasoning distillation deepseek qwen gemma+7

lazarus19/Vibe-Coding-Instruct

Introduction

What Sets It Apart

Who It's For + Tradeoffs

Information

Categories

Tags

More Items

ArithMark 3.0

XYZ-Aquila SFT

Reasoning Corpus 5M