Most evaluation datasets for coding agents either scrape human sessions (noisy, privacy concerns) or simulate single-turn code edits. That leaves a gap for controlled, multi-turn agent–user coding conversations that include tool use, runs, and repository edits. SynthTraces fills that gap by systematically generating synthetic end-to-end coding-session traces so researchers can study agent behavior, tool usage, and failure modes in a reproducible way.
What Sets It Apart
- Reproducible Cartesian-product generation: the dataset is produced as agent models × user models × codebases × starting questions (20 × 3 × 20 × 20 = 24,000 sessions). This makes coverage across models, repos, and prompts explicit and easy to filter for experiments.
- Multi-tool, multi-turn traces: each session records the full exchange plus agent actions (read, write, edit, bash) within a real cloned repository — useful for studying step-by-step coding workflows, tool invocation patterns, and edit granularity.
- Open-model focus with local user simulation: remote open models back the coding agent (examples include several public router models), while the user role is simulated locally via llama.cpp variants, enabling offline reproducibility and controlled variability in user prompts.
- Lightweight, MIT-licensed artifacts: designed as a minimal codebase to generate traces rather than a monolithic platform, with data provided in JSON/agent-traces formats suited for experiment pipelines.
Who it's for — and tradeoffs
Great fit if you are a researcher or engineer who wants a controlled, model-centric dataset to: evaluate coding-agent tool usage, analyze edit/retry behaviors, benchmark agent strategies across multiple codebases, or train models on synthetic agent–developer interactions. It’s also handy for building metrics that require explicit action traces (e.g., success-after-edit).
Look elsewhere if you need large-scale human-authored coding sessions, production user telemetry, or privacy-preserving real-user data — SynthTraces is synthetic by design and emphasizes coverage and reproducibility over authentic human variance. Also note final aggregate statistics (success rates, token counts) are marked TODO in the dataset card; check the code repository for any post-generation metadata updates.
Overall, SynthTraces is a focused, experiment-friendly resource for studying and prototyping LLM-driven coding agents in a controlled, repeatable setting.
