Automating end-to-end data-science workflows with LLM-based agents promises to cut manual effort, but progress is hampered by a lack of benchmarks that capture the diversity and skill composition of real tasks. AgenticDataBench aims to close that gap by assembling representative datasets, decomposing tasks into reusable data-science skills, and providing fine-grained labels so evaluations reflect practical agent capabilities rather than only coarse task success.
Key Findings
- Multi-domain coverage: The benchmark collects tasks and datasets across 15 verticals, including five real-world B2B cases, so evaluations stress domain variance and business constraints rather than toy examples. This means agent results are more indicative of production readiness.
- Skill-level labeling: Tasks are annotated by constituent data-science skills (extraction, cleaning, joining, modeling, interpretation), enabling per-skill performance analysis. This exposes which subroutines agents struggle with and guides targeted improvements.
- Hybrid construction: For domains lacking real tasks, the authors use an LLM-based task-generation pipeline grounded in extracted skills to produce realistic workflows, expanding coverage without excessive manual effort. This trades perfect realism for scalability while retaining skill diversity.
- Empirical evaluation: The paper evaluates state-of-the-art data agents on the benchmark and reports detailed, skill-wise failure modes rather than only aggregate scores, highlighting gaps in data integration, unstructured-text extraction, and multi-step reasoning.
Who it's for and tradeoffs
Great fit if you build or evaluate LLM-driven data agents, research agent architectures for data integration/analysis, or need a benchmark that surfaces per-skill weaknesses. Look elsewhere if you only need single-query SQL translation tests or tiny toy tables: AgenticDataBench emphasizes realistic, multi-step workflows and requires more complex testbeds and annotation effort. The LLM-generated tasks improve coverage but may not fully substitute high-fidelity proprietary business scenarios.
