AI Dataset2026

Agents Last Exam — Task Card Metadata

Provides task-card metadata for 147 long-horizon professional tasks from the Agents Last Exam benchmark — titles, prompts, taxonomy, and input-file descriptors. This v1.0 release is metadata-only; companion repos host input files and gated reference outputs.

Visit Website

Introduction

The dataset exposes the task-card metadata (one row per task) used to evaluate computer‑use agents on long-horizon professional work. By separating task descriptions, prompts, and input-file descriptors from the heavy assets (input files, reference outputs, VMs), it makes the benchmark's task design transparent without redistributing evaluation fixtures.

What Sets It Apart

Metadata-first release: contains 147 task cards (titles, one-paragraph summaries, full agent prompts, required checklists, expected software, and input-file descriptors) so researchers can inspect and filter tasks without downloading large input bundles.
Clear reproducibility surface: each task includes a stable task_id, taxonomy fields, and source_repo_path, which helps map a task card back to the source repository and the companion input/reference datasets.
Practical evaluation workflow: companion datasets split responsibilities — this dataset = task metadata (open), agents-last-exam-data = input files (open), agents-last-exam-reference = gated ground-truth outputs (manual access for scoring). That design supports public review of task design while protecting sensitive reference assets.

Who It Fits / Trade-offs

Great fit if you are a researcher or developer designing, auditing, or selecting tasks to evaluate agent capabilities (skill coverage, domain selection, long-horizon planning). The dataset is lightweight and easy to filter (parquet format; library support listed), but it is metadata-only: to run or score full evaluations you must fetch the companion input dataset and request access to the gated reference repo. If you need end-to-end runnable benchmarks, expect additional setup (VM images, input files, scoring fixtures) from the companion repositories.

Back

Information

Websitehuggingface.co
Authorsagents-last-exam (RDI Berkeley)
Published date2026/05/07

More Items

AI Dataset2026

GPT-5.6 Sol Coding & Debugging Traces

greghavens

Provides live Codex-CLI agent run traces from GPT-5.6 Sol capturing coding, debugging, security reviews, and harness/seed workflows in cumulative next-action prefixes — suitable for supervised fine-tuning and analysis of tool-using coding agents.

codex ai-coding ai-agent code security+7

AI Dataset2026

Qwen3.8-Max Distillation 50K

r0b0tlab, Alibaba Cloud +1

A curated collection of 49,772 teacher-generated chat traces from qwen3.8-max-preview for supervised fine-tuning and off-policy distillation. Preserves visible chain-of-thought blocks, emphasizes math/code/reasoning mixes, and includes provenance and licensing cautions tied to Alibaba Cloud Model Studio.

distillation qwen reasoning math code+6

AI Dataset2026

Claude Fable 5 Agent Traces

greghavens

Provides behavior-preserving next-step training traces from Claude Fable 5 for supervised fine-tuning and analysis of instruction-following, tool-calling, and coding agents. Runtime-normalized, independently verified, and supplied as Parquet/JSONL with 13,357 cumulative rows from 2,443 accepted trajectories.

huggingface anthropic claude ai-coding code+6

Agents Last Exam — Task Card Metadata

Introduction

What Sets It Apart

Who It Fits / Trade-offs

Information

Categories

Tags

More Items

GPT-5.6 Sol Coding & Debugging Traces

Qwen3.8-Max Distillation 50K

Claude Fable 5 Agent Traces