The dataset exposes the task-card metadata (one row per task) used to evaluate computer‑use agents on long-horizon professional work. By separating task descriptions, prompts, and input-file descriptors from the heavy assets (input files, reference outputs, VMs), it makes the benchmark's task design transparent without redistributing evaluation fixtures.
What Sets It Apart
- Metadata-first release: contains 147 task cards (titles, one-paragraph summaries, full agent prompts, required checklists, expected software, and input-file descriptors) so researchers can inspect and filter tasks without downloading large input bundles.
- Clear reproducibility surface: each task includes a stable
task_id, taxonomy fields, andsource_repo_path, which helps map a task card back to the source repository and the companion input/reference datasets. - Practical evaluation workflow: companion datasets split responsibilities — this dataset = task metadata (open),
agents-last-exam-data= input files (open),agents-last-exam-reference= gated ground-truth outputs (manual access for scoring). That design supports public review of task design while protecting sensitive reference assets.
Who It Fits / Trade-offs
Great fit if you are a researcher or developer designing, auditing, or selecting tasks to evaluate agent capabilities (skill coverage, domain selection, long-horizon planning). The dataset is lightweight and easy to filter (parquet format; library support listed), but it is metadata-only: to run or score full evaluations you must fetch the companion input dataset and request access to the gated reference repo. If you need end-to-end runnable benchmarks, expect additional setup (VM images, input files, scoring fixtures) from the companion repositories.
