Long-horizon computer-use benchmarks depend on authoritative, non-leaking task implementations. This gated Hugging Face dataset houses the official OSWorld 2.0 Python task classes that define environment logic and evaluators used across benchmark runs, so that agent evaluations remain comparable and resistant to accidentally leaked solutions.
What Sets It Apart
- Canonical task classes: Contains the root-level task_*.py implementations that reproduce the benchmark’s environment, input artifacts, and evaluators — not just example data. This ensures evaluation parity across runs.
- Gated distribution to prevent leakage: The files are intentionally distributed behind a Hugging Face gated dataset to reduce the chance that evaluated agents can find task answers or internal evaluator behavior during execution. This design choice prioritizes benchmark integrity over open ease-of-access.
- Integration-ready format: The dataset provides JSON-formatted task files and is commonly used alongside the OSWorld codebase and evaluation tooling; consumers typically integrate these task classes into local benchmark runners or evaluation pipelines.
Who It's For
Great fit if you run or reproduce OSWorld 2.0 evaluations, develop agents that interact with long-horizon GUI/web workflows, or need the authoritative task logic for research-grade comparisons. Look elsewhere if you only need example scenarios or lightweight synthetic tasks — this gated package contains the official task implementations and expects you to follow the project’s benchmark release policies and access workflow.
