xlangai/osworld_v2_tasks

Provides the gated, official OSWorld 2.0 Python task class files (task_*.py) required to run the benchmark; distributed via a Hugging Face gated dataset to reduce benchmark leakage. Download requires accepting gated access on Hugging Face.

Visit Website

Introduction

Long-horizon computer-use benchmarks depend on authoritative, non-leaking task implementations. This gated Hugging Face dataset houses the official OSWorld 2.0 Python task classes that define environment logic and evaluators used across benchmark runs, so that agent evaluations remain comparable and resistant to accidentally leaked solutions.

What Sets It Apart

Canonical task classes: Contains the root-level task_*.py implementations that reproduce the benchmark’s environment, input artifacts, and evaluators — not just example data. This ensures evaluation parity across runs.
Gated distribution to prevent leakage: The files are intentionally distributed behind a Hugging Face gated dataset to reduce the chance that evaluated agents can find task answers or internal evaluator behavior during execution. This design choice prioritizes benchmark integrity over open ease-of-access.
Integration-ready format: The dataset provides JSON-formatted task files and is commonly used alongside the OSWorld codebase and evaluation tooling; consumers typically integrate these task classes into local benchmark runners or evaluation pipelines.

Who It's For

Great fit if you run or reproduce OSWorld 2.0 evaluations, develop agents that interact with long-horizon GUI/web workflows, or need the authoritative task logic for research-grade comparisons. Look elsewhere if you only need example scenarios or lightweight synthetic tasks — this gated package contains the official task implementations and expects you to follow the project’s benchmark release policies and access workflow.

Back

Information

Websitehuggingface.co
Organizationsxlangai
Published date2026/06/01

More Items

SceneFun3D

2024

ETH Zurich, Google +2

Alexandros Delitzas, Ayca Takmaz +4

Provides point-accurate annotations of interactive parts in high-resolution indoor laser-scan point clouds, plus affordance labels, motion axes and natural-language task descriptions; includes aligned iPad RGB-D video slices with 2D projections for multimodal research.

robotics vision depth multimodal huggingface+1

CS2-10k

2026

Reka AI

Provides 600,000+ first-person player-round videos (10,000+ hours) with per-frame keyboard, mouse-delta, and 3D trajectory annotations in WebDataset shards—built for training world models, action-conditioned video, and imitation-learning workflows (non-commercial license).

video ai-video huggingface vision ai-development+1

WGO-Bench

2026

Macrodata Labs, InternRobotics +1

Provides a small, manually annotated benchmark for evaluating vision–language models that convert robot and egocentric manipulation videos into timestamped subtask segments and concise action labels. Contains 100 episodes, 743 gold segments, and MP4 bytes embedded per row.

video robotics ai-video evaluation huggingface+2