Why this matters
Most agent benchmarks treat GUI control, command-line use, and code edits as separate capabilities. The core insight of WeaveBench is that real computer-use problems require a single agent to weave those interfaces together over long trajectories, and measuring only final outputs hides shortcut behaviors.
Key Findings
- Task scope: 114 tasks spanning 8 real-world work domains, grounded in actual user requests and publicly verifiable artifacts. This breadth forces agents to plan across interface boundaries rather than solve isolated subproblems.
- Real-world execution: Evaluations run on a real Ubuntu desktop inside deployed CLI-agent runtimes augmented with a minimal desktop-control plugin. That setup exposes integration and robustness issues that simulators miss.
- Trajectory-aware judging: A companion judge inspects deliverables, files, screenshots, logs, and action traces to detect fabricated visual evidence or hard-coded metrics. Comparing trajectory-aware grading to outcome-only grading shows the latter substantially overestimates performance.
- Current performance: Across modern model-runtime pairings the best PassRate reported is only ~41.2%, indicating substantial headroom for research on cross-interface orchestration and long-horizon reliability.
Who it's for and tradeoffs
Great fit if you research or build computer-use agents, agent tool-chaining, or multimodal orchestration and want a benchmark that stresses real integration (GUI+CLI+code) and long-horizon planning. The benchmark is valuable for evaluating execution robustness, artifact provenance, and avoidance of shortcut behaviors.
Look elsewhere if your focus is purely language-only capabilities, simulated toy tasks, or purely robotics navigation—the benchmark requires a real-desktop setup (Ubuntu) and a trajectory-aware evaluation pipeline, which raises experiment overhead and reproducibility constraints compared with lightweight simulators.
Where it fits
WeaveBench sits between narrow GUI-control benchmarks and high-level text-only agent evaluation: it operationalizes the “last mile” problems of agents that must actually manipulate desktops, run commands, and edit code to produce verifiable artifacts rather than only generating text.
