AI Agent Papers2026

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Benchmark for long-horizon computer-use agents that must orchestrate GUI, CLI, and code operations within single trajectories across 114 real-world tasks. Evaluated on a real Ubuntu desktop and paired with a trajectory-aware judge that inspects deliverables, artifacts, and action traces—revealing a top PassRate of ~41.2%.

Visit Website

Introduction

Why this matters

Most agent benchmarks treat GUI control, command-line use, and code edits as separate capabilities. The core insight of WeaveBench is that real computer-use problems require a single agent to weave those interfaces together over long trajectories, and measuring only final outputs hides shortcut behaviors.

Key Findings

Task scope: 114 tasks spanning 8 real-world work domains, grounded in actual user requests and publicly verifiable artifacts. This breadth forces agents to plan across interface boundaries rather than solve isolated subproblems.
Real-world execution: Evaluations run on a real Ubuntu desktop inside deployed CLI-agent runtimes augmented with a minimal desktop-control plugin. That setup exposes integration and robustness issues that simulators miss.
Trajectory-aware judging: A companion judge inspects deliverables, files, screenshots, logs, and action traces to detect fabricated visual evidence or hard-coded metrics. Comparing trajectory-aware grading to outcome-only grading shows the latter substantially overestimates performance.
Current performance: Across modern model-runtime pairings the best PassRate reported is only ~41.2%, indicating substantial headroom for research on cross-interface orchestration and long-horizon reliability.

Who it's for and tradeoffs

Great fit if you research or build computer-use agents, agent tool-chaining, or multimodal orchestration and want a benchmark that stresses real integration (GUI+CLI+code) and long-horizon planning. The benchmark is valuable for evaluating execution robustness, artifact provenance, and avoidance of shortcut behaviors.

Look elsewhere if your focus is purely language-only capabilities, simulated toy tasks, or purely robotics navigation—the benchmark requires a real-desktop setup (Ubuntu) and a trajectory-aware evaluation pipeline, which raises experiment overhead and reproducibility constraints compared with lightweight simulators.

Where it fits

WeaveBench sits between narrow GUI-control benchmarks and high-level text-only agent evaluation: it operationalizes the “last mile” problems of agents that must actually manipulate desktops, run commands, and edit code to produce verifiable artifacts rather than only generating text.

Back

Information

Websitearxiv.org
AuthorsWanli Li, Bowen Zhou, Yunyao Yu, Zhou Xu, Yifan Yang, Dongsheng Li, Caihua Shan
Published date2026/06/08

More Items

Computer Vision Papers2026

HumanCLAW: Can Vision-Language Models Act Through a Body?

Siyao Li, Jiawei Gu +16

Evaluates whether vision-language models can make actionable decisions for a physical body by decoupling decision-making from low-level motor execution. Introduces HumanCLAW-Bench with 1,218 long-horizon egocentric episodes across 41 indoor scenes and diagnoses a lack of embodied self-awareness in current VLMs.

vision robotics evaluation benchmarks multimodal+2

Natural Language Processing Papers2026

Keep It InMind: Benchmarking the Implicit-Association Blind Spot in Agent Memory

Ruizhe Li, Mingxuan Du +2

Measures how agent memory systems miss implicitly associated facts by introducing InMind, a 125-task benchmark with paired controls that separate stored-vs-retrieval vs knowledge gaps. Quantifies a large retrieval-interface blind spot and points to routing as the core open problem.

benchmark evaluation paper LLM NLP+3

Natural Language Processing Papers2026

A New Role for Relevance: Guiding Corpus Interaction in Agentic Search

Jiangnan Li, Yuqing Li +3

Turns document relevance into an execution prior for agentic corpus interaction: orders documents for sequential ripgrep traversal, seeds promising entry points with query-relevant paragraphs, and reranks grep matches to surface informative excerpts. Improves the accuracy–efficiency frontier on browse QA and reasoning-intensive retrieval.

retrieval RAG reasoning LLM NLP+3