LogoAIAny
Icon for item

AgentWorldBench

Provides 2,170 reference-grounded evaluation samples across seven agent domains (MCP, Search, Terminal, SWE, Android, Web, OS) to score language world models on Format, Factuality, Consistency, Realism and Quality. Includes per-domain JSONL files, judge prompts and an evaluation script for reproducible scoring.

Introduction

Most agentic systems fail when their internal environment simulation drifts from real execution; robust benchmarks that pair actions with ground-truth next-state observations are therefore critical. AgentWorldBench delivers exactly that: a multi-domain, reference-grounded evaluation suite designed to probe the fidelity of language-based world models across long multi-turn trajectories.

What Sets It Apart
  • Paired ground truth per turn: every prediction can be directly compared to an actual environment observation (not only task success), enabling fine-grained, reference-grounded scoring.
  • Seven unified domains and 2,170 samples: covers MCP (API/tool responses), Search results, Terminal, Software Engineering (IDE/git/test traces), Android UI, Web DOM, and Desktop OS state—average trajectory length ≈22.8 turns.
  • Five-dimension rubric + rule verifiers: open-ended rubric judging (Format, Factuality, Consistency, Realism, Quality) supplemented by deterministic checks where applicable to isolate specific failure modes.
  • Reproducible pipeline: packaged per-domain JSONL records that include system prompts and judge templates, plus an evaluation script to run model inference → LLM judge → scoring.
Who It's For & Trade-offs

Great fit if you develop or evaluate language world models, agent simulators, or LLM-based judges and need reference-grounded, multi-turn diagnostics across diverse agent environments. Look elsewhere if you need pixel-frame visual benchmarks (AgentWorldBench uses UI view/accessibility trees for GUI domains) or extremely large-scale web crawl datasets—the collection is curated for fidelity and targeted diagnosis rather than sheer web-scale diversity.

Where It Fits

AgentWorldBench complements existing tool-centric benchmarks (Tool Decathlon, Terminal-Bench, OSWorld-Verified) by turning real interaction traces into next-state prediction tasks with paired ground truth and standardized judge prompts. It's intended as a standard evaluation suite for researchers tuning world models, comparing simulation fidelity, or training LLM judges.

Information

Categories