Most agentic systems fail when their internal environment simulation drifts from real execution; robust benchmarks that pair actions with ground-truth next-state observations are therefore critical. AgentWorldBench delivers exactly that: a multi-domain, reference-grounded evaluation suite designed to probe the fidelity of language-based world models across long multi-turn trajectories.
What Sets It Apart
- Paired ground truth per turn: every prediction can be directly compared to an actual environment observation (not only task success), enabling fine-grained, reference-grounded scoring.
- Seven unified domains and 2,170 samples: covers MCP (API/tool responses), Search results, Terminal, Software Engineering (IDE/git/test traces), Android UI, Web DOM, and Desktop OS state—average trajectory length ≈22.8 turns.
- Five-dimension rubric + rule verifiers: open-ended rubric judging (Format, Factuality, Consistency, Realism, Quality) supplemented by deterministic checks where applicable to isolate specific failure modes.
- Reproducible pipeline: packaged per-domain JSONL records that include system prompts and judge templates, plus an evaluation script to run model inference → LLM judge → scoring.
Who It's For & Trade-offs
Great fit if you develop or evaluate language world models, agent simulators, or LLM-based judges and need reference-grounded, multi-turn diagnostics across diverse agent environments. Look elsewhere if you need pixel-frame visual benchmarks (AgentWorldBench uses UI view/accessibility trees for GUI domains) or extremely large-scale web crawl datasets—the collection is curated for fidelity and targeted diagnosis rather than sheer web-scale diversity.
Where It Fits
AgentWorldBench complements existing tool-centric benchmarks (Tool Decathlon, Terminal-Bench, OSWorld-Verified) by turning real interaction traces into next-state prediction tasks with paired ground truth and standardized judge prompts. It's intended as a standard evaluation suite for researchers tuning world models, comparing simulation fidelity, or training LLM judges.
