AIAny - AgentWorldBench

Introduction

Most agentic systems fail when their internal environment simulation drifts from real execution; robust benchmarks that pair actions with ground-truth next-state observations are therefore critical. AgentWorldBench delivers exactly that: a multi-domain, reference-grounded evaluation suite designed to probe the fidelity of language-based world models across long multi-turn trajectories.

What Sets It Apart

Paired ground truth per turn: every prediction can be directly compared to an actual environment observation (not only task success), enabling fine-grained, reference-grounded scoring.
Seven unified domains and 2,170 samples: covers MCP (API/tool responses), Search results, Terminal, Software Engineering (IDE/git/test traces), Android UI, Web DOM, and Desktop OS state—average trajectory length ≈22.8 turns.
Five-dimension rubric + rule verifiers: open-ended rubric judging (Format, Factuality, Consistency, Realism, Quality) supplemented by deterministic checks where applicable to isolate specific failure modes.
Reproducible pipeline: packaged per-domain JSONL records that include system prompts and judge templates, plus an evaluation script to run model inference → LLM judge → scoring.

Who It's For & Trade-offs

Great fit if you develop or evaluate language world models, agent simulators, or LLM-based judges and need reference-grounded, multi-turn diagnostics across diverse agent environments. Look elsewhere if you need pixel-frame visual benchmarks (AgentWorldBench uses UI view/accessibility trees for GUI domains) or extremely large-scale web crawl datasets—the collection is curated for fidelity and targeted diagnosis rather than sheer web-scale diversity.

Where It Fits

AgentWorldBench complements existing tool-centric benchmarks (Tool Decathlon, Terminal-Bench, OSWorld-Verified) by turning real interaction traces into next-state prediction tasks with paired ground truth and standardized judge prompts. It's intended as a standard evaluation suite for researchers tuning world models, comparing simulation fidelity, or training LLM judges.

AgentWorldBench

Introduction

What Sets It Apart

Who It's For & Trade-offs

Where It Fits

Information

Categories

Tags

More Items

olmOCR-bench

Vāgdhenu — Sanskrit Chant Corpus

AFTER