Provides 2,170 reference-grounded evaluation samples across seven agent domains (MCP, Search, Terminal, SWE, Android, Web, OS) to score language world models on Format, Factuality, Consistency, Realism and Quality. Includes per-domain JSONL files, judge prompts and an evaluation script for reproducible scoring.