Provides 4,659 agentic single-turn SFT training pairs extracted from Claude Fable‑5, formatted as a single-column parquet for Qwen-style fine-tuning. Includes explicit chain-of-thought (<think>) blocks, XML-serialized <tool_use> calls, PII redaction, and AGPL-3.0 licensing.
Assesses whether coding agents can generate complete, playable games end-to-end inside the Godot engine. Implements an interaction-grounded evaluation (replayed demonstrations + rubric-guided multimodal judging) across 140 tasks and 15 game families; top agents score ~41%.
Evaluates multimodal LLMs' ability to reconstruct past observations and act in controllable non-Markov games. Introduces RNG-Bench with two games (Matching Pairs, 3D Maze), three controllable difficulty axes, a head-to-head duel protocol, and a Memory Gap metric to separate forgetting from action errors.
Benchmark for evaluating procedural skill evolution in LLM agents: isolates reusable skill bodies, role-specific work surfaces, and hidden oracle assets to measure whether skill refinements transfer across tasks, roles, and model backbones. Includes 382 workplace tasks, 22 skills, and a controlled evaluation protocol.