Most code-generation benchmarks stop at static compilation or unit tests; building a playable game exposes a different failure mode: integration across scripts, scenes, assets, rendering, and runtime interaction. GameCraft-Bench reframes end-to-end game generation as an interaction problem and operationalizes three evaluation desiderata—Engine Grounding, Artifact Completeness, and Interactive Verification—so that success means observable, replayable gameplay that matches a specification.
Key Findings
- End-to-end game generation remains challenging: the benchmark contains 140 Godot tasks across 15 game families, and the strongest evaluated agent achieves only 41.46% overall. This highlights gaps beyond single-file code correctness—agents often produce partial mechanics but not full, playable artifacts.
- Common failure modes are content paucity, missing or incorrect visual feedback, and broken scene wiring. These issues typically only surface when the game is executed and interacted with, rather than via static tests.
- The interaction-grounded evaluation (replayed demos + rubric-guided multimodal judging) reveals runtime and UX problems that automated static checks miss, making it a stricter test of real-world agent capabilities.
- By focusing on a real engine (Godot) and multimodal artifacts (scripts, scenes, assets, rendering), the benchmark forces agents to coordinate cross-file logic, asset references, and runtime behavior—key properties for deployable game generation systems.
Who it's for and tradeoffs
Great fit if you research agentic code generation, multimodal evaluation, or agent verification methods and need a benchmark that measures interactive end-to-end outcomes rather than just synthesis accuracy. Look elsewhere if your goal is narrower (e.g., unit-level code synthesis or single-file algorithms), since GameCraft-Bench requires engine setup, execution traces, and multimodal judging, which increases evaluation complexity and compute cost. The framework is valuable for pinpointing integration and UX failures but is heavier to run and score than text-only benchmarks.
Where it fits
GameCraft-Bench complements prior game-development benchmarks and frameworks by emphasizing engine-grounded, interaction-level verification in Godot. Use it alongside toolkits that automate execution and visual feedback capture when your research priority is producing fully playable, demonstrably interactive artifacts rather than isolated code snippets.
