Provides a benchmark and protocol to evaluate agents that iteratively edit executable policies under a fixed interaction budget, recording full execution–feedback–revise trajectories. Built from compact RL environments with trajectory-level diagnostics and hidden held-out validation.
Introduces a bounded-memory, typed-retrieval contract for long-horizon LLM agents and evaluates it in Slay the Spire 2 — assembling per-decision prompts from five typed slots rather than appending raw transcripts. Key outputs include ablationable memory layers, 298 labeled trajectories, and reproducible analysis scripts.