Most long-horizon agent evaluations either append entire transcripts (growing unbounded) or use opaque memory systems whose components are hard to isolate. This paper reframes agent memory as a contract: every decision sees a freshly composed user message built from typed evidence slices, keeping per-decision prompts bounded while making individual memory components ablatable and inspectable. The result is a reproducible testbed that isolates what kinds of stored evidence actually change long-horizon behavior.
Key Findings
- Typed, per-decision composition: prompts are assembled from five slots (L1 fixed protocol, L2 state schemas/legal formats, L3 retrieved rules, L4 episodic summaries, L5 triggered strategic skills). This design fixes prompt size while letting each layer be enabled, frozen, or ablated in isolation, so contributions can be measured.
- Empirical signal on a hard long-horizon game: on Slay the Spire 2 at the easiest ascension (A0), the no-store baseline won 3/10 runs while enabling triggered strategic skills (L5) raised wins to 6/10 under the paper's harness. The authors caution this sample supports directional but not fully decisive statistical claims and report Fisher/p-value context.
- Reproducible artifacts: the release includes 298 completed, condition-tagged trajectories, SHA-anchored L4/L5 snapshots, decision-time prompt records, and Wilson/bootstrap analysis scripts so researchers can re-aggregate or re-slice the evaluation surface.
- Cross-backbone and ladder probes: the setup supports swapping underlying LLM backbones (reported probes with multiple families) and climbing difficulty (A6–A8 probes) to test where memory layers matter most.
- Separation and diagnostics: the bounded, typed contract turns “how much history fits” into “which typed evidence is selected,” enabling clearer ablations of rules, episodes, and skills than accumulating-context prompts.
Who it's for & tradeoffs
Great fit if you want an auditable, ablation-friendly benchmark for agent memory that keeps per-decision prompts bounded and yields reusable artifacts for reproducible research. It helps researchers quantify which explicit memory layers affect long-horizon decision quality and compare designs across model families.
Look elsewhere if you need a direct, head-to-head comparison with unbounded accumulating-context agents (the paper treats accumulating-context baselines as operational comparisons and marks matched accumulating-context experiments as future work), or if you require statistically decisive large-sample claims about win-rate differences without additional runs. The Slay the Spire 2 task is stochastic and results depend on run sampling and scaffold choices, so expect sensitivity to evaluation protocols.
Where it fits
This work sits between structured-memory and skill-library agent efforts: it is less about a single compression or retrieval algorithm and more about defining an interface contract that makes memory components modular and testable. Its reproducible archive aims to be a diagnostic complement to benchmarks that keep growing transcripts or rely only on similarity-based retrieval.
