Most interactive benchmarks either expose full state or only test recall after an episode ends; that masks whether models can remember and act on observations that are no longer visible. This work isolates reconstructive memory as a first-class competency by forcing multi-step interaction where past visual inputs must be reconstructed and used online.
Key Findings
- Benchmark design: RNG-Bench comprises two complementary tasks—Matching Pairs (briefly revealed card identities at fixed locations) and 3D Maze (egocentric views require building a spatial map)—evaluated under a unified harness with three controlled difficulty axes (grid size, visual pattern, observation modality).
- Evaluation protocol: A head-to-head duel controls instance-level variance and the Memory Gap metric disentangles forgetting (loss of stored observation) from poor action selection, making diagnosis more precise than aggregate success rates.
- Empirical results: Hard configurations demand ~128K token contexts and ~350 image inputs per episode; state-of-the-art multimodal LLMs remain far from saturation and most residual errors are due to forgetting earlier observations rather than suboptimal decisions.
- Transfer and training: Fine-tuning Qwen3.5-9B on optimal-policy rollouts plus filtered demonstrations improved RNG-Bench performance and transferred to existing benchmarks without degrading other multimodal capabilities.
Who this helps and tradeoffs
Great fit if you develop multimodal LLMs, memory modules for embodied or agentic systems, or diagnostics that separate memory vs policy failures. The benchmark is especially useful for controlled research into long-horizon visual memory and for evaluating training or fine-tuning strategies aimed at reconstructive recall. Look elsewhere if you need real-world, noisy embodied interactions out of distribution from grid/maze abstractions, or if you cannot afford the compute and data demands (the hardest settings involve very long contexts and hundreds of images per episode).
