Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Evaluates multimodal LLMs' ability to reconstruct past observations and act in controllable non-Markov games. Introduces RNG-Bench with two games (Matching Pairs, 3D Maze), three controllable difficulty axes, a head-to-head duel protocol, and a Memory Gap metric to separate forgetting from action errors.

Visit Website

Introduction

Most interactive benchmarks either expose full state or only test recall after an episode ends; that masks whether models can remember and act on observations that are no longer visible. This work isolates reconstructive memory as a first-class competency by forcing multi-step interaction where past visual inputs must be reconstructed and used online.

Key Findings

Benchmark design: RNG-Bench comprises two complementary tasks—Matching Pairs (briefly revealed card identities at fixed locations) and 3D Maze (egocentric views require building a spatial map)—evaluated under a unified harness with three controlled difficulty axes (grid size, visual pattern, observation modality).
Evaluation protocol: A head-to-head duel controls instance-level variance and the Memory Gap metric disentangles forgetting (loss of stored observation) from poor action selection, making diagnosis more precise than aggregate success rates.
Empirical results: Hard configurations demand ~128K token contexts and ~350 image inputs per episode; state-of-the-art multimodal LLMs remain far from saturation and most residual errors are due to forgetting earlier observations rather than suboptimal decisions.
Transfer and training: Fine-tuning Qwen3.5-9B on optimal-policy rollouts plus filtered demonstrations improved RNG-Bench performance and transferred to existing benchmarks without degrading other multimodal capabilities.

Who this helps and tradeoffs

Great fit if you develop multimodal LLMs, memory modules for embodied or agentic systems, or diagnostics that separate memory vs policy failures. The benchmark is especially useful for controlled research into long-horizon visual memory and for evaluating training or fine-tuning strategies aimed at reconstructive recall. Look elsewhere if you need real-world, noisy embodied interactions out of distribution from grid/maze abstractions, or if you cannot afford the compute and data demands (the hardest settings involve very long contexts and hundreds of images per episode).

Back

Information

Websitearxiv.org
OrganizationsFudan University, Shanghai Innovation Institute, Shanghai Artificial Intelligence Laboratory, Zhejiang University, The Chinese University of Hong Kong
AuthorsShengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang
Published date2026/06/17

More Items

AI Video Papers2026

VideoCoCo: Code-as-CoT for Physically-Consistent Video Generation via an Agentic Dual-Engine System

Haodong Li, Tianfei Ren +26

Converts text prompts into physically consistent videos by synthesizing executable Blender programs as a process-level chain-of-thought and using a dual-engine pipeline (deterministic simulation draft + draft-conditioned video editor). Ships with a VideoCoCo-3K draft–instruction–target dataset and shows substantial gains in physical-consistency benchmarks.

video ai-video code coding coding-agents+5

Computer Vision Papers2026

PhiZero: A World Model Built Around Physical Language

Shuyao Shang, Yuqi Wang +5

Learns a discrete “physical language” from unlabeled videos and uses a reason-then-render pipeline: predict compact state-transition tokens, then decode them into future video. Separates dynamics inference from pixel synthesis to improve physical fidelity, controllable simulation, and zero-shot motion transfer.

paper video vision physics ai-video+4

Computer Vision Papers2026

CLBench-V: Evaluating Multimodal Context Learning from Grounding to Knowledge Acquisition

Lai Wei, Chengqi Li +4

Evaluates multimodal context learning across grounding, new information application, and knowledge acquisition using a 3,443-instance benchmark spanning science, finance, long documents, spatial reasoning, and web VQA; finds current multimodal models perform poorly (best score 0.2847) and analyzes failure modes.

multimodal benchmark vision evaluation paper+4