AIAny - SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Interactive spatial reasoning — the ability for multimodal agents to perceive, explore, and act in physical environments — is the gap most static VQA and simulator-specific tests fail to measure. SpatialWorld reframes evaluation around real-world-style tasks where agents must actively gather egocentric visual evidence and issue text-based actions under partial observability, exposing planning and exploration failures that single-turn benchmarks miss.

Key Findings

Unified, simulator-agnostic protocol: SpatialWorld integrates eight heterogeneous simulation backends under one API so models are evaluated on the same task semantics rather than simulator-specific pipelines. This makes cross-backend comparisons meaningful.
Realistic task design: 760 human-annotated tasks span household routines, travel, and social collaboration. Each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier to reduce ambiguity in scoring.
Hard baseline performance: evaluated on 15 advanced agents, the top closed-source model (GPT-5) achieves only 17.4% average task success rate (TSR) and the top open-source model (Qwen-3.5) 14.1%, showing substantial headroom.
Bottlenecks identified: active exploration under vision-only partial observability and long-horizon planning are primary failure modes; task success often does not correlate with execution efficiency, and performance varies significantly by domain.

Who it's for & tradeoffs

Great fit if you are a researcher building or evaluating embodied/multimodal agents and need a cross-simulator, task-level benchmark that emphasizes interactive decision-making and end-to-end task success. It is valuable for diagnosing exploration strategies, action grounding, and long-horizon planning in MLLMs. Look elsewhere if your goal is hardware-in-the-loop robotics evaluation (SpatialWorld is simulator-based), single-image VQA, or micro-benchmarks focused only on perception accuracy rather than interactive task completion. Also expect integration effort to connect new simulators or custom environments to the unified protocol.

Where it fits

Compared with static VQA suites and simulator-specific leaderboards, SpatialWorld sits between perception benchmarks and full robotics evaluation: it stresses interactive, language-native action interfaces and terminal verification across diverse simulated domains, making it a rigorous testbed for research that aims to close the gap between perception and goal-directed behavior.

Methodology highlights

Agents operate under vision-only partial observability and submit text-based actions native to MLLMs. Each task provides a reference trajectory (for diagnostics) and a terminal-state verifier (for reliable automatic scoring). The dataset and protocol emphasize reproducible, comparable evaluation rather than simulator-dependent heuristics.

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Introduction

Key Findings

Who it's for & tradeoffs

Where it fits

Methodology highlights

Information

Categories

Tags

More Items

Scaling Native Multimodal Pre-Training From Scratch

Molt: A Scalable PyTorch-Native Training Framework for Agentic Reinforcement Learning

Skill Self-Play: Pushing the Frontier of LLM Capability with Co-Evolving Skills