Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Provides a dual-path approach for spatial vision-language models: a Language-Only Reasoning (LOR) path for stepwise linguistic deduction and a Detect-Then-Reason (DTR) path that detects 3D cues via region tokens before numerical inference. Trains with chain-of-thought cold-start supervision and reinforcement learning to improve 3D grounding and multi-step spatial reasoning.

Visit Website

Introduction

Spatial queries in images and scenes are heterogeneous: some can be solved by stepwise linguistic deduction, while others require explicit 3D grounding before any reliable quantitative inference. SR-ReaL tackles this mismatch by teaching a single spatial VLM two complementary strategies and by using a staged training recipe that stabilizes reinforcement learning for process-level reasoning.

Key Findings

Dual-path design: a Language-Only Reasoning (LOR) path handles compositional, linguistic deductions; a Detect-Then-Reason (DTR) path injects explicit 3D cues (centers / boxes) via region tokens before arithmetic or metric inference.
Two-stage training: a cold-start supervised stage constructs chain-of-thought traces for both LOR and DTR and exposes a region→3D grounding interface; a subsequent RL stage (GRPO-style) jointly optimizes both paths with accuracy/format rewards and a discrete detection reward for DTR.
Complementary strengths: DTR improves region-aware tasks through more precise 3D localization; LOR improves general spatial problem solving when geometric grounding is unnecessary. Joint training yields positive transfer between paths.
Practical recipe: blending 2D/3D grounding data with general VQA during cold-start is critical for stable RL optimization and cross-domain generalization.

How it works (concise)

Cold-start supervised phase supplies structured CoT examples for both reasoning modes and a region-to-3D interface so the model can emit region tokens tied to predicted 3D coordinates.
Reinforcement phase refines policies using group-relative policy optimization with rewards for final answer correctness, output format, and, for DTR, a center-based detection reward that enforces geometric alignment during reasoning.

Who it's for and trade-offs

Great fit if you need a single VLM to handle a mix of spatial QA types (both region-centric 3D localization and compositional linguistic spatial queries) and you can provide or synthesize grounding annotations for cold-start training. Look elsewhere if you require a lightweight deployment (the dual-path model and RL tuning increase training complexity) or if no 3D/region supervision is available, since DTR depends on region-to-3D grounding data.

Back

Information

Websitearxiv.org
AuthorsYatai Ji, An-Chieh Cheng, Yang Fu, Yukang Chen, Han Zhang, Zhaojing Yang, Wei Huang, Ka Chun Cheung, Song Han, Vidya Nariyambut Murali …
Published date2026/06/16

More Items

AI Video Papers2026

VideoCoCo: Code-as-CoT for Physically-Consistent Video Generation via an Agentic Dual-Engine System

Haodong Li, Tianfei Ren +26

Converts text prompts into physically consistent videos by synthesizing executable Blender programs as a process-level chain-of-thought and using a dual-engine pipeline (deterministic simulation draft + draft-conditioned video editor). Ships with a VideoCoCo-3K draft–instruction–target dataset and shows substantial gains in physical-consistency benchmarks.

video ai-video code coding coding-agents+5

Computer Vision Papers2026

PhiZero: A World Model Built Around Physical Language

Shuyao Shang, Yuqi Wang +5

Learns a discrete “physical language” from unlabeled videos and uses a reason-then-render pipeline: predict compact state-transition tokens, then decode them into future video. Separates dynamics inference from pixel synthesis to improve physical fidelity, controllable simulation, and zero-shot motion transfer.

paper video vision physics ai-video+4

Reinforcement Learning Papers2026

CoRT: Counterfactual Replay for Token-Level Rubric-Guided Policy Optimization

Bo-Wen Zhang, Junwei He +6

Allocates token-level credit in rubric-conditioned GRPO by counterfactually replaying the same response under rubric and criteria-free prompts, using tokenwise log-likelihood contrasts to compute bounded, response-normalized weights that redistribute GRPO advantages without training an auxiliary scorer.

RL LLM NLP paper evaluation