Spatial queries in images and scenes are heterogeneous: some can be solved by stepwise linguistic deduction, while others require explicit 3D grounding before any reliable quantitative inference. SR-ReaL tackles this mismatch by teaching a single spatial VLM two complementary strategies and by using a staged training recipe that stabilizes reinforcement learning for process-level reasoning.
Key Findings
- Dual-path design: a Language-Only Reasoning (LOR) path handles compositional, linguistic deductions; a Detect-Then-Reason (DTR) path injects explicit 3D cues (centers / boxes) via region tokens before arithmetic or metric inference.
- Two-stage training: a cold-start supervised stage constructs chain-of-thought traces for both LOR and DTR and exposes a region→3D grounding interface; a subsequent RL stage (GRPO-style) jointly optimizes both paths with accuracy/format rewards and a discrete detection reward for DTR.
- Complementary strengths: DTR improves region-aware tasks through more precise 3D localization; LOR improves general spatial problem solving when geometric grounding is unnecessary. Joint training yields positive transfer between paths.
- Practical recipe: blending 2D/3D grounding data with general VQA during cold-start is critical for stable RL optimization and cross-domain generalization.
How it works (concise)
- Cold-start supervised phase supplies structured CoT examples for both reasoning modes and a region-to-3D interface so the model can emit region tokens tied to predicted 3D coordinates.
- Reinforcement phase refines policies using group-relative policy optimization with rewards for final answer correctness, output format, and, for DTR, a center-based detection reward that enforces geometric alignment during reasoning.
Who it's for and trade-offs
Great fit if you need a single VLM to handle a mix of spatial QA types (both region-centric 3D localization and compositional linguistic spatial queries) and you can provide or synthesize grounding annotations for cold-start training. Look elsewhere if you require a lightweight deployment (the dual-path model and RL tuning increase training complexity) or if no 3D/region supervision is available, since DTR depends on region-to-3D grounding data.
