Spatial reasoning often fails not because models lack primitives but because their action interface forces rigid, single-pass decisions or overly constrained tool calls. SpatialClaw flips that assumption: give the agent a persistent Python kernel preloaded with frames and perception/geometry primitives, and let it write one executable cell per step conditioned on prior outputs and observations. The core insight is that a code-as-action, stateful interface enables adaptive, compositional analyses that better match the open-ended nature of spatial tasks.
Key Findings
-
Stateful code-as-action interface: agents write and execute incremental Python cells with access to perception and geometry functions, so what? this lets agents inspect intermediate results, revise strategies, and flexibly compose low-level primitives instead of committing to a single analysis plan.
-
Training-free framework: no additional model fine-tuning is required, so what? SpatialClaw can be paired with off-the-shelf VLM backbones and still improve behavior across models without dataset-specific retraining.
-
Empirical gains at scale: evaluated on 20 static and dynamic 3D/4D spatial benchmarks, achieving 59.9% average accuracy and outperforming a recent spatial agent by +11.2 percentage points, so what? the improvement is broad (consistent across six VLM backbones and two model families), indicating the interface design—not just model capacity—drives practical spatial reasoning gains.
Who it's for and tradeoffs
Great fit if you are researching or building VLM-based agents that must perform compositional spatial analyses, diagnostics, or stepwise geometric computation across frames. Look elsewhere if your deployment forbids arbitrary code execution, requires minimal runtime latency, or cannot host a persistent Python kernel — SpatialClaw relies on executing user-generated code and preloaded perception primitives, which introduces runtime, safety, and integration considerations.
Where it fits
Compared with single-pass code execution (commit-first) and rigid structured tool-call interfaces (limited composition), SpatialClaw occupies the middle ground: flexible like free-form code but organized via one-step-per-decision execution, which improves adaptability on multi-step 3D/4D tasks.
How it works (brief)
A VLM produces a short Python cell each step, the kernel executes it against preloaded frames and primitives (e.g., detectors, depth/geometry utilities), and outputs become available to subsequent cells and text reasoning. This loop continues until the agent emits a final answer, enabling iterative inspection, correction, and complex geometric manipulations without retraining the model.
