Robotic manipulation needs explicit 3D geometric reasoning for contact-rich tasks, yet many recent vision-language-action and world-action models work primarily in 2D image space or 2D-derived latents. GAM’s core insight is to repurpose a pretrained geometric foundation model (GFM) as a single shared substrate for perception, temporal prediction, and action decoding by splitting the backbone at an intermediate layer and inserting a causal future predictor conditioned on language, proprioception, and action history.
Key Findings
- Single shared backbone for perception, prediction, and action: splitting the GFM lets shallow layers encode observations while a causal predictor forecasts future latent tokens that are routed through remaining blocks to decode both future geometry and actions, preserving geometric priors.
- Minimal architectural change, maximal reuse: temporal world modeling is added without retraining or replacing the full foundation model, reducing engineering and parameter cost compared with pixel-space world models.
- Better empirical tradeoffs: across simulation and real-robot benchmarks GAM is reported to be more accurate, more robust, lower-latency, and lighter than foundation-model-scale baselines, improving geometry-aware manipulation performance in contact-rich scenarios.
Who it helps and tradeoffs
Great fit if you build language-conditioned robot policies that require explicit 3D reasoning and you can leverage pretrained geometric foundation models—research labs and teams working on contact-rich manipulation or cross-embodiment transfer will benefit. Look elsewhere if you lack access to a compatible GFM, need purely model-free RL baselines, or target extremely lightweight embedded stacks: GAM inherits the foundation model’s compute/representation constraints and the approach may propagate any biases or gaps from the pretrained GFM.
