Geometric Action Model for Robot Policy Learning

Language-conditioned robot policy that reuses a pretrained geometric foundation model and inserts a causal future predictor at an intermediate layer so the same backbone produces future 3D-aware features and action outputs, enabling geometry-aware temporal prediction with minimal architectural change.

Visit Website

Introduction

Robotic manipulation needs explicit 3D geometric reasoning for contact-rich tasks, yet many recent vision-language-action and world-action models work primarily in 2D image space or 2D-derived latents. GAM’s core insight is to repurpose a pretrained geometric foundation model (GFM) as a single shared substrate for perception, temporal prediction, and action decoding by splitting the backbone at an intermediate layer and inserting a causal future predictor conditioned on language, proprioception, and action history.

Key Findings

Single shared backbone for perception, prediction, and action: splitting the GFM lets shallow layers encode observations while a causal predictor forecasts future latent tokens that are routed through remaining blocks to decode both future geometry and actions, preserving geometric priors.
Minimal architectural change, maximal reuse: temporal world modeling is added without retraining or replacing the full foundation model, reducing engineering and parameter cost compared with pixel-space world models.
Better empirical tradeoffs: across simulation and real-robot benchmarks GAM is reported to be more accurate, more robust, lower-latency, and lighter than foundation-model-scale baselines, improving geometry-aware manipulation performance in contact-rich scenarios.

Who it helps and tradeoffs

Great fit if you build language-conditioned robot policies that require explicit 3D reasoning and you can leverage pretrained geometric foundation models—research labs and teams working on contact-rich manipulation or cross-embodiment transfer will benefit. Look elsewhere if you lack access to a compatible GFM, need purely model-free RL baselines, or target extremely lightweight embedded stacks: GAM inherits the foundation model’s compute/representation constraints and the approach may propagate any biases or gaps from the pretrained GFM.

Back

Information

Websitearxiv.org
AuthorsJisang Han, Seonghu Jeon, Jaewoo Jung, René Zurbrügg, Honggyu An, Tifanny Portela, Marco Hutter, Marc Pollefeys, Seungryong Kim, Sunghwan Hong
Published date2026/06/15

More Items

Computer Vision Papers2026

CLBench-V: Evaluating Multimodal Context Learning from Grounding to Knowledge Acquisition

Lai Wei, Chengqi Li +4

Evaluates multimodal context learning across grounding, new information application, and knowledge acquisition using a 3,443-instance benchmark spanning science, finance, long documents, spatial reasoning, and web VQA; finds current multimodal models perform poorly (best score 0.2847) and analyzes failure modes.

multimodal benchmark vision evaluation paper+4

Computer Vision Papers2026

HumanCLAW: Can Vision-Language Models Act Through a Body?

Siyao Li, Jiawei Gu +16

Evaluates whether vision-language models can make actionable decisions for a physical body by decoupling decision-making from low-level motor execution. Introduces HumanCLAW-Bench with 1,218 long-horizon egocentric episodes across 41 indoor scenes and diagnoses a lack of embodied self-awareness in current VLMs.

vision robotics evaluation benchmarks multimodal+2

Computer Vision Papers2026

TurboVLA: Real-Time Vision-Language-Action Model at 32 Hz on an RTX 4090 with <1 GB VRAM

Hengyi Xie, Chenfei Yao +8

Directly maps visual observations and language instructions to continuous robot actions, replacing LLM-centric V→L→A pipelines. Uses separate visual and language encoders with lightweight bidirectional interaction and a compact decoder to cut inference cost and VRAM, achieving ~31 ms latency and <1 GB VRAM on an RTX 4090; suited for real-time robotic manipulation under tight compute budgets.

robotics vision multimodal paper code+4

Geometric Action Model for Robot Policy Learning

Introduction

Key Findings

Who it helps and tradeoffs

Information

Categories

Tags

More Items

CLBench-V: Evaluating Multimodal Context Learning from Grounding to Knowledge Acquisition

HumanCLAW: Can Vision-Language Models Act Through a Body?

TurboVLA: Real-Time Vision-Language-Action Model at 32 Hz on an RTX 4090 with <1 GB VRAM