Embodied AI2026

Guava: An Effective and Universal Harness for Embodied Manipulation

Provides a harness that lets language models control embodied manipulation via iterative perception–reasoning–action loops, semantic action abstractions, and multimodal observations. Demonstrates distilling capabilities into a 4B open-source model with under 2K simulated trajectories and shows sim-to-real generalization.

Visit Website

Introduction

Most embodied systems either train end-to-end or bolt on isolated tools; this work argues a different lever: the harness that mediates model↔robot interaction. The core claim is that a compact, well-designed harness—centered on iterative perception–reasoning–action loops, semantic action abstractions, and multimodal observations—can unlock strong manipulation skills in small, open models with minimal simulated data.

Key Findings

Iterative perception–reasoning–action loops: repeatedly acquiring targeted perceptual evidence and re-planning improves robustness on occluded, novel, and long-horizon tasks, so the system recovers from intermediate failures without exhaustive end-to-end retraining.
Semantic action abstractions: exposing high-level, symbolic-like action primitives (rather than raw low-level commands) reduces search complexity for the reasoning model and improves transfer across objects and embodiments.
Multimodal observations: combining focused visual crops, proprioceptive signals, and symbolic descriptors yields better grounding and generalization than vision-only prompts.
Data-efficient distillation: the authors distill the harnessed behavior into a 4B open model using fewer than 2,000 simulated trajectories, achieving performance comparable to proprietary baselines and demonstrating transfer to real robot setups.

Who It's For and Trade-offs

Great fit if you want a model-agnostic interface to add embodied manipulation to LLM-style agents, especially when simulator data is limited and you need fast iteration or sim-to-real transfer. The approach favors modular pipelines that let perception, planning, and control be improved independently. Look elsewhere if your target requires extreme low-level precision, continuous control optimizations that depend on heavy RL fine-tuning, or if you cannot provide reasonably realistic simulation/sensor fidelity—harness gains hinge on the quality of perceptual modules and abstractions. The harness also introduces engineering complexity (designing action APIs, verifiers, and multimodal views) that may not suit very small experimental projects.

Back

Information

Websitearxiv.org
AuthorsHaowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, Jiayuan Mao
Published date2026/06/16

More Items

AI Agent Papers2026

Qwen-UI-Agent Technical Report: Toward Next-Generation Real-World Centric Foundation GUI Agents

Hanzhang Zhou, Panrong Tong +14

Designs and evaluates a foundation GUI agent that performs cross-platform GUI and CLI actions on real devices to complete long-horizon workflows. Emphasizes a unified action space, a large-scale real-device mobile runtime, an AutoResearch-style data flywheel, and online RL training across 10,000+ concurrent environments.

qwen ai-agent agent-skills mobile android+6

AI Video Papers2026

VideoCoCo: Code-as-CoT for Physically-Consistent Video Generation via an Agentic Dual-Engine System

Haodong Li, Tianfei Ren +26

Converts text prompts into physically consistent videos by synthesizing executable Blender programs as a process-level chain-of-thought and using a dual-engine pipeline (deterministic simulation draft + draft-conditioned video editor). Ships with a VideoCoCo-3K draft–instruction–target dataset and shows substantial gains in physical-consistency benchmarks.

video ai-video code coding coding-agents+5

Machine Learning Foundation Papers2026

Metis: Memory Foundation Model

Zeyu Zhang, Ziliang Guo +15

Presents Metis, a prototype memory foundation model that embeds a persistent native memory state into the backbone so historical experience is compressed and accessed via memory attention. Key features: forward-only, gradient-free online memory updates; memory-specific mid-training objectives; and a dual text/code memory design.

foundation llm ai-agent agent-skills multimodal+3