AIAny - Kairos: A Native World Model Stack for Physical AI

Introduction

Physical AI systems need world models that do more than generate images: they must acquire heterogeneous embodied experience, persist states across long horizons, and run within real deployment constraints. This paper argues that meeting those demands requires a unified stack that is trained, architected, and engineered for embodiment from day one — not an adapter on top of visual generators.

Key Findings

Native pretraining via a Cross-Embodiment Data Curriculum: organizes open-world video, structured human behavior, and real robot interaction into a developmental pathway so the model can reuse heterogeneous data and learn sensorimotor couplings rather than just mimic visuals. This improves generalization to physical tasks compared with single-modality pretraining.
Native unified architecture with Hybrid Linear Temporal Attention: combines sliding-window attention for local dynamics, dilated sliding windows for mid-range dependencies, and a gated linear attention memory for persistent global state. The operator reduces temporal complexity from O(n^2) to O(n), enabling long-sequence rollouts at lower compute cost.
Formal theoretical guarantees: the temporal factorization is shown to bound error accumulation and thus provably support state propagation over extended horizons, addressing a common failure mode of long-horizon predictive models.
Deployment-aware co-design: system-level choices (model scale, attention operator, rollout generation) are tuned for low-latency inference on server and consumer-grade hardware, enabling real-time on-robot rollouts and closed-loop observation–action–feedback loops. The reference 4B-parameter model and specialized robot variants target real-world manipulation and control benchmarks.

Who it's for + Trade-offs

Great fit if you build or evaluate embodied AI systems that require long-horizon prediction, closed-loop control, or cross-domain transfer between videos, human behavior traces, and robot logs. It is relevant for robotics researchers, teams integrating world models into controllers, and groups focused on deployable on-device inference.

Look elsewhere if your primary need is purely language modeling, small-scale image generation, or when only passive visual representation (no action/physics coupling) is required — the approach depends on access to embodied interaction data and system-level engineering for real-time rollouts, which raises data collection and deployment complexity.

Kairos: A Native World Model Stack for Physical AI

Introduction

Key Findings

Who it's for + Trade-offs

Information

Categories

Tags

More Items

Qwen-UI-Agent Technical Report: Toward Next-Generation Real-World Centric Foundation GUI Agents

VideoCoCo: Code-as-CoT for Physically-Consistent Video Generation via an Agentic Dual-Engine System

Metis: Memory Foundation Model