LogoAIAny
Icon for item

Kairos: A Native World Model Stack for Physical AI

Learns, maintains, and runs unified world models for Physical AI using a cross-embodiment pretraining curriculum and a hybrid linear temporal-attention architecture. Emphasizes long-horizon state persistence, theoretical bounds on error accumulation, and deployment-aware low-latency inference for real-world embodied agents.

Introduction

Physical AI systems need world models that do more than generate images: they must acquire heterogeneous embodied experience, persist states across long horizons, and run within real deployment constraints. This paper argues that meeting those demands requires a unified stack that is trained, architected, and engineered for embodiment from day one — not an adapter on top of visual generators.

Key Findings
  • Native pretraining via a Cross-Embodiment Data Curriculum: organizes open-world video, structured human behavior, and real robot interaction into a developmental pathway so the model can reuse heterogeneous data and learn sensorimotor couplings rather than just mimic visuals. This improves generalization to physical tasks compared with single-modality pretraining.
  • Native unified architecture with Hybrid Linear Temporal Attention: combines sliding-window attention for local dynamics, dilated sliding windows for mid-range dependencies, and a gated linear attention memory for persistent global state. The operator reduces temporal complexity from O(n^2) to O(n), enabling long-sequence rollouts at lower compute cost.
  • Formal theoretical guarantees: the temporal factorization is shown to bound error accumulation and thus provably support state propagation over extended horizons, addressing a common failure mode of long-horizon predictive models.
  • Deployment-aware co-design: system-level choices (model scale, attention operator, rollout generation) are tuned for low-latency inference on server and consumer-grade hardware, enabling real-time on-robot rollouts and closed-loop observation–action–feedback loops. The reference 4B-parameter model and specialized robot variants target real-world manipulation and control benchmarks.
Who it's for + Trade-offs

Great fit if you build or evaluate embodied AI systems that require long-horizon prediction, closed-loop control, or cross-domain transfer between videos, human behavior traces, and robot logs. It is relevant for robotics researchers, teams integrating world models into controllers, and groups focused on deployable on-device inference.

Look elsewhere if your primary need is purely language modeling, small-scale image generation, or when only passive visual representation (no action/physics coupling) is required — the approach depends on access to embodied interaction data and system-level engineering for real-time rollouts, which raises data collection and deployment complexity.

Information

  • Websitearxiv.org
  • OrganizationsKairos Team
  • AuthorsKairos Team, Fei Wang, Shan You, Qiming Zhang, Tao Huang, Zuoyi Fu, Zhisheng Zheng, Yunlong Xi, Feng Lv, Xiaoming Wu
  • Published date2026/06/15

More Items