LogoAIAny
Icon for item

AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

Simulates egocentric, embodied human–world interactions and enables customizable, self-evolving local scenes by defining anchor views and text-driven evolution. Uses exogenous viewpoints and full-body motion supervision to improve spatial grounding and interaction consistency.

Introduction

Why this matters

Interactive world modeling for first-person agents requires not just photorealistic rendering but controllable, temporally consistent scene evolution — a capability most simulators either omit or treat as fixed-script behavior. AnchorWorld addresses that gap by combining egocentric simulation with two pragmatic designs: (1) exogenous (out-of-view) viewpoint supervision that recovers full-body pose and spatial context, and (2) anchor-view–based, text-conditioned rules that let local regions evolve over time while remaining geometrically coherent.

Key Findings
  • Exogenous viewpoint supervision: training with auxiliary cameras that observe the agent’s full body reduces ambiguity from truncated egocentric views and yields stronger spatial grounding for human–world interactions. This means interactions that depend on limb placement and reach are modeled more reliably than egocentric-only baselines.
  • Anchor-view customization: defining anchor views in a shared world coordinate frame plus short textual directives produces local scene evolution that follows prescribed spatio-temporal dynamics. Practically, this provides a compact interface for scene authors to specify how specific areas should change over time without re-simulating the entire world.
  • Empirical gains and validation: experiments report consistent improvements over state-of-the-art baselines on metrics for interaction integrity and geometric consistency; ablations confirm the utility of both exogenous supervision and anchor-based evolution.
Who it's for — and tradeoffs

Great fit if you need fine-grained, controllable egocentric simulation where body pose and local scene dynamics matter (eg, AR/VR agent testing, embodied perception research, human–robot interaction prototyping). AnchorWorld is most useful when you can provide or simulate auxiliary viewpoints (or accept training with such data) and when localized, text-driven evolution suffices instead of global scene rewriting.

Look elsewhere if you require fully general open-world generation without any anchored constraints, or if you cannot provide multi-view supervision — the method’s robustness hinges on full-body spatial cues and on the anchor-view abstraction for consistent evolution. Computational and dataset requirements for multi-view supervision and motion data may also be nontrivial for very large-scale deployments.

Where it fits

AnchorWorld sits between passive egocentric reconstruction methods and heavyweight scene simulators: it prioritizes interaction fidelity and controllable local dynamics over unconstrained scene synthesis. It’s complementary to embodied agent stacks that need reliable interaction grounding and to research that studies how text or symbolic policies should drive local environmental change.

Information

  • Websitearxiv.org
  • AuthorsYu Li, Menghan Xia, Gongye Liu, Xintao Wang, Conglang Zhang, Lei Ke, Yuxuan Lin, Ruihang Chu, Pengfei Wan, Kun Gai, Yujiu Yang
  • Published date2026/06/05