Why it matters
Language-based world models let agents plan and learn by predicting environment state transitions rather than querying real systems for every trial. This model extends that idea to seven interaction domains (tool calling, search, terminal, software engineering, Android, web, OS) and is trained specifically to produce long, coherent next-state observations using long-form chain-of-thought reasoning and an MoE architecture with an extended context window.
Key Capabilities
- Multi-domain environment simulation: Predicts next environment observations for diverse agent interactions (tool calls, CLI, GUI, web, search), so you can run large-scale, controllable simulations without the real environment.
- Long-context, multi-step trajectories: Supports extremely long context (default 262,144 tokens) and long outputs, enabling multi-turn environment simulation and extended reasoning about state transitions.
- Architecture & training tailored for world modeling: Three-stage pipeline (continual pretraining → supervised fine-tuning → RL with hybrid rewards) and a MoE design (35B params, sparse activation) to balance capacity and inference cost.
- Evaluation-oriented: Released alongside AgentWorldBench for grounded, rubric-based evaluation across five dimensions (format, factuality, consistency, realism, quality).
Who it's for and trade-offs
Great fit if you need to simulate many agent–environment interactions for RL training, benchmark world-model quality, or prototype agent workflows without costly or unsafe real-world runs. It’s particularly useful when long context and multi-step state prediction matter.
Look elsewhere if you need a small, low-latency on-device model, or if your workflow cannot accommodate the memory and compute demands of very long-context MoE models. Also note the checkpoint contains only language weights (visual modules nominally defined), so full multimodal usage may require extra components.
Where it fits
Compared with general-purpose LLMs, this model is specialized to predict environment observations and to be used as a decoupled simulator or an agent foundation model. Compared with simulator-specific codebases, it trades deterministic fidelity for scalable, controllable, language-native simulation that is easier to perturb and compose for research and training.
Architecture & practical notes
The released checkpoint is a causal language world model with sparse MoE layers, 35B total parameters (≈3B activated), and a recommended inference setup that preserves very long contexts. Best used with inference stacks that support large context windows (vLLM, SGLang) and with sampling settings tuned for simulation (suggested temperature/top-p/top-k).
