Most LLM-agent work treats the environment as fixed and the agent as the only learning component. Role-Agent flips that assumption: the same LLM alternates roles as both agent and environment so training signals and data distribution co-evolve, producing targeted practice and improved generalization.
Key Findings
- Dual-role co-evolution: combining World-In-Agent (WIA) — where the model predicts next states and receives a process reward from alignment — with Agent-In-World (AIW) — where the model analyzes failures and retrieves similar tasks for retraining — changes both objective and data distribution during training.
- Empirical gains: across multiple benchmarks the method yields an average improvement of ~4% over strong baselines, indicating better robustness on complex, multi-step tasks.
- Practical mechanism: WIA encourages environment-aware planning by rewarding accurate state prediction; AIW focuses learning on systematic failure modes via targeted task retrieval and resampling.
Who it's for and trade-offs
Great fit if you research LLM-based agents, automated curriculum methods, or want a lightweight approach to make agents more environment-aware without building separate simulators. Look elsewhere if you require provably accurate ground-truth environments (risk of compounding model errors), need fully reproducible external simulators, or must avoid extra compute from repeated self-generated episodes. The approach can amplify hallucination-like prediction errors if the base LLM is weak, so gains depend on model quality and validation on real environments.
Where it fits
Positioned between self-play / imagination-based planning and traditional RL: instead of an external simulator or separate environment model, Role-Agent uses the LLM itself to produce and critique trajectories, making it attractive for rapid prototyping of agent behaviors and curriculum generation when real environments are costly.
How it works (brief)
- World-In-Agent (WIA): after each action the LLM predicts the consequent state; alignment between predicted and actual states is converted to a process reward that shapes subsequent reasoning.
- Agent-In-World (AIW): failed trajectories are analyzed to extract failure modes; the system retrieves tasks with similar patterns and prioritizes them for retraining, effectively reshaping the training distribution toward problematic cases.
