Why this matters Large-language-agent benchmarks usually evaluate static planning or single-shot task success. AdaPlanBench flips that expectation: real-world constraints are often incomplete and revealed over interaction, so reliable agents must infer constraints from feedback and re-plan under accumulating, sometimes conflicting, requirements. The core insight is that evaluating adaptive planning requires a dynamic, violation-driven protocol rather than fixed prompts or one-off demonstrations.
Key Findings
- Adaptive planning remains hard: across ten leading LLMs the best model reached only ~67.8% accuracy, showing substantial room for improvement in robust re-planning and constraint tracking — so what: current agent designs still fail frequently when constraints accumulate.
- User constraints are especially challenging: performance drops more when user-preference constraints appear than with purely world-physical constraints — so what: agent alignment and preference modeling need stronger interactive inference mechanisms.
- Degradation with accumulating constraints: accuracy decreases as more hidden constraints are revealed, indicating brittle plan composition and limited memory/constraint-tracking — so what: successful agents must maintain and reason over an evolving constraint set.
- Failure modes point to weak physical grounding and ineffective revision strategies — so what: improvements likely require tighter environment grounding, explicit constraint bookkeeping, and better re-planning heuristics.
Who it's for and trade-offs
Great fit if you want to benchmark and stress-test LLM agents' interactive planning, constraint inference, and re-planning strategies in household-like tasks. It is useful for researchers developing agent frameworks, prompt-based planning systems, or modules for constraint management and user preference handling. Look elsewhere if you need evaluation of single-step instruction following, large-scale autonomous execution logging, or tasks outside embodied/household-style scenarios — AdaPlanBench is focused on multi-turn, constraint-driven planning rather than broad-scale execution or resource-heavy simulation.
How the benchmark works
AdaPlanBench provides 307 base household tasks and a scalable pipeline that programmatically augments each with two types of hidden constraints (world and user). At runtime agents propose plans; the protocol reveals a hidden constraint only when a proposed plan violates it, forcing iterative corrections. This design stresses (1) inferring unseen constraints from violation feedback, (2) tracking an accumulating constraint set across turns, and (3) efficiently re-planning to satisfy both physical and preference constraints. The authors include standardized metrics and analyses of common failure modes to guide follow-up work.
