AIAny - AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

Why this matters Large-language-agent benchmarks usually evaluate static planning or single-shot task success. AdaPlanBench flips that expectation: real-world constraints are often incomplete and revealed over interaction, so reliable agents must infer constraints from feedback and re-plan under accumulating, sometimes conflicting, requirements. The core insight is that evaluating adaptive planning requires a dynamic, violation-driven protocol rather than fixed prompts or one-off demonstrations.

Key Findings

Adaptive planning remains hard: across ten leading LLMs the best model reached only ~67.8% accuracy, showing substantial room for improvement in robust re-planning and constraint tracking — so what: current agent designs still fail frequently when constraints accumulate.
User constraints are especially challenging: performance drops more when user-preference constraints appear than with purely world-physical constraints — so what: agent alignment and preference modeling need stronger interactive inference mechanisms.
Degradation with accumulating constraints: accuracy decreases as more hidden constraints are revealed, indicating brittle plan composition and limited memory/constraint-tracking — so what: successful agents must maintain and reason over an evolving constraint set.
Failure modes point to weak physical grounding and ineffective revision strategies — so what: improvements likely require tighter environment grounding, explicit constraint bookkeeping, and better re-planning heuristics.

Who it's for and trade-offs

Great fit if you want to benchmark and stress-test LLM agents' interactive planning, constraint inference, and re-planning strategies in household-like tasks. It is useful for researchers developing agent frameworks, prompt-based planning systems, or modules for constraint management and user preference handling. Look elsewhere if you need evaluation of single-step instruction following, large-scale autonomous execution logging, or tasks outside embodied/household-style scenarios — AdaPlanBench is focused on multi-turn, constraint-driven planning rather than broad-scale execution or resource-heavy simulation.

How the benchmark works

AdaPlanBench provides 307 base household tasks and a scalable pipeline that programmatically augments each with two types of hidden constraints (world and user). At runtime agents propose plans; the protocol reveals a hidden constraint only when a proposed plan violates it, forcing iterative corrections. This design stresses (1) inferring unseen constraints from violation feedback, (2) tracking an accumulating constraint set across turns, and (3) efficiently re-planning to satisfy both physical and preference constraints. The authors include standardized metrics and analyses of common failure modes to guide follow-up work.

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

Introduction

Key Findings

Who it's for and trade-offs

How the benchmark works

Information

Categories

Tags

More Items

RAGU: A Multi-Step GraphRAG Engine with a Compact Domain-Adapted LLM

Loop the Loopies!

Cura 1T: Specialized Model for Agentic Healthcare