AIAny - Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Most evaluations of research agents judge only final answers — that hides where in a long trajectory (searches, tool use, evidence inspection, hypothesis steps) the failure actually occurred. This work argues process-level diagnosis matters: by converting raw agent logs into semantic spans and annotating harmful spans, it surfaces the precise trajectory segments that make an agent's conclusion unreliable, enabling targeted fixes rather than blind model swaps.

Key Findings

Empirical corpus: collected 2,790 real trajectories across two agent frameworks, three backbone models, and three benchmarks, then converted logs into semantic spans and LLM-assisted expert annotations. This creates a realistic signal set for span-level errors.
TELBench: a 1,000-instance benchmark that labels error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise — so evaluators can measure where agents go wrong, not just whether they fail.
DRIFT (claim-centric auditing): tracks agent claims, checks supporting evidence within the trajectory, and marks spans where unsupported or conflicting claims affect downstream answers. In experiments, DRIFT boosts span-level error localization and first-error accuracy by up to ~30 percentage points over baseline auditing methods.
Practical implication: diagnosis at the span/claim level lets developers prioritize fixes (better search heuristics, tool filters, evidence-checking) targeted to the failure mode instead of retraining whole models.

Who it's for — tradeoffs

Great fit if you design, evaluate, or debug long-horizon research agents and need interpretability at the trajectory level: dataset and tooling help pinpoint the exact step(s) to correct. Look elsewhere if you only care about end-to-end task accuracy or cannot produce detailed, timestamped logs (the method depends on traceable actions, claims, and evidence). The benchmark is focused on research-style agent behavior and comes from specific frameworks/models — it may need adaptation for other domains or closed-tool ecosystems.

Where it fits

This paper complements final-answer benchmarks and faithfulness/error-detection metrics by adding a process-level layer: use TELBench and DRIFT to turn a scalar success/fail signal into actionable span-level diagnostics that feed into targeted rule-makers, tool filters, or verification modules.

Methodological notes

The authors convert raw agent logs into semantic spans (searches, claims, evidence checks), use LLM-assisted expert review to annotate harmful spans, and evaluate multiple auditing frameworks across model families. Results highlight both dataset-driven limitations (annotation scale, domain coverage) and clear gains from claim-centric auditing for improving early-error detection.

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Introduction

Key Findings

Who it's for — tradeoffs

Where it fits

Methodological notes

Information

Categories

Tags

More Items

BadWAM: When World-Action Models Dream Right but Act Wrong

SearchOS-V1: Towards Robust Open-Domain Information-Seeking Agent Collaboration

SEED: Self-Evolving On-Policy Distillation for Agentic Reinforcement Learning