LogoAIAny
Icon for item

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Localizes harmful span-level errors inside long research-agent trajectories to show which trajectory segments make final answers unreliable. Provides a 1,000-instance TELBench of annotated spans and DRIFT, a claim-centric auditing method that improves span-level localization and first-error accuracy by up to 30 percentage points.

Introduction

Most evaluations of research agents judge only final answers — that hides where in a long trajectory (searches, tool use, evidence inspection, hypothesis steps) the failure actually occurred. This work argues process-level diagnosis matters: by converting raw agent logs into semantic spans and annotating harmful spans, it surfaces the precise trajectory segments that make an agent's conclusion unreliable, enabling targeted fixes rather than blind model swaps.

Key Findings
  • Empirical corpus: collected 2,790 real trajectories across two agent frameworks, three backbone models, and three benchmarks, then converted logs into semantic spans and LLM-assisted expert annotations. This creates a realistic signal set for span-level errors.
  • TELBench: a 1,000-instance benchmark that labels error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise — so evaluators can measure where agents go wrong, not just whether they fail.
  • DRIFT (claim-centric auditing): tracks agent claims, checks supporting evidence within the trajectory, and marks spans where unsupported or conflicting claims affect downstream answers. In experiments, DRIFT boosts span-level error localization and first-error accuracy by up to ~30 percentage points over baseline auditing methods.
  • Practical implication: diagnosis at the span/claim level lets developers prioritize fixes (better search heuristics, tool filters, evidence-checking) targeted to the failure mode instead of retraining whole models.
Who it's for — tradeoffs

Great fit if you design, evaluate, or debug long-horizon research agents and need interpretability at the trajectory level: dataset and tooling help pinpoint the exact step(s) to correct. Look elsewhere if you only care about end-to-end task accuracy or cannot produce detailed, timestamped logs (the method depends on traceable actions, claims, and evidence). The benchmark is focused on research-style agent behavior and comes from specific frameworks/models — it may need adaptation for other domains or closed-tool ecosystems.

Where it fits

This paper complements final-answer benchmarks and faithfulness/error-detection metrics by adding a process-level layer: use TELBench and DRIFT to turn a scalar success/fail signal into actionable span-level diagnostics that feed into targeted rule-makers, tool filters, or verification modules.

Methodological notes

The authors convert raw agent logs into semantic spans (searches, claims, evidence checks), use LLM-assisted expert review to annotate harmful spans, and evaluate multiple auditing frameworks across model families. Results highlight both dataset-driven limitations (annotation scale, domain coverage) and clear gains from claim-centric auditing for improving early-error detection.

Information

  • Websitearxiv.org
  • AuthorsJiaming Wang, Ziteng Feng, Jiangtao Wu, Ruihao Li, Qianqian Xie, Yuxiang Ren, He Zhu, Xueming Han, Fanyu Meng, Junlan Feng, Jiaheng Liu
  • Published date2026/06/01

Categories