AIAny - SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Repository exploration — identifying which files and specific lines an agent inspects before producing a fix — is a practical bottleneck for coding agents but is often obfuscated by holistic, binary benchmarks (resolved/unresolved). SWE-Explore isolates exploration as a measurable capability, showing that how an agent searches and ranks code regions strongly predicts downstream repair success.

Key Findings

Fine-grained, line-level ground truth: the benchmark derives line-level relevance from successful agent trajectories, enabling evaluation at a granularity that file-level tests miss. This exposes differences in what agents actually consult when solving issues.
Broad coverage and reproducibility: 848 issues drawn from 203 open-source repositories across 10 programming languages provide diverse, repository-level scenarios for evaluation and comparison.
Multi-dimensional metrics: evaluation includes coverage (what fraction of relevant lines are seen), ranking quality (how well relevant lines are prioritized under a line budget), and context-efficiency (how much useful context is included per inspected line). These metrics correlate with downstream repair behavior, making exploration scores predictive rather than purely descriptive.
Empirical result: agentic explorers (agents that plan, navigate, and retrieve code) outperform classical retrieval baselines. While modern methods achieve strong file-level localization, line-level coverage and efficient ranking remain the primary axes that separate top performers.

Who it's for and trade-offs

Great fit if you are building or benchmarking coding agents, retrieval/localization modules, or repair pipelines and need a focused, repository-level probe of exploration behavior. It helps diagnose whether failures stem from poor retrieval/ranking or from downstream synthesis. Look elsewhere if you only need black-box end-to-end success rates (resolved/unresolved) or if your priority is execution-time benchmarks rather than retrieval/inspection behavior; SWE-Explore emphasizes what agents read and rank, not execution speed or final patch evaluation alone.

Where it fits

SWE-Explore complements existing repository-level benchmarks (e.g., repair-centric suites) by isolating the exploration step. Use it to evaluate components such as code localizers, context retrievers, and agent navigation policies before integrating with synthesis/repair stages.

Methodological notes

Ground truth is distilled from independent agent trajectories that successfully solved the same issue, producing line-level annotations that reflect actual solution paths rather than manually annotated heuristics. Evaluations enforce a fixed line budget per instance to stress ranking efficiency and context selection.

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Introduction

Key Findings

Who it's for and trade-offs

Where it fits

Methodological notes

Information

Categories

Tags

More Items

StateAct: Program State, before Pixels, for Long-Horizon Computer-Use Agents

From Proprietary to Open-Source: Bridging the Distribution Gap via Multi-Agent Protocol Distillation in Agentic Search

JarvisHub: An Open Harness for Canvas-Native Multimodal Creative Agents