Repository exploration — identifying which files and specific lines an agent inspects before producing a fix — is a practical bottleneck for coding agents but is often obfuscated by holistic, binary benchmarks (resolved/unresolved). SWE-Explore isolates exploration as a measurable capability, showing that how an agent searches and ranks code regions strongly predicts downstream repair success.
Key Findings
- Fine-grained, line-level ground truth: the benchmark derives line-level relevance from successful agent trajectories, enabling evaluation at a granularity that file-level tests miss. This exposes differences in what agents actually consult when solving issues.
- Broad coverage and reproducibility: 848 issues drawn from 203 open-source repositories across 10 programming languages provide diverse, repository-level scenarios for evaluation and comparison.
- Multi-dimensional metrics: evaluation includes coverage (what fraction of relevant lines are seen), ranking quality (how well relevant lines are prioritized under a line budget), and context-efficiency (how much useful context is included per inspected line). These metrics correlate with downstream repair behavior, making exploration scores predictive rather than purely descriptive.
- Empirical result: agentic explorers (agents that plan, navigate, and retrieve code) outperform classical retrieval baselines. While modern methods achieve strong file-level localization, line-level coverage and efficient ranking remain the primary axes that separate top performers.
Who it's for and trade-offs
Great fit if you are building or benchmarking coding agents, retrieval/localization modules, or repair pipelines and need a focused, repository-level probe of exploration behavior. It helps diagnose whether failures stem from poor retrieval/ranking or from downstream synthesis. Look elsewhere if you only need black-box end-to-end success rates (resolved/unresolved) or if your priority is execution-time benchmarks rather than retrieval/inspection behavior; SWE-Explore emphasizes what agents read and rank, not execution speed or final patch evaluation alone.
Where it fits
SWE-Explore complements existing repository-level benchmarks (e.g., repair-centric suites) by isolating the exploration step. Use it to evaluate components such as code localizers, context retrievers, and agent navigation policies before integrating with synthesis/repair stages.
Methodological notes
Ground truth is distilled from independent agent trajectories that successfully solved the same issue, producing line-level annotations that reflect actual solution paths rather than manually annotated heuristics. Evaluations enforce a fixed line budget per instance to stress ranking efficiency and context selection.
