Why this matters
Most benchmarks measure one-shot performance; that hides how agents improve when allowed to interact, test, and iterate. EdgeBench instead studies learning trajectories by placing agents in realistic, executable tasks with multi-level feedback and letting them run for day-scale budgets. The core empirical insight is that across 134 tasks (38,000+ agent-hours) performance follows a log‑sigmoid scaling law in interaction time (R²≈0.998), making time-to-performance a first-class metric for agent evaluation.
What Sets It Apart
- Long-horizon, iterative evaluation: agents submit continually and receive granular feedback (pass rates, failing tests, scores); the best submission across a 12+ hour run is used as the final score — this measures improvement dynamics rather than a single snapshot.
- Two-container isolation (SForge): each task provides a work image visible to the agent and a hidden judge image that runs ephemeral tests; this design reduces evaluation hacking and preserves a realistic feedback channel.
- Large, diverse task suite with public subset: the full benchmark defines 134 day-scale tasks across six capability categories; 51 tasks and the full evaluation harness are publicly released for reproducible research and local benchmarking.
- Empirical scaling law and human baselines: the project reports detailed time-series results (38k+ hours) and human expert effort statistics (mean ≈57.2 hours/task), enabling comparisons of sample-efficiency and asymptotic ceilings.
Who It's For (and Trade-offs)
Great fit if you want to study how agents learn over time, compare sample-efficiency between agents, or benchmark closed-loop agent systems in realistic execution environments. It is also useful for developers who need a reproducible harness (SForge) for long-running automated experiments.
Look elsewhere if you need many thousands of small, one-shot NLP examples or a lightweight leaderboard for immediate inference-only metrics: EdgeBench is optimized for long-horizon, executable tasks and requires nontrivial infrastructure (Docker/Kubernetes) and time budgets to run meaningful experiments.
Practical notes
The public release includes task definitions, judge/work images patterns, and example agent scaffolds; running the full suite or the closed 134-task benchmark requires contacting the maintainers. The evaluation focus is on interaction-time curves and iterative improvement, so plan experiments around multi-hour to day-scale runs rather than quick single-query evaluations.
