Most benchmarks measure only final task scores; they hide the information-acquisition and revision process that matters when an agent must evolve executable policies under limited feedback. EvoPolicyGym flips that frame: it treats the edit–submit–feedback loop itself as the evaluated object, forcing agents to decide what to probe, when to explore, and how to convert sparse rollout evidence into robust code changes.
Key Findings
- Formalizes Autonomous Policy Evolution as a controlled evaluation setting so what: makes policy-search behavior (budget allocation, checkpoint selection, edit structure) measurable rather than an implementation artifact.
- Instantiates a Core16 suite of compact RL environments so what: enables cross-task comparison under a common 128-episode interaction budget and hidden held-out selection.
- Trajectory-level diagnostics reveal mechanism differences so what: strong agents do more than win isolated tasks—they discover task-appropriate mechanisms, preserve promising candidates, and balance structural synthesis vs parameter tuning.
- Baseline leaderboard outcomes (e.g., top aggregate rank for a strong LLM-based agent) so what: highlight coverage and consistency across environments as distinct from isolated first-place wins.
Who it's for and tradeoffs
Great fit if you want to measure how coding agents transform rollout feedback into concrete policy edits, compare harness–model workflows, or study budget-conditioned search strategies. Look elsewhere if your goal is large-scale robotics deployment, long-horizon simulator-heavy training, or pure end-to-end RL benchmarks—the suite focuses on compact, sandboxed environments and constrains interaction budget by design.
Where it fits
EvoPolicyGym sits between conventional RL leaderboards and open-ended engineering benchmarks: unlike single-score evaluations it records the full revision trajectory; unlike open-ended engineering suites it enforces strict visibility boundaries and a fixed interaction budget so that the evaluation isolates autonomous policy evolution rather than incremental engineering effort.
