Provides a benchmark and protocol to evaluate agents that iteratively edit executable policies under a fixed interaction budget, recording full execution–feedback–revise trajectories. Built from compact RL environments with trajectory-level diagnostics and hidden held-out validation.
Compiles natural-language function specifications into compact, locally-executable neural programs (PAW) that run on a small frozen interpreter; a 4B compiler emits LoRA adapters for a 0.6B runtime to provide offline, low-memory fuzzy text functions.