For decades the winners of AI were methods, not benchmarks — the Transformer has roughly 100x the citations of the WMT'14 dataset it was tested on. Shunyu Yao argues that game just ended: "RL finally generalizes," and the scarce skill is no longer inventing models but deciding what is worth measuring at all.
Core Argument
- The recipe is language pre-training (priors) + scale + reasoning-as-action. The counterintuitive lesson: priors — long ignored by RL researchers fixated on algorithms — were the missing piece, and treating reasoning as an action lets those language priors generalize across environments.
- Once the recipe works, incremental methods get crushed. Your hard-won 5% gain on a benchmark is erased by the next o-series model's 30% jump, achieved without even targeting your task.
- So the loop inverts. Instead of "can we train a model to solve X?", the question becomes "what should we train AI to do, and how do we measure real progress?" Evaluation, not training, becomes the lever — a mindset closer to a product manager than a researcher.
The Utility Problem
AI has beaten world champions at chess and Go and earned IMO/IOI gold, yet GDP has barely moved. Yao traces this to evaluation setups that quietly assume what real work never does: that agents run autonomously with no human in the loop, and that tasks are i.i.d. rather than sequential (a human engineer gets better inside a repo over time; today's agents re-solve each issue from scratch). Breaking these unexamined assumptions is where game-changing research now lives.
Who Should Read It
Great fit if you are a researcher or builder deciding where to spend effort in a post-"recipe" world, or trying to understand why benchmark saturation no longer equals real-world impact. Look elsewhere if you want concrete RL methodology — this is a strategic, mindset-level essay based on Yao's Stanford CS224N and Columbia talks, not a technical how-to.
