The single most consequential idea in this paper is not the 7B model that nearly matches GPT-4 on competition math — it is GRPO, the reinforcement learning algorithm introduced here that later became the engine behind DeepSeek-R1 and the broader wave of RL-for-reasoning work. What looked at first like a math-specialization paper turned out to seed a training recipe the whole field would adopt.
Key Findings
- A 7B model reaches 51.7% on MATH without external toolkits or majority voting, and 60.9% with self-consistency over 64 samples — territory previously reserved for far larger closed models.
- Data quality dominates scale: a carefully engineered selection pipeline harvesting 120B math tokens from public Common Crawl data outperforms naive scaling, showing the web still holds untapped high-quality math signal.
- GRPO (Group Relative Policy Optimization) drops PPO's value network, estimating the baseline from a group of sampled outputs instead. This cuts memory cost substantially while improving reasoning — the change that made large-scale RL practical for later DeepSeek models.
Methodology
GRPO reframes policy optimization around relative ranking: for each prompt it samples a group of completions, scores them, and uses the group's mean as the advantage baseline. Removing the separate critic model is what unlocks the memory savings, and the relative signal turns out to be a strong fit for verifiable tasks like math where correctness is cheap to check.
Who It's For
Great fit if you study RL post-training, reasoning LLMs, or want to understand the lineage that leads to DeepSeek-R1 — GRPO is the through-line. Look elsewhere if you need a turnkey math assistant or a survey; this is a focused research contribution about data curation and an RL algorithm, not a product or a broad overview.
