Automated competition-level mathematical proof remains a hard benchmark for LLMs because single-shot generation struggles with correctness and repair. The key insight here is that treating one trained model as a multipurpose engine (generator + verifier + refiner + ranker) and scaling via a population search at test time yields outsized gains: diversity + reliable verification lets you select proofs that single-pass decoding would miss.
Key Findings
- Population-level test-time scaling (MaxProof) turns one M3 model into a search procedure that produces and filters many candidate proofs, then returns the winner via tournament selection — this raises final-answer quality beyond single-sample improvements.
- Training separates three capabilities — proof generation, proof verification, and critique-conditioned repair — and unifies them into a released M3 model; the verifier is engineered for low false-positive rates so selection is conservative.
- With MaxProof scaling, the M3 model achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing the human gold-medal threshold reported by the authors.
How it works
The pipeline trains a single model with three proof-oriented skills, using a generative-verifier RL objective that privileges verifiers with low false positives (defense-in-depth). At test time MaxProof repeatedly samples and refines a population of candidate proofs; each candidate is (re)verified and ranked, and tournament-style selection picks the final proof. The RL element helps align generation toward proofs that survive the verifier and repair loop, while population search provides breadth and ensemble-like robustness.
Who it's for and tradeoffs
Great fit if you care about pushing LLMs toward formally structured, multi-step reasoning where verification and repair matter (e.g., automated theorem proving, math benchmarks, rigorous reasoning evaluation). Expect higher compute and latency at test time because MaxProof runs many generations, verifications, and repair iterations; it also depends critically on verifier quality — overly permissive verifiers inflate results, overly strict ones may discard valid creativity. The approach emphasizes end-to-end empirical performance on contest-style problems rather than formal machine-checked proofs.
Where it fits
Compared with single-pass LLM decoding or simple self-checking, MaxProof invests compute in population search plus a surgical verifier and repair loop, trading runtime for higher final correctness. It sits between pure neural generation and hybrid theorem-prover pipelines: less formal than interactive proof assistants, but more verification-focused than naive generation.
