LogoAIAny
Icon for item

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Applies a population-level test-time scaling strategy that uses one model as generator, verifier, refiner, and ranker to search over candidate proofs. Combines generative-verifier RL and a low false-positive verifier with tournament selection to reach competition-level performance on IMO and USAMO.

Introduction

Automated competition-level mathematical proof remains a hard benchmark for LLMs because single-shot generation struggles with correctness and repair. The key insight here is that treating one trained model as a multipurpose engine (generator + verifier + refiner + ranker) and scaling via a population search at test time yields outsized gains: diversity + reliable verification lets you select proofs that single-pass decoding would miss.

Key Findings
  • Population-level test-time scaling (MaxProof) turns one M3 model into a search procedure that produces and filters many candidate proofs, then returns the winner via tournament selection — this raises final-answer quality beyond single-sample improvements.
  • Training separates three capabilities — proof generation, proof verification, and critique-conditioned repair — and unifies them into a released M3 model; the verifier is engineered for low false-positive rates so selection is conservative.
  • With MaxProof scaling, the M3 model achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing the human gold-medal threshold reported by the authors.
How it works

The pipeline trains a single model with three proof-oriented skills, using a generative-verifier RL objective that privileges verifiers with low false positives (defense-in-depth). At test time MaxProof repeatedly samples and refines a population of candidate proofs; each candidate is (re)verified and ranked, and tournament-style selection picks the final proof. The RL element helps align generation toward proofs that survive the verifier and repair loop, while population search provides breadth and ensemble-like robustness.

Who it's for and tradeoffs

Great fit if you care about pushing LLMs toward formally structured, multi-step reasoning where verification and repair matter (e.g., automated theorem proving, math benchmarks, rigorous reasoning evaluation). Expect higher compute and latency at test time because MaxProof runs many generations, verifications, and repair iterations; it also depends critically on verifier quality — overly permissive verifiers inflate results, overly strict ones may discard valid creativity. The approach emphasizes end-to-end empirical performance on contest-style problems rather than formal machine-checked proofs.

Where it fits

Compared with single-pass LLM decoding or simple self-checking, MaxProof invests compute in population search plus a surgical verifier and repair loop, trading runtime for higher final correctness. It sits between pure neural generation and hybrid theorem-prover pipelines: less formal than interactive proof assistants, but more verification-focused than naive generation.

Information

  • Websitearxiv.org
  • AuthorsJiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang, Zhengmao Zhu, Tianle Li, Jingyang Li, Zehan Li, Binyang Jiang, Jin Zhu, Han Ding, Fei Yu, Chenyu Du, Zijian Song, Jiayuan Song, Zhi Zhang, Yunan Huang, Weiyu Cheng, Pengyu Zhao, Yu Cheng
  • Published date2026/06/11