AIAny - MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Automated competition-level mathematical proof remains a hard benchmark for LLMs because single-shot generation struggles with correctness and repair. The key insight here is that treating one trained model as a multipurpose engine (generator + verifier + refiner + ranker) and scaling via a population search at test time yields outsized gains: diversity + reliable verification lets you select proofs that single-pass decoding would miss.

Key Findings

Population-level test-time scaling (MaxProof) turns one M3 model into a search procedure that produces and filters many candidate proofs, then returns the winner via tournament selection — this raises final-answer quality beyond single-sample improvements.
Training separates three capabilities — proof generation, proof verification, and critique-conditioned repair — and unifies them into a released M3 model; the verifier is engineered for low false-positive rates so selection is conservative.
With MaxProof scaling, the M3 model achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing the human gold-medal threshold reported by the authors.

How it works

The pipeline trains a single model with three proof-oriented skills, using a generative-verifier RL objective that privileges verifiers with low false positives (defense-in-depth). At test time MaxProof repeatedly samples and refines a population of candidate proofs; each candidate is (re)verified and ranked, and tournament-style selection picks the final proof. The RL element helps align generation toward proofs that survive the verifier and repair loop, while population search provides breadth and ensemble-like robustness.

Who it's for and tradeoffs

Great fit if you care about pushing LLMs toward formally structured, multi-step reasoning where verification and repair matter (e.g., automated theorem proving, math benchmarks, rigorous reasoning evaluation). Expect higher compute and latency at test time because MaxProof runs many generations, verifications, and repair iterations; it also depends critically on verifier quality — overly permissive verifiers inflate results, overly strict ones may discard valid creativity. The approach emphasizes end-to-end empirical performance on contest-style problems rather than formal machine-checked proofs.

Where it fits

Compared with single-pass LLM decoding or simple self-checking, MaxProof invests compute in population search plus a surgical verifier and repair loop, trading runtime for higher final correctness. It sits between pure neural generation and hybrid theorem-prover pipelines: less formal than interactive proof assistants, but more verification-focused than naive generation.

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Introduction

Key Findings

How it works

Who it's for and tradeoffs

Where it fits

Information

Categories

Tags

More Items

K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

Beyond Euclidean Clipping: Overcoming Exploration Collapse in LLM RL via Riemannian Isometric Policy Optimization

Scaling Laws for Hypernetwork-Based Knowledge Injection in Large Language Models