DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Reaches 51.7% on the competition-level MATH benchmark with a 7B model and no tools or voting, rivaling Gemini-Ultra and GPT-4. Built on a 120B-token math corpus mined from Common Crawl, and introduces GRPO, a memory-efficient PPO variant for reasoning.

Visual Explainer Visit Website

Introduction

The single most consequential idea in this paper is not the 7B model that nearly matches GPT-4 on competition math — it is GRPO, the reinforcement learning algorithm introduced here that later became the engine behind DeepSeek-R1 and the broader wave of RL-for-reasoning work. What looked at first like a math-specialization paper turned out to seed a training recipe the whole field would adopt.

Key Findings

A 7B model reaches 51.7% on MATH without external toolkits or majority voting, and 60.9% with self-consistency over 64 samples — territory previously reserved for far larger closed models.
Data quality dominates scale: a carefully engineered selection pipeline harvesting 120B math tokens from public Common Crawl data outperforms naive scaling, showing the web still holds untapped high-quality math signal.
GRPO (Group Relative Policy Optimization) drops PPO's value network, estimating the baseline from a group of sampled outputs instead. This cuts memory cost substantially while improving reasoning — the change that made large-scale RL practical for later DeepSeek models.

Methodology

GRPO reframes policy optimization around relative ranking: for each prompt it samples a group of completions, scores them, and uses the group's mean as the advantage baseline. Removing the separate critic model is what unlocks the memory savings, and the relative signal turns out to be a strong fit for verifiable tasks like math where correctness is cheap to check.

Who It's For

Great fit if you study RL post-training, reasoning LLMs, or want to understand the lineage that leads to DeepSeek-R1 — GRPO is the through-line. Look elsewhere if you need a turnkey math assistant or a survey; this is a focused research contribution about data curation and an RL algorithm, not a product or a broad overview.

Back

Information

Websitearxiv.org
OrganizationsDeepSeek-AI
Published date2024/02/05

More Items

Large Language Model Papers2026

DecoEvo: Score-Decoupled Co-Evolution of Solver and Rubric-Generator Skills in Text Space

Jiangwang Chen, Zixin Song +11Tsinghua University, Qwen Business Unit of Alibaba +2

Co-evolves a solver skill and a rubric-generator skill for text-space LLM optimization under decoupled objectives to avoid rubric gaming without using gold rubrics. Solver updates use criterion-level feedback; generator updates use independent audits of requirement coverage and response discrimination.

LLM evaluation agent-skills qwen paper+2

AI Video Papers2026

Mage-VL: An Efficient Codec-Native Streaming Multimodal Foundation Model

Senqiao Yang, Kaichen Zhang +21

Real-time streaming multimodal foundation model that uses a codec-native tokenizer (Mage-ViT) to encode motion- and residual-rich regions from video I/P frames, reducing visual token usage by over 75% and enabling up to ~3.5× wall-clock inference speedup after training on ~560M images and 100M video frames.

multimodal video vision foundation-model ai+5

Large Language Model Papers2026

Kimi K3: Open Frontier Intelligence

Kimi Team, Tongtong Bai +400

Presents a 2.8T-parameter Mixture-of-Experts multimodal model with a 1-million-token context window and 104 billion activated parameters, targeting long-horizon agentic RL, coding, reasoning, and vision. Key innovations include Kimi Delta Attention, Attention Residuals, Stable LatentMoE (16 of 896 experts active per token), ~2.5× scaling efficiency over Kimi K2, and a public weight release.

kimi foundation-model llm multimodal vision+6