DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Re-derives LLM scaling laws, tracing prior disagreements to how compute budget was modeled, then trains 7B and 67B models on 2T tokens. The 67B model beats LLaMA-2 70B on code, math, and reasoning; its chat variant tops GPT-3.5 on open-ended evals.

Visual Explainer Visit Website

Introduction

Most scaling-law papers disagree on how to split a compute budget between model size and data, and the disagreement is usually waved away as dataset differences. This work pins it to a concrete cause: earlier studies used raw parameter count as the model-scale variable, which double-counts cheap embedding and normalization FLOPs. Switching to non-embedding FLOPs-per-token reconciles the conflicting results — and changes the optimal data/model ratio enough to matter when you are committing millions of GPU-hours.

Key Findings

The compute-scale metric you pick is not bookkeeping: using non-embedding FLOPs/token instead of parameter count flips the optimal allocation between adding parameters and adding tokens.
Optimal hyperparameters (batch size, learning rate) follow predictable power laws in compute, so they can be set ahead of an expensive run rather than tuned by trial and error.
Data quality shifts the scaling exponent itself — better data justifies spending relatively more of the budget on model size than on tokens.
The resulting 67B model, trained on 2T tokens with SFT and DPO, outperforms LLaMA-2 70B on code, math, and reasoning, and the chat variant beats GPT-3.5 on open-ended evaluation.

Methodology

The team fits scaling laws on small-scale sweeps, predicts the loss and ideal configuration for the full 7B and 67B runs, then validates that the large runs land where the laws predicted. The emphasis is on extrapolation accuracy — using cheap experiments to de-risk a single large training run — rather than on a new architecture.

Who It's For

Great fit if you train foundation models from scratch and need a defensible, reproducible basis for compute allocation and hyperparameter choices. This is also the founding paper of the DeepSeek series, useful context for anyone tracking that line of models. Look elsewhere if you want fine-tuning recipes or deployment guidance — the contribution is the scaling methodology and the open 7B/67B base and chat weights, not application tooling.

Back

Information

Websitearxiv.org
OrganizationsDeepSeek-AI
Published date2024/01/05

More Items

Large Language Model Papers2026

DecoEvo: Score-Decoupled Co-Evolution of Solver and Rubric-Generator Skills in Text Space

Jiangwang Chen, Zixin Song +11Tsinghua University, Qwen Business Unit of Alibaba +2

Co-evolves a solver skill and a rubric-generator skill for text-space LLM optimization under decoupled objectives to avoid rubric gaming without using gold rubrics. Solver updates use criterion-level feedback; generator updates use independent audits of requirement coverage and response discrimination.

LLM evaluation agent-skills qwen paper+2

AI Video Papers2026

Mage-VL: An Efficient Codec-Native Streaming Multimodal Foundation Model

Senqiao Yang, Kaichen Zhang +21

Real-time streaming multimodal foundation model that uses a codec-native tokenizer (Mage-ViT) to encode motion- and residual-rich regions from video I/P frames, reducing visual token usage by over 75% and enabling up to ~3.5× wall-clock inference speedup after training on ~560M images and 100M video frames.

multimodal video vision foundation-model ai+5

Large Language Model Papers2026

Kimi K3: Open Frontier Intelligence

Kimi Team, Tongtong Bai +400

Presents a 2.8T-parameter Mixture-of-Experts multimodal model with a 1-million-token context window and 104 billion activated parameters, targeting long-horizon agentic RL, coding, reasoning, and vision. Key innovations include Kimi Delta Attention, Attention Residuals, Stable LatentMoE (16 of 896 experts active per token), ~2.5× scaling efficiency over Kimi K2, and a public weight release.

kimi foundation-model llm multimodal vision+6