Most scaling-law papers disagree on how to split a compute budget between model size and data, and the disagreement is usually waved away as dataset differences. This work pins it to a concrete cause: earlier studies used raw parameter count as the model-scale variable, which double-counts cheap embedding and normalization FLOPs. Switching to non-embedding FLOPs-per-token reconciles the conflicting results — and changes the optimal data/model ratio enough to matter when you are committing millions of GPU-hours.
Key Findings
- The compute-scale metric you pick is not bookkeeping: using non-embedding FLOPs/token instead of parameter count flips the optimal allocation between adding parameters and adding tokens.
- Optimal hyperparameters (batch size, learning rate) follow predictable power laws in compute, so they can be set ahead of an expensive run rather than tuned by trial and error.
- Data quality shifts the scaling exponent itself — better data justifies spending relatively more of the budget on model size than on tokens.
- The resulting 67B model, trained on 2T tokens with SFT and DPO, outperforms LLaMA-2 70B on code, math, and reasoning, and the chat variant beats GPT-3.5 on open-ended evaluation.
Methodology
The team fits scaling laws on small-scale sweeps, predicts the loss and ideal configuration for the full 7B and 67B runs, then validates that the large runs land where the laws predicted. The emphasis is on extrapolation accuracy — using cheap experiments to de-risk a single large training run — rather than on a new architecture.
Who It's For
Great fit if you train foundation models from scratch and need a defensible, reproducible basis for compute allocation and hyperparameter choices. This is also the founding paper of the DeepSeek series, useful context for anyone tracking that line of models. Look elsewhere if you want fine-tuning recipes or deployment guidance — the contribution is the scaling methodology and the open 7B/67B base and chat weights, not application tooling.
