Open-weight Mixture-of-Experts LLM with 671B total parameters but 37B activated per token, trained on 14.8T tokens for 2.788M H800 GPU-hours. Matches leading closed models at a fraction of typical training cost via FP8 and architectural tricks.
The headline number is the one that breaks intuition: a 671B-parameter model trained for roughly $5.6M in compute. DeepSeek-V3 makes the case that the next frontier isn't more GPUs but better systems engineering — co-designing the model, the training algorithm, and the FP8 numerical stack so a Mixture-of-Experts model activates just 37B of its 671B parameters per token without the load-balancing headaches that usually wreck sparse training.
On knowledge, math, and code benchmarks it lands ahead of other open-weight models and within reach of GPT-4o and Claude-3.5-Sonnet — a gap that, before this paper, most assumed required a closed lab's budget. The recipe, not just the weights, is the contribution.
Great fit if you're a researcher or infra team studying efficient large-scale training, or want strong open weights you can self-host and fine-tune. Look elsewhere if you need a turnkey hosted API with guarantees, or lack the multi-node H800-class hardware to actually run a 671B MoE — the open weights don't make the serving footprint small.