Long-context inference has been gated by attention's quadratic cost, not by model quality. DeepSeek-V4 attacks the cost side directly: at a one-million-token context, V4-Pro spends only 27% of the single-token FLOPs and 10% of the KV cache of V3.2, while V4-Flash drops to 10% of FLOPs and 7% of cache. The point is to make million-token contexts a routine operating mode rather than a stunt, opening room for longer test-time scaling and long-horizon agentic work.
Key Findings
- The efficiency comes from a hybrid attention scheme: Compressed Sparse Attention (CSA) compresses KV along the sequence dimension then applies sparse attention, while Heavily Compressed Attention (HCA) compresses KV harder but keeps attention dense — so cost stays bounded as context grows.
- Manifold-Constrained Hyper-Connections (mHC) constrain the residual mapping to doubly stochastic matrices (spectral norm ≤ 1), fixing the numerical instability that broke earlier hyper-connection stacks at depth.
- Training uses the Muon optimizer for faster convergence, FP4 quantization-aware training for MoE expert weights, and a two-stage post-training pipeline: independent domain specialists (math, code, agent) consolidated into one model via on-policy distillation.
- On internal evals V4-Pro-Max outperforms Claude Sonnet 4.5 and approaches Opus 4.5 on agentic coding, and surpasses Gemini-3.1-Pro on long-context academic benchmarks; it still trails frontier closed models on knowledge by roughly 3–6 months.
Methodology
The series keeps the DeepSeekMoE and Multi-Token Prediction stack from V3, swapping the routing affinity activation to Softplus, removing the routing-target-node cap, and replacing dense FFNs in early blocks with hash-routed MoE layers. Both models pre-train on 32T+ tokens (V4-Pro on 33T) with native 1M-context support.
Great Fit If / Look Elsewhere
Worth studying if you care about the architecture of efficient ultra-long context, MoE routing, or RL-based post-training at frontier scale. Look elsewhere if you want a turnkey small model: this is a preview report on trillion-parameter systems, and the authors note the design stays deliberately complex with several empirically-validated-but-not-fully-understood tricks.
