DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

Preview of an MoE model family (V4-Pro: 1.6T params, 49B active; V4-Flash: 284B, 13B active) built for 1M-token contexts. A hybrid attention design cuts single-token inference FLOPs to 27% and KV cache to 10% versus V3.2 at million-token length.

Visual Explainer Visit Website

Introduction

Long-context inference has been gated by attention's quadratic cost, not by model quality. DeepSeek-V4 attacks the cost side directly: at a one-million-token context, V4-Pro spends only 27% of the single-token FLOPs and 10% of the KV cache of V3.2, while V4-Flash drops to 10% of FLOPs and 7% of cache. The point is to make million-token contexts a routine operating mode rather than a stunt, opening room for longer test-time scaling and long-horizon agentic work.

Key Findings

The efficiency comes from a hybrid attention scheme: Compressed Sparse Attention (CSA) compresses KV along the sequence dimension then applies sparse attention, while Heavily Compressed Attention (HCA) compresses KV harder but keeps attention dense — so cost stays bounded as context grows.
Manifold-Constrained Hyper-Connections (mHC) constrain the residual mapping to doubly stochastic matrices (spectral norm ≤ 1), fixing the numerical instability that broke earlier hyper-connection stacks at depth.
Training uses the Muon optimizer for faster convergence, FP4 quantization-aware training for MoE expert weights, and a two-stage post-training pipeline: independent domain specialists (math, code, agent) consolidated into one model via on-policy distillation.
On internal evals V4-Pro-Max outperforms Claude Sonnet 4.5 and approaches Opus 4.5 on agentic coding, and surpasses Gemini-3.1-Pro on long-context academic benchmarks; it still trails frontier closed models on knowledge by roughly 3–6 months.

Methodology

The series keeps the DeepSeekMoE and Multi-Token Prediction stack from V3, swapping the routing affinity activation to Softplus, removing the routing-target-node cap, and replacing dense FFNs in early blocks with hash-routed MoE layers. Both models pre-train on 32T+ tokens (V4-Pro on 33T) with native 1M-context support.

Great Fit If / Look Elsewhere

Worth studying if you care about the architecture of efficient ultra-long context, MoE routing, or RL-based post-training at frontier scale. Look elsewhere if you want a turnkey small model: this is a preview report on trillion-parameter systems, and the authors note the design stays deliberately complex with several empirically-validated-but-not-fully-understood tricks.

Back

Information

Websitehuggingface.co
OrganizationsDeepSeek-AI
Published date2026/05/06

More Items

Large Language Model Papers2026

DecoEvo: Score-Decoupled Co-Evolution of Solver and Rubric-Generator Skills in Text Space

Jiangwang Chen, Zixin Song +11Tsinghua University, Qwen Business Unit of Alibaba +2

Co-evolves a solver skill and a rubric-generator skill for text-space LLM optimization under decoupled objectives to avoid rubric gaming without using gold rubrics. Solver updates use criterion-level feedback; generator updates use independent audits of requirement coverage and response discrimination.

LLM evaluation agent-skills qwen paper+2

AI Video Papers2026

Mage-VL: An Efficient Codec-Native Streaming Multimodal Foundation Model

Senqiao Yang, Kaichen Zhang +21

Real-time streaming multimodal foundation model that uses a codec-native tokenizer (Mage-ViT) to encode motion- and residual-rich regions from video I/P frames, reducing visual token usage by over 75% and enabling up to ~3.5× wall-clock inference speedup after training on ~560M images and 100M video frames.

multimodal video vision foundation-model ai+5

Large Language Model Papers2026

Kimi K3: Open Frontier Intelligence

Kimi Team, Tongtong Bai +400

Presents a 2.8T-parameter Mixture-of-Experts multimodal model with a 1-million-token context window and 104 billion activated parameters, targeting long-horizon agentic RL, coding, reasoning, and vision. Key innovations include Kimi Delta Attention, Attention Residuals, Stable LatentMoE (16 of 896 experts active per token), ~2.5× scaling efficiency over Kimi K2, and a public weight release.

kimi foundation-model llm multimodal vision+6