DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Reworks the MoE layer to push each expert toward a narrow specialty: split experts into many finer ones and activate more per token, plus reserve a few always-on shared experts for common knowledge. A 2B model matches GShard 2.9B; at 16B it rivals LLaMA2 7B on ~40% of the compute.

Visual Explainer Visit Website

Introduction

Standard top-K MoE routing has a hidden tax: with only a handful of coarse experts, each one is forced to absorb a grab-bag of unrelated knowledge, and the same common patterns get relearned across many experts. DeepSeekMoE attacks both leaks at once, and the routing recipe it lands on later becomes the MoE backbone of DeepSeek-V2 and V3.

Key Findings

Two cheap structural changes carry most of the gains: slicing experts into mN finer units (activating mK of them) for far more combinatorial routing flexibility, and isolating Ks experts as always-active shared ones so redundant common knowledge lives in one place.
The efficiency story is concrete, not hand-wavy. At 2B params DeepSeekMoE matches GShard 2.9B, which uses 1.5x its expert parameters and compute, and nearly reaches a dense model with the same total params.
It scales. At 16B it holds even with LLaMA2 7B while using roughly 40% of the compute; preliminary 145B runs approach DeepSeek 67B at a fraction of the cost.

Methodology

The core move is decoupling specialization from capacity. Fine-grained segmentation raises the number of distinct routing paths exponentially, so the gating network can assemble sharper, more targeted expert mixes per token instead of leaning on a few overloaded generalists. Shared-expert isolation then pulls the knowledge every token needs out of the routed pool, freeing the specialized experts to actually specialize rather than rehearsing the basics.

Who It's For

Great fit if you build or study sparse LLMs and want a principled, reproducible account of why fine-grained plus shared experts beats vanilla top-K routing, with ablations across 2B/16B/145B scales. Look elsewhere if you need a ready-to-serve chat model or production inference tooling, this is an architecture paper, and the shared-expert design assumes you control the training stack rather than just fine-tuning an existing checkpoint.

Back

Information

Websitearxiv.org
OrganizationsDeepSeek-AI
Published date2024/01/11

More Items

Large Language Model Papers2026

DecoEvo: Score-Decoupled Co-Evolution of Solver and Rubric-Generator Skills in Text Space

Jiangwang Chen, Zixin Song +11Tsinghua University, Qwen Business Unit of Alibaba +2

Co-evolves a solver skill and a rubric-generator skill for text-space LLM optimization under decoupled objectives to avoid rubric gaming without using gold rubrics. Solver updates use criterion-level feedback; generator updates use independent audits of requirement coverage and response discrimination.

LLM evaluation agent-skills qwen paper+2

AI Video Papers2026

Mage-VL: An Efficient Codec-Native Streaming Multimodal Foundation Model

Senqiao Yang, Kaichen Zhang +21

Real-time streaming multimodal foundation model that uses a codec-native tokenizer (Mage-ViT) to encode motion- and residual-rich regions from video I/P frames, reducing visual token usage by over 75% and enabling up to ~3.5× wall-clock inference speedup after training on ~560M images and 100M video frames.

multimodal video vision foundation-model ai+5

Large Language Model Papers2026

Kimi K3: Open Frontier Intelligence

Kimi Team, Tongtong Bai +400

Presents a 2.8T-parameter Mixture-of-Experts multimodal model with a 1-million-token context window and 104 billion activated parameters, targeting long-horizon agentic RL, coding, reasoning, and vision. Key innovations include Kimi Delta Attention, Attention Residuals, Stable LatentMoE (16 of 896 experts active per token), ~2.5× scaling efficiency over Kimi K2, and a public weight release.

kimi foundation-model llm multimodal vision+6