Standard top-K MoE routing has a hidden tax: with only a handful of coarse experts, each one is forced to absorb a grab-bag of unrelated knowledge, and the same common patterns get relearned across many experts. DeepSeekMoE attacks both leaks at once, and the routing recipe it lands on later becomes the MoE backbone of DeepSeek-V2 and V3.
Key Findings
- Two cheap structural changes carry most of the gains: slicing experts into mN finer units (activating mK of them) for far more combinatorial routing flexibility, and isolating Ks experts as always-active shared ones so redundant common knowledge lives in one place.
- The efficiency story is concrete, not hand-wavy. At 2B params DeepSeekMoE matches GShard 2.9B, which uses 1.5x its expert parameters and compute, and nearly reaches a dense model with the same total params.
- It scales. At 16B it holds even with LLaMA2 7B while using roughly 40% of the compute; preliminary 145B runs approach DeepSeek 67B at a fraction of the cost.
Methodology
The core move is decoupling specialization from capacity. Fine-grained segmentation raises the number of distinct routing paths exponentially, so the gating network can assemble sharper, more targeted expert mixes per token instead of leaning on a few overloaded generalists. Shared-expert isolation then pulls the knowledge every token needs out of the routed pool, freeing the specialized experts to actually specialize rather than rehearsing the basics.
Who It's For
Great fit if you build or study sparse LLMs and want a principled, reproducible account of why fine-grained plus shared experts beats vanilla top-K routing, with ablations across 2B/16B/145B scales. Look elsewhere if you need a ready-to-serve chat model or production inference tooling, this is an architecture paper, and the shared-expert design assumes you control the training stack rather than just fine-tuning an existing checkpoint.
