LogoAIAny

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

A family of open code models (1.3B-33B) trained from scratch on 2T tokens of project-level code, using a 16K-window fill-in-the-blank objective. Beats Codex and GPT-3.5 on code benchmarks and ships under a license permitting commercial use.

Introduction

Most code LLMs in early 2024 either stayed closed (Codex, GPT-3.5) or trailed them by a wide margin. DeepSeek-Coder's bet is that data organization matters as much as scale: instead of treating code as isolated files, it builds a corpus at the repository level, preserving cross-file dependencies that single-file training silently discards.

Key Findings
  • Project-level corpus construction lets the model reason across files in a repository, not just within a single snippet — the gap most code benchmarks fail to capture but real engineering depends on.
  • A fill-in-the-blank (FIM) objective with a 16K context window targets infilling and completion directly, which matters more for IDE-style use than left-to-right generation alone.
  • Trained from scratch on 2 trillion tokens, the 33B model surpasses open peers and closed models like Codex and GPT-3.5 across multiple benchmarks — evidence that a focused code-first pretraining recipe closes the open/closed gap.
  • The permissive license allows unrestricted commercial use, removing the usual barrier for teams that cannot adopt research-only weights.
Methodology

The approach combines repository-level data assembly (so the model sees realistic dependency structure) with a next-token plus FIM training mix, then scales the same recipe across the 1.3B-to-33B range. This makes the family a study in how far careful corpus design and objective choice can carry a code model, rather than relying on parameter count alone.

Who It's For

Great fit if you are evaluating self-hostable code models for completion or infilling, want commercially usable weights, or care about cross-file reasoning over toy single-function benchmarks. Look elsewhere if you need a general-purpose chat assistant — these are code-specialized base and instruct models, and later DeepSeek releases (V2/V3) supersede them on raw capability.

Information

  • Websitearxiv.org
  • OrganizationsDeepSeek-AI
  • Published date2024/01/25