The headline isn't that a language model can write code — it's how the paper measures it. By releasing HumanEval, a set of hand-written programming problems graded by actually running unit tests rather than matching text, this work reset how the field judges code models, and that benchmark outlived the model itself.
Key Findings
- Functional correctness, not text overlap. On HumanEval, Codex solves 28.8% of problems pass@1 while GPT-3 solves 0% — a gap that exists only because pre-training on natural language alone doesn't teach executable code.
- Sampling is a lever. Drawing 100 samples per problem and ranking them lifts the solve rate to 70.2%. Repeated sampling turns a mediocre single-shot model into a strong one, a pattern that recurs across later reasoning work.
- Honest about failure modes. The paper documents misaligned outputs, sample inefficiency, and the safety and economic implications of code generation — unusually candid for a capabilities release.
Why It Matters
Codex is the bridge between research LLMs and a product millions use: a distinct production version powers GitHub Copilot. It also made "evaluate by execution" the default for code, shaping successors like MBPP, MultiPL-E, and SWE-bench.
Who Should Read It
Great fit if you build or evaluate coding assistants and want the origin of pass@k and execution-based grading. Look elsewhere if you want a current model — Codex is deprecated and modern code models are far stronger — but the evaluation methodology here is still load-bearing.
