Codex: Evaluating Large Language Models Trained on Code

Showed that fine-tuning a GPT model on public GitHub code yields a capable program synthesizer, and introduced HumanEval — the docstring-to-function benchmark that still anchors code-generation evaluation. A production variant powers GitHub Copilot.

Visual Explainer Visit Website

Introduction

The headline isn't that a language model can write code — it's how the paper measures it. By releasing HumanEval, a set of hand-written programming problems graded by actually running unit tests rather than matching text, this work reset how the field judges code models, and that benchmark outlived the model itself.

Key Findings

Functional correctness, not text overlap. On HumanEval, Codex solves 28.8% of problems pass@1 while GPT-3 solves 0% — a gap that exists only because pre-training on natural language alone doesn't teach executable code.
Sampling is a lever. Drawing 100 samples per problem and ranking them lifts the solve rate to 70.2%. Repeated sampling turns a mediocre single-shot model into a strong one, a pattern that recurs across later reasoning work.
Honest about failure modes. The paper documents misaligned outputs, sample inefficiency, and the safety and economic implications of code generation — unusually candid for a capabilities release.

Why It Matters

Codex is the bridge between research LLMs and a product millions use: a distinct production version powers GitHub Copilot. It also made "evaluate by execution" the default for code, shaping successors like MBPP, MultiPL-E, and SWE-bench.

Who Should Read It

Great fit if you build or evaluate coding assistants and want the origin of pass@k and execution-based grading. Look elsewhere if you want a current model — Codex is deprecated and modern code models are far stronger — but the evaluation methodology here is still load-bearing.

Back

Information

Websitearxiv.org
OrganizationsOpenAI
AuthorsMark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan
Published date2021/07/07

More Items

Large Language Model Papers2026

DecoEvo: Score-Decoupled Co-Evolution of Solver and Rubric-Generator Skills in Text Space

Jiangwang Chen, Zixin Song +11Tsinghua University, Qwen Business Unit of Alibaba +2

Co-evolves a solver skill and a rubric-generator skill for text-space LLM optimization under decoupled objectives to avoid rubric gaming without using gold rubrics. Solver updates use criterion-level feedback; generator updates use independent audits of requirement coverage and response discrimination.

LLM evaluation agent-skills qwen paper+2

AI Video Papers2026

Mage-VL: An Efficient Codec-Native Streaming Multimodal Foundation Model

Senqiao Yang, Kaichen Zhang +21

Real-time streaming multimodal foundation model that uses a codec-native tokenizer (Mage-ViT) to encode motion- and residual-rich regions from video I/P frames, reducing visual token usage by over 75% and enabling up to ~3.5× wall-clock inference speedup after training on ~560M images and 100M video frames.

multimodal video vision foundation-model ai+5

Large Language Model Papers2026

Kimi K3: Open Frontier Intelligence

Kimi Team, Tongtong Bai +400

Presents a 2.8T-parameter Mixture-of-Experts multimodal model with a 1-million-token context window and 104 billion activated parameters, targeting long-horizon agentic RL, coding, reasoning, and vision. Key innovations include Kimi Delta Attention, Attention Residuals, Stable LatentMoE (16 of 896 experts active per token), ~2.5× scaling efficiency over Kimi K2, and a public weight release.

kimi foundation-model llm multimodal vision+6