Introduced the two-stage recipe behind the GPT lineage: unsupervised generative pre-training on unlabeled text, then supervised fine-tuning per task. A single 12-layer Transformer decoder beat bespoke architectures on 9 of 12 NLP benchmarks.
A 1.5B-parameter model trained only to predict the next token on diverse web text does translation, summarization, and QA zero-shot, with no fine-tuning. It recast NLP tasks as conditional language modeling and sparked the staged-release misuse debate.
At 175 billion parameters, this autoregressive model becomes a strong few-shot learner: it handles translation, QA, and reasoning from a few prompt examples with no gradient updates, establishing in-context learning as an alternative to fine-tuning.
Showed that fine-tuning a GPT model on public GitHub code yields a capable program synthesizer, and introduced HumanEval — the docstring-to-function benchmark that still anchors code-generation evaluation. A production variant powers GitHub Copilot.
Made reinforcement learning from human feedback (RLHF) the standard alignment recipe: collect demonstrations and preference rankings, train a reward model, then optimize with PPO. A 1.3B aligned model was preferred over the 175B GPT-3 by human raters.
A multimodal model that accepts image and text inputs and returns text, scoring at human level on professional exams — including a bar exam in the top 10%. Its performance was forecast from models using 1/1000th the compute, showing predictable scaling.