GPT1: Improving Language Understanding by Generative Pre-Training

Introduced the two-stage recipe behind the GPT lineage: unsupervised generative pre-training on unlabeled text, then supervised fine-tuning per task. A single 12-layer Transformer decoder beat bespoke architectures on 9 of 12 NLP benchmarks.

Visual Explainer Visit Website

Introduction

Before this 2018 paper, advancing NLP usually meant hand-designing a new model architecture for each task. Its quietly radical claim: one generically pre-trained Transformer, fine-tuned with almost no structural change, could beat all of them. That bet is the foundation every later GPT stands on.

Key Findings

Generative pre-training transfers broadly. Pre-training a 12-layer Transformer decoder to predict the next token on BooksCorpus, then fine-tuning, raised the state of the art on 9 of 12 datasets spanning entailment, question answering, semantic similarity, and classification.
Task-aware input transformations replace task-specific models. Structured inputs — premise/hypothesis pairs, document/question/answer triples — are linearized into token sequences, so the same network handles every task with only a linear output head bolted on.
Capabilities grow with pre-training alone. Even before fine-tuning, zero-shot task performance rose steadily as pre-training progressed — an early hint of what GPT-2 and GPT-3 would later scale.

How It Works

The decoder-only Transformer is trained with a plain left-to-right language-modeling objective, then fine-tuned with an auxiliary LM loss running alongside the supervised loss, which the authors show improves generalization and speeds convergence. The deliberate choice of unidirectional context — unlike BERT months later — is what keeps the model generative.

Who Should Read It

Great fit if you want the historical root of modern LLMs, or to understand why "pre-train then adapt" displaced bespoke architectures. Look elsewhere for current practice: the specific fine-tuning recipe here is superseded by in-context learning and instruction tuning, and at 117M parameters the model is tiny by today's standards.

Back

Information

Websitecdn.openai.com
OrganizationsOpenAI
AuthorsAlec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever
Published date2018/06/11

More Items

Large Language Model Papers2026

DecoEvo: Score-Decoupled Co-Evolution of Solver and Rubric-Generator Skills in Text Space

Jiangwang Chen, Zixin Song +11Tsinghua University, Qwen Business Unit of Alibaba +2

Co-evolves a solver skill and a rubric-generator skill for text-space LLM optimization under decoupled objectives to avoid rubric gaming without using gold rubrics. Solver updates use criterion-level feedback; generator updates use independent audits of requirement coverage and response discrimination.

LLM evaluation agent-skills qwen paper+2

AI Video Papers2026

Mage-VL: An Efficient Codec-Native Streaming Multimodal Foundation Model

Senqiao Yang, Kaichen Zhang +21

Real-time streaming multimodal foundation model that uses a codec-native tokenizer (Mage-ViT) to encode motion- and residual-rich regions from video I/P frames, reducing visual token usage by over 75% and enabling up to ~3.5× wall-clock inference speedup after training on ~560M images and 100M video frames.

multimodal video vision foundation-model ai+5

Large Language Model Papers2026

Kimi K3: Open Frontier Intelligence

Kimi Team, Tongtong Bai +400

Presents a 2.8T-parameter Mixture-of-Experts multimodal model with a 1-million-token context window and 104 billion activated parameters, targeting long-horizon agentic RL, coding, reasoning, and vision. Key innovations include Kimi Delta Attention, Attention Residuals, Stable LatentMoE (16 of 896 experts active per token), ~2.5× scaling efficiency over Kimi K2, and a public weight release.

kimi foundation-model llm multimodal vision+6