Before this 2018 paper, advancing NLP usually meant hand-designing a new model architecture for each task. Its quietly radical claim: one generically pre-trained Transformer, fine-tuned with almost no structural change, could beat all of them. That bet is the foundation every later GPT stands on.
Key Findings
- Generative pre-training transfers broadly. Pre-training a 12-layer Transformer decoder to predict the next token on BooksCorpus, then fine-tuning, raised the state of the art on 9 of 12 datasets spanning entailment, question answering, semantic similarity, and classification.
- Task-aware input transformations replace task-specific models. Structured inputs — premise/hypothesis pairs, document/question/answer triples — are linearized into token sequences, so the same network handles every task with only a linear output head bolted on.
- Capabilities grow with pre-training alone. Even before fine-tuning, zero-shot task performance rose steadily as pre-training progressed — an early hint of what GPT-2 and GPT-3 would later scale.
How It Works
The decoder-only Transformer is trained with a plain left-to-right language-modeling objective, then fine-tuned with an auxiliary LM loss running alongside the supervised loss, which the authors show improves generalization and speeds convergence. The deliberate choice of unidirectional context — unlike BERT months later — is what keeps the model generative.
Who Should Read It
Great fit if you want the historical root of modern LLMs, or to understand why "pre-train then adapt" displaced bespoke architectures. Look elsewhere for current practice: the specific fine-tuning recipe here is superseded by in-context learning and instruction tuning, and at 117M parameters the model is tiny by today's standards.
