The counterintuitive result that reframed alignment: a 1.3B-parameter model that humans prefer over GPT-3, a model 100x larger. The lesson the field took from this 2022 paper is that following human intent is a separate axis from scale — and the direct ancestor of ChatGPT's training pipeline.
Key Findings
- Alignment beats raw size. Outputs from the 1.3B InstructGPT were preferred to those from the 175B GPT-3 despite the enormous parameter gap, because the smaller model was tuned to do what users actually asked.
- A three-stage recipe that stuck. Supervised fine-tuning on labeler demonstrations, then a reward model trained on human preference rankings, then PPO optimization against that reward — the template later reused, with variations, across the industry.
- Truthfulness and toxicity improve with small regressions. Aligned models hallucinate less and emit less toxic content, at a modest "alignment tax" on some public NLP benchmarks that the authors mitigate by mixing in pre-training gradients.
Why It Matters
This is the methodological bridge from GPT-3 to ChatGPT. RLHF as described here became the default post-training step for instruction-following assistants, and the reward-model framing seeded the entire preference-optimization literature (DPO and successors react to it).
Who Should Read It
Great fit if you work on post-training, alignment, or want to understand why chat assistants behave as they do. Look elsewhere if you need the current frontier — preference methods like DPO simplify the PPO stage — but the problem framing and human-data methodology here remain the reference point.
