LogoAIAny

InstructGPT: Training Language Models to Follow Instructions with Human Feedback

Made reinforcement learning from human feedback (RLHF) the standard alignment recipe: collect demonstrations and preference rankings, train a reward model, then optimize with PPO. A 1.3B aligned model was preferred over the 175B GPT-3 by human raters.

Introduction

The counterintuitive result that reframed alignment: a 1.3B-parameter model that humans prefer over GPT-3, a model 100x larger. The lesson the field took from this 2022 paper is that following human intent is a separate axis from scale — and the direct ancestor of ChatGPT's training pipeline.

Key Findings
  • Alignment beats raw size. Outputs from the 1.3B InstructGPT were preferred to those from the 175B GPT-3 despite the enormous parameter gap, because the smaller model was tuned to do what users actually asked.
  • A three-stage recipe that stuck. Supervised fine-tuning on labeler demonstrations, then a reward model trained on human preference rankings, then PPO optimization against that reward — the template later reused, with variations, across the industry.
  • Truthfulness and toxicity improve with small regressions. Aligned models hallucinate less and emit less toxic content, at a modest "alignment tax" on some public NLP benchmarks that the authors mitigate by mixing in pre-training gradients.
Why It Matters

This is the methodological bridge from GPT-3 to ChatGPT. RLHF as described here became the default post-training step for instruction-following assistants, and the reward-model framing seeded the entire preference-optimization literature (DPO and successors react to it).

Who Should Read It

Great fit if you work on post-training, alignment, or want to understand why chat assistants behave as they do. Look elsewhere if you need the current frontier — preference methods like DPO simplify the PPO stage — but the problem framing and human-data methodology here remain the reference point.

Information

  • Websitearxiv.org
  • OrganizationsOpenAI
  • AuthorsLong Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin
  • Published date2022/03/04