InstructGPT: Training Language Models to Follow Instructions with Human Feedback

Made reinforcement learning from human feedback (RLHF) the standard alignment recipe: collect demonstrations and preference rankings, train a reward model, then optimize with PPO. A 1.3B aligned model was preferred over the 175B GPT-3 by human raters.

Visual Explainer Visit Website

Introduction

The counterintuitive result that reframed alignment: a 1.3B-parameter model that humans prefer over GPT-3, a model 100x larger. The lesson the field took from this 2022 paper is that following human intent is a separate axis from scale — and the direct ancestor of ChatGPT's training pipeline.

Key Findings

Alignment beats raw size. Outputs from the 1.3B InstructGPT were preferred to those from the 175B GPT-3 despite the enormous parameter gap, because the smaller model was tuned to do what users actually asked.
A three-stage recipe that stuck. Supervised fine-tuning on labeler demonstrations, then a reward model trained on human preference rankings, then PPO optimization against that reward — the template later reused, with variations, across the industry.
Truthfulness and toxicity improve with small regressions. Aligned models hallucinate less and emit less toxic content, at a modest "alignment tax" on some public NLP benchmarks that the authors mitigate by mixing in pre-training gradients.

Why It Matters

This is the methodological bridge from GPT-3 to ChatGPT. RLHF as described here became the default post-training step for instruction-following assistants, and the reward-model framing seeded the entire preference-optimization literature (DPO and successors react to it).

Who Should Read It

Great fit if you work on post-training, alignment, or want to understand why chat assistants behave as they do. Look elsewhere if you need the current frontier — preference methods like DPO simplify the PPO stage — but the problem framing and human-data methodology here remain the reference point.

Back

Information

Websitearxiv.org
OrganizationsOpenAI
AuthorsLong Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin
Published date2022/03/04

More Items

Reinforcement Learning Papers2026

CoRT: Counterfactual Replay for Token-Level Rubric-Guided Policy Optimization

Bo-Wen Zhang, Junwei He +6

Allocates token-level credit in rubric-conditioned GRPO by counterfactually replaying the same response under rubric and criteria-free prompts, using tokenwise log-likelihood contrasts to compute bounded, response-normalized weights that redistribute GRPO advantages without training an auxiliary scorer.

RL LLM NLP paper evaluation

Large Language Model Papers2026

DecoEvo: Score-Decoupled Co-Evolution of Solver and Rubric-Generator Skills in Text Space

Jiangwang Chen, Zixin Song +11Tsinghua University, Qwen Business Unit of Alibaba +2

Co-evolves a solver skill and a rubric-generator skill for text-space LLM optimization under decoupled objectives to avoid rubric gaming without using gold rubrics. Solver updates use criterion-level feedback; generator updates use independent audits of requirement coverage and response discrimination.

LLM evaluation agent-skills qwen paper+2

AI Video Papers2026

Mage-VL: An Efficient Codec-Native Streaming Multimodal Foundation Model

Senqiao Yang, Kaichen Zhang +21

Real-time streaming multimodal foundation model that uses a codec-native tokenizer (Mage-ViT) to encode motion- and residual-rich regions from video I/P frames, reducing visual token usage by over 75% and enabling up to ~3.5× wall-clock inference speedup after training on ~560M images and 100M video frames.

multimodal video vision foundation-model ai+5