Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

Proposes ZPPO, a distillation method that keeps the teacher inside prompts rather than injecting teacher gradients, using binary- and negative-candidate prompts plus a prompt replay buffer to recover learning signal on hard examples; shows gains for small Qwen3.5 students across 31 multimodal benchmarks.

Visit Website

Introduction

Why this matters ZPPO addresses a common tradeoff in model compression: direct logit imitation from a large teacher can over-constrain small students, while pure on-policy RL discards hard examples where all rollouts fail. The paper’s core insight is to keep teacher information in the prompt (so the student still learns from the teacher’s outputs) without contaminating the policy gradient, thereby preserving on-policy exploration while providing dense, actionable supervision on failure cases.

Key Findings

Prompt-based teacher supervision: Reformulating hard examples into Binary Candidate-included Questions (BCQ) and Negative Candidate-included Questions (NCQ) surfaces correct answers and common failure modes inside the student’s prompt context, enabling token-level discrimination without altering the RL gradient.
Prompt replay buffer: Recirculating hard questions until they "graduate" or are FIFO-evicted focuses training on the student’s current zone of proximal development, amplifying learning on examples the student is close to mastering.
Empirical gains at small scales: Evaluated as post-training on Qwen3.5-family students (0.8B–9B) with a 27B teacher across a 31-benchmark suite (LLM, VLM, Video), ZPPO outperforms off-/on-policy distillation and GRPO, with the largest relative improvements for the smallest students.

Who it’s for and tradeoffs

Great fit if you compress large multimodal teachers into much smaller text/vision-language students and need to recover learning on hard examples without breaking on-policy training. ZPPO is especially attractive when teacher logits are impractical to use directly or when naive distillation harms generalization. Look elsewhere if you can afford white-box teacher logits with matched-capacity distillation, if strict theoretical on-policy purity is required without any prompt conditioning, or if prompt replay storage and bookkeeping are a deployment constraint.

Where it fits

ZPPO sits between imitation-style logit distillation and pure RLVR: it keeps the exploration benefits of on-policy rollouts while borrowing dense supervisory signal from the teacher via prompts rather than gradients. This makes it a practical middle path for small-model post-training in multimodal reasoning settings.

Back

Information

Websitearxiv.org
AuthorsByung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan, Karan Sapra, Andrew Tao, Pavlo Molchanov, Yejin Choi, Yu-Chiang Frank Wang …
Published date2026/06/16

More Items

Reinforcement Learning Papers2026

CoRT: Counterfactual Replay for Token-Level Rubric-Guided Policy Optimization

Bo-Wen Zhang, Junwei He +6

Allocates token-level credit in rubric-conditioned GRPO by counterfactually replaying the same response under rubric and criteria-free prompts, using tokenwise log-likelihood contrasts to compute bounded, response-normalized weights that redistribute GRPO advantages without training an auxiliary scorer.

RL LLM NLP paper evaluation

Large Language Model Papers2026

DecoEvo: Score-Decoupled Co-Evolution of Solver and Rubric-Generator Skills in Text Space

Jiangwang Chen, Zixin Song +11Tsinghua University, Qwen Business Unit of Alibaba +2

Co-evolves a solver skill and a rubric-generator skill for text-space LLM optimization under decoupled objectives to avoid rubric gaming without using gold rubrics. Solver updates use criterion-level feedback; generator updates use independent audits of requirement coverage and response discrimination.

LLM evaluation agent-skills qwen paper+2

AI Video Papers2026

Mage-VL: An Efficient Codec-Native Streaming Multimodal Foundation Model

Senqiao Yang, Kaichen Zhang +21

Real-time streaming multimodal foundation model that uses a codec-native tokenizer (Mage-ViT) to encode motion- and residual-rich regions from video I/P frames, reducing visual token usage by over 75% and enabling up to ~3.5× wall-clock inference speedup after training on ~560M images and 100M video frames.

multimodal video vision foundation-model ai+5