Why this matters ZPPO addresses a common tradeoff in model compression: direct logit imitation from a large teacher can over-constrain small students, while pure on-policy RL discards hard examples where all rollouts fail. The paper’s core insight is to keep teacher information in the prompt (so the student still learns from the teacher’s outputs) without contaminating the policy gradient, thereby preserving on-policy exploration while providing dense, actionable supervision on failure cases.
Key Findings
- Prompt-based teacher supervision: Reformulating hard examples into Binary Candidate-included Questions (BCQ) and Negative Candidate-included Questions (NCQ) surfaces correct answers and common failure modes inside the student’s prompt context, enabling token-level discrimination without altering the RL gradient.
- Prompt replay buffer: Recirculating hard questions until they "graduate" or are FIFO-evicted focuses training on the student’s current zone of proximal development, amplifying learning on examples the student is close to mastering.
- Empirical gains at small scales: Evaluated as post-training on Qwen3.5-family students (0.8B–9B) with a 27B teacher across a 31-benchmark suite (LLM, VLM, Video), ZPPO outperforms off-/on-policy distillation and GRPO, with the largest relative improvements for the smallest students.
Who it’s for and tradeoffs
Great fit if you compress large multimodal teachers into much smaller text/vision-language students and need to recover learning on hard examples without breaking on-policy training. ZPPO is especially attractive when teacher logits are impractical to use directly or when naive distillation harms generalization. Look elsewhere if you can afford white-box teacher logits with matched-capacity distillation, if strict theoretical on-policy purity is required without any prompt conditioning, or if prompt replay storage and bookkeeping are a deployment constraint.
Where it fits
ZPPO sits between imitation-style logit distillation and pure RLVR: it keeps the exploration benefits of on-policy rollouts while borrowing dense supervisory signal from the teacher via prompts rather than gradients. This makes it a practical middle path for small-model post-training in multimodal reasoning settings.
