Visual preference is subjective and often ill-captured by a single scalar score. This paper's core insight is to represent preferences as full score distributions produced by a reasoning-capable teacher VLM, then distill those reasoning-conditioned distributions into a compact, deployable student — preserving nuanced judgments without requiring reasoning at inference time. The approach yields large human-preference gains when used as a reward for text-to-image optimization.
Key Findings
- Teacher-student decomposition: a large VLM teacher performs reasoning to infer rubric-aligned score distributions; a compact student learns those distributions for fast inference — so what? you keep rich, uncertainty-aware judgments in production-friendly models.
- Group-wise Direct Score Optimization (GDSO): trains the teacher by combining expectation-based policy-gradient rewards with pointwise and pairwise supervision on score distributions and gaps — so what? it encourages both distributional fidelity and decision-consistent ranking.
- Reasoning-Internalized Score Distillation (RISD): transfers the teacher's reasoning-conditioned score distribution into the student without exposing reasoning chains at runtime — so what? you get near-teacher accuracy with a much smaller model and lower inference cost.
- Empirical gains: the 27B GDSO teacher achieves 89.6% human preference accuracy and the 9B RISD student reaches 88.6%, and using Z-Reward as a differentiable reward yields a 41.3% net human-preference improvement over an SFT baseline — so what? distributional rewards materially improve perceived image quality.
Who It's For and Trade-offs
Great fit if you research or build text-to-image reward models, want uncertainty-aware preference signals, or need a deployable reward model that encodes reasoning without runtime chains. Look elsewhere if you cannot access or fine-tune large VLMs, need fully transparent reasoning at inference, or require publicly reproducible benchmarks (the evaluation uses an internally annotated set). Computational and annotation costs for training the teacher are nontrivial.
Method Overview
The teacher is a large vision-language model optimized with GDSO to output score distributions aligned to a human rubric; GDSO blends distribution-expectation policy gradients with direct supervision on distribution shapes and inter-score gaps. The student is trained with RISD to internalize those distributions into a compact VLM, enabling efficient deployment and use as a differentiable reward for downstream text-to-image optimization.
