Benchmarks that use coarse semantic similarity often hide perceptual failures: models can score well overall while missing mandatory visual facts or making subtle hallucinations. PerceptionRubrics tackles this by turning verified, high-detail captions into atomic rubrics and applying a gated scoring mechanism so that failure on essential facts blocks any finer-grained credit.
Key Findings
- Data-driven rubricization: 1,038 information-dense images were paired with 10,718 instance-specific rubrics (4,053 Must-Right; 6,665 Easy-Wrong). This converts long, high-fidelity captions into explicit unit tests for vision-language grounding, enabling precise failure analysis.
- Gated scoring changes rankings: the Must-Right gate forces models to satisfy core visual facts before scoring fine details, revealing brittle conjunction failures that conventional holistic metrics miss. Practically, many models pass fragmented checks but fail strict conjunctive constraints.
- Human alignment and stratification: the gated rubric metric correlates better with human judgments and uncovers a roughly 8% perception gap between leading proprietary systems and the top open-source model, highlighting remaining open-source deficits in fine-grained perception.
Who it's for and Trade-offs
Great fit if you need rigorous, human-aligned diagnostics of multimodal perception—researchers benchmarking MLLMs, dataset curators auditing visual grounding, or teams hunting hallucination modes. Look elsewhere if you need lightweight, coarse ranking for simple captioning tasks: the rubric pipeline is more labor- and data-intensive and prioritizes strict factual fidelity over permissive semantic similarity.
Methodology & Practical Notes
The benchmark constructs golden captions via a Circular Peer-Review pipeline (ensemble MLLM critique + human verification) and distills them into Must-Right and Easy-Wrong rubrics evaluated by an LLM-as-judge with a gated scoring formula. Key numbers: 1,038 images, mean caption length skewed high (mean ≈ 770 words), 10,718 rubrics, average ~10.33 rubrics/image. The dataset and evaluation code are released to support reproducible auditing and reuse as inference-time verifiers.
