Current model evaluations emphasize final-answer accuracy, which misses a crucial capability: knowing when an answer might be wrong. Metacognition-Bench reframes evaluation by measuring functional metacognition — the model's ability to detect seductive-but-wrong reasoning paths and either avoid them or self-correct. That gap explains many cases where capable models produce confident but incorrect outputs.
What Sets It Apart
- Two complementary axes: a multiple-choice "vulnerability" test (trap_rate) that quantifies how often a model picks a planted tempting trap, and a free-form "adapter gain" metric that measures how much a lightweight probe (an adapter reading frozen hidden states) improves error detection beyond the model's own confidence. This split separates whether a model is inherently robust from whether a small probe can meaningfully flag its failures.
- Problem design: 300 auto-gradable items each embedding a documented hidden_trap (base-rate neglect, premise-shift, binary framing, etc.), balanced across 121 domains and graded difficulty so the benchmark stresses metacognitive behaviors rather than domain knowledge alone.
- Practical deliverables: a live leaderboard tracking both axes and a collection of per-model, base-frozen metacognition adapters (adapter, not fine-tune) that output P(wrong) signals by reading the model's internal state.
Who It's For and Trade-offs
Great fit if you need a focused evaluation of LLM safety/reliability beyond accuracy — e.g., researchers benchmarking model self-awareness, teams deploying confidence-calibrated LLM tooling, or developers testing lightweight monitoring probes. Look elsewhere if you only need large-scale factual benchmarks or end-to-end task accuracy: Metacognition-Bench is small (300 items) and explicitly adversarial, so it trades breadth for targeted diagnostic power. The provided adapters are lightweight probes that require access to model hidden states and therefore suit setups where you can run probes against frozen-base models.
Methodology & Practical Notes
Problems are auto-generated under constraints and filtered by LLM grading for trap validity and gradability. The adapter workflow freezes the base model and trains a compact MLP on hidden states to predict failure; adapter gain is reported as AUROC Δ over the model's raw confidence on held-out splits. The benchmark is intentionally model-agnostic and comes with reproducible tooling and a public leaderboard for ongoing submissions.
