AIAny - Human Psychometric Questionnaires Mischaracterize LLM Behavior

Most evaluations that assign stable personality or value traits to LLMs rely on human psychometric questionnaires. This paper shows why that practice can be misleading: questionnaire responses and generation‑based profiles of LLMs diverge substantially, so questionnaire results overstate stable dispositions and fail to predict how models behave in realistic user interactions.

Key Findings

Profile divergence: Across eight open‑source LLMs, trait/value profiles derived from Likert-style questionnaire responses (PVQ-40/21, BFI-44/10) differ substantially from profiles based on generation probabilities for everyday, value‑laden queries. So what: a questionnaire-derived label (e.g., “high agreeableness”) is not a reliable predictor of model outputs in the wild.
Item consistency disappears in generative behavior: Within-construct internal consistency that appears in questionnaire answers largely vanishes when measuring generation probabilities. So what: apparent stable dispositions may be an artifact of questionnaire design rather than model internals.
Lexical cue and social desirability effects: Explicit lexical cues in established questionnaire items let models recognize the target construct and produce alignment‑consistent or socially desirable answers; everyday queries lack those cues. So what: questionnaires measure recognition of prompts, not stable preferences.
Demographic persona prompting is limited: Persona prompts shift questionnaire responses in ways resembling human patterns, but they do not produce comparable shifts in generation‑based measures for realistic queries. So what: persona-conditioned questionnaires overestimate a model’s ability to simulate demographic behavior in real interactions.

Who it's for, and trade‑offs

Great fit if you design or evaluate LLM behavior for user-facing systems and need realistic behavioral profiling rather than surface-level trait labels. The paper argues for generation‑based profiling when your goal is to predict likely model outputs during everyday use. Look elsewhere if you need a quick, comparable benchmark for inter-model calibration—questionnaires are easier to administer and compare across many models, but they risk giving false confidence about real-world behavior.

Methodological note

The study analyzes eight open‑source LLMs, compares Likert self-reports on PVQ and BFI questionnaires with generation probabilities over curated everyday queries, and tests demographic persona prompts. The recommendation is pragmatic: use generation‑based profiling (probabilistic output measures on realistic queries) when the objective is behavioral prediction, and treat questionnaire-derived traits as complementary—informative about recognition and alignment tendencies but not definitive behavioral predictors.

Human Psychometric Questionnaires Mischaracterize LLM Behavior

Introduction

Key Findings

Who it's for, and trade‑offs

Methodological note

Information

Categories

Tags

More Items

Kimi K3: Open Frontier Intelligence

LAMAR: An Open Language-Aware Multilingual Alignment Reranker

Scaling Native Multimodal Pre-Training From Scratch