Most evaluations that assign stable personality or value traits to LLMs rely on human psychometric questionnaires. This paper shows why that practice can be misleading: questionnaire responses and generation‑based profiles of LLMs diverge substantially, so questionnaire results overstate stable dispositions and fail to predict how models behave in realistic user interactions.
Key Findings
- Profile divergence: Across eight open‑source LLMs, trait/value profiles derived from Likert-style questionnaire responses (PVQ-40/21, BFI-44/10) differ substantially from profiles based on generation probabilities for everyday, value‑laden queries. So what: a questionnaire-derived label (e.g., “high agreeableness”) is not a reliable predictor of model outputs in the wild.
- Item consistency disappears in generative behavior: Within-construct internal consistency that appears in questionnaire answers largely vanishes when measuring generation probabilities. So what: apparent stable dispositions may be an artifact of questionnaire design rather than model internals.
- Lexical cue and social desirability effects: Explicit lexical cues in established questionnaire items let models recognize the target construct and produce alignment‑consistent or socially desirable answers; everyday queries lack those cues. So what: questionnaires measure recognition of prompts, not stable preferences.
- Demographic persona prompting is limited: Persona prompts shift questionnaire responses in ways resembling human patterns, but they do not produce comparable shifts in generation‑based measures for realistic queries. So what: persona-conditioned questionnaires overestimate a model’s ability to simulate demographic behavior in real interactions.
Who it's for, and trade‑offs
Great fit if you design or evaluate LLM behavior for user-facing systems and need realistic behavioral profiling rather than surface-level trait labels. The paper argues for generation‑based profiling when your goal is to predict likely model outputs during everyday use. Look elsewhere if you need a quick, comparable benchmark for inter-model calibration—questionnaires are easier to administer and compare across many models, but they risk giving false confidence about real-world behavior.
Methodological note
The study analyzes eight open‑source LLMs, compares Likert self-reports on PVQ and BFI questionnaires with generation probabilities over curated everyday queries, and tests demographic persona prompts. The recommendation is pragmatic: use generation‑based profiling (probabilistic output measures on realistic queries) when the objective is behavioral prediction, and treat questionnaire-derived traits as complementary—informative about recognition and alignment tendencies but not definitive behavioral predictors.
