Why this matters
Most current evaluations check whether models recall facts or reproduce a persona at a single point in a story. That misses a central competence for narrative agents: adaptively shifting values and behavior as a character's psychological arc unfolds. ArcANE reframes evaluation to ask not “what did the character do?” but “would the character act this way at this point in the arc?” — including in scenarios the source text never shows.
Key Findings
- Conditioning on an explicit Character Arc consistently outperforms other context strategies (e.g., raw retrieval or isolated chapters) across six evaluated models and six context modes — so conditioning captures trajectory information that simple retrieval cannot.
- The Arc advantage is largest on out-of-text probes, where retrieval has nothing relevant to fetch — so arc-aware prompts let models generalize a character’s trajectory to novel situations.
- Fine-tuning open-weight models on ArcANE yields ArcANE-8B/32B, which widen the performance gap on out-of-text scenarios — suggesting the dataset can teach models to internalize arc-driven behavior rather than only memorizing passages.
- The benchmark covers 17 novels and 80 principal characters, with each Character Arc segmented into phases and a common probe applied across phases to measure temporal alignment rather than static consistency.
Who this is for and tradeoffs
Great fit if you care about evaluating or fine-tuning conversational agents that must maintain evolving personalities (e.g., narrative NPCs, role-play assistants, long-form interactive storytelling). ArcANE is useful for researchers comparing context strategies, probing generalization, or training models to be arc-aware.
Look elsewhere if your target domain is short-form factual QA, task-oriented dialogue, or non-narrative conversational systems — ArcANE focuses on literary character trajectories, so its probes and segmentation assumptions may not map to transactional or domain-specific behaviors. Also note: automatic probes approximate human judgements and may miss subtle cultural or interpretive aspects of characters; human evaluation remains valuable for final assessment.
How it works (brief)
ArcANE constructs a Character Arc by segmenting a novel along a psychological axis into phases, then generates probes that present the same scenario across those phases. Probes include situations present in the source text and deliberately out-of-text scenarios to test generalization. The paper compares six models under six context modes (including arc conditioning and retrieval) and reports both benchmarking results and the effect of fine-tuning, producing ArcANE-8B and ArcANE-32B models trained on the same data.
Taken together, ArcANE shifts evaluation from static persona recall toward temporal alignment with a character's trajectory, and shows concrete gains for arc-aware conditioning and fine-tuning — especially when models must act beyond what the original text documents.
