AIAny - ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

Why this matters

Most current evaluations check whether models recall facts or reproduce a persona at a single point in a story. That misses a central competence for narrative agents: adaptively shifting values and behavior as a character's psychological arc unfolds. ArcANE reframes evaluation to ask not “what did the character do?” but “would the character act this way at this point in the arc?” — including in scenarios the source text never shows.

Key Findings

Conditioning on an explicit Character Arc consistently outperforms other context strategies (e.g., raw retrieval or isolated chapters) across six evaluated models and six context modes — so conditioning captures trajectory information that simple retrieval cannot.
The Arc advantage is largest on out-of-text probes, where retrieval has nothing relevant to fetch — so arc-aware prompts let models generalize a character’s trajectory to novel situations.
Fine-tuning open-weight models on ArcANE yields ArcANE-8B/32B, which widen the performance gap on out-of-text scenarios — suggesting the dataset can teach models to internalize arc-driven behavior rather than only memorizing passages.
The benchmark covers 17 novels and 80 principal characters, with each Character Arc segmented into phases and a common probe applied across phases to measure temporal alignment rather than static consistency.

Who this is for and tradeoffs

Great fit if you care about evaluating or fine-tuning conversational agents that must maintain evolving personalities (e.g., narrative NPCs, role-play assistants, long-form interactive storytelling). ArcANE is useful for researchers comparing context strategies, probing generalization, or training models to be arc-aware.

Look elsewhere if your target domain is short-form factual QA, task-oriented dialogue, or non-narrative conversational systems — ArcANE focuses on literary character trajectories, so its probes and segmentation assumptions may not map to transactional or domain-specific behaviors. Also note: automatic probes approximate human judgements and may miss subtle cultural or interpretive aspects of characters; human evaluation remains valuable for final assessment.

How it works (brief)

ArcANE constructs a Character Arc by segmenting a novel along a psychological axis into phases, then generates probes that present the same scenario across those phases. Probes include situations present in the source text and deliberately out-of-text scenarios to test generalization. The paper compares six models under six context modes (including arc conditioning and retrieval) and reports both benchmarking results and the effect of fine-tuning, producing ArcANE-8B and ArcANE-32B models trained on the same data.

Taken together, ArcANE shifts evaluation from static persona recall toward temporal alignment with a character's trajectory, and shows concrete gains for arc-aware conditioning and fine-tuning — especially when models must act beyond what the original text documents.

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

Introduction

Key Findings

Who this is for and tradeoffs

How it works (brief)

Information

Categories

Tags

More Items

BadWAM: When World-Action Models Dream Right but Act Wrong

SearchOS-V1: Towards Robust Open-Domain Information-Seeking Agent Collaboration

SEED: Self-Evolving On-Policy Distillation for Agentic Reinforcement Learning