Human mediation is a dynamic, trajectory-driven process: disputants' intentions, emotions, and context shift over turns, so per-turn, off-topic scoring hides real progress. SoCRATES reframes evaluation by constructing multi-domain scenarios from real conflicts and scoring only the topic-localized turns that advance mediation—yielding higher evaluator alignment and clearer diagnosis of social-adaptation gaps.
Key Findings
- Topic-localized evaluation improves signal: the paper's evaluator reaches 0.82 alignment with human experts and more than doubles agreement compared to a naive per-turn baseline, so evaluators better reflect whether a turn actually advances mediation.
- Broad, realistic testbeds reveal limited current capability: across eight benchmarked LLMs, the best mediator closes only about one third of the unmediated consensus gap, indicating substantial headroom for model improvements in social adaptation.
- Performance depends strongly on socio-cognitive axes: evaluation varies sharply by strategic posture, party composition, history length, emotional reactivity, and cultural identity—implying that robustness to these axes is the key bottleneck for practical mediation.
- Agentic scenario pipeline: SoCRATES synthesizes scenarios from real conflicts and systematically varies domains and socio-cognitive conditions, enabling controlled, diverse stress tests rather than a few expert-authored cases.
Who It's For and Trade-offs
Great fit if you are developing or evaluating LLM-based social agents, research on human–AI mediation, or building benchmarks that stress social-adaptive behaviors. The suite helps diagnose which socio-cognitive axes break down and where model improvements should focus. Look elsewhere if you need real-world field trials with live human subjects—SoCRATES is a simulation-derived benchmark and abstracts some real-world noise and longitudinal effects that only deployment studies can capture.
Where It Fits
Unlike prior per-turn or small expert-case testbeds, SoCRATES emphasizes (1) realistic, agentically generated scenarios, (2) multi-axis socio-cognitive variation, and (3) topic-localized scoring. This makes it complementary to human-subject evaluations and useful as a reproducible stress-testing layer for mediator-capability research.
Methodological notes
The benchmark includes eight domains, probes five adaptation axes (strategic posture, party composition, history length, emotional reactivity, cultural identity), and pairs the scenario suite with an evaluator trained to score only turns that move a topic forward. The paper reports evaluator vs. expert agreement, cross-model benchmarking, and per-axis breakdowns that expose where current LLMs fail to adapt socially.
