Most memory benchmarks check storage, retrieval, or update fidelity — not how retrieved memories sway downstream decisions. MemSyco-Bench focuses on that gap: it measures when memory helps versus when it causes models to over-align with users (sycophancy), producing systematically wrong or biased outputs.
Key Findings
MemSyco-Bench organizes evaluation into five complementary tasks (Objective Fact Judgment, Contextual Scope Control, Memory–Evidence Conflict, Valid Memory Selection, Personalized Memory Use) and supplies 1,550 final samples with standardized scoring. It compares NoMemory, full-dialogue (RawDialogue), and multiple memory‑system settings, and includes open-ended LLM judging and unified baseline adapters so researchers can isolate failure modes like stale, conflicting, or overgeneralized memories.
Who it's for and trade-offs
Great fit if you build or evaluate memory-augmented agents and need targeted tests for preference-driven failures or personalization harms. The benchmark surfaces when memory retrieval helps and when it should be ignored, but it is specialized: it emphasizes preference-related and decision-making effects of memory rather than exhaustively measuring all memory competencies (e.g., capacity, low-level retrieval latency). Expect to complement this with other benchmarks for throughput, long-range factual retention, or systems-level deployment metrics.
Where it fits
Use MemSyco-Bench to compare memory extraction strategies, retrieval formats, and mitigation techniques (e.g., richer context extraction or summarization) when your agent must avoid blindly echoing user beliefs. The repository includes evaluation scripts, baseline adapters, and leaderboards to facilitate reproducible comparisons.
