LogoAIAny
Icon for item

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Decouples perception and reasoning for hours-long videos by streaming inputs into a three-tier Hierarchical Graph Memory and using an agentic Observation–Reason–Action retrieval loop; reduces reasoning context to ~2% of full video while improving benchmark accuracy.

Introduction

Most video understanding approaches try to cram entire visual streams into a single reasoning context, which explodes token counts and dilutes attention for multi-hour inputs. MemDreamer takes an opposite tack: it incrementally abstracts streamed frames into a three-tier Hierarchical Graph Memory and treats downstream understanding as an agentic retrieval problem, not blind full-context attention. This reframing is the paper's core insight and the reason it scales to hours-long videos without linear token growth.

Key Findings
  • Hierarchical Graph Memory: MemDreamer builds a top-down, three-tier graph (foundational spatiotemporal/causal core plus higher semantic abstractions) as the video streams, enabling compact, structured representations that preserve relations across long timescales. So what: long-range dependencies are kept in graph edges rather than raw tokens, avoiding token explosion.
  • Agentic retrieval loop: During inference a reasoning model performs Observation → Reason → Action cycles, using tool-augmented retrieval to search nodes and traverse logical edges. So what: reasoning focuses only on nodes relevant to the current query, constraining the context window to ~2% of full ingestion while keeping necessary evidence.
  • Empirical gains: Reported state-of-the-art on four mainstream long-video benchmarks with an absolute accuracy gain (paper reports +12.5 points) and a remaining gap to human experts of ~3.7 points. So what: the method materially narrows the human–model gap on long-video logic tasks while being far more context-efficient.
  • Broader correlation: Analysis in the paper finds a strong positive linear correlation between a VLM's logic-reasoning ability and long-video understanding performance, suggesting that scaling agentic capabilities (retrieval + action loops) is a promising path for multimodal comprehension.
Who it's for and trade-offs

Great fit if you need to scale logical or causal reasoning across multi-hour video inputs (e.g., long-form QA, summarization, forensic timeline reconstruction) and want to avoid full-sequence attention. It benefits teams building multimodal systems that can incorporate retrieval tools and graph-structured memories. Look elsewhere if your use case is single-shot short clips where end-to-end attention is simpler and latency is the main constraint—MemDreamer adds architectural complexity (memory construction, graph maintenance, agentic control) and relies on effective node abstraction/annotation to avoid garbage-in problems.

Where it fits

MemDreamer sits between pure token-based VLMs (which struggle with very long contexts) and heavy offline summarization pipelines: it preserves causal/spatiotemporal structure for long horizons while enabling focused, tool-augmented reasoning. For projects prioritizing correctness over minimal system complexity, it’s a pragmatic compromise.

Brief method note

The system streams video to incrementally build node/edge representations at multiple abstraction levels, then lets a reasoning model query that graph using an Observation–Reason–Action loop and retrieval tools (search, traversal) rather than ingesting raw frames or flattened token sequences for entire videos.

Information

  • Websitearxiv.org
  • AuthorsCong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang, Guangming Yao, Hao Chen, Jingdong Chen, Yi Yuan, Chunhua Shen
  • Published date2026/06/05