Long egocentric videos expose two surprising failure modes for Retrieval-Augmented Generation (RAG): many existing benchmarks let models answer queries without consulting video evidence, and prior methods force a single modality/granularity configuration per query even though relevance varies across chunks. That mismatch hides retrieval errors and limits generator performance.
Key Findings
- Benchmarking gap: V-RAGBench reframes evaluation as ⟨query, evidence chunk, answer⟩ triplets so retrieval and generation can be measured separately; this reveals retrieval failures that prior benchmarks mask. This matters because improving retrieval does not always change downstream metrics when answers can be guessed from query priors.
- Chunk-adaptive retrieval helps: CARVE runs multiple retrievers across modality and temporal-granularity configurations in parallel, then applies chunk-level reranking to pick a winning configuration per chunk. As a result, the evidence fed to the generator interleaves different configurations instead of forcing a single global choice, which yields better end-to-end QA performance.
- Practical win: CARVE outperforms eight recent VideoRAG baselines on the proposed benchmark, showing that per-chunk configuration selection (not just better embeddings or larger context windows) is a key lever for long-video RAG.
Who It's For and Trade-offs
Great fit if you work on multimodal QA or VideoRAG for long/egocentric footage and need faithful evaluation of retrieval vs. generation. The contributions are primarily methodological and empirical: a diagnostic benchmark (V-RAGBench) and a retrieval architecture (CARVE) that is compatible with existing generators. Look elsewhere if your application is short, fully scripted video QA or you cannot afford the runtime cost of running multiple retrievers in parallel—CARVE improves fidelity at the cost of more retrieval compute and the engineering to manage multi-configuration reranking.
Where It Fits
This paper slots between work that expands RAG beyond text and methods that focus on long-context LMMs: instead of only scaling context windows, it argues for smarter selection and representation of video evidence. It complements graph- or event-based approaches (which model temporal structure) by addressing per-chunk modality/granularity choices and by providing a benchmark that better isolates retrieval performance.
High-level Method Overview
CARVE: (1) enumerate a set of retrieval configurations (different modalities and temporal granularities), (2) run parallel retrievers to fetch candidate chunks under each configuration, (3) apply chunk-adaptive reranking to choose the best configuration per chunk, and (4) supply the generator with interleaved evidence reflecting those chunk-level choices. V-RAGBench: curated triplets that enable decoupled evaluation of retrieval accuracy and generation correctness, highlighting cases where generator answers rely on priors rather than retrieved evidence.
