Why this matters
Video models frequently succeed via superficial textual shortcuts and struggle on questions that demand domain knowledge or multi-step reasoning. VideoKR targets that gap by providing a deliberately curated, large-scale training corpus and an expert-annotated evaluation benchmark designed to push models beyond surface cues and toward genuine knowledge- and reasoning-driven video understanding. (arxiv.org)
Key Findings
- A purpose-built corpus: VideoKR contains 315K video reasoning examples drawn from 145K newly collected, CC-licensed expert-domain videos — the scale and domain focus are chosen to expose models to real-world, knowledge-rich scenarios rather than routine web-video narration. (arxiv.org)
- Human-in-the-loop CoT rationales: Examples include chain-of-thought style rationales produced by a skill-oriented generation pipeline, improving the signal for multi-step reasoning during post-training. (arxiv.org)
- Evaluation that penalizes shortcuts: The paper introduces VideoKR-Eval, an expert-annotated benchmark whose questions require genuine video understanding and knowledge-intensive reasoning rather than relying on textual shortcuts, revealing improvements from targeted post-training. (arxiv.org)
- Measured impact: Under a standard SFT→GRPO pipeline, models post-trained on VideoKR show gains on knowledge-intensive video reasoning while staying competitive on broader video reasoning tasks — highlighting data design as a lever for progress. (arxiv.org)
Who it's for and tradeoffs
Great fit if you are training or fine-tuning video-language models and want to improve domain knowledge and multi-step reasoning (e.g., scientific, instructional, or expert-domain video tasks). VideoKR is a better starting point than generic web-video corpora when your primary failure mode is knowledge gaps or shortcut exploitation. (arxiv.org)
Look elsewhere if you need a dataset focused on casual/social video captioning, or if your compute/budget constraints prevent additional post-training — the corpus aims at stronger reasoning via scale and annotation effort, which implies extra training cost and annotation complexity compared to lightweight benchmarks.
