VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

Large-scale training corpus for knowledge- and reasoning-intensive video understanding: 315K video reasoning examples over 145K CC-licensed expert-domain videos, with human-in-the-loop chain-of-thought rationales to strengthen post-training for video reasoning. ([arxiv.org](https://arxiv.org/abs/2606.05259))

Visit Website

Introduction

Why this matters

Video models frequently succeed via superficial textual shortcuts and struggle on questions that demand domain knowledge or multi-step reasoning. VideoKR targets that gap by providing a deliberately curated, large-scale training corpus and an expert-annotated evaluation benchmark designed to push models beyond surface cues and toward genuine knowledge- and reasoning-driven video understanding. (arxiv.org)

Key Findings

A purpose-built corpus: VideoKR contains 315K video reasoning examples drawn from 145K newly collected, CC-licensed expert-domain videos — the scale and domain focus are chosen to expose models to real-world, knowledge-rich scenarios rather than routine web-video narration. (arxiv.org)
Human-in-the-loop CoT rationales: Examples include chain-of-thought style rationales produced by a skill-oriented generation pipeline, improving the signal for multi-step reasoning during post-training. (arxiv.org)
Evaluation that penalizes shortcuts: The paper introduces VideoKR-Eval, an expert-annotated benchmark whose questions require genuine video understanding and knowledge-intensive reasoning rather than relying on textual shortcuts, revealing improvements from targeted post-training. (arxiv.org)
Measured impact: Under a standard SFT→GRPO pipeline, models post-trained on VideoKR show gains on knowledge-intensive video reasoning while staying competitive on broader video reasoning tasks — highlighting data design as a lever for progress. (arxiv.org)

Who it's for and tradeoffs

Great fit if you are training or fine-tuning video-language models and want to improve domain knowledge and multi-step reasoning (e.g., scientific, instructional, or expert-domain video tasks). VideoKR is a better starting point than generic web-video corpora when your primary failure mode is knowledge gaps or shortcut exploitation. (arxiv.org)

Look elsewhere if you need a dataset focused on casual/social video captioning, or if your compute/budget constraints prevent additional post-training — the corpus aims at stronger reasoning via scale and annotation effort, which implies extra training cost and annotation complexity compared to lightweight benchmarks.

Back

Information

Websitearxiv.org
AuthorsLin Fu, Zheyuan Yang, Yang Wang, Tingyu Song, Arman Cohan, Yilun Zhao
Published date2026/06/03

More Items

AI Agent Papers2026

Xiaomi-Robotics-1: Scaling Vision-Language-Action Models with over 100K Hours of Real-World Trajectories

Jun Guo, Piaopiao Jin +31Xiaomi Robotics

A vision-language-action foundation model trained on 100k+ hours of real-world robot manipulation trajectories to follow natural-language instructions and adapt to downstream tasks with minimal fine-tuning. Uses a two-stage (pre-/post-) training recipe and a scalable auto-labeling pipeline; shows clear scaling benefits and state-of-the-art sim-to-real transfer on standard benchmarks.

robotics vision multimodal foundation-model paper+2

Computer Vision Papers2026

Read It Back: Pretrained MLLMs Are Zero-Shot Reward Models for Text-to-Image Generation

Runhui Huang, Qihui Zhang +4

Uses pretrained multimodal LLMs as zero-shot, training-free reward models for text-to-image RL by scoring how well the original text prompt can be recovered from a generated image via image-conditioned prompt log-likelihood; includes a Self-SpectraReward closed-loop variant.

paper multimodal vision RL evaluation+4

AI Model2023

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Bowen Wen, Wei Yang +2NVIDIA, NVlabs +1

Estimates and tracks 6D poses of novel objects without per-object fine-tuning — supports both model-based (CAD) and model-free (few reference images) setups. Trained on large-scale synthetic data with a transformer-based architecture and contrastive learning; CVPR 2024 highlight with demos and pretrained weights.

pytorch vision robotics depth foundation-model+3