AIAny - AI Video Papers

AI Video Papers2025

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li +7Google DeepMind

Argues a single web-scale generative video model handles vision tasks zero-shot the way LLMs handle language. Probes Veo 3 on segmentation, edge detection, image editing, physical and affordance reasoning, and puzzles like maze solving and symmetry.

video vision LLM paper ai-video+3

AI Video Papers2026

Video-Oasis: Rethinking Evaluation of Video Understanding

Geuntaek Lim, Sungjune Park +6

Provides a diagnostic suite that audits video-understanding benchmarks to find samples solvable without visual or temporal input, filters those shortcuts, and produces a distilled video-native testbed that reveals major capability gaps in current Video-LLMs.

video evaluation ai-video vision multimodal+3

AI Video Papers2026

EarlyTom: Early Token Compression Completes Fast Video Understanding

Hesong Wang, Xin Jin +5

Performs training-free early-stage visual token compression inside the vision encoder to cut time-to-first-token (TTFT) and FLOPs for Video-LLMs. Introduces a decoupled spatial token selection strategy and reports up to 2.65× TTFT reduction and 61% FLOPs savings on LLaVA-OneVision-7B (NVIDIA A100) while preserving full-token accuracy — aimed at latency-sensitive video understanding.

video vision ai-video multimodal llm+3

AI Video Papers2026

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Yuyang Zhao, Yicheng Pan +7

Enables real-time streaming video-to-video editing (1280×704 @24 FPS) on a single RTX 5090 GPU. Uses a Hybrid Diffusion Transformer for balanced local/global modeling, Cycle‑Reverse Regularization for temporal consistency, and system-level mixed-precision and fused kernels to maximize throughput.

video ai-video vision transformers nvidia+2

AI Video Papers2026

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Cong Chen, Guo Gan +8

Decouples perception and reasoning for hours-long videos by streaming inputs into a three-tier Hierarchical Graph Memory and using an agentic Observation–Reason–Action retrieval loop; reduces reasoning context to ~2% of full video while improving benchmark accuracy.

paper ai-video multimodal GNN agent-skills+3

AI Video Papers2026

Latent Spatial Memory for Video World Models

Weijie Wang, Haoyu Zhao +8

Stores a persistent 3D scene cache directly in a diffusion model's latent space to produce temporally and spatially consistent videos. Constructs memory via depth-guided back-projection and queries it with direct latent-space warping — achieving large speed and memory gains versus pixel-space 3D baselines.

video vision depth ai-video paper+1

AI Video Papers2026

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Dingyu Yao, Junhao Zhou +13Joy Future Academy, JD

Continuously watches live video and autonomously decides each second whether to speak, stay silent, or delegate; released together with an 8B vision-first model, time-aligned interaction data, training recipe, and a deployable real-time system. Designed for vision-triggered, low-latency streaming scenarios and evaluated across six real-world streams.

video vision multimodal vllm ai-agent+3

AI Video Papers2026

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

Jiwen Liu, Shujuan Li +9

Encodes and clones camera motion from reference videos to generate multi-shot videos — uses a visual "camera grid" to represent camera parameters, trains on million-scale grid–video pairs, and employs a hierarchical prompt-expansion agent to coordinate camera, subject, and action control for multimodal diffusion models.

video multimodal ai-video vision prompt-engineering+2

AI Video Papers2026

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

Yuho Lee, Jisu Shin +6KAIST, Qualcomm AI Research (Qualcomm Korea)

Proposes chunk-level multimodal retrieval and chunk-adaptive reranking for retrieval-augmented generation on long egocentric videos; introduces V-RAGBench to decouple retrieval vs. generation evaluation and CARVE to run parallel retrievers and select per-chunk configurations.

RAG video multimodal evaluation vision+2

AI Video Papers2026

DreamX-World 1.0: A General-Purpose Interactive World Model

DreamX Team, Yancheng Bai +21

Controllable long-horizon text/image-to-video generation that supports camera navigation, revisits, and promptable events across photorealistic and stylized domains. Introduces camera-aware positional encoding (E-PRoPE), memory-conditioned scene persistence, causal-forcing distillation, and RL alignment to retain camera control and reduce drift.

video vision multimodal RL paper+2

AI Video Papers2026

TurboServe: Serving Streaming Video Generation Efficiently and Economically

Youhe Jiang, Haoxu Wang +61Shanghai Jiao Tong University, 2Shengshu Technology +1

Serves interactive, long-lived streaming video-generation sessions by jointly scheduling session placement and GPU autoscaling to meet tight per-chunk latency. Combines migration-aware placement, load-driven autoscaling, coalesced chunk processing, GPU–CPU offloading and NCCL GPU–GPU migration; reports ~37% reductions in worst-case per-chunk latency and GPU operating cost.

video ai-video ai-serving ai-inference mLOps+4

AI Video Papers2026

Parallelized Autoregressive Decoding for Omni-Modal Dense Video Captioning

Wenzheng Zeng, Siyi Jiao +3National University of Singapore

Generates temporally grounded captions for dense multi-event videos by restructuring autoregressive token dependencies to enable lossless parallel decoding; introduces a latent global planning module and event-factorized parallel decoding to improve grounding accuracy and achieve large decoding speedups.

video multimodal ai-video LLM paper+2

Category

Explore by categories

All Categories

AI Leaderboard

AI Agent Tutorials

AI Coding Tutorials

AI Model

AI Agent Papers

Chatbot

AI Dataset

Machine Learning Foundation Books

AI Train

AI Deploy

AI Client

Machine Learning Foundation Papers

Machine Learning Foundation Tutorials

AI Image Demos

AI Agent

Large Language Model Tutorials

Large Language Model Papers

Machine Learning Engineering Papers

Computer Vision Tutorials

Computer Vision Papers

Natural Language Processing Papers

Reinforcement Learning Papers

Speech Technology Papers

AI API

AI Coding

AI Image

AI Video

MLOps

MCP Client

MCP Server

AI Video Papers

AI Audio

AI Others

AI Infra

Embodied AI

Video models are zero-shot learners and reasoners

Video-Oasis: Rethinking Evaluation of Video Understanding

EarlyTom: Early Token Compression Completes Fast Video Understanding

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Latent Spatial Memory for Video World Models

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

DreamX-World 1.0: A General-Purpose Interactive World Model

TurboServe: Serving Streaming Video Generation Efficiently and Economically

Parallelized Autoregressive Decoding for Omni-Modal Dense Video Captioning