AI Video Papers2026

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Continuously watches live video and autonomously decides each second whether to speak, stay silent, or delegate; released together with an 8B vision-first model, time-aligned interaction data, training recipe, and a deployable real-time system. Designed for vision-triggered, low-latency streaming scenarios and evaluated across six real-world streams.

Visit Website

Introduction

Real-world moments are fleeting and often require an assistant to act without being explicitly prompted. This work reframes video-language agents from turn-based responders into always-present observers: the model watches a live stream and decides each second whether to speak, remain silent, or hand the task to a background model.

Key Findings

Vision-first, second-by-second decision policy: the model is trained to make a discrete choice every second (speak / silent / delegate), which improves timing and reduces irrelevant chatter in continuous streams.
Open-stack release: the authors provide an 8B-scale VL-interaction model plus the training recipe, over four million time-aligned clips labeled at one-second granularity, and a complete deployable system (ASR/TTS, memory, UI, background brain) so others can reproduce and extend the setup.
Efficiency for long streams: a predictive video codec and streaming design keep token growth and latency low over hours of video, enabling sub-second responsiveness in practical deployments.
Empirical preference gains: in six real-world streaming scenarios, human raters preferred this approach over in-app video-call assistants (Doubao, Gemini) by a wide margin on both quality and timing.

Who it's for and trade-offs

Great fit if you need an always-present video assistant that proactively points out timely events (surveillance alerts, livestream commentary, meeting highlights) and you value reproducibility (open weights, data, and recipe). Look elsewhere if you require models trained under strict third-party affiliations or regulated datasets, or if you cannot host an 8B-scale model and the surrounding streaming infrastructure. The released stack prioritizes vision-driven proactivity and deployability over tightly integrated end-to-end speech fusion, keeping ASR/TTS and background agents pluggable.

Back

Information

Websitearxiv.org
OrganizationsJoy Future Academy, JD
AuthorsDingyu Yao, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Haowen Hou, Zheming Liang, Congcong Wang, Yuhang Cao, Shenglong Ye, Shuai Xie …
Published date2026/06/10

More Items

AI Video Papers2026

VideoCoCo: Code-as-CoT for Physically-Consistent Video Generation via an Agentic Dual-Engine System

Haodong Li, Tianfei Ren +26

Converts text prompts into physically consistent videos by synthesizing executable Blender programs as a process-level chain-of-thought and using a dual-engine pipeline (deterministic simulation draft + draft-conditioned video editor). Ships with a VideoCoCo-3K draft–instruction–target dataset and shows substantial gains in physical-consistency benchmarks.

video ai-video code coding coding-agents+5

Machine Learning Foundation Papers2026

Metis: Memory Foundation Model

Zeyu Zhang, Ziliang Guo +15

Presents Metis, a prototype memory foundation model that embeds a persistent native memory state into the backbone so historical experience is compressed and accessed via memory attention. Key features: forward-only, gradient-free online memory updates; memory-specific mid-training objectives; and a dual text/code memory design.

foundation llm ai-agent agent-skills multimodal+3

Computer Vision Papers2026

PhiZero: A World Model Built Around Physical Language

Shuyao Shang, Yuqi Wang +5

Learns a discrete “physical language” from unlabeled videos and uses a reason-then-render pipeline: predict compact state-transition tokens, then decode them into future video. Separates dynamics inference from pixel synthesis to improve physical fidelity, controllable simulation, and zero-shot motion transfer.

paper video vision physics ai-video+4