LogoAIAny
Icon for item

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Continuously watches live video and autonomously decides each second whether to speak, stay silent, or delegate; released together with an 8B vision-first model, time-aligned interaction data, training recipe, and a deployable real-time system. Designed for vision-triggered, low-latency streaming scenarios and evaluated across six real-world streams.

Introduction

Real-world moments are fleeting and often require an assistant to act without being explicitly prompted. This work reframes video-language agents from turn-based responders into always-present observers: the model watches a live stream and decides each second whether to speak, remain silent, or hand the task to a background model.

Key Findings
  • Vision-first, second-by-second decision policy: the model is trained to make a discrete choice every second (speak / silent / delegate), which improves timing and reduces irrelevant chatter in continuous streams.
  • Open-stack release: the authors provide an 8B-scale VL-interaction model plus the training recipe, over four million time-aligned clips labeled at one-second granularity, and a complete deployable system (ASR/TTS, memory, UI, background brain) so others can reproduce and extend the setup.
  • Efficiency for long streams: a predictive video codec and streaming design keep token growth and latency low over hours of video, enabling sub-second responsiveness in practical deployments.
  • Empirical preference gains: in six real-world streaming scenarios, human raters preferred this approach over in-app video-call assistants (Doubao, Gemini) by a wide margin on both quality and timing.
Who it's for and trade-offs

Great fit if you need an always-present video assistant that proactively points out timely events (surveillance alerts, livestream commentary, meeting highlights) and you value reproducibility (open weights, data, and recipe). Look elsewhere if you require models trained under strict third-party affiliations or regulated datasets, or if you cannot host an 8B-scale model and the surrounding streaming infrastructure. The released stack prioritizes vision-driven proactivity and deployability over tightly integrated end-to-end speech fusion, keeping ASR/TTS and background agents pluggable.

Information

  • Websitearxiv.org
  • OrganizationsJoy Future Academy, JD
  • AuthorsDingyu Yao, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Haowen Hou, Zheming Liang, Congcong Wang, Yuhang Cao, Shenglong Ye, Shuai Xie
  • Published date2026/06/10