Video Models are Zero-Shot Learners and Reasoners
Introduction
The field of artificial intelligence has witnessed a paradigm shift with the advent of Large Language Models (LLMs), which have transitioned natural language processing from narrow, task-specific architectures to versatile, generalist foundation models. This evolution was driven by foundational principles: scaling up massive generative models trained on vast, web-scale datasets. Intriguingly, contemporary generative video models adhere to similar primitives—large-scale training on diverse video corpora. This raises a compelling question: Are video models poised to achieve analogous advancements in general-purpose vision understanding, mirroring the trajectory of LLMs in language?
In this seminal work, authors Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos from institutions including Google DeepMind explore this hypothesis through empirical demonstrations using Veo 3, a state-of-the-art generative video model developed by Google. Published on arXiv in September 2025, the paper argues that video models are not merely content generators but emergent zero-shot learners and reasoners capable of tackling a spectrum of visual intelligence tasks without domain-specific fine-tuning.
Emergent Zero-Shot Capabilities
The core contribution of the paper lies in showcasing Veo 3's proficiency across a broad array of zero-shot visual tasks, categorized into perception, modeling, and manipulation abilities. These capabilities emerge naturally from the model's generative pretraining, underscoring the power of scaling in multimodal AI.
Perception Tasks
Veo 3 demonstrates remarkable zero-shot performance in low-level vision primitives. For instance, it accurately segments objects within complex scenes, delineating boundaries with high fidelity despite no explicit segmentation training. Similarly, edge detection—identifying contours and structural outlines—is achieved by prompting the model to generate videos highlighting these features, revealing an implicit understanding of visual primitives akin to classical computer vision algorithms like Canny edge detectors, but generalized across diverse contexts.
Modeling the Visual World
Beyond perception, Veo 3 models higher-level semantic and physical properties of the environment. It infers physical attributes such as material rigidity, fluidity, or elasticity from static or dynamic prompts, simulating realistic interactions (e.g., predicting how a glass shatters or cloth drapes). Object affordance recognition—determining potential uses of items, like grasping a hammer for nailing—further highlights the model's intuitive grasp of functional relationships, bridging perception with commonsense reasoning in vision.
Manipulation and Interaction
Veo 3 extends its utility to interactive scenarios, enabling zero-shot image and video editing through descriptive prompts. Users can instruct modifications like altering object positions, styles, or compositions, with the model generating coherent outputs. Notably, it simulates tool use, such as wielding utensils or machinery, by generating sequences that depict plausible manipulations, foreshadowing applications in robotics and augmented reality.
These foundational abilities coalesce to support emergent visual reasoning. Veo 3 solves navigational puzzles like mazes by generating paths that adhere to spatial constraints and symmetry problems by producing balanced, mirrored configurations, demonstrating rudimentary planning and logical inference in the visual domain.
Parallels to LLMs and Broader Implications
The authors draw explicit analogies to LLMs, emphasizing how both paradigms leverage autoregressive generation on internet-scale data to unlock unforeseen capabilities. Just as GPT-series models surprised the community with in-context learning and few-shot reasoning, Veo 3's visual analogs suggest that video diffusion models can transcend generation to embody comprehensive vision systems.
This trajectory implies profound implications for AI. Video models could unify disparate vision subfields— from object detection to scene understanding—under a single foundation model, reducing the need for specialized architectures. In robotics (cs.RO), such models might inform visuomotor policies; in computer vision (cs.CV), they challenge traditional supervised paradigms; and in machine learning (cs.LG) and AI (cs.AI), they validate scaling hypotheses for multimodal intelligence.
However, challenges remain: Veo 3's capabilities are probabilistic and prompt-dependent, lacking the reliability of dedicated models. Future work could explore fine-tuning for robustness, ethical safeguards against misuse in deepfakes, and integration with language for grounded video-language reasoning.
Conclusion
By empirically validating Veo 3's zero-shot prowess, this paper positions generative video models as harbingers of generalist vision AI. It invites the community to rethink foundation models not just for text, but for the rich, dynamic tapestry of visual data, potentially accelerating progress toward artificial general intelligence.
