Video models are zero-shot
learners and reasoners

Google DeepMind
* Joint leads.
Paper PDF Podcast

TL;DR

Veo 3 shows emergent zero-shot abilities across many visual tasks, indicating that video models are on a path to becoming vision foundation models—just like LLMs became foundation models for language.

Perception

Modeling

Manipulation

Reasoning

Abstract

The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding?
We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo 3's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

Podcast

On a run and want to get a gist of our paper? Listen to the following podcast!

Perception

Edge detection

Segmentation

Keypoint localization

Super-resolution

Blind deblurring

Blind denoising

Low-light enhancing

Conjunctive search

Dalmatian illusion

Shape cue-conflict

Rorschach blot

Modeling

Material properties (flammability)

Rigid body transform

Soft body transform

Gravity (earth)

Gravity (moon)

Buoyancy (bottle cap)

Buoyancy (rock)

Visual Jenga

Object packing

Material optics (glass)

Material optics (mirror)

Color mixing (additive)

Color mixing (subtractive)

Categorizing objects

Omniglot (recognition)

Omniglot (generation)

Omniglot (parsing)

Memory of world states

Manipulation

Background removal

Style transfer

Colorization

Inpainting

Outpainting

Text manipulation

Image editing with doodles

Scene composition

Novel view synthesis

3D-aware reposing

Transfiguration

Professional headshot

Dexterous manipulation (jar)

Dexterous manipulation (throw/catch)

Dexterous manipulation (baoding balls)

Affordance recognition

Drawing

Visual instruction (burrito)

Reasoning

Graph traversal

Tree BFS

Sequence (dots)

Sequence (arrows)

Sequence (circles)

Sequence (squares)

Connecting colors

Shape fitting

Sorting numbers

Tool use

Simple sudoku completion

Water puzzle

Maze solving (mouse)

Robot navigation

Rule extrapolation

Analogy (color)

Analogy (resize)

Analogy (reflect)

Analogy (rotate)

Maze (5x5)

Maze (7x7)

Maze (9x9)

Maze (irregular)

Symmetry (shape)

Symmetry (random)

BibTeX

@article{wiedemer2025video,
  title={Video models are zero-shot learners and reasoners},
  author={Wiedemer, Thaddäus and Li, Yuxuan and Vicol, Paul and Gu, Shixiang Shane and Matarese, Nick and Swersky, Kevin and Kim, Been and Jaini, Priyank and Geirhos, Robert},
  journal={arXiv preprint arXiv:TBD},
  year={2025}
}