Video models are zero-shot learners and reasoners

Abstract

The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding?
We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo 3's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

Perception

Edge detection

Segmentation

Keypoint localization

Super-resolution

Blind deblurring

Blind denoising

Low-light enhancement

Conjunctive search / binding problem

Dalmatian illusion understanding

Shape cue-conflict understanding

Rorschach blot interpretation

Modeling

Material properties (flammability)

Rigid body transform

Soft body transform

Gravity (earth)

Gravity (moon)

Buoyancy (bottle cap)

Buoyancy (rock)

Visual Jenga

Object packing

Material optics (glass)

Material optics (mirror)

Color mixing (additive)

Color mixing (subtractive)

Categorizing objects

Omniglot (recognition)

Omniglot (generation)

Omniglot (parsing)

Memory of world states

Manipulation

Background removal

Style transfer

Colorization

Inpainting

Outpainting

Text manipulation

Image editing with doodles

Scene composition

Novel view synthesis

3D-aware reposing

Transfiguration

Professional headshot

Dexterous manipulation (jar)

Dexterous manipulation (throw/catch)

Dexterous manipulation (baoding balls)

Affordance recognition

Drawing

Visual instruction (burrito rolling)

Reasoning

Graph traversal

Tree BFS

Sequence (dots)

Sequence (arrows)

Sequence (circles)

Sequence (squares)

Connecting colors

Shape fitting

Sorting numbers

Tool use

Simple sudoku completion

Water puzzle solving

Maze solving (mouse)

Robot navigation

Rule extrapolation

Analogy (color)

Analogy (resize)

Analogy (reflect)

Analogy (rotate)

Maze (5x5)

Maze (7x7)

Maze (9x9)

Maze (irregular)

Symmetry (shape)

Symmetry (random)

Failure cases (click to expand) ▶

Monocular depth estimation

Monocular surface normal estimation

Force prompting

Motion trajectory prompting

Tying the knot

Connect the path puzzle

Letter word search

Eulerian path

Solving linear equations

Spot the difference

Visual IQ test

Glass falling

Collisions

Jigsaw puzzle

Sliding puzzle

Scrambled puzzle

Bottleneck

Laundry folding

Motion planning

BibTeX

@article{wiedemer2025video,
  title={Video models are zero-shot learners and reasoners},
  author={Wiedemer, Thadd{\"a}us and Li, Yuxuan and Vicol, Paul and Gu, Shixiang Shane and Matarese, Nick and Swersky, Kevin and Kim, Been and Jaini, Priyank and Geirhos, Robert},
  journal={arXiv preprint arXiv:2509.20328},
  year={2025}
}

Video models are zero-shot
learners and reasoners

TL;DR