AIAny - OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

Introduction

Long camera trajectories and multi-shot edits are core to cinematic video but are hard to clone reliably: parametric rigs break on multi-shot scenes and synthetic cross-paired data are scarce and brittle. The core insight here is to treat camera parameters as a visual modality — a compact "camera grid" video — and train a multimodal diffusion transformer on very large camera-grid↔video pairs so camera motion can be pasted across scenes without requiring explicit camera calibration or cross-paired synthesis.

Key Findings

Camera-as-video representation: encoding camera parameters into grid-motion videos lets the model handle arbitrary, compound multi-shot trajectories (shot transitions, push/pull, pans/rotations) as a single visual condition, avoiding brittle parametric templates.
Million-scale supervision: pretraining on a large, synthesized camera-grid–video corpus supplies diverse trajectories and shot compositions, improving robustness on complex, long-range camera cloning tasks.
Hierarchical Prompt Expansion agent: a prompt-planning stage fuses camera motion, subject description, and action cues into coherent directives for the diffusion transformer, improving semantic and temporal coherence across shot boundaries.
Director-level control: the framework coordinates characters, actions and cameras to support multimodal controls (text, reference video, trajectory) for controllable video generation without per-case fine-tuning.

Who it's for and tradeoffs

Great fit if you need reproducible, multi-shot camera motion transfer for generated video (researchers and studios working on controllable video synthesis, or teams developing reference-based camera control pipelines). Look elsewhere if you require live-phone deployment on-device, extremely small-data regimes, or explicit, per-frame metric-quality camera calibration — the approach relies on large-scale training data and sizable model capacity, and may need adaptation for very unconstrained, noisy real-world reference footage.

How it works (brief)

The method renders camera parameters into a grid-format motion video that is used alongside content signals (text, image, or content video) to condition a multimodal diffusion transformer. During inference a hierarchical prompt-expansion module constructs conditioning prompts that describe intra-shot motion, inter-shot transitions, and semantic fusion with subjects and actions; the pretrained model then synthesizes multi-shot outputs that follow the target camera trajectories while preserving content consistency.

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

Introduction

Key Findings

Who it's for and tradeoffs

How it works (brief)

Information

Categories

Tags

More Items

CLBench-V: Evaluating Multimodal Context Learning from Grounding to Knowledge Acquisition

HumanCLAW: Can Vision-Language Models Act Through a Body?

TurboVLA: Real-Time Vision-Language-Action Model at 32 Hz on an RTX 4090 with <1 GB VRAM