Long camera trajectories and multi-shot edits are core to cinematic video but are hard to clone reliably: parametric rigs break on multi-shot scenes and synthetic cross-paired data are scarce and brittle. The core insight here is to treat camera parameters as a visual modality — a compact "camera grid" video — and train a multimodal diffusion transformer on very large camera-grid↔video pairs so camera motion can be pasted across scenes without requiring explicit camera calibration or cross-paired synthesis.
Key Findings
- Camera-as-video representation: encoding camera parameters into grid-motion videos lets the model handle arbitrary, compound multi-shot trajectories (shot transitions, push/pull, pans/rotations) as a single visual condition, avoiding brittle parametric templates.
- Million-scale supervision: pretraining on a large, synthesized camera-grid–video corpus supplies diverse trajectories and shot compositions, improving robustness on complex, long-range camera cloning tasks.
- Hierarchical Prompt Expansion agent: a prompt-planning stage fuses camera motion, subject description, and action cues into coherent directives for the diffusion transformer, improving semantic and temporal coherence across shot boundaries.
- Director-level control: the framework coordinates characters, actions and cameras to support multimodal controls (text, reference video, trajectory) for controllable video generation without per-case fine-tuning.
Who it's for and tradeoffs
Great fit if you need reproducible, multi-shot camera motion transfer for generated video (researchers and studios working on controllable video synthesis, or teams developing reference-based camera control pipelines). Look elsewhere if you require live-phone deployment on-device, extremely small-data regimes, or explicit, per-frame metric-quality camera calibration — the approach relies on large-scale training data and sizable model capacity, and may need adaptation for very unconstrained, noisy real-world reference footage.
How it works (brief)
The method renders camera parameters into a grid-format motion video that is used alongside content signals (text, image, or content video) to condition a multimodal diffusion transformer. During inference a hierarchical prompt-expansion module constructs conditioning prompts that describe intra-shot motion, inter-shot transitions, and semantic fusion with subjects and actions; the pretrained model then synthesizes multi-shot outputs that follow the target camera trajectories while preserving content consistency.
