Encodes and clones camera motion from reference videos to generate multi-shot videos — uses a visual "camera grid" to represent camera parameters, trains on million-scale grid–video pairs, and employs a hierarchical prompt-expansion agent to coordinate camera, subject, and action control for multimodal diffusion models.