Controllable long-horizon text/image-to-video generation that supports camera navigation, revisits, and promptable events across photorealistic and stylized domains. Introduces camera-aware positional encoding (E-PRoPE), memory-conditioned scene persistence, causal-forcing distillation, and RL alignment to retain camera control and reduce drift.