Generates text from interleaved text, image, and short-video inputs using discrete diffusion and block‑autoregressive multi‑canvas sampling; built on a sparse MoE (8/128) Gemma 4 backbone and optimized for low‑latency inference and very long contexts (up to 256K tokens).