Generates and reasons about multimodal physical-world content—text, images, video, audio, and robot/action trajectories—conditioned on combinations of text, image, video and action inputs. The 64B “Super” variant targets Physical AI use cases and supports vLLM‑Omni, Diffusers, and action prediction.