Unifies multimodal image understanding, text-to-image generation, and instruction-based editing in a single diffusion LLM using a Mixture-of-Experts backbone, SigLIP-VQ discrete tokenizer, and a distilled diffusion decoder enabling fast (8-step) decoding; full-generation needs ~47GB GPU RAM.