Trillion‑parameter inference is dominated by two costs: per‑parameter bit width (memory footprint and bandwidth) and serial backbone forward passes during autoregressive decoding. This release attacks both simultaneously by quantizing the bulk of parameters (MoE experts) to FP4 while running a lightweight BF16 drafter that predicts token blocks for a single verification step — a practical trade that keeps quality near the FP8 baseline while drastically lowering decode cost.
Key Capabilities
- Expert‑only MXFP4 quantization: only MoE experts are cast to FP4 (block size 32) while attention projections and critical modules remain higher precision. So what: it reduces model size and memory‑bandwidth pressure where it matters most while avoiding broad accuracy regressions that blanket FP4 casting would introduce.
- DFlash block‑diffusion speculative decoding: a small BF16 drafter fills a masked block (capped at 8 tokens) in one forward pass and the backbone verifies the block. So what: this moves per‑prediction compute from linear‑in‑context to near‑constant for the draft stage and raises draft throughput without sacrificing verification‑level quality.
- Trillion‑scale and long‑context engineering: the model and drafter are optimized for MoE topology, SWA (sliding window attention) and large context lengths (backbone reported support to 1M tokens). So what: it’s positioned for workloads that need long context, multi‑turn agents, or code generation at large scale.
- Deployment path for SGLang: shipped with example SGLang flags for launching backbone + drafter and speculative decoding. So what: teams running large MoE racks can integrate speculative decoding with existing distributed topologies.
Who it's for and tradeoffs
Great fit if you operate large MoE inference fleets and need to lower memory‑bandwidth and per‑token latency for long‑context or agent/code workloads, and you can run the specialized distributed stacks (SGLang, tensor/expert parallel setups). Look elsewhere if you need a small, local model for lightweight inference — the approach assumes large hardware parallelism and custom runtime support. Expect extra engineering to integrate MXFP4 QAT, manage acceptance rates for speculative decoding, and to validate task‑level quality on your benchmarks.
Where it fits
Compared with blanket low‑precision or smaller distilled drafts, this design trades more complex deployment for higher end‑to‑end efficiency at scale: it’s a middle path between full low‑precision casting (worse quality) and purely CPU/GPU cost reductions via smaller models (worse capability density).
Brief note on reproducibility & license
The HuggingFace model card includes example launch commands and a BibTeX citation; the repo flags an MIT license for the model assets. Reproducing the reported gains requires following the project's FP4 QAT and DFlash training/serving recipes and access to large distributed GPUs/TPUs.
