Making a 753B MoE model usable on a single Blackwell node is the practical payoff of this release: NVIDIA supplies a GLM-5.2 checkpoint quantized to NVFP4 so you can run long‑context (up to 1M tokens) GLM-5.2 inference on multi‑GPU Blackwell hardware without starting from scratch.
Key Capabilities
- Long-context MoE inference: GLM-5.2 uses IndexShare sparse attention and supports solid 1M-token contexts for long-horizon tasks, with a 753B-parameter MoE backbone and ~40B activated parameters during inference. This enables complex reasoning and multi-step coding workflows.
- NVFP4 quantization (ModelOpt): Weights/activations for linear ops inside MoE experts are quantized using NVFP4 (E2M1 + FP8 E4M3 scales) to shrink the BF16 checkpoint (~1.37 TB) to roughly ~459 GB on disk/VRAM while keeping router/first-last layers in BF16/FP32.
- Deployment-ready runtimes: Checkpoint is prepared for SGLang and vLLM (examples provided); recommended tensor-parallel size is 8 and tested configs use 8×96 GB Blackwell GPUs. KV-cache dtype and tool/reasoning parsers are specified for best results.
- Benchmarks vs FP8 baseline: NVFP4 shows small accuracy deltas on evaluated suites (GPQA Diamond 89.39 vs 89.52 baseline; SciCode 49.04 vs 49.85; IFBench 75.81 vs 74.95; AA-LCR 70.13 vs 69.38; τ²‑Bench Telecom 98.25 vs 97.9), demonstrating near-baseline task performance after quantization.
Who it's for and tradeoffs
Great fit if you run or operate inference on NVIDIA Blackwell multi‑GPU nodes and need a pre-quantized, long‑context-capable GLM-5.2 for agentic tool-use, coding, or long-form reasoning. It avoids building a custom quantization pipeline and reduces storage/VRAM footprint to make GLM-5.2 practical on ~8×96 GB Blackwell setups.
Look elsewhere if you lack Blackwell GPUs or multi‑GPU resources (this does not fit on single GPUs), require an FP32/BF16 exact reproduction of FP8 behavior for research, or prefer models with smaller absolute disk/VRAM footprints. Tradeoffs include some modules kept in higher precision, static per-block calibration for quantization, and small benchmark shifts versus the FP8 baseline. As with the base model, output can reflect dataset biases and should be safety-tested before production deployment.
