NVFP4-quantized variant of Qwen3.6-27B that reduces parameter bits from 16 to 4, cutting disk and GPU memory requirements by ~2.5× while keeping comparable benchmark accuracy; ready for vLLM-based inference on NVIDIA hardware and supports long, multimodal contexts.
Predicts per-request MoE expert footprints from prefill activations and routes decode requests to workers that maximize expert-locality, lowering decode latency by combining offline K-means partitioning with online locality-band routing and a KV-block–coindexed signature cache.