What is SGLang?
SGLang is an open-source serving engine and Structured Generation Language created by the LMSYS team to turbo-charge inference for large language models (LLMs) and vision-language models. By co-designing a fast backend runtime with a concise, Python-like frontend DSL, SGLang lets developers build multi-step, parallel, and structured generation pipelines while sustaining state-of-the-art throughput.
Key capabilities
- RadixAttention & KV-cache reuse for efficient prefill/decoding
- Continuous batching, speculative decoding, quantization (FP8/INT4/AWQ/GPTQ)
- Prefill–decode disaggregation & expert parallelism to scale across GPUs
- Frontend language primitives for control flow, tool/function calls, JSON/AST output, and multimodal inputs
- Broad model support (Llama-3/4, DeepSeek, Mistral, Qwen, LLaVA, etc.) and OpenAI-style API compatibility
Who is it for?
• Engineers building low-latency chat, RAG, or agent systems • Researchers needing reproducible, high-throughput benchmarks • Platform teams seeking a production-grade, vendor-neutral inference stack
Released under the Apache-2.0 license, SGLang is now part of the PyTorch Ecosystem and powers trillions of tokens per day in production systems.