SGLang

Open-source high-performance framework and DSL for serving large language & vision-language models with low-latency, controllable, structured generation.

Visit Website

Introduction

What is SGLang?

SGLang is an open-source serving engine and Structured Generation Language created by the LMSYS team to turbo-charge inference for large language models (LLMs) and vision-language models. By co-designing a fast backend runtime with a concise, Python-like frontend DSL, SGLang lets developers build multi-step, parallel, and structured generation pipelines while sustaining state-of-the-art throughput.

Key capabilities

RadixAttention & KV-cache reuse for efficient prefill/decoding
Continuous batching, speculative decoding, quantization (FP8/INT4/AWQ/GPTQ)
Prefill–decode disaggregation & expert parallelism to scale across GPUs
Frontend language primitives for control flow, tool/function calls, JSON/AST output, and multimodal inputs
Broad model support (Llama-3/4, DeepSeek, Mistral, Qwen, LLaVA, etc.) and OpenAI-style API compatibility

Who is it for?

• Engineers building low-latency chat, RAG, or agent systems • Researchers needing reproducible, high-throughput benchmarks • Platform teams seeking a production-grade, vendor-neutral inference stack

Released under the Apache-2.0 license, SGLang is now part of the PyTorch Ecosystem and powers trillions of tokens per day in production systems.

Back

Information

Websitedocs.sglang.ai
AuthorsLianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng
Published date2023/12/12

More Items

Ray

2017

RISELab (UC Berkeley), Anyscale Inc.

Ray is an open-source distributed compute engine that lets you scale Python and AI workloads—from data processing to model training and serving—without deep distributed-systems expertise.

ai-development ai-framework ai-train ai-serving

OpenVINO

2018

Intel

OpenVINO is an open-source toolkit from Intel that streamlines the optimization and deployment of AI inference models across a wide range of Intel® hardware.

ai-development ai-inference ai-serving

NVIDIA Dynamo

2025

NVIDIA

NVIDIA Dynamo is an open-source, high-throughput, low-latency inference framework that scales generative-AI and reasoning models across large, multi-node GPU clusters.

ai-development ai-inference ai-serving nvidia