An open-source, production-ready system for serving machine-learning models at scale.
vLLM is a high-throughput, memory-efficient inference and serving engine for large language models (LLMs), built to deliver state-of-the-art performance on GPUs with features such as PagedAttention and continuous batching.
Open-source high-performance framework and DSL for serving large language & vision-language models with low-latency, controllable, structured generation.
A lightweight open-source platform for running, managing, and integrating large language models locally via a simple CLI and REST API.
NVIDIA TensorRT is an SDK and tool-suite that compiles and optimizes trained neural-network models for ultra-fast, low-latency inference on NVIDIA GPUs.
Ray is an open-source distributed compute engine that lets you scale Python and AI workloads—from data processing to model training and serving—without deep distributed-systems expertise.
CNCF-incubating model inference platform (formerly KFServing) that provides Kubernetes CRDs for scalable predictive and generative workloads.
Open-source, high-performance server for deploying and scaling AI/ML models on GPUs or CPUs, supporting multiple frameworks and cloud/edge targets.
OpenVINO is an open-source toolkit from Intel that streamlines the optimization and deployment of AI inference models across a wide range of Intel® hardware.
Microsoft’s high-performance, cross-platform inference engine for ONNX and GenAI models.
Open-source framework for building, shipping and running containerized AI services with a single command.
Hugging Face’s Rust + Python server for high-throughput, multi-GPU text generation.