Xorbits’ universal inference layer (library name `xinference`) that deploys and serves LLMs and multimodal models from laptop to cluster.
NVIDIA’s open-source library that compiles Transformer blocks into highly-optimized TensorRT engines for blazing-fast LLM inference on NVIDIA GPUs.
CUDA kernel library that brings Flash-attention-style optimizations to any LLM serving stack.
Lightning-fast engine that lets you serve any AI model—LLMs, vision, audio—at scale with zero YAML and automatic GPU autoscaling.
Pythonic framework to inject experimental KV-cache optimizations into HuggingFace Transformers stacks.
Distributed KV-cache store & transfer engine that decouples prefilling from decoding to scale vLLM serving clusters.
vLLM-project’s control-plane that orchestrates cost-efficient, plug-and-play LLM inference infrastructure.
NVIDIA Dynamo is an open-source, high-throughput, low-latency inference framework that scales generative-AI and reasoning models across large, multi-node GPU clusters.