Local-first LLM ecosystem from Nomic AI that runs quantized chat models on everyday CPUs and GPUs with a desktop app, Python bindings and REST API.
Universal LLM deployment engine that compiles models with TVM Unity for native execution across GPUs, CPUs, mobile and WebGPU.
Toolkit from InternLM for compressing, quantizing and serving LLMs with INT4/INT8 kernels on GPUs.
Xorbits’ universal inference layer (library name `xinference`) that deploys and serves LLMs and multimodal models from laptop to cluster.
NVIDIA’s open-source library that compiles Transformer blocks into highly-optimized TensorRT engines for blazing-fast LLM inference on NVIDIA GPUs.
CUDA kernel library that brings Flash-attention-style optimizations to any LLM serving stack.
Lightning-fast engine that lets you serve any AI model—LLMs, vision, audio—at scale with zero YAML and automatic GPU autoscaling.
Pythonic framework to inject experimental KV-cache optimizations into HuggingFace Transformers stacks.
Distributed KV-cache store & transfer engine that decouples prefilling from decoding to scale vLLM serving clusters.
vLLM-project’s control-plane that orchestrates cost-efficient, plug-and-play LLM inference infrastructure.
NVIDIA Dynamo is an open-source, high-throughput, low-latency inference framework that scales generative-AI and reasoning models across large, multi-node GPU clusters.