Microsoft’s high-performance, cross-platform inference engine for ONNX and GenAI models.
Open-source framework for building, shipping and running containerized AI services with a single command.
Hugging Face’s Rust + Python server for high-throughput, multi-GPU text generation.
GGML-based C/C++ implementation that runs LLaMA-family models locally with no dependencies.
Local-first LLM ecosystem from Nomic AI that runs quantized chat models on everyday CPUs and GPUs with a desktop app, Python bindings and REST API.
Universal LLM deployment engine that compiles models with TVM Unity for native execution across GPUs, CPUs, mobile and WebGPU.
Toolkit from InternLM for compressing, quantizing and serving LLMs with INT4/INT8 kernels on GPUs.
Xorbits’ universal inference layer (library name `xinference`) that deploys and serves LLMs and multimodal models from laptop to cluster.
NVIDIA’s open-source library that compiles Transformer blocks into highly-optimized TensorRT engines for blazing-fast LLM inference on NVIDIA GPUs.
CUDA kernel library that brings Flash-attention-style optimizations to any LLM serving stack.
Lightning-fast engine that lets you serve any AI model—LLMs, vision, audio—at scale with zero YAML and automatic GPU autoscaling.
Pythonic framework to inject experimental KV-cache optimizations into HuggingFace Transformers stacks.