Overview
NVIDIA Dynamo is a datacenter-scale inference framework purpose-built to serve all AI models—LLMs, vision models, and agentic workflows—at minimal latency and maximal hardware efficiency. Announced at GTC 2025, Dynamo extends the single-node strengths of Triton Inference Server into a modular, multinode architecture that dynamically schedules GPU resources, routes requests, and manages KV-cache across vast GPU fleets.
Key Capabilities
- Disaggregated Serving – Separates compute-heavy prefill and memory-bound decode phases onto different GPUs for up to 30× throughput gains.
- Smart Router – LLM-aware request routing and KV-cache reuse to slash recomputation costs.
- GPU Planner – Real-time capacity monitoring and adaptive GPU re-allocation to hit strict TTFT/ITL SLOs.
- Distributed KV-Cache Manager – Tiered offloading (GPU → CPU RAM → SSD → Object Store) enabling petabyte-scale cache with graceful cost/latency trade-offs.
- Engine-Agnostic Integrations – First-class support for TensorRT-LLM, vLLM, SGLang, and any gRPC/HTTP back-end.
- Rust Runtime + Python SDK – Performance core in Rust with an ergonomic Python layer for graph authoring, CLI (
dynamo run
, dynamo build
) and CI/CD.
- Enterprise-Ready – Apache-2.0 licensed OSS with NVIDIA NIM packaging for production support, security hardening, and long-term maintenance.