LogoAIAny
Icon for item

NVIDIA Dynamo

NVIDIA Dynamo is an open-source, high-throughput, low-latency inference framework that scales generative-AI and reasoning models across large, multi-node GPU clusters.

Introduction

Overview

NVIDIA Dynamo is a datacenter-scale inference framework purpose-built to serve all AI models—LLMs, vision models, and agentic workflows—at minimal latency and maximal hardware efficiency. Announced at GTC 2025, Dynamo extends the single-node strengths of Triton Inference Server into a modular, multinode architecture that dynamically schedules GPU resources, routes requests, and manages KV-cache across vast GPU fleets.

Key Capabilities
  • Disaggregated Serving – Separates compute-heavy prefill and memory-bound decode phases onto different GPUs for up to 30× throughput gains.
  • Smart Router – LLM-aware request routing and KV-cache reuse to slash recomputation costs.
  • GPU Planner – Real-time capacity monitoring and adaptive GPU re-allocation to hit strict TTFT/ITL SLOs.
  • Distributed KV-Cache Manager – Tiered offloading (GPU → CPU RAM → SSD → Object Store) enabling petabyte-scale cache with graceful cost/latency trade-offs.
  • Engine-Agnostic Integrations – First-class support for TensorRT-LLM, vLLM, SGLang, and any gRPC/HTTP back-end.
  • Rust Runtime + Python SDK – Performance core in Rust with an ergonomic Python layer for graph authoring, CLI (dynamo run, dynamo build) and CI/CD.
  • Enterprise-Ready – Apache-2.0 licensed OSS with NVIDIA NIM packaging for production support, security hardening, and long-term maintenance.

Information

Categories