Overview
NVIDIA Triton Inference Server (now part of Dynamo-Triton) is an open-source, production-grade model-serving platform. It unifies inference across TensorFlow, PyTorch, ONNX, TensorRT-LLM, vLLM, classical ML libraries and custom back-ends, delivering optimized performance from data-center GPUs to edge CPUs. Triton standardizes inference protocols (HTTP/REST, gRPC, C-API) and integrates with Kubernetes, KServe and popular MLOps stacks, making it easy to roll out, monitor and scale AI workloads.
Key Capabilities
- Multi-framework & multi-hardware support (GPU, x86/ARM CPU, AWS Inferentia)
- Concurrent model execution & dynamic batching for peak throughput
- Model ensembles / pipelines with Business Logic Scripting
- LLM-centric features – speculative decoding, function calling, constrained decoding
- Rich observability – health, metrics, tracing, model-level statistics
- Cloud-native: Docker/NGC images, Helm charts, autoscale with K8s & SLAs