LogoAIAny
Icon for item

Triton

Open-source, high-performance server for deploying and scaling AI/ML models on GPUs or CPUs, supporting multiple frameworks and cloud/edge targets.

Introduction

Overview

NVIDIA Triton Inference Server (now part of Dynamo-Triton) is an open-source, production-grade model-serving platform. It unifies inference across TensorFlow, PyTorch, ONNX, TensorRT-LLM, vLLM, classical ML libraries and custom back-ends, delivering optimized performance from data-center GPUs to edge CPUs. Triton standardizes inference protocols (HTTP/REST, gRPC, C-API) and integrates with Kubernetes, KServe and popular MLOps stacks, making it easy to roll out, monitor and scale AI workloads.

Key Capabilities
  • Multi-framework & multi-hardware support (GPU, x86/ARM CPU, AWS Inferentia)
  • Concurrent model execution & dynamic batching for peak throughput
  • Model ensembles / pipelines with Business Logic Scripting
  • LLM-centric features – speculative decoding, function calling, constrained decoding
  • Rich observability – health, metrics, tracing, model-level statistics
  • Cloud-native: Docker/NGC images, Helm charts, autoscale with K8s & SLAs

Information

Categories