Triton

Open-source, high-performance server for deploying and scaling AI/ML models on GPUs or CPUs, supporting multiple frameworks and cloud/edge targets.

Visit Website

Introduction

Overview

NVIDIA Triton Inference Server (now part of Dynamo-Triton) is an open-source, production-grade model-serving platform. It unifies inference across TensorFlow, PyTorch, ONNX, TensorRT-LLM, vLLM, classical ML libraries and custom back-ends, delivering optimized performance from data-center GPUs to edge CPUs. Triton standardizes inference protocols (HTTP/REST, gRPC, C-API) and integrates with Kubernetes, KServe and popular MLOps stacks, making it easy to roll out, monitor and scale AI workloads.

Key Capabilities

Multi-framework & multi-hardware support (GPU, x86/ARM CPU, AWS Inferentia)
Concurrent model execution & dynamic batching for peak throughput
Model ensembles / pipelines with Business Logic Scripting
LLM-centric features – speculative decoding, function calling, constrained decoding
Rich observability – health, metrics, tracing, model-level statistics
Cloud-native: Docker/NGC images, Helm charts, autoscale with K8s & SLAs

Back

Information

Websitedeveloper.nvidia.com
AuthorsNVIDIA
Published date2018/05/01

More Items

Ray

2017

RISELab (UC Berkeley), Anyscale Inc.

Ray is an open-source distributed compute engine that lets you scale Python and AI workloads—from data processing to model training and serving—without deep distributed-systems expertise.

ai-development ai-framework ai-train ai-serving

OpenVINO

2018

Intel

OpenVINO is an open-source toolkit from Intel that streamlines the optimization and deployment of AI inference models across a wide range of Intel® hardware.

ai-development ai-inference ai-serving

NVIDIA Dynamo

2025

NVIDIA

NVIDIA Dynamo is an open-source, high-throughput, low-latency inference framework that scales generative-AI and reasoning models across large, multi-node GPU clusters.

ai-development ai-inference ai-serving nvidia