NVIDIA’s open-source library that compiles Transformer blocks into highly-optimized TensorRT engines for blazing-fast LLM inference on NVIDIA GPUs.
TensorRT-LLM accelerates large-language-model inference by generating TensorRT engines with custom attention kernels, paged-KV caching, quantization (FP8/FP4/INT4/INT8) and speculative decoding.