LogoAIAny
Icon for item

FlashInfer

CUDA kernel library that brings Flash-attention-style optimizations to any LLM serving stack.

Introduction

Overview

FlashInfer provides Triton/Torch-CUDA kernels for grouped-query and sliding window attention with 2-3× speed-ups.

Key Capabilities
  • Drop-in Python bindings
  • Dynamic rope scaling & int4 support
  • Benchmarks on A100/H100 & RTX GPUs

Information

  • Websiteflashinfer.ai
  • AuthorsFlashInfer Team
  • Published date2023/11/12

Categories