Overview
FlashInfer provides Triton/Torch-CUDA kernels for grouped-query and sliding window attention with 2-3× speed-ups.
Key Capabilities
- Drop-in Python bindings
- Dynamic rope scaling & int4 support
- Benchmarks on A100/H100 & RTX GPUs
CUDA kernel library that brings Flash-attention-style optimizations to any LLM serving stack.