Overview
FlashInfer provides Triton/Torch-CUDA kernels for grouped-query and sliding window attention with 2-3× speed-ups.
Key Capabilities
- Drop-in Python bindings
- Dynamic rope scaling & int4 support
- Benchmarks on A100/H100 & RTX GPUs
CUDA kernel library that brings Flash-attention-style optimizations to any LLM serving stack.
KTransformers is a flexible framework for experiencing cutting-edge optimizations in LLM inference and fine-tuning, focusing on CPU-GPU heterogeneous computing. It consists of two core modules: kt-kernel for high-performance inference kernels and kt-sft for fine-tuning. The project supports various hardware and models like DeepSeek series, Kimi-K2, achieving significant resource savings and speedups, such as reducing GPU memory for a 671B model to 70GB and up to 28x acceleration.