FlashInfer

CUDA kernel library that brings Flash-attention-style optimizations to any LLM serving stack.

Visit Website

Introduction

Overview

FlashInfer provides Triton/Torch-CUDA kernels for grouped-query and sliding window attention with 2-3× speed-ups.

Key Capabilities

Drop-in Python bindings
Dynamic rope scaling & int4 support
Benchmarks on A100/H100 & RTX GPUs

Back

Information

Websiteflashinfer.ai
AuthorsFlashInfer Team
Published date2023/11/12

More Items

WhisperLiveKit

2025

QuentinFuxa

WhisperLiveKit is an ultra-low-latency, self-hosted speech-to-text toolkit with speaker identification. Powered by leading simultaneous speech research like Simul-Whisper and WhisperStreaming, it enables intelligent buffering and incremental processing for real-time transcription, translation across 200 languages, and speaker diarization. Ideal for meeting notes, accessibility tools, and content creation.

github ai-tools ai-inference ai-serving ASR+2

KTransformers

2024

MADSys Lab, Tsinghua University, Approaching.AI +17

KTransformers is a flexible framework for experiencing cutting-edge optimizations in LLM inference and fine-tuning, focusing on CPU-GPU heterogeneous computing. It consists of two core modules: kt-kernel for high-performance inference kernels and kt-sft for fine-tuning. The project supports various hardware and models like DeepSeek series, Kimi-K2, achieving significant resource savings and speedups, such as reducing GPU memory for a 671B model to 70GB and up to 28x acceleration.

github llm ai-inference ai-train ai-framework+3

Ray

2017

RISELab (UC Berkeley), Anyscale Inc.

Ray is an open-source distributed compute engine that lets you scale Python and AI workloads—from data processing to model training and serving—without deep distributed-systems expertise.

ai-development ai-framework ai-train ai-serving