Overview
Flash Linear Attention (fla) is an open-source collection of high-performance, Triton-based implementations for a wide range of linear attention mechanisms. Written in PyTorch and Triton, fla aims to be platform-agnostic and hardware-efficient, delivering optimized kernels and fused modules for faster training and inference of linear-attention-based models.
Key features
- Triton + PyTorch implementations: Custom kernels implemented in Triton for maximum performance while keeping a pure PyTorch-facing API.
- Wide model coverage: Implements many modern linear-attention and linear-time sequence models (RetNet, GLA, DeltaNet, Mamba2, Samba, RWKV variants, FoX, etc.) and provides Transformers-compatible model classes/configs.
- Fused modules: Includes fused layers (e.g., fused cross-entropy, fused norm+gate, fused linear+CE) to reduce memory footprint and improve throughput during training.
- Hybrid/plug-in attention: Easy to interleave or replace standard softmax attention with linear attention variants via configuration (supports hybrid models, local attention interleaving, etc.).
- Multi-platform verification: Verified on NVIDIA, AMD and Intel hardware; provides CI for different GPU targets.
- Generation & evaluation utilities: Examples and benchmarking scripts for generation speed, lm-evaluation-harness integration and long-context RULER evaluations.
Typical use cases
- Research and development of efficient attention mechanisms and linear transformers.
- Replacing standard multi-head attention with linear-attention alternatives to improve memory/compute trade-offs for long-context models.
- Training and evaluating linear-attention models at scale with improved kernel efficiency and fused operations.
Installation & compatibility
- Requires PyTorch >= 2.5 and Triton >= 3.0 (or nightly in some cases).
- Distributed as pip packages
fla-coreandflash-linear-attention, and also installable from source via git. - Integrates with Hugging Face Transformers (provides model configs and classes compatible with AutoModel APIs).
Performance & benchmarks
The repository contains benchmarks comparing Triton-based kernels to other implementations (e.g., FlashAttention2) across sequence lengths and devices, and provides scripts to measure generation throughput and latency on common GPUs (e.g., H100). The project emphasizes reducing memory usage (fused layers) and accelerating both forward and backward passes of linear-attention modules.
Community & citation
- Maintained by the
fla-orgorganization; primary contributors include researchers such as Songlin Yang and Yu Zhang (cited in the repository). The repo includes citation metadata for academic referencing. - Actively developed with frequent updates adding new attention variants (examples in the news/changelog show additions through 2025).
When to use
Use fla when you want to experiment with or deploy linear-attention architectures that require subquadratic memory/compute, need high-performance kernels across different hardware backends, or when you want Transformers-compatible models and tools for long-context generation/evaluation.
