LogoAIAny
Icon for item

Flash Linear Attention

Flash Linear Attention (fla) is a Triton- and PyTorch-based library providing efficient implementations of state-of-the-art linear attention mechanisms and related modules. It targets hardware-efficient kernels, supports multiple platforms (NVIDIA, AMD, Intel), and provides models and utilities compatible with Hugging Face Transformers for training and inference.

Introduction

Overview

Flash Linear Attention (fla) is an open-source collection of high-performance, Triton-based implementations for a wide range of linear attention mechanisms. Written in PyTorch and Triton, fla aims to be platform-agnostic and hardware-efficient, delivering optimized kernels and fused modules for faster training and inference of linear-attention-based models.

Key features
  • Triton + PyTorch implementations: Custom kernels implemented in Triton for maximum performance while keeping a pure PyTorch-facing API.
  • Wide model coverage: Implements many modern linear-attention and linear-time sequence models (RetNet, GLA, DeltaNet, Mamba2, Samba, RWKV variants, FoX, etc.) and provides Transformers-compatible model classes/configs.
  • Fused modules: Includes fused layers (e.g., fused cross-entropy, fused norm+gate, fused linear+CE) to reduce memory footprint and improve throughput during training.
  • Hybrid/plug-in attention: Easy to interleave or replace standard softmax attention with linear attention variants via configuration (supports hybrid models, local attention interleaving, etc.).
  • Multi-platform verification: Verified on NVIDIA, AMD and Intel hardware; provides CI for different GPU targets.
  • Generation & evaluation utilities: Examples and benchmarking scripts for generation speed, lm-evaluation-harness integration and long-context RULER evaluations.
Typical use cases
  • Research and development of efficient attention mechanisms and linear transformers.
  • Replacing standard multi-head attention with linear-attention alternatives to improve memory/compute trade-offs for long-context models.
  • Training and evaluating linear-attention models at scale with improved kernel efficiency and fused operations.
Installation & compatibility
  • Requires PyTorch >= 2.5 and Triton >= 3.0 (or nightly in some cases).
  • Distributed as pip packages fla-core and flash-linear-attention, and also installable from source via git.
  • Integrates with Hugging Face Transformers (provides model configs and classes compatible with AutoModel APIs).
Performance & benchmarks

The repository contains benchmarks comparing Triton-based kernels to other implementations (e.g., FlashAttention2) across sequence lengths and devices, and provides scripts to measure generation throughput and latency on common GPUs (e.g., H100). The project emphasizes reducing memory usage (fused layers) and accelerating both forward and backward passes of linear-attention modules.

Community & citation
  • Maintained by the fla-org organization; primary contributors include researchers such as Songlin Yang and Yu Zhang (cited in the repository). The repo includes citation metadata for academic referencing.
  • Actively developed with frequent updates adding new attention variants (examples in the news/changelog show additions through 2025).
When to use

Use fla when you want to experiment with or deploy linear-attention architectures that require subquadratic memory/compute, need high-performance kernels across different hardware backends, or when you want Transformers-compatible models and tools for long-context generation/evaluation.

Information

  • Websitegithub.com
  • Authorsfla-org, Songlin Yang, Yu Zhang
  • Published date2023/12/20

Categories

More Items