AIAny - Flash Linear Attention

Overview

Flash Linear Attention (fla) is an open-source collection of high-performance, Triton-based implementations for a wide range of linear attention mechanisms. Written in PyTorch and Triton, fla aims to be platform-agnostic and hardware-efficient, delivering optimized kernels and fused modules for faster training and inference of linear-attention-based models.

Key features

Triton + PyTorch implementations: Custom kernels implemented in Triton for maximum performance while keeping a pure PyTorch-facing API.
Wide model coverage: Implements many modern linear-attention and linear-time sequence models (RetNet, GLA, DeltaNet, Mamba2, Samba, RWKV variants, FoX, etc.) and provides Transformers-compatible model classes/configs.
Fused modules: Includes fused layers (e.g., fused cross-entropy, fused norm+gate, fused linear+CE) to reduce memory footprint and improve throughput during training.
Hybrid/plug-in attention: Easy to interleave or replace standard softmax attention with linear attention variants via configuration (supports hybrid models, local attention interleaving, etc.).
Multi-platform verification: Verified on NVIDIA, AMD and Intel hardware; provides CI for different GPU targets.
Generation & evaluation utilities: Examples and benchmarking scripts for generation speed, lm-evaluation-harness integration and long-context RULER evaluations.

Typical use cases

Research and development of efficient attention mechanisms and linear transformers.
Replacing standard multi-head attention with linear-attention alternatives to improve memory/compute trade-offs for long-context models.
Training and evaluating linear-attention models at scale with improved kernel efficiency and fused operations.

Installation & compatibility

Requires PyTorch >= 2.5 and Triton >= 3.0 (or nightly in some cases).
Distributed as pip packages fla-core and flash-linear-attention, and also installable from source via git.
Integrates with Hugging Face Transformers (provides model configs and classes compatible with AutoModel APIs).

Performance & benchmarks

The repository contains benchmarks comparing Triton-based kernels to other implementations (e.g., FlashAttention2) across sequence lengths and devices, and provides scripts to measure generation throughput and latency on common GPUs (e.g., H100). The project emphasizes reducing memory usage (fused layers) and accelerating both forward and backward passes of linear-attention modules.

Community & citation

Maintained by the fla-org organization; primary contributors include researchers such as Songlin Yang and Yu Zhang (cited in the repository). The repo includes citation metadata for academic referencing.
Actively developed with frequent updates adding new attention variants (examples in the news/changelog show additions through 2025).

When to use

Use fla when you want to experiment with or deploy linear-attention architectures that require subquadratic memory/compute, need high-performance kernels across different hardware backends, or when you want Transformers-compatible models and tools for long-context generation/evaluation.

Flash Linear Attention

Introduction

Overview

Key features

Typical use cases

Installation & compatibility

Performance & benchmarks

Community & citation

When to use

Information

Categories

Tags

More Items

Genesis

MemU

ms-swift (SWIFT: Scalable lightWeight Infrastructure for Fine-Tuning)