Overview
xFormers is a modular toolbox developed by Facebook Research (Meta) that provides customizable and optimized building blocks for Transformer architectures. It's designed for researchers and engineers who need flexible components beyond the primitives provided by mainstream libraries. The project emphasizes research-first design and high performance, including custom CUDA kernels and fused operators where beneficial.
Key Features
- Memory-efficient exact attention (special implementation that can reduce memory and improve speed for certain workloads).
- Sparse attention and block-sparse attention primitives for long-context models.
- Fused operators such as fused linear layers, fused layer norm, fused dropout(activation(x+bias)), and fused SwiGLU to reduce kernel launches and improve throughput.
- Custom CUDA kernels with dispatch to other high-performance libraries (when appropriate).
- Modular, composable blocks suitable for vision, NLP, and other domains.
Installation (summary)
- Recommended (prebuilt wheels, requires compatible PyTorch):
- pip install -U xformers --index-url https://download.pytorch.org/whl/cu126 (CUDA 12.6)
- pip install -U xformers --index-url https://download.pytorch.org/whl/cu128 (CUDA 12.8)
- pip install -U xformers --index-url https://download.pytorch.org/whl/cu129 (CUDA 12.9)
- Development / pre-release:
- pip install --pre -U xformers
- From source (for custom PyTorch versions or custom builds):
- pip install ninja (optional, speeds build)
- pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
Note: Building from source may require setting TORCH_CUDA_ARCH_LIST, compatible NVCC/GCC versions, and enough build memory.
Typical Use Cases
- Research experiments that require non-standard or novel attention mechanisms.
- Performance-sensitive Transformer training / fine-tuning where fused kernels and memory-efficient attention can reduce GPU memory and runtime.
- Prototyping new Transformer blocks by composing xFormers components without boilerplate.
Benchmarks & Performance
The project provides benchmark plots (e.g., memory-efficient MHA vs. standard implementations) demonstrating notable speed and memory advantages for certain workloads (A100 tests referenced in repo). Performance gains depend on model configuration, dtype, hardware, and whether xFormers custom kernels are available/used.
Compatibility & Requirements
- Built to be used with PyTorch (instructions reference specific PyTorch versions). Prebuilt wheels assume a matching CUDA / PyTorch runtime.
- Provides guidance for troubleshooting builds (NVCC vs CUDA runtime, GCC compatibility, TORCH_CUDA_ARCH_LIST, MAX_JOBS for ninja builds, long path issues on Windows).
License & Citation
- License: BSD-style license (see LICENSE in repo). The code reuses or is inspired by several other projects (e.g., Triton, Flash-Attention, CUTLASS).
- Citation: the repository includes a BibTeX entry that authors can use when referencing xFormers in publications.
When to Choose xFormers
Choose xFormers when you need flexible, research-oriented Transformer blocks with performance optimizations that are not yet available in mainstream libraries, or when you want to experiment with alternative attention mechanisms (sparse, block-sparse, memory-efficient exact attention) with reduced engineering overhead.
