torchtitan — PyTorch-native training platform
torchtitan is a minimal, clean-room PyTorch-native platform built to accelerate experimentation and production-scale pretraining of generative AI models. The project emphasizes clarity, extensibility, and composability of parallelism techniques so that researchers and engineers can apply multi-dimensional scaling with minimal changes to model code.
Core focus
- Provide a simple, well-documented codebase demonstrating modern PyTorch distributed features for LLM pretraining and large-scale generative-model training.
- Enable rapid experimentation via extension points and an "experiments" folder while maintaining production-oriented utilities for checkpointing, profiling, and performance measurement.
Key features
- Multi-dimensional composable parallelisms:
- FSDP2 with per-parameter sharding
- Tensor Parallel (including async TP)
- Pipeline Parallel (including optimizations to reduce pipeline bubble)
- Context Parallel for very long context lengths
- Meta device model initialization to avoid materializing full model weights on CPU/GPU during setup.
- Selective and full activation checkpointing to trade compute for memory.
- Distributed and async checkpointing, with interoperable checkpoint formats that can be loaded by other tools (e.g., torchtune).
- float8 and MXFP8 training support for reduced-precision speedups on supported hardware.
- Integration with torch.compile for optimized kernels when available.
- Checkpointable data-loading and built-in C4 dataset configuration; supports custom datasets.
- Built-in metrics (loss, throughput, TFLOPs, MFU, GPU memory) and logging via TensorBoard or Weights & Biases.
- Debugging and profiling tools (CPU/GPU profiling, memory profiling, Flight Recorder).
- Helper scripts for tokenizer download, Llama checkpoint conversion, FSDP/HSDP memory estimation, and distributed inference.
- Verified performance and convergence reports (benchmarks up to 512 GPUs are provided by the project).
Model support
- Out-of-the-box support for training Meta's Llama 3.1 (8B, 70B, 405B) with example train configs and helper scripts.
Installation & usage notes
- torchtitan is developed against recent PyTorch nightly builds; for latest features the README recommends using the latest PyTorch nightly or a matching pinned nightly for stable releases.
- Can be installed from source, via pre-release nightlies, or via stable pip/conda releases (each stable release pins compatible torch/nightly versions).
- Provides run scripts and multi-node examples (Slurm/ParallelCluster), and simple commands to start training (e.g., launch an 8-GPU Llama 3 8B run).
Research & provenance
- The project is accompanied by a paper accepted to ICLR 2025 and an arXiv submission (arXiv:2410.06511). The README includes citation information and links to the ICLR poster and openreview entry.
Community & license
- Hosted under the pytorch GitHub organization and maintained as an open-source project (BSD-3-Clause). The repository contains contribution guidelines, an experiments folder for new ideas, and a community forum category for distributed/torchtitan discussions.
When to use
- Use torchtitan if you want a PyTorch-native, minimally opinionated platform to experiment with large-scale LLM pretraining techniques (multi-dimensional parallelism, advanced checkpointing, low-precision formats) while keeping code changes to models small. It is suitable for both research exploration and as a base for production-ready pretraining pipelines.
