AIAny - torchtitan

torchtitan — PyTorch-native training platform

torchtitan is a minimal, clean-room PyTorch-native platform built to accelerate experimentation and production-scale pretraining of generative AI models. The project emphasizes clarity, extensibility, and composability of parallelism techniques so that researchers and engineers can apply multi-dimensional scaling with minimal changes to model code.

Core focus

Provide a simple, well-documented codebase demonstrating modern PyTorch distributed features for LLM pretraining and large-scale generative-model training.
Enable rapid experimentation via extension points and an "experiments" folder while maintaining production-oriented utilities for checkpointing, profiling, and performance measurement.

Key features

Multi-dimensional composable parallelisms:
- FSDP2 with per-parameter sharding
- Tensor Parallel (including async TP)
- Pipeline Parallel (including optimizations to reduce pipeline bubble)
- Context Parallel for very long context lengths
Meta device model initialization to avoid materializing full model weights on CPU/GPU during setup.
Selective and full activation checkpointing to trade compute for memory.
Distributed and async checkpointing, with interoperable checkpoint formats that can be loaded by other tools (e.g., torchtune).
float8 and MXFP8 training support for reduced-precision speedups on supported hardware.
Integration with torch.compile for optimized kernels when available.
Checkpointable data-loading and built-in C4 dataset configuration; supports custom datasets.
Built-in metrics (loss, throughput, TFLOPs, MFU, GPU memory) and logging via TensorBoard or Weights & Biases.
Debugging and profiling tools (CPU/GPU profiling, memory profiling, Flight Recorder).
Helper scripts for tokenizer download, Llama checkpoint conversion, FSDP/HSDP memory estimation, and distributed inference.
Verified performance and convergence reports (benchmarks up to 512 GPUs are provided by the project).

Model support

Out-of-the-box support for training Meta's Llama 3.1 (8B, 70B, 405B) with example train configs and helper scripts.

Installation & usage notes

torchtitan is developed against recent PyTorch nightly builds; for latest features the README recommends using the latest PyTorch nightly or a matching pinned nightly for stable releases.
Can be installed from source, via pre-release nightlies, or via stable pip/conda releases (each stable release pins compatible torch/nightly versions).
Provides run scripts and multi-node examples (Slurm/ParallelCluster), and simple commands to start training (e.g., launch an 8-GPU Llama 3 8B run).

Research & provenance

The project is accompanied by a paper accepted to ICLR 2025 and an arXiv submission (arXiv:2410.06511). The README includes citation information and links to the ICLR poster and openreview entry.

Community & license

Hosted under the pytorch GitHub organization and maintained as an open-source project (BSD-3-Clause). The repository contains contribution guidelines, an experiments folder for new ideas, and a community forum category for distributed/torchtitan discussions.

When to use

Use torchtitan if you want a PyTorch-native, minimally opinionated platform to experiment with large-scale LLM pretraining techniques (multi-dimensional parallelism, advanced checkpointing, low-precision formats) while keeping code changes to models small. It is suitable for both research exploration and as a base for production-ready pretraining pipelines.

torchtitan

Introduction

torchtitan — PyTorch-native training platform

Core focus

Key features

Model support

Installation & usage notes

Research & provenance

Community & license

When to use

Information

Categories

Tags

More Items

Langfuse

TEN Framework

WeKnora