Megatron-LM

NVIDIA’s model-parallel training library for GPT-like transformers at multi-billion-parameter scale.

Introduction

Overview

Megatron-LM pioneered tensor and pipeline model parallelism, enabling training of GPT-style models up to hundreds of billions of parameters with high GPU efficiency.

Key Capabilities

Tensor & pipeline parallel APIs with minimal code changes
Fused layer-norm, bias-gelu and FlashAttention kernels
Activation recomputation & distributed optimizer sharding
Megatron Core library for plug-and-play integration
Extensive examples and Docker images for quick start

Back

Information

Websitedeveloper.nvidia.com
AuthorsNVIDIA
Published date2019/09/17

More Items

Agent Lightning

2025

Microsoft Research

Agent Lightning is an open-source framework developed by Microsoft Research for optimizing and training AI agents using reinforcement learning (RL) and other techniques, supporting integration with any agent framework with minimal code changes.

RL LLM ai-agent microsoft ai-train+3

nanochat

2025

Andrej Karpathy

nanochat is a full-stack, minimal codebase for training, fine-tuning, evaluating, and deploying a ChatGPT-like large language model (LLM) from scratch on a single 8xH100 GPU node for under $100.

LLM chatbot ai-train ai-tools tutorial+1

Ray

2017

RISELab (UC Berkeley), Anyscale Inc.

Ray is an open-source distributed compute engine that lets you scale Python and AI workloads—from data processing to model training and serving—without deep distributed-systems expertise.

ai-development ai-framework ai-train ai-serving