GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Chops any layer-sequence model across accelerators and splits each mini-batch into micro-batches to keep the pipeline busy, hitting near-linear speedup without architecture-specific tricks or fast interconnects.

Visual Explainer Visit Website

Introduction

Most "train a giant model" papers propose a new architecture; this one proposes a scheduler. GPipe's quiet bet was that scaling is mainly a systems problem, not a modeling one — so instead of designing memory-frugal networks, it treats any model that's a sequence of layers as something you can slice across devices and run like an assembly line. The catch with naive layer-splitting is that only one accelerator works at a time; GPipe's actual trick is cutting each mini-batch into micro-batches so the stages overlap, turning an idle pipeline into a busy one.

Key Findings

Pipeline parallelism with micro-batch splitting. Partitioning a model by layers usually wastes devices to "bubble" idle time; splitting the mini-batch into micro-batches that flow through stages back-to-back recovers almost-linear speedup as you add accelerators.
Re-materialization buys memory. Instead of storing every activation, GPipe recomputes them during the backward pass, letting a single device hold a far larger slice — this is what makes the giant models fit at all.
Synchronous, exact gradients. Unlike asynchronous model-parallel schemes, GPipe's updates are mathematically identical to single-device training, so scaling doesn't quietly change what you're optimizing.
It transfers across domains. A 557M-parameter AmoebaNet reached 84.4% top-1 on ImageNet-2012, and a single 6B-parameter, 128-layer Transformer beat all bilingual baselines across 100+ languages — same library, very different architectures.

Great Fit / When to Skip

Great fit if you want the conceptual roots of how today's large models are physically trained, or you're choosing between pipeline, tensor, and data parallelism and want to understand the pipeline branch from its source. Look elsewhere if you need a current recipe: modern stacks (Megatron-LM, DeepSpeed, GPipe's own successors like PipeDream) combine pipeline parallelism with tensor and ZeRO-style sharding, and the bubble-vs-memory tradeoffs here have since been refined.

Back

Information

Websitear5iv.labs.arxiv.org
OrganizationsGoogle Brain
AuthorsYanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu …
Published date2018/11/16

More Items

Embodied AI2026

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

Ling Xu, Chuyu Han +7Southeast University, Nanjing University +2

Provides a portable C++ inference runtime to deploy embodied AI models (vision–language–action and world–action) on heterogeneous robot hardware, enabling latency-first batch-1 closed-loop control. Key features include modular multi-rate layers, fused low-latency inference, and extensible head/IO plugins.

robotics ai-inference ai-serving ai-deploy mLOps+5

Machine Learning Engineering Papers2026

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Wei Pang, Xiangru Jian +11

Standardizes representation-level evaluation for tabular encoders by exporting row-, column-, and table-level embeddings and probing them with shared lightweight heads across three suites (TRL-CTbench, TRL-Rbench, TRL-DLTE). Supplies curated benchmark assets and task rewrites (50 OpenML tables, 123 targets, a 47,772-table DLTE lake) to enable fair cross-paradigm comparison.

paper code github embeddings ai-leaderboard+2

AI Agent Papers2024

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez +5Princeton Language and Intelligence, Princeton University

Treats the interface between an LM agent and a computer as a design variable. A custom agent-computer interface (ACI) with concise file-edit, repo-navigation, and test commands plus compact feedback reaches 12.5% pass@1 on SWE-bench, 87.7% on HumanEvalFix.

paper ai-agent LLM ai-coding engineering