LogoAIAny
Icon for item

Awesome-ML-SYS-Tutorial

A GitHub repository of learning notes and code dedicated to ML + SYS (machine learning systems). It collects tutorials, code walkthroughs and engineering notes on RLHF, distributed training (FSDP, Megatron), inference and scheduling (SGLang, vllm), quantization, CUDA/GPU optimization, system design, and practical engineering.

Introduction

Awesome-ML-SYS-Tutorial — Detailed introduction

Awesome-ML-SYS-Tutorial is a personal, curated GitHub repository that documents the author's learning path, notes, and code for building and optimizing machine learning systems (ML + SYS). The project focuses on bridging theory and systems engineering to turn ML research into reliable, production-capable applications.

Scope and focus
  • RLHF and reinforcement-learning-based training systems: detailed notes and implementations around RLHF frameworks, rollout engines, and specific frameworks such as slime, AReal and verl.
  • Distributed training and large-model engineering: discussions and guides on FSDP, Megatron-style parallelism, PPO/GRPO training variants, chunked GAE, and tricks to scale training for long-context and multi-turn RL scenarios.
  • Inference, scheduling and serving: deep dives into SGLang, vllm worker integration, KV cache management, zero-overhead batch scheduling, speculative decoding, and design choices for low-latency high-throughput model serving.
  • Low-level performance engineering: CUDA Graphs, memory snapshot tooling to diagnose leaks, latency optimizations for weight updates, and techniques such as FP8 usage in RL to accelerate sampling and training.
  • Quantization and model efficiency: practical notes on AWQ, BF16 trade-offs, and quantization design considerations for serving large models.
  • Engineering & developer tooling: docker usage, CI for notebooks, development environment setup, and other reproducibility / delivery notes.
Content and format
  • The repository mixes Chinese and English content; many major sections provide both language versions of essays and code walk-throughs.
  • It contains long-form write-ups, implementation guides, architecture walkthroughs, and runnable code examples for system-level components.
  • Several entries are marked as "Pending Review": the author notes ongoing maintenance and gradual reorganization since initial note-writing in late 2024.
Who it's for
  • Researchers and engineers who want to learn practical ML systems engineering: from building RLHF pipelines to deploying inference servers for large multimodal models.
  • People interested in performance debugging, distributed training best practices, and real-world system design trade-offs for large models.
Notable signals
  • The repo documents hands-on engineering solutions (e.g., integrating FSDP, speculative decoding in RL rollouts, FP8-only sampling/training experiments) and provides code-level walkthroughs suitable for practitioners.
  • Emphasizes the intersection of theory and systems: the author explicitly frames combining theory and systems as a path to practical applications.
Metadata
  • Created on: 2024-11-09
  • Primary maintainer / author: GitHub user "zhaochenyang20"
  • Public GitHub repository with many stars and community contributions; contains both notes and code designed to help others reproduce and learn the ML-SYS tooling and techniques.

More Items