AIAny - Awesome-ML-SYS-Tutorial

Awesome-ML-SYS-Tutorial — Detailed introduction

Awesome-ML-SYS-Tutorial is a personal, curated GitHub repository that documents the author's learning path, notes, and code for building and optimizing machine learning systems (ML + SYS). The project focuses on bridging theory and systems engineering to turn ML research into reliable, production-capable applications.

Scope and focus

RLHF and reinforcement-learning-based training systems: detailed notes and implementations around RLHF frameworks, rollout engines, and specific frameworks such as slime, AReal and verl.
Distributed training and large-model engineering: discussions and guides on FSDP, Megatron-style parallelism, PPO/GRPO training variants, chunked GAE, and tricks to scale training for long-context and multi-turn RL scenarios.
Inference, scheduling and serving: deep dives into SGLang, vllm worker integration, KV cache management, zero-overhead batch scheduling, speculative decoding, and design choices for low-latency high-throughput model serving.
Low-level performance engineering: CUDA Graphs, memory snapshot tooling to diagnose leaks, latency optimizations for weight updates, and techniques such as FP8 usage in RL to accelerate sampling and training.
Quantization and model efficiency: practical notes on AWQ, BF16 trade-offs, and quantization design considerations for serving large models.
Engineering & developer tooling: docker usage, CI for notebooks, development environment setup, and other reproducibility / delivery notes.

Content and format

The repository mixes Chinese and English content; many major sections provide both language versions of essays and code walk-throughs.
It contains long-form write-ups, implementation guides, architecture walkthroughs, and runnable code examples for system-level components.
Several entries are marked as "Pending Review": the author notes ongoing maintenance and gradual reorganization since initial note-writing in late 2024.

Who it's for

Researchers and engineers who want to learn practical ML systems engineering: from building RLHF pipelines to deploying inference servers for large multimodal models.
People interested in performance debugging, distributed training best practices, and real-world system design trade-offs for large models.

Notable signals

The repo documents hands-on engineering solutions (e.g., integrating FSDP, speculative decoding in RL rollouts, FP8-only sampling/training experiments) and provides code-level walkthroughs suitable for practitioners.
Emphasizes the intersection of theory and systems: the author explicitly frames combining theory and systems as a path to practical applications.

Metadata

Created on: 2024-11-09
Primary maintainer / author: GitHub user "zhaochenyang20"
Public GitHub repository with many stars and community contributions; contains both notes and code designed to help others reproduce and learn the ML-SYS tooling and techniques.

Awesome-ML-SYS-Tutorial

Introduction

Awesome-ML-SYS-Tutorial — Detailed introduction

Scope and focus

Content and format

Who it's for

Notable signals

Metadata

Information

Categories

Tags

More Items

Generative AI for Beginners

ML for Beginners

CS231n: Deep Learning for Computer Vision