Overview
SkyPilot is an open-source system designed to let AI teams and infra teams run, manage, and scale AI workloads on virtually any infrastructure. It exposes a simple unified interface (YAML or Python API + CLI) so the same task definition can be launched on Kubernetes, Slurm, or many cloud providers without code changes. SkyPilot emphasizes portability, cost efficiency, and operational ergonomics for training, distributed jobs, and model serving.
Key features
- Unified job-as-code interface: define resources, setup, data sync and run commands in a YAML or Python task spec so jobs are portable across infra.
- Multi-infra provisioning: transparently provision VMs, containers, or pods across Kubernetes clusters and many cloud providers.
- Cost and availability optimization: spot instance support with preemption auto-recovery, intelligent scheduling to pick cheapest/available infra, and autostop/cleanup of idle resources.
- Job lifecycle and orchestration: queueing, streaming logs, auto-retry, checkpoint-aware restarts, and autoscaling patterns for large workloads.
- Developer ergonomics: local dev experience on K8s (SSH into pods, sync code, connect IDE), simple commands for launching and managing experiments.
- Examples & integrations: built-in examples for training (DeepSpeed, PyTorch, TorchTitan, Verl), serving (vLLM, vllm-like backends), LLM workflows, RAG, and integrations with frameworks and vector DBs.
Supported infrastructure
SkyPilot supports a long list of infrastructures including (but not limited to) Kubernetes, Slurm, AWS, GCP, Azure, OCI, CoreWeave, Lambda Cloud, RunPod, Fluidstack, Paperspace, Vast.ai, VMware vSphere, and many others. This makes it suitable for hybrid and multi-cloud environments.
Typical workflow
- Write a task spec (YAML or Python) describing resources, setup steps, and run commands.
- Use
sky launch <task.yaml>to provision resources; SkyPilot finds suitable infra, provisions, syncs workdir, runs setup, and starts the job. - Stream logs, inspect jobs, and let SkyPilot handle retries/auto-recovery or cleanup.
Example YAML snippet:
resources:
accelerators: A100:8
num_nodes: 1
workdir: ~/torch_examples
setup: |
pip install -r requirements.txt
run: |
python main.py --epochs 1Ecosystem, research and origin
SkyPilot started from the Sky Computing Lab at UC Berkeley and the project links to related academic work (e.g., NSDI 2023 paper and a Sky Computing whitepaper). It maintains documentation, demos, and a blog with case studies and benchmarks. The project is open-source on GitHub and has an active set of examples for training and serving modern LLMs and AI workloads.
When to use SkyPilot
- You need a portable, single interface to run the same AI job across clusters and multiple cloud providers.
- You want cost-optimized provisioning (spot instances, auto-failover) with orchestration safeguards.
- You need tooling that eases large-scale or distributed training and serving while exposing developer-friendly workflows.
Getting started
Visit the documentation (official site) for installation, quickstart, and CLI references. Typical install is via pip and there are nightly/source install options for the latest fixes and features.
