system design · system-design
Design a GPU Cluster Scheduler (MIG + Gang)
Bin packing, MIG slicing, gang scheduling, K8s GPU operator, preemption, QoS. NVIDIA signature SDI.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
Distributed training jobs need all GPUs simultaneously (gang scheduling); inference jobs need fractional GPUs (MIG slicing). Scheduler honors both, packs tight, considers topology (NVLink islands).
Scheduler tracks GPU inventory per host + MIG profiles supported. Requests specify N GPUs + topology hints (same node? NVLink island?) + priority. Gang scheduling waits until all N can be co-scheduled. Pre-emption: low-priority inference jobs evict for high-priority training. Topology-aware: prefer GPUs on same NVLink island for tensor-parallel workloads. MIG: slice H100 into 7 partial GPUs for inference fairness.
When to use
Internal ML platforms, cloud GPU services, HPC clusters.
When not to
Single-tenant single-GPU workloads.
flowchart TB
Req[Job · 8 GPUs same NVLink + gang] --> Sched[GPU Scheduler]
Sched --> Inv[(GPU Inventory · per host + MIG)]
Sched --> Filter[Topology + MIG filter]
Filter --> Score[Score · packing + locality]
Score --> Gang{All slots available?}
Gang -->|yes| Place[Place job]
Gang -->|no| Wait[Wait / backfill smaller]
Preempt[High-pri arrives] --> Evict[Evict low-pri]
Evict --> SchedKey insights
- Gang scheduling avoids partial allocation deadlock, all-or-nothing.
- MIG converts one H100 into 7 isolated GPUs for inference SLA.
- Topology hints matter, TP=8 jobs must land on NVLink island, else 10x slower.
- Backfill: small jobs fit gaps while large gang job waits, keeps utilization high.
- Preemption requires job checkpointing, only training jobs that checkpoint regularly survive.