system design · system-design

Design Azure VM Scheduler (Bin Packing + Fault Domains)

VM allocation, bin packing, fault domains, SLA-aware. Microsoft signature systems SDI.

expert4hazurekubernetessystem-design
Ask GPTConfidence

Theory

Explanation

Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.

Cloud has 1M+ physical hosts. VMs come in shapes (1vCPU/2GB to 96vCPU/768GB). Pack them onto hosts maximizing utilization while honoring constraints: fault-domain spread, update-domain spread, anti-affinity, GPU/memory needs, region/AZ rules.

Scheduler runs as cluster manager. Inventory of hosts + their free resources. Incoming VM request specifies shape + constraints. Greedy bin-pack with constraint satisfaction: filter feasible hosts → score by utilization, fault-domain spread, locality → place. For HA sets, spread across N fault domains + M update domains. Preemption for higher-priority VMs (e.g., burst capacity). Live migration moves VMs without restart when hosts need maintenance.

When to use

Cloud compute schedulers, Kubernetes schedulers, GPU clusters.

When not to

Bare-metal only with no virtualization.

flowchart TB
  Req[VM Request · shape + constraints] --> Sched[Scheduler]
  Sched --> Inv[(Host Inventory)]
  Sched --> Filter[Filter feasible]
  Filter --> Score[Score · util + spread + locality]
  Score --> Place[Place on host]
  Place --> Host[Host]
  Host --> Hyper[Hypervisor · creates VM]
  Maint[Maintenance Event] --> LM[Live Migration]
  LM --> NewHost[Target Host]

Key insights

  • Constraint satisfaction is NP-hard but tractable in practice via heuristics.
  • Fault domains protect against rack/PSU failure; update domains against rolling upgrades.
  • Live migration uses pre-copy memory: iteratively copy dirty pages, brief stun-and-switch.
  • Preemption supports spot instances, cheap capacity that can be reclaimed.
  • Capacity planning operates on top of scheduler, knowing what VMs to provision next month.