system design · system-design
Design an Amazon Fulfillment Center System
Inventory tracking, robotics control, picking/packing optimization, order routing. Tests cyber-physical workflow design + real-time coordination at warehouse scale.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
A fulfillment center is a real-time control plane over physical inventory. Every SKU has a location; every robot has a task; every order has a deadline. Software must orchestrate thousands of robots + humans + conveyor belts + sortation machines such that an item moves from shelf to truck within hours, with zero loss.
Five subsystems: (1) Inventory Service, authoritative state of every SKU × bin location, eventually consistent across replicas; (2) Robotics Dispatcher, accepts pick tasks, runs A*/heuristics across the floor graph, assigns to nearest free drive unit; (3) Workstation Service, choreographs human pickers + barcode scans, validates picks; (4) Sortation Engine, chute assignment from package → outbound dock based on carrier + truck schedule; (5) Exception Handling, damage, missing item, mis-pick, escalates to operator UI. All glued by Kafka event log; every state change is an event sourced from immutable log.
When to use
Any high-throughput physical-goods flow: warehouses, ports, postal sorting, parts manufacturing.
When not to
Small inventories with manual ops, overhead exceeds benefit. Pure-digital goods.
Time: p99 dispatch <100ms · Space: O(SKUs × locations + active tasks)
flowchart TB
Order[Order Stream] --> Plan[Wave Planner]
Plan --> Pick[Pick Tasks]
Pick --> Dispatch{{Robotics Dispatcher}}
Dispatch --> Floor[Drive Units Fleet]
Floor --> WS[Workstation]
WS --> Pack[Pack Station]
Pack --> Sort[Sortation Engine]
Sort --> Dock[Outbound Dock]
Floor -.events.-> Kafka[[Kafka Event Log]]
WS -.events.-> Kafka
Kafka --> Inv[(Inventory Service)]
Kafka --> Exc[Exception Handler]
Exc --> OpsUI[Operator UI]Key insights
- Event sourcing is mandatory, losing a single pick event corrupts inventory forever. Kafka with replication factor 3+.
- Inventory is eventually consistent globally but strongly consistent per bin (compare-and-swap on bin version).
- Robot dispatch is constrained optimization: minimize total drive distance subject to deadline + battery + congestion.
- Exception rate ~0.5% of picks, operator UI must surface within 30s or wave throughput collapses.
- Wave planning amortizes setup cost, group orders that share aisles into a single floor sweep.