system design · system-design
Design a Global OTA Update System (Tesla Fleet)
Delta updates, staged rollout, canary, rollback, fail-safe, low-bandwidth, fleet segmentation.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
Update millions of safety-critical devices over flaky cellular. Update must be atomic, signed, reversible. Roll out gradually (canary → 1% → 10% → all). If telemetry shows regression, halt + roll back.
Build pipeline produces signed firmware image. Delta-encoder generates diffs from each prior version. Update server segments fleet (region, model, fw version, cohort). Canary cohort gets new build first; telemetry observed. If healthy, expand rollout in stages. Devices have A/B partitions, install to inactive slot, switch boot on success. On boot failure, watchdog auto-switches back. All firmware cryptographically signed; secure boot verifies chain.
When to use
Embedded firmware, IoT, mobile OS, edge devices.
When not to
Pure server-side software (use blue-green deploy).
flowchart TB
Build[Signed Firmware Build] --> Delta[Delta Encoder]
Delta --> Server[OTA Update Server]
Server --> Cohort[Cohort Selector]
Cohort -->|1% canary| Canary[Canary fleet]
Canary --> Tel[Telemetry · regressions?]
Tel -->|healthy| Expand[10% → 50% → 100%]
Tel -->|regression| Halt[Halt + alert]
Vehicle[Vehicle] --> Slot[A/B Partition]
Slot --> Install[Install to inactive]
Install --> Reboot[Reboot]
Reboot --> Verify{Boot ok?}
Verify -->|yes| Done[Switch active]
Verify -->|no| Rollback[Switch back]Key insights
- Staged rollout prevents fleet-wide bricks. Cohort 0.1% → 1% → 10% → 100% over days.
- A/B partitions enable atomic install + safe rollback.
- Delta encoding cuts download size 80%+ for incremental updates.
- Telemetry signals to halt: boot loops, crash rate, sensor failures, manual flags.
- Signing chain (boot ROM → bootloader → kernel → app) prevents tampered firmware.