system design · system-design
Design Teams Voice & Video (WebRTC + Signaling)
WebRTC, signaling, TURN/STUN, recording. Real-time low-latency media.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
WebRTC handles browser-native low-latency audio/video. Signaling negotiates session; SFU routes media in groups. Recording duplicates streams. NAT traversal via STUN; relay via TURN when direct fails.
Signaling server runs SDP offer/answer + ICE candidate exchange. Clients hit STUN to discover public address; fall back to TURN relay if NAT symmetric. Group calls use SFU (cheap forward, no decode). Simulcast: client sends 3 quality layers; SFU picks per-receiver based on bandwidth. Recording: SFU mirrors all streams to a recorder service which transcodes + uploads.
When to use
Real-time voice/video products: meetings, telehealth, gaming voice chat.
When not to
Pre-recorded streaming (use HLS). Sub-50ms (use UDP custom).
flowchart LR A([Caller]) --> Sig[Signaling Server] B([Callee]) --> Sig Sig -->|SDP+ICE| A Sig -->|SDP+ICE| B A -->|STUN| Stun[STUN] A <-->|P2P or via TURN| B Group[Group Call] --> SFU[SFU Media Server] SFU --> Rec[Recorder] Rec --> Blob[(Recording Blob)]
Key insights
- SFU is the cost optimization, never decode in the cloud.
- Simulcast lets each receiver request appropriate quality without bothering sender.
- TURN relays cost real bandwidth, minimize via good STUN coverage.
- Recording is a side-channel, adding/removing it does not affect live participants.
- ICE failure rate ~5%, TURN fallback non-optional.