system design · system-design

Design Teams Voice & Video (WebRTC + Signaling)

WebRTC, signaling, TURN/STUN, recording. Real-time low-latency media.

hard4hgeneralsystem-design

Theory

Explanation

Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.

WebRTC handles browser-native low-latency audio/video. Signaling negotiates session; SFU routes media in groups. Recording duplicates streams. NAT traversal via STUN; relay via TURN when direct fails.

Signaling server runs SDP offer/answer + ICE candidate exchange. Clients hit STUN to discover public address; fall back to TURN relay if NAT symmetric. Group calls use SFU (cheap forward, no decode). Simulcast: client sends 3 quality layers; SFU picks per-receiver based on bandwidth. Recording: SFU mirrors all streams to a recorder service which transcodes + uploads.

When to use

Real-time voice/video products: meetings, telehealth, gaming voice chat.

When not to

Pre-recorded streaming (use HLS). Sub-50ms (use UDP custom).

flowchart LR
  A([Caller]) --> Sig[Signaling Server]
  B([Callee]) --> Sig
  Sig -->|SDP+ICE| A
  Sig -->|SDP+ICE| B
  A -->|STUN| Stun[STUN]
  A <-->|P2P or via TURN| B
  Group[Group Call] --> SFU[SFU Media Server]
  SFU --> Rec[Recorder]
  Rec --> Blob[(Recording Blob)]

Key insights

SFU is the cost optimization, never decode in the cloud.
Simulcast lets each receiver request appropriate quality without bothering sender.
TURN relays cost real bandwidth, minimize via good STUN coverage.
Recording is a side-channel, adding/removing it does not affect live participants.
ICE failure rate ~5%, TURN fallback non-optional.