system design · system-design
Design Gmail (Billions-Scale Email)
Email storage, search/indexing, spam filtering, threading. Tests inverted indexes + storage tiering.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
Email is read-heavy after the first arrival second. Recent emails are hot; archives are cold. Search is the killer feature, needs full-text inverted index. Spam filter is a separate write-time gate.
Inbound SMTP cluster authenticates DKIM/SPF/DMARC, queues to delivery. Spam classifier (ML) gates write. Email body stored in object store; metadata + threading in row-store (Bigtable). Per-user inverted index built by indexer service from body content; supports search by from/to/subject/body. Threading by Message-ID + In-Reply-To headers + subject normalization. Storage tiering: <30 days hot SSD, older spillover to colder tier.
When to use
Email, support tickets, conversation systems with archive + search.
When not to
Real-time chat (use messaging product). Single-user inbox (overkill).
flowchart LR
SMTP[Inbound SMTP] --> Auth[DKIM/SPF/DMARC]
Auth --> Spam{ML Spam Filter}
Spam -->|ham| Q[[Delivery Queue]]
Spam -->|spam| Junk[(Junk Folder)]
Q --> Store[(Email Metadata · Bigtable)]
Q --> Blob[(Body Storage · GCS)]
Q --> Idx[Indexer]
Idx --> Inv[(Inverted Index)]
User([User]) --> Read[Read API]
Read --> Store
Read --> Blob
User --> Search[Search API]
Search --> InvKey insights
- Threading needs Message-ID + In-Reply-To; subject fallback for broken clients.
- Inverted index per user is small enough to fit per-user shard.
- Spam ML must run synchronously at write, async would deliver spam before classifying.
- Storage tiering by recency cuts cost dramatically.
- DKIM/SPF/DMARC checks prevent inbox spoofing; reject at SMTP layer.