system design · system-design

Design Gmail (Billions-Scale Email)

Email storage, search/indexing, spam filtering, threading. Tests inverted indexes + storage tiering.

hard4hgeneralsystem-design
Ask GPTConfidence

Theory

Explanation

Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.

Email is read-heavy after the first arrival second. Recent emails are hot; archives are cold. Search is the killer feature, needs full-text inverted index. Spam filter is a separate write-time gate.

Inbound SMTP cluster authenticates DKIM/SPF/DMARC, queues to delivery. Spam classifier (ML) gates write. Email body stored in object store; metadata + threading in row-store (Bigtable). Per-user inverted index built by indexer service from body content; supports search by from/to/subject/body. Threading by Message-ID + In-Reply-To headers + subject normalization. Storage tiering: <30 days hot SSD, older spillover to colder tier.

When to use

Email, support tickets, conversation systems with archive + search.

When not to

Real-time chat (use messaging product). Single-user inbox (overkill).

flowchart LR
  SMTP[Inbound SMTP] --> Auth[DKIM/SPF/DMARC]
  Auth --> Spam{ML Spam Filter}
  Spam -->|ham| Q[[Delivery Queue]]
  Spam -->|spam| Junk[(Junk Folder)]
  Q --> Store[(Email Metadata · Bigtable)]
  Q --> Blob[(Body Storage · GCS)]
  Q --> Idx[Indexer]
  Idx --> Inv[(Inverted Index)]
  User([User]) --> Read[Read API]
  Read --> Store
  Read --> Blob
  User --> Search[Search API]
  Search --> Inv

Key insights

  • Threading needs Message-ID + In-Reply-To; subject fallback for broken clients.
  • Inverted index per user is small enough to fit per-user shard.
  • Spam ML must run synchronously at write, async would deliver spam before classifying.
  • Storage tiering by recency cuts cost dramatically.
  • DKIM/SPF/DMARC checks prevent inbox spoofing; reject at SMTP layer.