system design · system-design

Design Google's Web Crawler

URL frontier, politeness/rate limiting, dedup via hashing, distributed crawling, robots.txt. Signature Google SDI.

hard4hgeneralsystem-design

Theory

Explanation

Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.

Crawl billions of URLs without hammering any one host, without re-crawling the same content, while respecting robots.txt. URL frontier is the queue; dedup is the gate; politeness is per-host throttling.

URL Frontier: priority queue partitioned by host. Per-host bucket has its own rate-limit. Workers pull from buckets round-robin. Fetched HTML parsed → URLs extracted → dedup against seen-set (Bloom filter + Bigtable) → enqueued. Content hash dedupes identical pages across URLs. robots.txt cached per host with TTL. Priority based on PageRank-like signals: high-rank pages crawled more often.

When to use

Search indexing, archival, security scanning.

When not to

Single-site scraping, overkill.

flowchart LR
  Seed[Seed URLs] --> Frontier[(URL Frontier · per-host queues)]
  Frontier --> Worker[Fetcher Workers]
  Worker --> Robots{robots.txt check}
  Robots -->|allowed| Fetch[Fetch HTML]
  Robots -->|blocked| Drop[Drop]
  Fetch --> Parse[Parser]
  Parse --> Hash{Content Hash}
  Hash -->|seen| Drop2[Drop]
  Hash -->|new| Store[(Page Store)]
  Parse --> URLs[Extracted URLs]
  URLs --> Bloom[Seen-URL Bloom]
  Bloom -->|new| Frontier

Key insights

Politeness = per-host concurrency + delay. Hosts can have wildly different capacities.
Bloom filter cheaply rejects seen URLs; backed by persistent set for correctness.
Content hash catches same content under different URLs (canonicalization).
Priority queue lets you re-crawl important pages faster (news vs static).
robots.txt fetched per host with TTL ~24h; respect Crawl-delay.