system design · system-design
Design Google's Web Crawler
URL frontier, politeness/rate limiting, dedup via hashing, distributed crawling, robots.txt. Signature Google SDI.
Theory
Explanation
Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.
Crawl billions of URLs without hammering any one host, without re-crawling the same content, while respecting robots.txt. URL frontier is the queue; dedup is the gate; politeness is per-host throttling.
URL Frontier: priority queue partitioned by host. Per-host bucket has its own rate-limit. Workers pull from buckets round-robin. Fetched HTML parsed → URLs extracted → dedup against seen-set (Bloom filter + Bigtable) → enqueued. Content hash dedupes identical pages across URLs. robots.txt cached per host with TTL. Priority based on PageRank-like signals: high-rank pages crawled more often.
When to use
Search indexing, archival, security scanning.
When not to
Single-site scraping, overkill.
flowchart LR
Seed[Seed URLs] --> Frontier[(URL Frontier · per-host queues)]
Frontier --> Worker[Fetcher Workers]
Worker --> Robots{robots.txt check}
Robots -->|allowed| Fetch[Fetch HTML]
Robots -->|blocked| Drop[Drop]
Fetch --> Parse[Parser]
Parse --> Hash{Content Hash}
Hash -->|seen| Drop2[Drop]
Hash -->|new| Store[(Page Store)]
Parse --> URLs[Extracted URLs]
URLs --> Bloom[Seen-URL Bloom]
Bloom -->|new| FrontierKey insights
- Politeness = per-host concurrency + delay. Hosts can have wildly different capacities.
- Bloom filter cheaply rejects seen URLs; backed by persistent set for correctness.
- Content hash catches same content under different URLs (canonicalization).
- Priority queue lets you re-crawl important pages faster (news vs static).
- robots.txt fetched per host with TTL ~24h; respect Crawl-delay.