system design · system-design

Design a Distributed File System (GFS / Colossus)

Master + chunkservers, replication, append-only writes, fault recovery. Foundation of modern data infra.

expert5hgeneralsystem-design
Ask GPTConfidence

Theory

Explanation

Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.

Files broken into fixed-size chunks (64MB). Master holds metadata; chunkservers hold data. Replication for durability; append-only writes for simplicity at scale; lease-based primary chunkserver coordinates writes.

Master node: namespace + file→chunks→replicas mapping. Chunkservers: hold 64MB chunks on local disk. Reads go directly to chunkserver (master only serves locations). Writes: client gets primary replica from master, primary orders mutations, secondaries follow. Replication factor 3 default. Heartbeats from chunkservers; master re-replicates on detected loss. Colossus (successor) replaces single master with sharded metadata + Reed-Solomon erasure coding for storage efficiency.

When to use

Foundation for compute frameworks (MapReduce, Spark), data lakes, log archival.

When not to

Small files (master metadata bloat). Low-latency random writes, use NewSQL.

flowchart TB
  Client([Client]) -->|metadata request| Master[Master · namespace + chunk map]
  Client -->|read chunk| CS1[Chunkserver 1]
  Client -->|write chunk| Primary[Primary Replica]
  Primary --> Sec1[Secondary 1]
  Primary --> Sec2[Secondary 2]
  CS1 -.heartbeat.-> Master
  Sec1 -.heartbeat.-> Master
  Sec2 -.heartbeat.-> Master
  Master --> Repair[Re-replication on failure]

Key insights

  • Append-only avoids the hardest writes: in-place updates. Most workloads (logs, MapReduce intermediates) accept this.
  • Master metadata fits in RAM if average chunk is 64MB, 1PB cluster = 16M chunks × ~64B metadata = 1GB.
  • Replication for durability; erasure coding (Reed-Solomon) replaces it in Colossus for storage efficiency.
  • Lease-based primary avoids needing consensus per write, primary holds 60s lease.
  • Single master is the historical bottleneck; Colossus shards metadata via consistent hashing.