system design · system-design

Design a Distributed Key-Value Store (DynamoDB-style)

Amazon's signature L5+ system design. Tests consistent hashing, replication, quorum, vector clocks, and the CAP trade-offs Amazon engineers grade for keenly. Based on the 2007 Dynamo paper.

expert5hawsgeneralsystem-design

Ask GPTConfidence

Theory

Explanation

Intuition first, formal definition second. Skim the bullets if you already know this; read the prose if you don't.

Forget transactions, forget joins, forget rich queries. Give me get(k) and put(k,v) with five nines availability, single-digit-ms latency, and the ability to grow from 1 GB to 1 PB without re-architecting. That is the Dynamo bet. Everything in the design serves availability and scale. Consistency is tunable per request, not a global property.

Five primitives compose the system: (1) Consistent hashing, keys are SHA-1 hashed onto a ring; each node owns the arc up to its predecessor. Virtual nodes (256+ per physical node) smooth load distribution. (2) Replication, each key replicated to next N nodes clockwise on the ring (N=3 typical). (3) Quorum, R + W > N for read-your-writes; W=1 gives "always-write" availability, R=N gives strong consistency. (4) Vector clocks, version conflicts surfaced to client when concurrent writes occur (no last-writer-wins by default). (5) Anti-entropy, merkle trees + gossip + hinted handoff keep replicas converging in the background.

When to use

Massive scale, simple access pattern, availability over strict consistency. Shopping carts, session stores, product catalogs, IoT telemetry, leaderboard counters. Same model powers DynamoDB, Cassandra, Riak.

When not to

Transactional workloads (banking ledger), strong consistency required (paid Spanner / CockroachDB), rich queries (joins, aggregations), those need SQL. Small data, operational overhead not worth it under ~1 TB.

Time: get/put p99 ~10 ms · Space: O(N * keys)

flowchart TB
    subgraph Ring["Consistent Hash Ring"]
        direction LR
        N1["Node A<br/>vnodes 0..255"] --- N2["Node B<br/>vnodes 256..511"]
        N2 --- N3["Node C<br/>vnodes 512..767"]
        N3 --- N4["Node D<br/>vnodes 768..1023"]
        N4 --- N1
    end
    Client([Client]) --> Coord{{Coordinator<br/>any node}}
    Coord -->|W=2| N1
    Coord -->|W=2| N2
    Coord -->|W=2| N3
    Coord <-->|gossip: health + ring| N4
    N1 -. "merkle anti-entropy" .- N2
    N2 -. "merkle anti-entropy" .- N3
    HQ["Hinted Handoff Queue"] -. "replay on recovery" .-> N4

Key insights

Consistent hashing solves elasticity. Adding a node shifts only 1/N of keys; without it, doubling cluster size shuffles 50% of data.
Virtual nodes solve hot-spots. Without them, removing a node dumps its entire range onto the next neighbor; with 256 vnodes per physical node, load redistributes across the cluster.
R + W > N gives strong read-after-write. R=W=N/2+1 (quorum) is the sweet spot, survives ⌊N/2⌋ failures, latency p99 = max of (N/2+1) slowest replicas.
Vector clocks make conflicts the client's problem, and that is intentional. The shopping cart that "merges" two concurrent adds is correct because the application knows the merge semantics; the database does not.
Sloppy quorum + hinted handoff = the "always writable" guarantee. If the primary replica is down, write to the next live node with a hint; replay when primary returns.
Anti-entropy via Merkle trees lets two replicas converge by comparing O(log N) hashes instead of streaming the full keyspace.
Gossip protocol propagates membership + ring state. Every node periodically gossips with 1–3 random peers; full ring view converges in O(log N) gossip rounds.