All Posts

Consistent Hashing: Scaling Without Chaos

How do you distribute data across 100 servers so that adding a new server doesn't require moving...

Abstract AlgorithmsAbstract Algorithms
··13 min read

AI-assisted content.

TLDR: Standard hashing (key % N) breaks when $N$ changes — adding or removing a server reshuffles almost all keys. Consistent Hashing maps both servers and keys onto a ring (0–360°). When a server is added, only its immediate neighbors' keys move, minimizing data migration.


📖 The Problem with key % N: Why Normal Hashing Fails at Scale

Suppose you have 3 cache servers and you route requests like this:

server = hash(key) % 3

Works great — until you add a 4th server. Now hash(key) % 4 is different for almost every key. Nearly all cached data needs to be remapped, causing a cache stampede — every backend database query misses cache simultaneously.

Same problem when a server goes down: hash(key) % 2 reroutes most keys to different servers. Data that was cached is no longer reachable.

The core issue: modular hashing is globally sensitive to $N$.


🔍 Core Vocabulary: Keys, Hashes, and Rings

Before diving into the ring mechanics, clarify the key terms:

TermMeaning
Hash functionMaps an arbitrary key to a number in a fixed range (e.g., 0 to 2³²−1)
KeyThe identifier you want to route — a user ID, session token, or cache key
Node / ServerA physical or virtual machine that stores or processes data
TokenThe position a node occupies on the hash ring
Virtual node (vnode)A logical copy of a physical node, placed at multiple ring positions
RingThe circular representation of the full hash space

The critical insight: consistent hashing uses the same hash function for both keys and servers, so they share the same address space.


⚙️ The Ring: One Hash Space for Both Servers and Keys

Consistent hashing fixes this by mapping everything — both servers and keys — onto the same circular hash space (commonly 0 to $2^{32} - 1$).

graph TD
    subgraph Hash Ring
        direction LR
        A((0)) --> B((ServerA @90))
        B --> C((ServerB @180))
        C --> D((ServerC @270))
        D --> A
    end

Placement rules:

  1. Hash each server's name/IP to find its position on the ring.
  2. Hash each key to find its position on the ring.
  3. A key belongs to the first server encountered when walking clockwise from the key's position.

Example:

  • ServerA is at position 90.
  • ServerB is at position 180.
  • ServerC is at position 270.
  • Key at position 100 → walks clockwise → first server hit is ServerB (position 180).

📊 Adding or Removing a Server: Only Neighbors Move

This is the breakthrough. When you add ServerD at position 135:

  • Keys previously assigned to ServerB (in the range 91–180) that now fall in range 91–135 are reassigned to ServerD.
  • Everything else is unaffected.
flowchart LR
    A[Add ServerD at 135] --> B{Which keys move?}
    B --> C[Keys 91-135 move to ServerD]
    B --> D[Keys 0-90  ServerA: unchanged]
    B --> E[Keys 136-180  ServerB: unchanged]
    B --> F[All other keys: unchanged]

Before: $N$ servers → adding 1 requires moving approximately $K/N$ keys ($K$ = total keys). Naive % N: adding 1 requires remapping most keys.

That's the efficiency win of consistent hashing.


🧠 Deep Dive: Virtual Nodes: Fixing Uneven Distribution

Raw consistent hashing has a problem: if servers hash to clustered positions on the ring, one server gets most keys and others get few.

Solution: Virtual Nodes (vnodes)

Each physical server is placed at multiple positions on the ring by hashing variants of its name:

ServerA-0 → position 10
ServerA-1 → position 50
ServerA-2 → position 200

ServerB-0 → position 30
ServerB-1 → position 120
ServerB-2 → position 250

With 150 virtual nodes per server (Cassandra's default), the distribution becomes statistically uniform, regardless of how physical servers hash.

SettingDistributionFailure Impact
1 vnode/serverUneven, lottery of positionsOne server takes on all of failed server's load
150 vnodes/serverNear-uniformFailed server's keys spread across all healthy servers

📊 Virtual Nodes: Key Assignment on the Ring

flowchart LR
    K[Key hash value (e.g. position 55)] --> R[Hash Ring (walk clockwise)]
    R --> V1[vnode ServerA-1 @position 50 (skip)]
    R --> V2[vnode ServerB-0 @position 60 (first hit)]
    V2 --> N[Assigned to ServerB]
    Note1[ServerA also at 10, 200 ServerB also at 120, 250]

This diagram shows how a key at hash position 55 is assigned to a physical server using virtual nodes. The ring walk proceeds clockwise from position 55, skips ServerA-1's vnode at position 50 (behind the key), and lands on ServerB-0's vnode at position 60 — so the key is owned by ServerB. Each physical server holds multiple vnodes at different ring positions (shown in the note), so when a server fails its load scatters evenly across all remaining servers rather than collapsing onto a single neighbor.

📊 Node Addition: Minimal Key Rebalance

sequenceDiagram
    participant C as Client
    participant NR as Hash Ring
    participant New as New Node D
    participant Old as Existing Node B

    C->>NR: add Node D at position 135
    NR->>Old: which keys are in range 91-135?
    Old-->>NR: keys {key1, key2, key3}
    NR->>New: transfer keys {key1, key2, key3}
    New-->>NR: transfer complete
    Note over NR: Only ~1/N keys moved (minimal rebalance)
    C->>NR: lookup key at position 100
    NR-->>C: route to Node D (new owner)

🌍 Real-World Applications: Where Consistent Hashing Powers Real Systems

  • Apache Cassandra: Uses vnodes (default 256 per server) for token assignment.
  • Amazon DynamoDB: Consistent hashing of partition keys across storage nodes.
  • Memcached / Redis Cluster: Client-side consistent hashing libraries route cache keys to shards.
  • Nginx / HAProxy: Some load balancing plugins use consistent hashing for session affinity.

When Cassandra adds a node, only the data in that node's token range is streamed over — not a full reshuffle.


⚖️ Trade-offs & Failure Modes: Limits and When It Doesn't Help

ConcernDetail
Hot keysA single key (e.g., celebrity tweet) can overwhelm one vnode despite balanced distribution
Data skewNon-uniform key distributions still concentrate load
Read replicasConsistent hashing defines primary placement; replication factor adds complexity
Small clustersWith 3 servers, even 150 vnodes overlap enough to cause imbalance

Virtual nodes help with structural balance but do not fix hot-key problems. Those require application-level solutions (key sharding with suffixes, caching at a higher layer).


🔬 Deep Dive: Replication and Token Ring Internals

Replication Factor

Consistent hashing defines which node is the primary owner of a key, but production systems need replication for durability. Cassandra uses the ring position to determine replicas: after finding the primary node (the first clockwise match), it walks clockwise to find the next N-1 unique physical servers. With replication_factor=3, data lives on 3 servers.

flowchart LR
    K[Key K (position 150)] -->|primary owner| A[ServerB @180]
    A -->|replica 1| B[ServerC @270]
    B -->|replica 2| C[ServerA @90 (wraps)]

This means a key is always available even if 2 of 3 replicas are down — the CAP theorem tradeoff made explicit by where you set your read/write quorum (QUORUM, ONE, ALL).

Token Ranges in Cassandra

In Cassandra, each vnode owns a token range — a slice of the 64-bit integer space. The full space is divided into as many ranges as there are vnodes. With 3 servers × 256 vnodes = 768 token ranges, each server owns 256 non-contiguous slices spread evenly around the ring.

When you run nodetool status, you see each node's token assignments. Adding a node causes Cassandra to transfer only the affected token ranges — typically 1/N of total data.

Consistent Hashing in Redis Cluster

Redis Cluster uses a variation: 16,384 hash slots (not a continuous ring, but a segmented space). The CRC16(key) % 16384 formula assigns keys to slots; slots are manually assigned to nodes. This gives operators explicit control over rebalancing — you move slots between nodes rather than relying on automatic ring placement.


🧭 Decision Guide: When to Use Consistent Hashing: Decision Guide

Use consistent hashing when:

  • Building a distributed cache (Memcached, Redis Cluster) where servers join/leave regularly.
  • Partitioning a distributed database (Cassandra, DynamoDB) and minimizing data movement is critical.
  • Implementing a load balancer with session affinity — route a user to the same backend on every request.
  • Building a content delivery system where edge nodes are added or removed based on demand.

Skip consistent hashing when:

  • Your dataset fits on one server — no sharding needed.
  • You need range queries across sorted keys — B-tree-based range sharding is better.
  • You need exact control over which shard holds which data — explicit mapping beats ring placement.
SignalRecommendation
Nodes join/leave frequently✅ Consistent hashing
Distributed cache with many servers✅ Consistent hashing
Need sorted range scans❌ Use range-based sharding
Single server or very small cluster❌ Overkill — use key % N

🌐 Real-World Architecture: Amazon DynamoDB

DynamoDB's storage layer uses consistent hashing with virtual nodes to distribute table partitions across its fleet of storage servers. Here is a simplified read path:

flowchart LR
    Client -->|PutItem: userId=123| R[DynamoDB Router]
    R -->|hash(userId=123)| Ring[Virtual Node Ring]
    Ring -->|token match| S1[Storage Node A (primary)]
    S1 -->|replicate| S2[Storage Node B]
    S1 -->|replicate| S3[Storage Node C]
    S1 -->|ack| R
    R -->|200 OK| Client

When AWS adds capacity to a DynamoDB region, new storage nodes join the ring and absorb only the adjacent token ranges. No global reshuffle happens. A customer writing to userId=123 at the same moment the ring is being rebalanced continues to succeed — the router just uses updated ring metadata for new requests.

The same principle applies to:

  • Amazon S3 internal object placement.
  • Apache Cassandra deployed by Netflix for viewing history at petabyte scale.
  • Discord using Cassandra's consistent hashing ring for message storage across 19 billion messages daily.

🧪 Practical Setup: Consistent Hashing in Python

Here is a minimal consistent hashing implementation showing the core data structure:

import hashlib
from sortedcontainers import SortedDict

class ConsistentHashRing:
    def __init__(self, vnodes=150):
        self.ring = SortedDict()
        self.vnodes = vnodes

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_server(self, server: str):
        for i in range(self.vnodes):
            token = self._hash(f"{server}-{i}")
            self.ring[token] = server

    def remove_server(self, server: str):
        self.ring = SortedDict(
            {t: s for t, s in self.ring.items() if s != server}
        )

    def get_server(self, key: str) -> str:
        if not self.ring:
            raise Exception("No servers in ring")
        token = self._hash(key)
        idx = self.ring.bisect_left(token)
        if idx == len(self.ring):
            idx = 0  # wrap around
        return self.ring.peekitem(idx)[1]

# Usage
ring = ConsistentHashRing(vnodes=150)
ring.add_server("server-a")
ring.add_server("server-b")
ring.add_server("server-c")

print(ring.get_server("user-123"))  # server-b

The SortedDict gives O(log N) lookup for bisect_left. Adding or removing a server only iterates over its 150 vnode tokens — not all keys.


🛠️ Java TreeMap: A Production-Ready Consistent Hash Ring Without External Dependencies

java.util.TreeMap is the natural data structure for a consistent hash ring in Java: integer tokens are the keys (sorted automatically), server names are the values, and ceilingKey(token) implements the clockwise-lookup in O(log N) — zero external dependencies required.

import java.security.MessageDigest;
import java.util.TreeMap;

public class ConsistentHashRing {

    private final int             virtualNodes;
    private final TreeMap<Long, String> ring = new TreeMap<>();

    public ConsistentHashRing(int virtualNodes) {
        this.virtualNodes = virtualNodes;
    }

    public void addServer(String server) {
        for (int i = 0; i < virtualNodes; i++) {
            long token = hash(server + "-vnode-" + i);
            ring.put(token, server);
        }
    }

    public void removeServer(String server) {
        for (int i = 0; i < virtualNodes; i++) {
            ring.remove(hash(server + "-vnode-" + i));
        }
    }

    public String getServer(String key) {
        if (ring.isEmpty()) throw new IllegalStateException("No servers in ring");
        long token   = hash(key);
        // ceilingKey: first token >= hash(key) → first server clockwise from key
        Long ceiling = ring.ceilingKey(token);
        // Wrap around the ring if the key lands past the last token
        return ring.get(ceiling != null ? ceiling : ring.firstKey());
    }

    private long hash(String input) {
        try {
            byte[] digest = MessageDigest.getInstance("MD5").digest(input.getBytes());
            long h = 0;
            for (int i = 0; i < 8; i++) h = (h << 8) | (digest[i] & 0xFFL);
            return h;
        } catch (Exception e) { throw new RuntimeException(e); }
    }

    public static void main(String[] args) {
        ConsistentHashRing ring = new ConsistentHashRing(150);  // 150 vnodes per server
        ring.addServer("cache-a");
        ring.addServer("cache-b");
        ring.addServer("cache-c");

        System.out.println(ring.getServer("user-123"));     // e.g. cache-b
        System.out.println(ring.getServer("session-456"));  // e.g. cache-a

        // Add a 4th server — only ~1/4 of keys remap (vs. nearly all with key%N)
        ring.addServer("cache-d");
        System.out.println(ring.getServer("user-123"));     // may now route to cache-d
    }
}

Ketama and Guava are the two standard production alternatives in the Java ecosystem:

  • Ketama (used by spymemcached and XMemcached) is the industry-standard Memcached consistent hashing algorithm in Java, using MD5-based token distribution for maximum client-server compatibility across polyglot stacks.
  • Guava Hashing.consistentHash provides a simpler rendezvous-hashing variant — no ring data structure needed; the bucket is computed directly from the hash code.
// Guava consistent hash — maps a key to one of N buckets, minimizes remapping on resize
import com.google.common.hash.Hashing;
import java.nio.charset.StandardCharsets;

int bucket = Hashing.consistentHash(
    Hashing.murmur3_32().hashString("user-123", StandardCharsets.UTF_8),
    3   // number of buckets (servers)
);
System.out.println("Routes to bucket: " + bucket);  // 0, 1, or 2

// When you grow from 3 to 4 buckets, only ~1/4 of keys remap — same guarantee as the ring

Choose the TreeMap ring for full Cassandra/DynamoDB-style token management with virtual nodes; choose Guava.consistentHash for simple stateless routing where you only need the bucket number.

For a full deep-dive on Ketama algorithm internals, Guava's rendezvous hashing math, and Redis Cluster hash slot management in Java, a dedicated follow-up post is planned.


📚 Lessons Learned: Pitfalls in Production

1. Hash function choice matters. MD5 and SHA-1 are common choices but have distribution quirks. CRC32 is faster and well-tested in production consistent hashing (used by Memcached). Avoid simple sum(bytes) % range — it clusters keys.

2. Vnode count is a tuning knob, not a default. Cassandra defaults to 256 vnodes. Higher vnodes = better balance but more gossip overhead. A cluster of 3 nodes can have all 768 tokens assigned; a cluster of 300 nodes needs at least 300 vnodes to avoid token collisions.

3. Consistent hashing ≠ balanced load under skewed access. If 80% of reads hit 20% of keys (Pareto distribution), the servers holding those key ranges are hot regardless of ring balance. Layer in a read cache (Redis) in front of hot partitions.

4. Token ownership is metadata, not data. When you add a server, you must stream the actual data from the previous owner before the new server can serve reads. Plan for this migration window — Cassandra calls it "streaming" and it can take hours on large datasets.


📌 TLDR: Summary & Key Takeaways

  • key % N remaps almost all keys when $N$ changes — consistent hashing avoids this.
  • Both servers and keys are placed on a shared ring; a key maps to the next server clockwise.
  • Adding/removing a server only moves keys in the adjacent arc — not the full key space.
  • Virtual nodes (150–256 per server) ensure uniform distribution despite hash collisions.
  • Used in Cassandra, DynamoDB, Memcached, and Redis Cluster.

Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms