Consistent Hashing: Scaling Without Chaos
How do you distribute data across 100 servers so that adding a new server doesn't require moving...
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Standard hashing (
key % N) breaks when $N$ changes — adding or removing a server reshuffles almost all keys. Consistent Hashing maps both servers and keys onto a ring (0–360°). When a server is added, only its immediate neighbors' keys move, minimizing data migration.
📖 The Problem with key % N: Why Normal Hashing Fails at Scale
Suppose you have 3 cache servers and you route requests like this:
server = hash(key) % 3
Works great — until you add a 4th server. Now hash(key) % 4 is different for almost every key. Nearly all cached data needs to be remapped, causing a cache stampede — every backend database query misses cache simultaneously.
Same problem when a server goes down: hash(key) % 2 reroutes most keys to different servers. Data that was cached is no longer reachable.
The core issue: modular hashing is globally sensitive to $N$.
🔍 Core Vocabulary: Keys, Hashes, and Rings
Before diving into the ring mechanics, clarify the key terms:
| Term | Meaning |
| Hash function | Maps an arbitrary key to a number in a fixed range (e.g., 0 to 2³²−1) |
| Key | The identifier you want to route — a user ID, session token, or cache key |
| Node / Server | A physical or virtual machine that stores or processes data |
| Token | The position a node occupies on the hash ring |
| Virtual node (vnode) | A logical copy of a physical node, placed at multiple ring positions |
| Ring | The circular representation of the full hash space |
The critical insight: consistent hashing uses the same hash function for both keys and servers, so they share the same address space.
⚙️ The Ring: One Hash Space for Both Servers and Keys
Consistent hashing fixes this by mapping everything — both servers and keys — onto the same circular hash space (commonly 0 to $2^{32} - 1$).
graph TD
subgraph Hash Ring
direction LR
A((0)) --> B((ServerA @90))
B --> C((ServerB @180))
C --> D((ServerC @270))
D --> A
end
Placement rules:
- Hash each server's name/IP to find its position on the ring.
- Hash each key to find its position on the ring.
- A key belongs to the first server encountered when walking clockwise from the key's position.
Example:
ServerAis at position 90.ServerBis at position 180.ServerCis at position 270.- Key at position 100 → walks clockwise → first server hit is
ServerB(position 180).
📊 Adding or Removing a Server: Only Neighbors Move
This is the breakthrough. When you add ServerD at position 135:
- Keys previously assigned to
ServerB(in the range 91–180) that now fall in range 91–135 are reassigned toServerD. - Everything else is unaffected.
flowchart LR
A[Add ServerD at 135] --> B{Which keys move?}
B --> C[Keys 91-135 move to ServerD]
B --> D[Keys 0-90 ServerA: unchanged]
B --> E[Keys 136-180 ServerB: unchanged]
B --> F[All other keys: unchanged]
Before: $N$ servers → adding 1 requires moving approximately $K/N$ keys ($K$ = total keys).
Naive % N: adding 1 requires remapping most keys.
That's the efficiency win of consistent hashing.
🧠 Deep Dive: Virtual Nodes: Fixing Uneven Distribution
Raw consistent hashing has a problem: if servers hash to clustered positions on the ring, one server gets most keys and others get few.
Solution: Virtual Nodes (vnodes)
Each physical server is placed at multiple positions on the ring by hashing variants of its name:
ServerA-0 → position 10
ServerA-1 → position 50
ServerA-2 → position 200
ServerB-0 → position 30
ServerB-1 → position 120
ServerB-2 → position 250
With 150 virtual nodes per server (Cassandra's default), the distribution becomes statistically uniform, regardless of how physical servers hash.
| Setting | Distribution | Failure Impact |
| 1 vnode/server | Uneven, lottery of positions | One server takes on all of failed server's load |
| 150 vnodes/server | Near-uniform | Failed server's keys spread across all healthy servers |
📊 Virtual Nodes: Key Assignment on the Ring
flowchart LR
K[Key hash value (e.g. position 55)] --> R[Hash Ring (walk clockwise)]
R --> V1[vnode ServerA-1 @position 50 (skip)]
R --> V2[vnode ServerB-0 @position 60 (first hit)]
V2 --> N[Assigned to ServerB]
Note1[ServerA also at 10, 200 ServerB also at 120, 250]
This diagram shows how a key at hash position 55 is assigned to a physical server using virtual nodes. The ring walk proceeds clockwise from position 55, skips ServerA-1's vnode at position 50 (behind the key), and lands on ServerB-0's vnode at position 60 — so the key is owned by ServerB. Each physical server holds multiple vnodes at different ring positions (shown in the note), so when a server fails its load scatters evenly across all remaining servers rather than collapsing onto a single neighbor.
📊 Node Addition: Minimal Key Rebalance
sequenceDiagram
participant C as Client
participant NR as Hash Ring
participant New as New Node D
participant Old as Existing Node B
C->>NR: add Node D at position 135
NR->>Old: which keys are in range 91-135?
Old-->>NR: keys {key1, key2, key3}
NR->>New: transfer keys {key1, key2, key3}
New-->>NR: transfer complete
Note over NR: Only ~1/N keys moved (minimal rebalance)
C->>NR: lookup key at position 100
NR-->>C: route to Node D (new owner)
🌍 Real-World Applications: Where Consistent Hashing Powers Real Systems
- Apache Cassandra: Uses vnodes (default 256 per server) for token assignment.
- Amazon DynamoDB: Consistent hashing of partition keys across storage nodes.
- Memcached / Redis Cluster: Client-side consistent hashing libraries route cache keys to shards.
- Nginx / HAProxy: Some load balancing plugins use consistent hashing for session affinity.
When Cassandra adds a node, only the data in that node's token range is streamed over — not a full reshuffle.
⚖️ Trade-offs & Failure Modes: Limits and When It Doesn't Help
| Concern | Detail |
| Hot keys | A single key (e.g., celebrity tweet) can overwhelm one vnode despite balanced distribution |
| Data skew | Non-uniform key distributions still concentrate load |
| Read replicas | Consistent hashing defines primary placement; replication factor adds complexity |
| Small clusters | With 3 servers, even 150 vnodes overlap enough to cause imbalance |
Virtual nodes help with structural balance but do not fix hot-key problems. Those require application-level solutions (key sharding with suffixes, caching at a higher layer).
🔬 Deep Dive: Replication and Token Ring Internals
Replication Factor
Consistent hashing defines which node is the primary owner of a key, but production systems need replication for durability. Cassandra uses the ring position to determine replicas: after finding the primary node (the first clockwise match), it walks clockwise to find the next N-1 unique physical servers. With replication_factor=3, data lives on 3 servers.
flowchart LR
K[Key K (position 150)] -->|primary owner| A[ServerB @180]
A -->|replica 1| B[ServerC @270]
B -->|replica 2| C[ServerA @90 (wraps)]
This means a key is always available even if 2 of 3 replicas are down — the CAP theorem tradeoff made explicit by where you set your read/write quorum (QUORUM, ONE, ALL).
Token Ranges in Cassandra
In Cassandra, each vnode owns a token range — a slice of the 64-bit integer space. The full space is divided into as many ranges as there are vnodes. With 3 servers × 256 vnodes = 768 token ranges, each server owns 256 non-contiguous slices spread evenly around the ring.
When you run nodetool status, you see each node's token assignments. Adding a node causes Cassandra to transfer only the affected token ranges — typically 1/N of total data.
Consistent Hashing in Redis Cluster
Redis Cluster uses a variation: 16,384 hash slots (not a continuous ring, but a segmented space). The CRC16(key) % 16384 formula assigns keys to slots; slots are manually assigned to nodes. This gives operators explicit control over rebalancing — you move slots between nodes rather than relying on automatic ring placement.
🧭 Decision Guide: When to Use Consistent Hashing: Decision Guide
Use consistent hashing when:
- Building a distributed cache (Memcached, Redis Cluster) where servers join/leave regularly.
- Partitioning a distributed database (Cassandra, DynamoDB) and minimizing data movement is critical.
- Implementing a load balancer with session affinity — route a user to the same backend on every request.
- Building a content delivery system where edge nodes are added or removed based on demand.
Skip consistent hashing when:
- Your dataset fits on one server — no sharding needed.
- You need range queries across sorted keys — B-tree-based range sharding is better.
- You need exact control over which shard holds which data — explicit mapping beats ring placement.
| Signal | Recommendation |
| Nodes join/leave frequently | ✅ Consistent hashing |
| Distributed cache with many servers | ✅ Consistent hashing |
| Need sorted range scans | ❌ Use range-based sharding |
| Single server or very small cluster | ❌ Overkill — use key % N |
🌐 Real-World Architecture: Amazon DynamoDB
DynamoDB's storage layer uses consistent hashing with virtual nodes to distribute table partitions across its fleet of storage servers. Here is a simplified read path:
flowchart LR
Client -->|PutItem: userId=123| R[DynamoDB Router]
R -->|hash(userId=123)| Ring[Virtual Node Ring]
Ring -->|token match| S1[Storage Node A (primary)]
S1 -->|replicate| S2[Storage Node B]
S1 -->|replicate| S3[Storage Node C]
S1 -->|ack| R
R -->|200 OK| Client
When AWS adds capacity to a DynamoDB region, new storage nodes join the ring and absorb only the adjacent token ranges. No global reshuffle happens. A customer writing to userId=123 at the same moment the ring is being rebalanced continues to succeed — the router just uses updated ring metadata for new requests.
The same principle applies to:
- Amazon S3 internal object placement.
- Apache Cassandra deployed by Netflix for viewing history at petabyte scale.
- Discord using Cassandra's consistent hashing ring for message storage across 19 billion messages daily.
🧪 Practical Setup: Consistent Hashing in Python
Here is a minimal consistent hashing implementation showing the core data structure:
import hashlib
from sortedcontainers import SortedDict
class ConsistentHashRing:
def __init__(self, vnodes=150):
self.ring = SortedDict()
self.vnodes = vnodes
def _hash(self, key: str) -> int:
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_server(self, server: str):
for i in range(self.vnodes):
token = self._hash(f"{server}-{i}")
self.ring[token] = server
def remove_server(self, server: str):
self.ring = SortedDict(
{t: s for t, s in self.ring.items() if s != server}
)
def get_server(self, key: str) -> str:
if not self.ring:
raise Exception("No servers in ring")
token = self._hash(key)
idx = self.ring.bisect_left(token)
if idx == len(self.ring):
idx = 0 # wrap around
return self.ring.peekitem(idx)[1]
# Usage
ring = ConsistentHashRing(vnodes=150)
ring.add_server("server-a")
ring.add_server("server-b")
ring.add_server("server-c")
print(ring.get_server("user-123")) # server-b
The SortedDict gives O(log N) lookup for bisect_left. Adding or removing a server only iterates over its 150 vnode tokens — not all keys.
🛠️ Java TreeMap: A Production-Ready Consistent Hash Ring Without External Dependencies
java.util.TreeMap is the natural data structure for a consistent hash ring in Java: integer tokens are the keys (sorted automatically), server names are the values, and ceilingKey(token) implements the clockwise-lookup in O(log N) — zero external dependencies required.
import java.security.MessageDigest;
import java.util.TreeMap;
public class ConsistentHashRing {
private final int virtualNodes;
private final TreeMap<Long, String> ring = new TreeMap<>();
public ConsistentHashRing(int virtualNodes) {
this.virtualNodes = virtualNodes;
}
public void addServer(String server) {
for (int i = 0; i < virtualNodes; i++) {
long token = hash(server + "-vnode-" + i);
ring.put(token, server);
}
}
public void removeServer(String server) {
for (int i = 0; i < virtualNodes; i++) {
ring.remove(hash(server + "-vnode-" + i));
}
}
public String getServer(String key) {
if (ring.isEmpty()) throw new IllegalStateException("No servers in ring");
long token = hash(key);
// ceilingKey: first token >= hash(key) → first server clockwise from key
Long ceiling = ring.ceilingKey(token);
// Wrap around the ring if the key lands past the last token
return ring.get(ceiling != null ? ceiling : ring.firstKey());
}
private long hash(String input) {
try {
byte[] digest = MessageDigest.getInstance("MD5").digest(input.getBytes());
long h = 0;
for (int i = 0; i < 8; i++) h = (h << 8) | (digest[i] & 0xFFL);
return h;
} catch (Exception e) { throw new RuntimeException(e); }
}
public static void main(String[] args) {
ConsistentHashRing ring = new ConsistentHashRing(150); // 150 vnodes per server
ring.addServer("cache-a");
ring.addServer("cache-b");
ring.addServer("cache-c");
System.out.println(ring.getServer("user-123")); // e.g. cache-b
System.out.println(ring.getServer("session-456")); // e.g. cache-a
// Add a 4th server — only ~1/4 of keys remap (vs. nearly all with key%N)
ring.addServer("cache-d");
System.out.println(ring.getServer("user-123")); // may now route to cache-d
}
}
Ketama and Guava are the two standard production alternatives in the Java ecosystem:
- Ketama (used by
spymemcachedandXMemcached) is the industry-standard Memcached consistent hashing algorithm in Java, using MD5-based token distribution for maximum client-server compatibility across polyglot stacks. - Guava
Hashing.consistentHashprovides a simpler rendezvous-hashing variant — no ring data structure needed; the bucket is computed directly from the hash code.
// Guava consistent hash — maps a key to one of N buckets, minimizes remapping on resize
import com.google.common.hash.Hashing;
import java.nio.charset.StandardCharsets;
int bucket = Hashing.consistentHash(
Hashing.murmur3_32().hashString("user-123", StandardCharsets.UTF_8),
3 // number of buckets (servers)
);
System.out.println("Routes to bucket: " + bucket); // 0, 1, or 2
// When you grow from 3 to 4 buckets, only ~1/4 of keys remap — same guarantee as the ring
Choose the TreeMap ring for full Cassandra/DynamoDB-style token management with virtual nodes; choose Guava.consistentHash for simple stateless routing where you only need the bucket number.
For a full deep-dive on Ketama algorithm internals, Guava's rendezvous hashing math, and Redis Cluster hash slot management in Java, a dedicated follow-up post is planned.
📚 Lessons Learned: Pitfalls in Production
1. Hash function choice matters. MD5 and SHA-1 are common choices but have distribution quirks. CRC32 is faster and well-tested in production consistent hashing (used by Memcached). Avoid simple sum(bytes) % range — it clusters keys.
2. Vnode count is a tuning knob, not a default. Cassandra defaults to 256 vnodes. Higher vnodes = better balance but more gossip overhead. A cluster of 3 nodes can have all 768 tokens assigned; a cluster of 300 nodes needs at least 300 vnodes to avoid token collisions.
3. Consistent hashing ≠ balanced load under skewed access. If 80% of reads hit 20% of keys (Pareto distribution), the servers holding those key ranges are hot regardless of ring balance. Layer in a read cache (Redis) in front of hot partitions.
4. Token ownership is metadata, not data. When you add a server, you must stream the actual data from the previous owner before the new server can serve reads. Plan for this migration window — Cassandra calls it "streaming" and it can take hours on large datasets.
📌 TLDR: Summary & Key Takeaways
key % Nremaps almost all keys when $N$ changes — consistent hashing avoids this.- Both servers and keys are placed on a shared ring; a key maps to the next server clockwise.
- Adding/removing a server only moves keys in the adjacent arc — not the full key space.
- Virtual nodes (150–256 per server) ensure uniform distribution despite hash collisions.
- Used in Cassandra, DynamoDB, Memcached, and Redis Cluster.
🔗 Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
