NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

Consistent hashing rings, adaptive capacity, and chunk balancers — how the three leading NoSQL databases partition data at scale

System Design Interview Prep

Abstract Algorithms

·May 3, 2026·22 min read

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 22 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node — virtual nodes (vnodes) make rebalancing smooth. DynamoDB manages partitioning completely on your behalf, splitting physical partitions at 10 GB automatically and redistributing throughput via adaptive capacity — but a structurally hot partition key still requires write sharding at the application layer. MongoDB divides collections into chunks routed by a mongos process, with ranged shard keys enabling efficient range queries and hashed shard keys distributing writes uniformly; jumbo chunks signal a low-cardinality shard key. In all three systems, the partition key decision is permanent and architectural — getting it wrong is a 2 AM production incident, not a config file change.

📖 Why NoSQL Databases Cannot Rely on a Single Node

Every distributed system eventually confronts the same problem: the data is too large, too busy, or too geographically spread to fit on one machine. A relational database with partitioning still runs on a single node — the partitions are physical table segments, not independent servers. NoSQL databases solve the next-order problem: they distribute data across dozens or hundreds of nodes while making the cluster look, to the application, like a single coherent store.

The mechanism that makes this possible is partitioning — specifically, the rules the database uses to decide which node owns which piece of data. When you write a message to a chat application, the database must answer a question in microseconds: which node in a 300-node cluster is responsible for storing this row? When you read it back, it must answer the same question for every read. The answer must be fast, consistent, and tolerant of nodes being added or removed without reshuffling the entire dataset.

Three databases — Apache Cassandra, Amazon DynamoDB, and MongoDB — each answer this question differently. They share the vocabulary (partition key, shard, token) but implement fundamentally distinct mechanisms. Understanding those mechanisms is not academic. The failure modes, the hot-spot patterns, and the mitigation strategies all follow directly from the partitioning model each database uses. An engineer who treats these databases as interchangeable black boxes will reproduce the same 2 AM incident in all three.

🔍 Core Concepts: Partition Keys, Token Ranges, and Shard Boundaries

Before comparing the three systems, it helps to establish the vocabulary each one uses and why the terminology overlaps without being synonymous.

A partition key in all three databases is the field (or combination of fields) that determines which physical storage location owns a given record. In Cassandra, it is the first component of the PRIMARY KEY definition. In DynamoDB, it is the hash key attribute. In MongoDB, it is the shard key field (or compound fields) chosen at collection sharding time. In all cases, the partition key is immutable after schema creation — records cannot be moved to a different partition by updating the partition key value.

A token is the internal numeric representation of the partition key after a hash function has been applied. Cassandra and DynamoDB both operate on a hash ring or hash-based partition map internally. The token determines position on the ring or within the partition map. MongoDB with a hashed shard key does the same thing; with a ranged shard key, MongoDB operates on the raw value ordering of the shard key field.

A chunk is MongoDB's unit of distribution — a contiguous range (or hash bucket) of documents defined by the shard key. Cassandra and DynamoDB do not use the word chunk; their unit is the partition itself. The distinction matters because MongoDB's balancer migrates chunks, not individual documents, when rebalancing load across shards.

Concept	Cassandra	DynamoDB	MongoDB
Routing unit	Partition (all rows with same partition key hash)	Partition (all items with same hash key)	Chunk (contiguous shard key range or hash bucket)
Hash function	Murmur3	Internal AWS function	MD5 (hashed) or raw value ordering (ranged)
Key anatomy	`PRIMARY KEY ((partition_key), clustering_key)`	Hash key + optional sort key	Shard key field(s)
Cluster topology	Peer-to-peer ring — all nodes equal	Fully managed — AWS controls topology	Config server + mongos router + shard replicas
Replication	Replication factor N — next N clockwise nodes	Multi-AZ managed transparently	Replica sets per shard

⚙️ How Cassandra, DynamoDB, and MongoDB Actually Split Data

Cassandra: Consistent Hashing and the Virtual Node Ring

Cassandra distributes data using a consistent hash ring that spans the full 64-bit integer space. Each node claims ownership of one or more token ranges on this ring. When a write arrives, Cassandra applies Murmur3 to the partition key value and produces a token. A ring lookup identifies the node whose token range contains that token — that node becomes the primary replica.

Without virtual nodes (vnodes), each physical node owns one contiguous arc of the ring. When a new node joins, it takes ownership of exactly one arc, receiving data from one neighbour. With vnodes — the default since Cassandra 2.x — each physical node owns 256 non-contiguous token ranges scattered around the ring. When a node joins or leaves, the redistribution load is spread across many neighbours rather than one. A cluster expansion that previously required coordinating a large single transfer now proceeds as many small parallel transfers, completing faster with less impact on query latency.

The replication factor (RF) controls how many nodes hold a copy of each partition. With RF=3, the write goes to the primary node and is forwarded to the next two nodes clockwise on the ring. The application can choose its consistency level: ONE returns after the first acknowledgement; QUORUM returns after the majority (2 of 3) acknowledge; ALL waits for all three. Consistency level is a per-query knob — the same cluster supports different trade-offs for different operations.

Partition key design in Cassandra has two hard constraints. First, all rows sharing a partition key value are stored together on disk in a single SSTable segment — efficient for sequential reads but disastrous if one partition accumulates without bound. Second, Cassandra has no automatic split mechanism for oversized partitions. A partition that grows beyond ~100 MB in production begins causing compaction stalls, garbage collection pressure, and cascading read latency on the owning node. The fix is always a data model change: add a time-bucket component to the partition key so each time window of data becomes its own partition, capping per-partition growth.

DynamoDB: Managed Partition Maps and Adaptive Capacity

DynamoDB removes the operational burden of ring management entirely. You choose a partition key (and optionally a sort key) and AWS manages all physical partitioning transparently. Under the hood, DynamoDB maintains a partition map — a hash-based index that routes each item to a physical partition by applying an internal hash function to the partition key value.

Each physical partition carries a ceiling of approximately 10 GB and a proportional share of the table's provisioned throughput in read capacity units (RCUs) and write capacity units (WCUs). When a partition approaches 10 GB, DynamoDB automatically splits it into two halves, dividing both the key range and the throughput allocation equally between them. This split is transparent to the application. The consequence that surfaces in production: if access concentrates on one half after the split, that half may exhaust its throughput allocation and begin throttling — even though the table's total provisioned capacity is not fully consumed.

Burst capacity provides a short-term safety buffer. Each partition retains up to five minutes of unused throughput that can be consumed during brief spikes. Burst capacity is not a guarantee — it depends on what neighbouring partitions have saved. For sustained hot partition traffic, burst capacity provides no relief.

Adaptive capacity (introduced in 2019) automatically redistributes unused RCUs and WCUs from cold partitions to hot ones in near real-time. A flash sale that drives 10× traffic to five product keys while the rest of the table sits idle will trigger adaptive capacity to shift throughput to those keys within seconds. Adaptive capacity is effective for transient spikes but cannot fix a structurally imbalanced partition key — one where a small set of keys permanently accounts for the majority of traffic.

The fix for structural hot partitions is write sharding: instead of writing all items under product_id = top-seller, distribute them across ten logical sub-partitions by appending a random suffix — product_id#0 through product_id#9. Reads then fire ten parallel Query operations and merge results in the application layer. The cost is explicit complexity in the access layer; the benefit is that ten physical DynamoDB partitions, not one, absorb the write load.

MongoDB: Chunks, the Balancer, and Zone Sharding

MongoDB's sharding model introduces a routing layer — mongos — that sits between the application and the shard nodes. Routing metadata is stored in a config server replica set. When a query arrives at mongos, it consults the config server to identify which shard owns the chunk matching the query's shard key, then forwards the request directly. For queries that do not include the shard key, mongos broadcasts the query to every shard (scatter-gather) and merges the results.

A ranged shard key divides documents by ordered value ranges of the shard key field. Documents with sequential shard key values live on the same chunk. This is efficient for range queries — find({ order_date: { $gte: start, $lte: end } }) touches a single shard if order_date is the shard key. The weakness: monotonically increasing values (ObjectId, timestamps, auto-increment IDs) route all new inserts to the current highest chunk, creating a write hot spot on one shard.

A hashed shard key applies MD5 to the shard key value before mapping to a chunk, producing near-uniform write distribution regardless of key ordering. Range queries on a hashed shard key scatter across all shards — there is no locality. Choosing between ranged and hashed is a decision between write distribution and query efficiency, and it cannot be changed without resharding the collection.

The balancer is a background process that monitors chunk counts per shard and migrates chunks from over-loaded shards to under-loaded ones. Chunk migration is live: MongoDB streams documents to the destination shard, catches up via oplog replication, and atomically redirects routing metadata. The balancer respects a configurable active window — typically off-peak hours — to avoid migrating during peak traffic.

A jumbo chunk occurs when all documents in a chunk share the exact same shard key value, making the chunk unsplittable. The balancer marks it as jumbo and stops attempting migration. A country_code = "US" chunk in a collection sharded by country_code will grow without bound if the US accounts for 40% of documents. The fix is a compound shard key: { country_code: 1, user_id: 1 }. Now each country can have thousands of splittable chunks differentiated by user_id ranges.

Zone sharding assigns specific shard key ranges to specific shard nodes. A European shard cluster can be tagged with zone EU, and the shard key range covering European country codes is pinned to that zone. All EU documents land on EU nodes — enforcing data residency for GDPR compliance without any application-layer routing logic.

🧠 Deep Dive: Inside the Partition Routing Engines

The Internals

Cassandra token ring internals: The ring is a logical construct maintained in Cassandra's gossip protocol. Each node broadcasts its token assignments to its neighbours; the full ring state converges across the cluster within seconds. The coordinator node — the node that receives a write from the client — does not need a centralised lookup service. It holds the full ring state in memory and computes the primary replica address locally in microseconds. This is a key architectural advantage: no single-point-of-failure routing layer. Any node can act as coordinator for any write, making Cassandra's routing inherently peer-to-peer.

DynamoDB partition map internals: AWS publishes no official documentation on the internal partition map implementation, but the observable behaviour is consistent with a hash ring or consistent hash table approach. The partition map is opaque to the user. Partition splits are asynchronous background operations; the application sees no downtime but may observe brief latency spikes as split metadata propagates. Each physical partition is backed by three-way replication across availability zones, making the durability model independent of the partition key choice.

MongoDB config server internals: The config server stores a chunks collection that maps each chunk's key range to a shard identifier. mongos caches this mapping in memory and refreshes it when routing errors indicate a stale chunk map. The config server is a replica set (three members minimum in production) to avoid becoming a single point of failure for all routing decisions. Write-heavy workloads can cause chunk migrations to fall behind — the balancer enforces a throttle to prevent migrations from consuming the cluster's full I/O budget.

Pseudo-schema of routing metadata across all three systems:

System	Routing Metadata Store	Key fields stored	Update mechanism
Cassandra	Gossip protocol (in-memory on every node)	node_id, token_ranges[], datacenter, rack	Peer broadcast on join/leave/move
DynamoDB	AWS-internal partition map (opaque)	partition_id, key_range, capacity_allocation	Managed split/merge events
MongoDB	Config server replica set	chunk_id, shard_key_range, shard_id, version	Balancer migration + mongos cache refresh

Performance Analysis

Cassandra performance characteristics: The token ring lookup is O(log N) where N is the number of token ranges. With 256 vnodes per node on a 100-node cluster, the ring contains 25,600 token ranges — still resolved in microseconds from a sorted in-memory structure. The dominant latency factor is not routing but replication acknowledgement across data centre racks. A wide Cassandra partition — one that accumulates hundreds of MB — degrades not just its own reads but all partitions hosted on the same node during compaction cycles. Compaction is single-threaded per tier in LeveledCompactionStrategy; a 500 MB partition triggers a compaction run that pauses normal I/O on that node for seconds.

DynamoDB performance characteristics: DynamoDB's SLA of single-digit millisecond latency holds for well-designed partition keys. The throughput ceiling is per-partition: a single partition can sustain roughly 1,000 WCUs and 3,000 RCUs per second before throttling begins. A hot partition that drives 10× the per-partition limit will see ProvisionedThroughputExceededException even with unlimited table-level capacity. The diagnostic signal in CloudWatch is the ConsumedWriteCapacityUnits metric split by partition key — if one or two keys account for 80% of consumed capacity, write sharding is required. Adaptive capacity can buffer spikes up to 3× the per-partition limit temporarily, but provides no relief for sustained structural imbalance.

MongoDB performance characteristics: Chunk migration throughput is bounded by the balancer's configured rate — by default, MongoDB limits migrations to avoid saturating replication bandwidth. A collection with jumbo chunks sees the balancer stop attempting to rebalance those chunks, causing permanent load imbalance on the shard holding them. The mongos scatter-gather cost is proportional to the number of shards: a 50-shard cluster executing a scatter-gather query must wait for the slowest shard. For ranged shard keys, the 99th-percentile query latency is dominated by the single shard holding the matching range; for hashed shard keys, it is dominated by scatter-gather coordination overhead.

System	Per-partition write ceiling	Hot partition symptom	Auto-rebalancing
Cassandra	No enforced WCU limit; node I/O is shared	Coordinator timeout spikes on owning node	Manual: vnode token rebalance or data model change
DynamoDB	~1,000 WCU / partition	`ProvisionedThroughputExceededException` on specific keys	Adaptive capacity (transient); split on 10 GB
MongoDB	Balancer manages chunk distribution	Jumbo chunk; shard imbalance after balancer stops	Automatic chunk migration (configurable rate)

📊 Visualizing Partition Routing Across All Three Systems

The first diagram traces a write request through Cassandra's consistent hash ring, from partition key to vnode assignment to replication across three nodes. Follow each numbered step to see where the routing decision is made and how replication propagates automatically.

flowchart TD
    A["Write Request (partition key: channel_id = gaming-news)"] --> B["Murmur3 Hash (applied to partition key value)"]
    B --> C["Token Computed (64-bit integer on the ring)"]
    C --> D["Ring Lookup (find vnode owning this token range)"]
    D --> E["Primary Node Identified (Node B owns this vnode)"]
    E --> F["Coordinator Routes to Node B"]
    F --> G["Replication Factor = 3 (forward to next 2 clockwise nodes)"]
    G --> H["Node B (primary replica — write persisted)"]
    G --> I["Node C (replica 2 — write persisted)"]
    G --> J["Node D (replica 3 — write persisted)"]
    H --> K["Quorum Satisfied (2 of 3 acknowledged — return to client)"]
    I --> K
    J --> K

The ring lookup is resolved from Node B's local in-memory token map — no centralised routing service is involved. The replication fan-out is parallel: Nodes C and D receive the write simultaneously with Node B. Quorum acknowledgement returns to the client as soon as any two nodes confirm, giving the application a durable write guarantee without waiting for the third node.

The second diagram traces the DynamoDB write path through the internal partition map, including the automatic split trigger and the adaptive capacity pathway that fires when a partition key becomes disproportionately hot.

flowchart TD
    A["Write Request (PK: product_id = top-seller, SK: order_ts)"] --> B["Internal Hash Function (applied to partition key)"]
    B --> C["Partition Map Lookup (identifies physical partition P7)"]
    C --> D{"Partition P7 approaching 10 GB?"}
    D -->|"No — capacity available"| E{"Is partition key structurally hot?"}
    D -->|"Yes — 10 GB limit reached"| F["Automatic Split (P7 → P7a and P7b)"]
    F --> G["Throughput Divided Equally (each half gets half of P7 WCU/RCU allocation)"]
    G --> E
    E -->|"No — traffic is balanced"| H["Write Succeeds (response returned)"]
    E -->|"Yes — throttling risk"| I["Adaptive Capacity Triggered (shift unused RCU/WCU from cold partitions)"]
    I --> J{"Spike transient or structural?"}
    J -->|"Transient — burst resolved"| H
    J -->|"Structural — hot key permanent"| K["ProvisionedThroughputExceededException (write sharding required at app layer)"]

The split path shows the hidden cost of automatic splitting: each half inherits half the throughput budget. Adaptive capacity bridges transient gaps, but the diamond labelled "Structural?" marks the boundary of what automatic management can fix. Beyond that boundary, the partition key must be redesigned.

🌍 Real-World Applications: Discord, Netflix, and Retail Scale

Discord — Cassandra for message history. Discord's message storage uses Cassandra with a composite partition key of (channel_id, week_bucket). The week_bucket component is derived from the message timestamp — each week of messages for a channel becomes its own partition. This design caps per-partition size at roughly one week's volume of messages per channel, preventing the unbounded growth that would occur with channel_id alone as the partition key. Every "load last 50 messages" request supplies both components, resolving to a single Cassandra node with one sequential SSTable read.

Amazon retail — DynamoDB for order lookup. The canonical DynamoDB single-table design co-locates users, orders, and order items under the same table using a PK: USER#<id> and SK: ORDER#<date> convention. All orders for a user are co-located on one partition, enabling a single Query to fetch a user's full order history. The risk: a user who places an extreme volume of orders (a B2B wholesale account, for example) can create a hot partition. The mitigation is typically a Global Secondary Index that distributes access by order date rather than user, providing an escape hatch for access patterns that don't fit the primary key model.

Retail geo-compliance — MongoDB zone sharding. A global e-commerce platform uses MongoDB zone sharding to ensure European customer data resides on European shard nodes. The shard key includes a region_code field. Shard tags (EU, US, APAC) are assigned to specific shard nodes, and shard key ranges covering each region's codes are pinned to the corresponding tag. Application code writes without any routing awareness — mongos enforces the placement rule transparently.

Time-series IoT — Cassandra bucket design. A fleet telemetry platform writes sensor readings with a partition key of (device_id, hour_bucket) — the bucket derived from the reading timestamp rounded to the nearest hour. Each hour's readings for a device become their own partition, capping partition size and enabling efficient range scans within a single hour. Historical reads combine multiple hourly partition queries with client-side merge. The trade-off is explicit: application complexity increases in exchange for bounded partition sizes and predictable node I/O.

⚖️ Trade-offs and Failure Modes in NoSQL Partitioning

Dimension	Cassandra	DynamoDB	MongoDB
Hot partition root cause	Low-cardinality or time-ordered partition key	Structurally skewed hash key	Low-cardinality shard key producing jumbo chunks
Hot partition symptom	Coordinator timeouts; GC pauses on owning node	`ProvisionedThroughputExceededException` on specific keys	Shard imbalance; balancer marks chunk as jumbo
Hot partition fix	Composite partition key with time-bucket component	Write sharding with random suffix (`key#0`–`key#9`); scatter-gather on read	Compound shard key adding high-cardinality field
Node addition impact	Smooth with vnodes — many small transfers from many nodes	Transparent — AWS manages partition map expansion	Balancer migrates chunks to new shard automatically
Cross-partition query cost	`ALLOW FILTERING` triggers full cluster scan — avoid	`Scan` reads all partitions — billed per item read	Scatter-gather via mongos — latency proportional to shard count
Schema change complexity	Partition key change requires full table rebuild	Partition key change requires new table + data migration	Resharding (MongoDB 5.0+) supported but operationally heavy

The most expensive anti-pattern across all three systems is choosing a partition key based on convenience (the field you already query) without verifying its cardinality and access distribution. A partition key that satisfies current access patterns but has low cardinality — a boolean field, a status enum, a small set of country codes — will produce hot partitions under growth. Validate access distribution with production-shaped data before committing to a partition key.

🧭 Decision Guide: Which NoSQL Partitioning Model Fits Your Use Case

Scenario	Recommendation	Reason
Time-series writes with time-range reads	Cassandra with composite `(entity_id, time_bucket)` key	Co-locates recent data; time-bucket prevents unbounded partition growth
Serverless access patterns; unpredictable traffic	DynamoDB with adaptive capacity	Managed partitioning absorbs spikes; no cluster to operate
Geo-compliance: data residency by region	MongoDB with zone sharding	Zone tags enforce node placement without application routing logic
Uniform high-throughput writes; no range queries	Any of the three with hashed key	Uniform write distribution is the primary goal
Range queries dominate; writes are low	MongoDB with ranged shard key OR Cassandra with clustering key range scan	Range locality maximises read efficiency
Known access pattern, simple key model	DynamoDB single-table design	Co-location of related entities enables single-query retrieval
Large operational team, full control required	Cassandra with custom vnodes and rack awareness	Peer-to-peer architecture with explicit topology control
Avoid when	Partition key has < 100 distinct values	Insufficient cardinality for uniform distribution in any of the three systems

🧪 Worked Scenarios: Diagnosing and Fixing Common Partition Problems

Scenario 1 — Cassandra: wide partition on a gaming channel. A Discord-style messaging service stores messages with PRIMARY KEY (channel_id, message_id). After six months, the #general channel in a popular server has 50 million messages — the partition is approaching 400 MB. Read latency on #general has tripled; GC pauses on the owning node are bleeding into adjacent channels. The diagnosis is a wide partition caused by unbounded growth under a single-value partition key. The fix: migrate the schema to PRIMARY KEY ((channel_id, week_bucket), message_id) where week_bucket = YEARWEEK(message_timestamp). This caps each partition at one week of messages. Historical data must be backfilled into the new schema — a one-time migration job running parallel to the live service.

Scenario 2 — DynamoDB: flash sale hot partition. An e-commerce table uses product_id as the partition key. During a flash sale, one product drives 800 WCU/s against a 1,000 WCU/s per-partition ceiling. Adaptive capacity absorbs the first few minutes. When the sale runs for two hours, throttling begins. CloudWatch shows ConsumedWriteCapacityUnits concentrated on product_id = flash-deal-001. The fix: write sharding. The application appends a random suffix 0–9 to the partition key on write, distributing traffic across flash-deal-001#0 through flash-deal-001#9. Reads execute ten parallel queries and merge results. This is an explicit architecture change, not a configuration toggle.

Scenario 3 — MongoDB: jumbo chunk on country_code. A SaaS platform shards its tenants collection by { country_code: 1 }. Over time, the US chunk — containing 60% of all tenants — grows past 128 MB. The balancer marks it as jumbo and stops migration attempts. Every write for a US tenant hits the same shard. The fix: change to a compound shard key { country_code: 1, tenant_id: 1 }. Each country_code now has thousands of chunks differentiated by tenant_id ranges. The balancer can freely distribute US-prefix chunks across all shards. This requires a reshard operation in MongoDB 5.0+ or a full collection reload in earlier versions.

📚 Lessons Learned from Production NoSQL Deployments

The partition key is a permanent architectural contract. Changing it post-launch requires a full schema rebuild — weeks of migration work on a live system. Treat the partition key decision with the same weight as an API contract. Run access pattern analysis on production-shaped data volumes before finalising the key.

Cardinality is the single most important partition key property. A partition key with fewer than 100 distinct values will produce dangerously skewed distribution under any of the three systems. Test the key against your actual data distribution — not theoretical uniform assumptions.

Monitor partition-level metrics, not just table-level metrics. All three systems expose per-partition or per-shard metrics. Cassandra JMX exposes partition size; DynamoDB CloudWatch shows per-key consumed capacity; MongoDB's db.collection.getShardDistribution() reports chunk counts per shard. Alert on skew before it becomes a production incident.

Adaptive capacity and burst capacity are safety nets, not architectural features. DynamoDB's automatic mechanisms are designed for transient traffic patterns. Designing a system that relies on adaptive capacity to absorb a structurally hot partition key is building a ticking clock. Size partition key cardinality for your worst-case traffic distribution.

Test cross-partition access patterns before going to production. Cassandra's ALLOW FILTERING, DynamoDB's Scan, and MongoDB's scatter-gather all carry costs that grow linearly with cluster size. A query that works acceptably on a 10-node development cluster can become catastrophically slow on a 200-node production cluster. Profile the full cluster traversal cost before shipping.

📌 Summary and Key Takeaways

Cassandra uses a consistent hashing ring. Murmur3 maps partition key values to tokens; vnodes distribute token ranges across nodes for smooth rebalancing. The partition key controls node assignment; the clustering key controls on-disk sort order. Partitions above ~100 MB cause compaction stalls — composite partition keys with time-bucket components prevent unbounded growth.
DynamoDB manages partitioning on your behalf. Physical partitions split automatically at 10 GB; throughput is allocated per partition. Adaptive capacity handles transient spikes; structural hot partitions require write sharding at the application layer — no amount of provisioned capacity fixes a poorly chosen partition key.
MongoDB routes via chunks and a balancer. Ranged shard keys enable range query locality; hashed shard keys distribute writes uniformly. The balancer migrates chunks to maintain shard balance. Jumbo chunks — caused by low-cardinality shard keys — defeat the balancer; compound shard keys with high-cardinality secondary fields are the solution. Zone sharding enforces data residency without application routing logic.
All three systems share the same failure mode: a partition key that concentrates traffic on a small number of physical partitions. The diagnostic and the direction of the fix are universal — higher cardinality, composite keys, or application-layer distribution. The implementation differs by system.
The partition key is chosen once. Redesigning it after launch requires a full rebuild. Invest the design time upfront.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

SQL Partitioning: Range, Hash, List, and Composite Strategies Explained

TLDR: SQL partitioning divides one logical table into smaller physical child tables, all accessed through the parent table name. The query optimizer skips irrelevant child tables entirely — a process called partition pruning — turning a 30-second ful...

May 3, 2026•23 min read

Split Brain Explained: When Two Nodes Both Think They Are Leader

TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader — each accepting writes the other never sees. Prevent it with quorum consensus (at least ⌊N/2⌋+1 nodes must agree before leadership is g...

May 3, 2026•20 min read

Clock Skew and Causality Violations: Why Distributed Clocks Lie

TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions — but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...

May 3, 2026•18 min read

Stale Reads and Cascading Failures in Distributed Systems

TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable — stale reads...