Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
How PostgreSQL, MySQL, Cassandra, DynamoDB, and MongoDB implement sharding — the trade-offs every system designer must know
Abstract AlgorithmsTLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose — range, hash, consistent hashing, or directory — determines whether range queries stay cheap, writes stay balanced, and adding a new node moves 100% or just 1/N of your data. The shard key is the single most consequential architectural decision you will make, and the wrong one is nearly impossible to fix without a full migration.
📖 When One PostgreSQL Instance Is No Longer Enough
Your database is maxed out. A single PostgreSQL instance handles 50k writes per second. You need 500k. You've already maxed out CPU, RAM, and the fastest NVMe disks money can buy — you cannot scale up further. You must scale out. Sharding is how you do it. But three teams before yours tried to shard this same table and failed. The first chose range sharding and created a permanent hot spot on the "last" shard because all new records got the highest keys. The second chose hash sharding without planning for range queries and ended up scatter-gathering across 16 nodes for every dashboard report. The third picked created_at as the shard key — a timestamp — and every single write landed on the same node for the entire last hour. Here's why they failed, and how to avoid every one of those traps.
Sharding is horizontal partitioning: you split one logical dataset across multiple physical nodes, called shards. Each shard owns only a fraction of the data. A query router — whether a middleware layer, a database driver, or a built-in database mechanism — directs each operation to the right shard based on a shard key.
This is fundamentally different from vertical scaling (getting a bigger machine) and from replication (copying the same data to multiple nodes). With replication, every node holds all the data but reads can be spread. With sharding, no single node holds all the data — you distribute both storage and write throughput.
🔍 The Vocabulary of Sharding: Keys, Maps, and Routing Fundamentals
Before evaluating any sharding strategy, you need a shared vocabulary. Misunderstanding these six terms — especially the difference between a shard key and a partition key, or between resharding and replication — is how teams make irreversible architectural decisions.
| Term | Definition |
| Shard key | The column (or set of columns) used to decide which shard owns a row |
| Shard map | The routing table that maps shard key values to physical shard addresses |
| Resharding | Redistributing data when the number of shards changes — the most expensive operation in any sharded system |
| Hot spot | A shard receiving disproportionately more reads or writes than its peers |
| Cross-shard query | A query whose result set spans data on multiple shards, requiring scatter-gather |
| Scatter-gather | Broadcast a query to all shards, collect partial results, and merge at the coordinator |
| 2PC | Two-phase commit — the protocol needed for atomic transactions that span multiple shards |
The routing path for every database operation flows through three conceptual layers: the client sends the query to a router (vtgate in Vitess, mongos in MongoDB, the Citus coordinator in PostgreSQL), the router consults the shard map to identify the target shard, and the shard processes the query and returns results. In hash and consistent-hash strategies the router computes the shard address algorithmically — no map lookup needed. In directory-based strategies the router must query the shard map service synchronously, which is why caching is critical to that approach's performance.
| One large database | Sharded database |
| Simple queries and joins | More routing complexity |
| Easier operational model | Higher scale ceiling |
| One write bottleneck | Writes spread across shards |
| Easy global transactions | Cross-shard work requires 2PC |
⚙️ The Four Sharding Strategies and What Each One Optimizes
Range Sharding: Natural Ordering at the Cost of Write Balance
Range sharding assigns contiguous key intervals to each shard. If your shard key is user_id and you have four shards, you might configure:
- Shard 1 →
user_id0 – 999,999 - Shard 2 →
user_id1,000,000 – 1,999,999 - Shard 3 →
user_id2,000,000 – 2,999,999 - Shard 4 →
user_id3,000,000 – ∞
The diagram below shows how the query router uses the shard key value to route each request directly to the owning shard, with no need to contact the others.
graph LR
R[Query Router] -->|user_id 0-999999| S1[Shard 1]
R -->|user_id 1M-1.99M| S2[Shard 2]
R -->|user_id 2M-2.99M| S3[Shard 3]
R -->|user_id 3M and above| S4[Shard 4]
The router holds a simple lookup table of ranges. Any query with a known user_id lands on exactly one shard. Range queries like WHERE user_id BETWEEN 500000 AND 700000 also resolve to a single shard, making them highly efficient.
The hot spot trap. Range sharding works well for keys that are not correlated with time or write activity. If your shard key is an auto-incrementing ID or a timestamp, every new write lands on the last shard while the earlier shards sit idle. HBase and CockroachDB both use range sharding and both document this problem prominently — CockroachDB calls it "hot ranges" and provides automatic load-based splitting. MongoDB's range sharding produces the same problem unless you pre-split your ranges and seed the shards with data before load arrives.
Best fit: Time-series data with a well-distributed key (not a monotonically increasing one), geographic data split by region, or any dataset where range queries are the primary access pattern.
Hash Sharding: Even Distribution at the Cost of Range Queries
Hash sharding applies a hash function to the shard key and maps the result to a shard using modulo arithmetic:
shard = hash(user_id) mod N
Because good hash functions distribute output uniformly, writes spread evenly across all N shards regardless of the key's original distribution. There are no hot spots from timestamp clustering or sequential IDs.
The routing diagram is straightforward — every key flows through the hash function, and the result selects a shard:
graph TD
K[Incoming Key] --> H{"hash(key) mod 4"}
H -->|0| S0[Shard 0]
H -->|1| S1[Shard 1]
H -->|2| S2[Shard 2]
H -->|3| S3[Shard 3]
The hash function collapses the key's natural ordering, so a query like WHERE user_id BETWEEN 100 AND 200 cannot go to one shard — the row for user_id = 101 could be on Shard 3 while user_id = 102 is on Shard 0. Every range query becomes a scatter-gather across all shards.
The resharding problem. When you add a fifth shard, hash(key) mod 5 produces different results for nearly every key, meaning almost all existing data must be moved. This is a full migration event — expensive, risky, and typically requires weeks of dual-write migration to execute safely.
Best fit: Write-heavy workloads with uniform key distribution where range queries are rare, such as user-session stores, feature flags, or event logs queried by exact ID.
Consistent Hashing: Minimising Data Movement When the Cluster Changes
Consistent hashing solves the resharding problem of simple hash sharding by mapping both nodes and keys onto the same circular hash space — the ring — spanning 0 to 2⁶⁴. A key belongs to the first node encountered when walking clockwise from the key's hash position.
graph LR
NA((Node A)) --> NB((Node B))
NB --> NC((Node C))
NC --> ND((Node D))
ND -->|wraps to start| NA
K1["Key user-123 hash=100"] -.->|clockwise next node| NB
K2["Key order-456 hash=220"] -.->|clockwise next node| NC
When Node E is inserted between Node B and Node C on the ring, only the keys that were previously assigned to Node C and now fall between B and E need to move. All other keys stay put. In a cluster of N nodes, adding one node moves only ~1/N of the total data — versus nearly 100% for simple modulo hashing.
Virtual nodes (vnodes). A physical node is assigned not one but dozens or hundreds of positions on the ring. This distributes its load across many segments, so when a node is added or removed, the redistribution is spread across all existing nodes rather than burdening only one neighbor. Cassandra defaults to 256 vnodes per physical node. DynamoDB's internal sharding engine uses a similar mechanism, though it is fully transparent to users.
Best fit: Systems where nodes are added or removed frequently, or where gradual resharding with minimal data movement is a hard operational requirement. Cassandra, Riak, and Amazon DynamoDB's internal layer all use consistent hashing.
Directory-Based Sharding: Maximum Flexibility With a Coordination Cost
Directory-based sharding replaces the deterministic routing function with a lookup table — a shard map service — that records exactly which shard owns each key or key range. There is no formula; the directory can assign any key to any shard for any reason.
| Characteristic | Detail |
| Routing | Query the directory service, then forward to the named shard |
| Flexibility | Any key can live on any shard; supports heterogeneous shard sizes |
| Rebalancing | Move a key by updating one directory entry — no data-formula recalculation |
| Cost | The directory is a single point of failure and a query-path bottleneck |
| Mitigation | Aggressive client-side caching of directory entries; ZooKeeper or etcd for HA |
Vitess (YouTube's MySQL sharding layer) implements a form of directory-based routing through its vtgate query router and vindex lookup tables. Facebook historically used a MySQL shard map stored in ZooKeeper and cached in every application server. The directory makes rebalancing surgical — you can move a single celebrity user's data to a dedicated shard without touching anyone else's routing — but the directory itself must be treated as a critical infrastructure component with high availability.
Best fit: Workloads with known-hot keys that need custom placement, multi-tenant SaaS where tenants vary wildly in size, or systems migrating from one sharding strategy to another where you need per-key routing control during the transition.
🧠 Deep Dive: How SQL and NoSQL Databases Wire Sharding Into Their Engines
Internals: Query Routing, Shard Maps, and Database-Specific Partitioning Models
SQL databases were designed around the assumption that all data lives in one engine on one server. Sharding is an external capability in most SQL systems, not a native one.
PostgreSQL and Citus. PostgreSQL's native PARTITION BY feature creates partitions on the same server — useful for query pruning but not horizontal scaling. True sharding requires the Citus extension (now part of Azure Cosmos DB for PostgreSQL). Citus converts a coordinator PostgreSQL node into a query router and distributes table data across worker nodes using hash or range distribution. A reference_table is a small lookup table replicated to every worker so it can be joined locally without cross-shard round-trips.
-- Distribute the orders table across worker nodes using hash sharding on user_id
SELECT create_distributed_table('orders', 'user_id');
-- Replicate a small countries table to every worker to enable cheap local joins
SELECT create_reference_table('countries');
MySQL and Vitess. Vitess was built inside YouTube when their MySQL cluster could no longer handle video write volume. The architecture separates routing from storage: vtgate is the stateless query router speaking MySQL wire protocol; vttablet is a proxy in front of each MySQL shard handling connection pooling and query rewriting; vtctld manages topology. Routing in Vitess is controlled by vindexes — shard key definitions in the VSchema JSON:
{
"sharded": true,
"vindexes": {
"hash_user_id": { "type": "hash" }
},
"tables": {
"orders": {
"column_vindexes": [
{ "column": "user_id", "name": "hash_user_id" }
]
}
}
}
Cassandra. Partitioning is built into every Cassandra table. The partition key is the shard key. Cassandra hashes it with MurmurHash3 and maps the token to a position on the consistent hash ring. The clustering key defines sort order within a partition. Every query that includes the partition key goes to exactly one node; queries without it scatter to all nodes.
CREATE KEYSPACE social_network
WITH replication = {
'class': 'NetworkTopologyStrategy',
'datacenter1': 3
};
-- user_id is the partition key; created_at is the clustering key (sort order within the partition)
CREATE TABLE posts (
user_id uuid,
created_at timestamp,
content text,
PRIMARY KEY (user_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);
DynamoDB. DynamoDB's sharding is transparent — AWS manages the ring, tokens, and rebalancing automatically. You define a partition key (the shard key) and optionally a sort key (clustering key within a partition). The service splits and merges partitions automatically based on throughput consumption and storage size.
{
"TableName": "Orders",
"KeySchema": [
{ "AttributeName": "user_id", "KeyType": "HASH" },
{ "AttributeName": "order_id", "KeyType": "RANGE" }
],
"BillingMode": "PAY_PER_REQUEST"
}
MongoDB. MongoDB's sharding layer adds mongos (the query router) and config servers (the shard map, storing chunk-to-shard assignments) to a replica set cluster. A chunk is a contiguous range of shard key values. The balancer moves chunks between shards to keep data distribution even. MongoDB's zone sharding pins key ranges to specific shards, enabling geographic data residency — EU users to EU shards, US users to US shards.
sh.enableSharding("social_db");
sh.shardCollection("social_db.users", { user_id: "hashed" });
Performance Analysis: Query Amplification, Scatter-Gather Cost, and Resharding Overhead
Query amplification is the central performance risk of any sharded system. A query that includes the shard key predicate touches exactly one shard — latency is equivalent to a single-node query. A query without a shard key predicate must scatter to all N shards in parallel, wait for the slowest response, and merge results. If 10% of your queries are scatter-gather and you have 16 shards, those 10% of queries consume 16x the database compute resources of a single-shard query.
Resharding overhead varies dramatically by strategy:
| Database | Resharding Mechanism | Approximate Data Movement |
| Cassandra (consistent hash) | vnodes redistribute automatically on bootstrap / decommission | ~1/N of total data |
| DynamoDB | Fully automatic partition split/merge; no operator action | Zero (AWS-managed) |
| PostgreSQL + Citus | select rebalance_table_shards() — copy-based, runs online | Per-shard copy |
| MySQL + Vitess | Multi-step vtctl resharding workflow with traffic cutover | Full key range copy |
| MongoDB (range chunks) | Automatic chunk balancer; moveChunk for manual placement | Per-chunk copy |
Simple hash (key mod N) | All keys remap on any change to N | ~100% of total data |
Replication factor impact. Each write in Cassandra with RF=3 must be acknowledged by a quorum of 2 replicas before the coordinator returns success. Every shard in a 16-node cluster with RF=3 writes to 3 nodes per partition. Write latency is bounded by the second-fastest replica (quorum), not the fastest. At 500k writes/sec across 16 nodes, you are executing 1.5M actual node writes per second.
📊 Shard Key Selection: How Your Choice Shapes Write Balance, Query Locality, and Resharding Cost
No section in this post matters more than this one.A wrong shard key cannot be fixed without a full data migration. Here are the five properties every shard key must satisfy:
High cardinality. The shard key must have many distinct values. Using gender (M/F) gives you two possible shards — you've partitioned a problem in half, not horizontally scaled. user_id is excellent; country_code is marginal if most of your users are in three countries.
Even distribution. Even with high cardinality, clustered distributions create hot spots. A shard key of created_at is high cardinality (each timestamp is unique) but all inserts go to the latest shard. Uniform distribution requires that the key's values spread across the full range naturally, or that you apply a hash.
Query locality. The shard key should match your dominant query pattern. If 90% of your queries filter by user_id, then user_id is the correct shard key: most queries hit one shard. If 50% filter by user_id and 50% by order_id, you have a harder problem — you may need a global secondary index on the non-shard dimension.
Immutability. If a row's shard key value changes, the row must be moved to a different shard. Most systems treat this as either an expensive operation or an outright forbidden one. Choose a key that will not change for the lifetime of the record.
Celebrity resistance. If 10% of your users generate 80% of your traffic (celebrity accounts, power users), their user_id becomes a hot key regardless of how even the distribution is across all other users. The mitigation is covered in the Hot Spots section.
The following table summarises how common shard key choices perform across these five dimensions:
| Shard Key | Cardinality | Even Distribution | Query Locality | Immutable | Celebrity Risk |
user_id (random or UUID) | High | High | High for user queries | Yes | Medium |
created_at (timestamp) | High | Low — hot last shard | Low | Yes | Low |
| Auto-increment integer | High | Low with range; High with hash | Medium | Yes | Low |
country_code | Low | Uneven | High for geo queries | Yes | High (US, CN) |
email (hashed) | High | High | Low — opaque | Yes | Low |
Compound (tenant_id, user_id) | Very High | High | High for tenant queries | Yes | Medium |
The compound key pattern — used heavily in Cassandra and Cosmos DB — is often the right answer for multi-tenant systems because it provides both query locality (all data for a tenant on one shard) and enough cardinality to spread across many shards.
🌍 How Instagram, YouTube, and Discord Hit the Sharding Wall
These three companies represent three different failure modes that drove them to fundamentally redesign their data distribution strategies.
Instagram and the celebrity hot partition. Instagram's original data model sharded the media table by user_id. This worked well for the median user — posts and reads spread evenly. But when celebrities joined the platform, their user_id partition became a permanent hot spot. A single Beyoncé post sent millions of read requests to one shard in seconds. Instagram's solution was a two-tier approach: celebrity accounts (defined as users with follower count above a threshold) had their media replicated to a read replica pool, and the routing layer directed celebrity reads to this pool rather than the primary shard. The primary sharding strategy stayed unchanged; only the routing layer added a special case for hot keys — effectively combining hash sharding with directory-based routing for a subset of users.
YouTube and the MySQL single-node limit. YouTube's challenge was not a hot key problem — it was a total write volume problem. By 2008 the video metadata database had outgrown what a single MySQL primary could handle. The response was Vitess, which added a sharding proxy layer in front of MySQL without requiring any change to the application's SQL dialect. The key design insight was that Vitess needed to be SQL-transparent: applications continued to send standard MySQL queries, and Vitess's vindex layer handled routing. This allowed YouTube to shard incrementally — starting with 2 shards and expanding to hundreds — without changing application code each time.
Discord and the Cassandra compaction cliff. Discord ran a Cassandra cluster for message storage sharded by (channel_id, message_id). For most channels this worked well: all messages in a channel lived on one partition, ordered by message_id. The problem appeared at scale when large Discord servers with millions of messages triggered Cassandra's compaction process. Cassandra's LSM-tree storage periodically merges SSTables on disk (compaction). For massive partitions, compaction caused read latency spikes lasting minutes as the node churned through gigabytes of data on a single partition. Discord's eventual fix was to move away from Cassandra entirely for long-term message storage, migrating to ScyllaDB (a Cassandra-compatible engine with better compaction strategies) and eventually to a custom message store. The lesson: even a well-chosen shard key can create operational problems at extreme partition sizes if the storage engine's internal mechanics are not accounted for.
| Company | Problem | Shard Key | Resolution |
| Celebrity hot partition | user_id | Directory routing override for high-follower accounts | |
| YouTube | MySQL write volume limit | N/A (single node) | Vitess proxy sharding layer preserving SQL compatibility |
| Discord | Cassandra compaction at large partition size | (channel_id, message_id) | ScyllaDB migration + custom message store |
⚖️ Cross-Shard Operations, Hot Spots, and the Trade-offs That Horizontal Scale Cannot Eliminate
Once data lives on multiple shards, operations that used to be simple become multi-step distributed protocols — and some of them are permanently more expensive than their single-node equivalents.
Scatter-gather reads. A query without a shard key predicate must broadcast to all shards, wait for all responses, and merge the result set at the coordinator. A SELECT * FROM orders WHERE status = 'pending' against a 16-shard cluster is 16 parallel database queries. Latency is bounded by the slowest shard, not the average. Dashboard-style analytics queries are frequently scatter-gather — a major argument for maintaining a separate analytical replica or OLAP store rather than running analytics against your sharded OLTP database.
Cross-shard joins. When a users table and an orders table are sharded by different keys, a join between them crosses shard boundaries. The correct approach is either to co-locate the data (shard both tables by the same key — called a co-located join in Citus) or to denormalise the joined fields into the child table so no join is needed. Application-layer joins — loading both datasets into memory and joining in code — work for small datasets but fail at scale.
Cross-shard transactions. Atomic transactions across multiple shards require two-phase commit (2PC): the coordinator sends a PREPARE to all participating shards, waits for all OK votes, then sends COMMIT. If the coordinator fails between PREPARE and COMMIT, the shards remain in a locked "in-doubt" state until the coordinator recovers. Most NoSQL databases do not support cross-partition transactions at all, or restrict them to a single partition. Cassandra provides lightweight transactions (compare-and-set) within a single partition only. DynamoDB supports transactions across up to 100 items within a single region only.
Hot spots: three root causes and their mitigations. A hot spot is a shard receiving far more traffic than its peers. It degrades the entire system because your cluster's effective throughput is limited by the hottest shard.
- Bad shard key choice. Timestamps and auto-increment integers direct all new writes to one shard. Switch to a hash or compound key.
- Power-law data distribution. Even with a good hash, if 20% of users generate 80% of data, the shards holding those users are hotter. Fix: key salting — append a random suffix to the shard key, then fan-out on reads.
- Celebrity users or viral content. A single high-follower account triggers millions of reads on one partition. Fix: directory-based routing override for known hot keys, combined with DAX or a read cache for reads.
| Trade-off Dimension | Range Sharding | Hash Sharding | Consistent Hashing | Directory-Based |
| Range query cost | Single shard | Scatter-gather | Scatter-gather | Depends on map |
| Write distribution | Uneven (hot last shard) | Even | Even | Configurable |
| Resharding cost | Low — add range | High — rehash all | Low — ~1/N moves | Low — update map |
| Operational complexity | Low | Low | Medium (vnodes) | High (map HA) |
| Cross-shard transactions | Not solved | Not solved | Not solved | Not solved |
🧭 Selecting the Right Sharding Strategy for Your Access Pattern
The flowchart below walks through the key decision axes. Start at the top with your primary constraint — range query requirement — and follow the branches.
graph TD
Start([Start: Define primary access pattern]) --> Q1{Range queries needed?}
Q1 -->|Yes, accept hot spots| RangeS[Range Sharding]
Q1 -->|Yes, avoid hot spots| CHR[Consistent Hashing plus secondary index]
Q1 -->|No| Q2{Uniform write distribution?}
Q2 -->|Yes| Q3{Cloud-managed preference?}
Q2 -->|No - power-law or celebrity keys| CHR
Q3 -->|Yes| Managed[DynamoDB or Azure Cosmos DB]
Q3 -->|No| Q4{Cross-shard transactions required?}
Q4 -->|Yes| SQLLayer[PostgreSQL plus Citus or MySQL plus Vitess]
Q4 -->|No| NoSQL[Cassandra or MongoDB hashed]
RangeS --> Note1[Pre-split ranges and monitor for hot tails]
CHR --> Note2[Add vnodes for even load distribution]
Read the flowchart from top to bottom. The first branch is the most critical: if your workload requires range queries, you start with range sharding or consistent hashing with a secondary index. If range queries are rare or absent, even distribution takes priority and hash-based strategies win. The managed cloud path (DynamoDB, Cosmos DB) removes operational overhead at the cost of flexibility in routing logic.
🧪 Worked Example: Designing a Sharded Orders Table for an E-Commerce Platform at Scale
This section walks through a concrete sharding design for an orders table supporting 10M daily active users and 200k orders per hour at peak. The goal is to identify the shard key, select a strategy, and reason through the failure modes.
The table structure. An orders table has the fields order_id, user_id, merchant_id, created_at, status, and total_amount. The access patterns are: (1) fetch all orders for a user — highly frequent; (2) fetch all orders for a merchant — moderately frequent; (3) fetch orders by status across all users — rare but needed for ops dashboards; (4) order details by order_id — very frequent.
Step 1: Evaluate candidate shard keys.
| Candidate Key | Cardinality | Even Distribution | Satisfies Pattern 1 | Satisfies Pattern 2 | Satisfies Pattern 4 |
user_id | High | High | Yes — single shard | No — scatter-gather | No — scatter-gather |
merchant_id | Medium | Medium | No — scatter-gather | Yes — single shard | No — scatter-gather |
order_id | Very High | High | No — scatter-gather | No — scatter-gather | Yes — single shard |
created_at | High | Low — hot shard | No | No | No |
No single key satisfies all four patterns. user_id is the right choice for patterns 1 and the dominant access pattern — it collocates all of a user's orders and keeps writes spread across users. Patterns 2 and 3 will require scatter-gather unless addressed by a secondary structure.
Step 2: Address the secondary access patterns. For merchant queries (pattern 2), maintain a global secondary index (GSI) on merchant_id. In DynamoDB this is a first-class feature; in Cassandra it requires a separate table keyed by merchant_id that is updated in the same write path. For status dashboards (pattern 3), maintain a separate analytics replica or materialised view refreshed asynchronously — do not run ops dashboards against the sharded OLTP cluster.
Step 3: Choose the sharding strategy. Since user_id is a UUID (high cardinality, uniform distribution) and range queries on user_id are not required, hash sharding is the correct strategy. If the platform will grow from 4 to 16 shards over the next two years, use consistent hashing with vnodes to limit resharding data movement.
Step 4: Plan for celebrity merchants. If 5% of merchants generate 40% of orders, merchant_id hot spots will affect the GSI shard. Apply write sharding to the merchant index: merchant_id_0 through merchant_id_9, fan-out on reads and merge.
The key outcome: the primary access pattern (user orders) hits one shard; secondary access patterns are served by purpose-built structures rather than cross-shard scatter on the primary table.
🛠️ Citus: Distributed PostgreSQL Without Leaving the SQL Ecosystem
Citus is an open-source PostgreSQL extension (also available as Azure Cosmos DB for PostgreSQL) that adds a horizontal sharding layer directly into the PostgreSQL query engine. Unlike external middleware such as ShardingSphere, Citus runs inside the coordinator PostgreSQL process — the planner is aware of shard locations and generates distributed query plans natively.
How Citus distributes data. When you call create_distributed_table('orders', 'user_id'), Citus creates N logical shards (default: 32), hashes each row's user_id value to a shard slot, and stores each shard as a regular PostgreSQL table on a worker node. The shard count is fixed at distribution time — choosing a higher shard count up front (e.g., 512) gives you more granularity when rebalancing across an expanding worker fleet.
-- Add a worker node to the Citus cluster
SELECT * FROM citus_add_node('worker3.internal', 5432);
-- Distribute the orders table by user_id using hash sharding
SELECT create_distributed_table('orders', 'user_id', shard_count => 128);
-- Replicate the products lookup table to every worker for local joins
SELECT create_reference_table('products');
-- Rebalance shards after adding the new worker — runs online, no downtime
SELECT rebalance_table_shards();
Co-located tables. If orders and order_items are both distributed by user_id, Citus guarantees that rows for the same user_id live on the same worker node. A query that joins these two tables by user_id executes entirely on one worker without cross-node round-trips — this is the co-location optimization that makes Citus competitive for OLTP workloads.
Reference tables. Tables like countries, product_categories, or tax_rates are small and read frequently in joins. Citus replicates reference tables to all workers so every local query can join them without remote calls. The coordinator keeps reference table writes serialized to maintain consistency across all worker copies.
For a deeper walkthrough of Citus internals including distributed transaction support and time-series hypertable integration with TimescaleDB, the Citus documentation at citusdata.com provides comprehensive operational guides.
📊 How the Five Major Databases Compare Across Every Sharding Dimension
| Database | Sharding Strategy | Shard Key Type | Cross-Shard Transactions | Resharding Mechanism | Hot Spot Handling | Managed Option |
| PostgreSQL + Citus | Hash (default) or Range | Any column | Yes, via 2PC (limited) | rebalance_table_shards() | Key salting + shard rebalancing | Azure Cosmos DB for PG |
| MySQL + Vitess | Hash (vindex) | Any column | No (application-level) | vtctl resharding workflow | Custom vindex lookups | PlanetScale (Vitess SaaS) |
| Cassandra | Consistent hashing (MurmurHash3) | Partition key | No (single-partition LWT only) | vnode rebalancing on join/leave | vnodes + key salting | AWS Keyspaces, Astra DB |
| DynamoDB | Managed hash (AWS-internal) | Partition key + sort key | Yes (up to 100 items, single region) | Fully automatic, transparent | Write sharding + DAX caching | AWS (fully managed) |
| MongoDB | Hash or Range per collection | Any field | Yes (4.0+, replica set; 4.2+, sharded) | Automatic chunk balancer | Zone sharding + compound keys | MongoDB Atlas |
The standout observation from this table: operational simplicity and flexibility are inversely correlated. DynamoDB offers zero resharding overhead but gives you no control over partition placement. Vitess gives you fine-grained routing control through vindexes but requires a complex multi-component infrastructure and a dedicated resharding workflow. Most teams benefit from starting with managed sharding (DynamoDB, Atlas, Citus on Azure) and moving to self-managed sharding only when cost or routing requirements demand it.
📚 Lessons Learned From Production Sharding Failures
Pick your shard key for read locality, not write convenience. Auto-increment IDs are easy to generate but create a hot-last-shard problem with range sharding and lose ordering with hash sharding. User-facing IDs, tenant IDs, or compound keys almost always serve query patterns better.
Monotonically increasing shard keys and range sharding are an anti-pattern unless you plan for it. Time-series databases like InfluxDB and TimescaleDB solve this by automatically rolling over to a new shard partition as time advances. If you are building on a general-purpose database, you need to pre-create future range partitions before load arrives.
Cross-shard scatter is a last resort, not a design pattern. Teams that accept "we'll just scatter-gather" during design end up with dashboards that take 30 seconds to load in production. Design schemas so that your top-5 query patterns each touch at most one or two shards.
Test resharding before you need it. The worst time to learn that Vitess resharding takes 18 hours is during a capacity crisis. Every team operating a sharded database should rehearse a resharding event — including traffic cutover — in a staging environment at least annually.
Consistent hashing with vnodes is the right default for systems that will grow. The operational cost of simple modulo hash sharding (move all data on every resize) is enormous compared to consistent hashing. Unless you have a compelling reason for simplicity, start with consistent hashing.
The shard key cannot be changed after go-live without a full migration. This is the law of sharding. Treat the shard key decision with the same gravity as your primary key choice. If you are unsure, use a compound key — it gives you more flexibility and a higher cardinality ceiling.
📌 TLDR: What Every System Designer Must Remember About Sharding
- Sharding = horizontal partitioning across nodes. Each node owns a subset of the data. You distribute both storage and write throughput, not just reads.
- Range sharding is efficient for range queries but creates hot spots when keys are monotonically increasing. Pre-split ranges and seed shards before go-live.
- Hash sharding eliminates hot spots from sequential keys but destroys range query efficiency. Every range scan becomes a scatter-gather.
- Consistent hashing minimises data movement during resharding to ~1/N of total data. Virtual nodes (vnodes) provide even load distribution across heterogeneous hardware.
- Directory-based sharding offers maximum flexibility for custom key placement but requires a highly available routing service that is itself a critical infrastructure component.
- The shard key is irreversible. Choose it for query locality, high cardinality, even distribution, and immutability. Avoid timestamps, status flags, and low-cardinality columns.
- Hot spots have three root causes: bad key choice, power-law distribution, and celebrity keys. Each has a distinct mitigation — key salting, vnodes, and directory overrides respectively.
- For managed workloads: DynamoDB and Cosmos DB handle resharding transparently. For OLTP on PostgreSQL, Citus provides SQL-compatible horizontal sharding with co-located joins. For MySQL at extreme scale, Vitess is the proven path.
📝 Practice Quiz
A team shards their
orderstable bycreated_attimestamp using range sharding. After six months, they notice that all database writes are hitting a single shard. What is the root cause and how should they fix it?- A) The shard map is corrupted — rebuild it from the replica
- B) Monotonically increasing shard keys direct all new inserts to the latest range shard — switch to a hash or compound shard key
- C) Range sharding always creates this problem and must be replaced with consistent hashing
- D) The timestamp column should be indexed to spread writes across shards
Correct Answer: B
In a consistent hashing cluster with four physical nodes, each owning 64 vnodes, you add a fifth node. Approximately what fraction of the total data must be moved to the new node?
- A) Almost all of it — consistent hashing still requires full rehashing
- B) Exactly 50% — the ring is split in half
- C) Approximately 1/5 (20%) of total data — only the keys near the new node's token positions move
- D) Zero — consistent hashing never moves data
Correct Answer: C
DynamoDB is returning
ProvisionedThroughputExceededExceptionfor a singleuser_idwhile the rest of the table has plenty of spare capacity. Which strategy most directly addresses this hot partition without requiring a schema migration?- A) Increase the total table's provisioned throughput by 10x
- B) Switch the table to on-demand billing mode
- C) Add DAX (DynamoDB Accelerator) caching in front of the table to absorb hot-key reads
- D) Create a Global Secondary Index on the
user_idcolumn
Correct Answer: C
Open-ended: Your team is building a multi-tenant B2B SaaS platform with 5,000 tenant companies. The largest 10 tenants generate 60% of all database traffic. You need cross-tenant reporting queries to run in under 5 seconds, and you must support point-in-time recovery per-tenant. Design a sharding strategy: what is your shard key, what sharding approach do you use, and how do you handle the top-10 hot tenants? Justify each decision.
Correct Answer: No single correct answer. A strong response should: use
tenant_idas the shard key for co-location of all tenant data; choose consistent hashing to allow new tenant onboarding without full resharding; apply directory-based overrides or dedicated shards for the top-10 tenants to avoid hot spots; use reference tables or a separate read replica for cross-tenant reporting to avoid scatter-gather; and consider per-tenant logical databases or schemas to enable per-tenant PITR using database-level backup tooling.
🔗 Related Posts
- Consistent Hashing Explained: The Ring That Scales
- System Design Sharding Strategy: Shard Keys, Hot Spots, and Resharding Safely
- System Design Replication and Failover: Leader-Follower and Multi-Leader Patterns
- Distributed Transactions: Two-Phase Commit, Sagas, and When to Use Each
- Capacity Estimation Guide: Putting Real Numbers on System Design Decisions

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions — with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy — but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
