Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared

How PostgreSQL, MySQL, Cassandra, DynamoDB, and MongoDB implement sharding — the trade-offs every system designer must know

System Design Interview Prep

Abstract Algorithms

·Apr 5, 2026·27 min read

Cover Image for Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 27 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose — range, hash, consistent hashing, or directory — determines whether range queries stay cheap, writes stay balanced, and adding a new node moves 100% or just 1/N of your data. The shard key is the single most consequential architectural decision you will make, and the wrong one is nearly impossible to fix without a full migration.

📖 When One PostgreSQL Instance Is No Longer Enough

Your database is maxed out. A single PostgreSQL instance handles 50k writes per second. You need 500k. You've already maxed out CPU, RAM, and the fastest NVMe disks money can buy — you cannot scale up further. You must scale out. Sharding is how you do it. But three teams before yours tried to shard this same table and failed. The first chose range sharding and created a permanent hot spot on the "last" shard because all new records got the highest keys. The second chose hash sharding without planning for range queries and ended up scatter-gathering across 16 nodes for every dashboard report. The third picked created_at as the shard key — a timestamp — and every single write landed on the same node for the entire last hour. Here's why they failed, and how to avoid every one of those traps.

Sharding is horizontal partitioning: you split one logical dataset across multiple physical nodes, called shards. Each shard owns only a fraction of the data. A query router — whether a middleware layer, a database driver, or a built-in database mechanism — directs each operation to the right shard based on a shard key.

This is fundamentally different from vertical scaling (getting a bigger machine) and from replication (copying the same data to multiple nodes). With replication, every node holds all the data but reads can be spread. With sharding, no single node holds all the data — you distribute both storage and write throughput.

🔍 The Vocabulary of Sharding: Keys, Maps, and Routing Fundamentals

Before evaluating any sharding strategy, you need a shared vocabulary. Misunderstanding these six terms — especially the difference between a shard key and a partition key, or between resharding and replication — is how teams make irreversible architectural decisions.

Term	Definition
Shard key	The column (or set of columns) used to decide which shard owns a row
Shard map	The routing table that maps shard key values to physical shard addresses
Resharding	Redistributing data when the number of shards changes — the most expensive operation in any sharded system
Hot spot	A shard receiving disproportionately more reads or writes than its peers
Cross-shard query	A query whose result set spans data on multiple shards, requiring scatter-gather
Scatter-gather	Broadcast a query to all shards, collect partial results, and merge at the coordinator
2PC	Two-phase commit — the protocol needed for atomic transactions that span multiple shards

The routing path for every database operation flows through three conceptual layers: the client sends the query to a router (vtgate in Vitess, mongos in MongoDB, the Citus coordinator in PostgreSQL), the router consults the shard map to identify the target shard, and the shard processes the query and returns results. In hash and consistent-hash strategies the router computes the shard address algorithmically — no map lookup needed. In directory-based strategies the router must query the shard map service synchronously, which is why caching is critical to that approach's performance.

One large database	Sharded database
Simple queries and joins	More routing complexity
Easier operational model	Higher scale ceiling
One write bottleneck	Writes spread across shards
Easy global transactions	Cross-shard work requires 2PC

⚙️ The Four Sharding Strategies and What Each One Optimizes

Range Sharding: Natural Ordering at the Cost of Write Balance

Range sharding assigns contiguous key intervals to each shard. If your shard key is user_id and you have four shards, you might configure:

Shard 1 → user_id 0 – 999,999
Shard 2 → user_id 1,000,000 – 1,999,999
Shard 3 → user_id 2,000,000 – 2,999,999
Shard 4 → user_id 3,000,000 – ∞

The diagram below shows how the query router uses the shard key value to route each request directly to the owning shard, with no need to contact the others.

graph LR
    R[Query Router] -->|user_id 0-999999| S1[Shard 1]
    R -->|user_id 1M-1.99M| S2[Shard 2]
    R -->|user_id 2M-2.99M| S3[Shard 3]
    R -->|user_id 3M and above| S4[Shard 4]

The router holds a simple lookup table of ranges. Any query with a known user_id lands on exactly one shard. Range queries like WHERE user_id BETWEEN 500000 AND 700000 also resolve to a single shard, making them highly efficient.

The hot spot trap. Range sharding works well for keys that are not correlated with time or write activity. If your shard key is an auto-incrementing ID or a timestamp, every new write lands on the last shard while the earlier shards sit idle. HBase and CockroachDB both use range sharding and both document this problem prominently — CockroachDB calls it "hot ranges" and provides automatic load-based splitting. MongoDB's range sharding produces the same problem unless you pre-split your ranges and seed the shards with data before load arrives.

Best fit: Time-series data with a well-distributed key (not a monotonically increasing one), geographic data split by region, or any dataset where range queries are the primary access pattern.

Hash Sharding: Even Distribution at the Cost of Range Queries

Hash sharding applies a hash function to the shard key and maps the result to a shard using modulo arithmetic:

shard = hash(user_id) mod N

Because good hash functions distribute output uniformly, writes spread evenly across all N shards regardless of the key's original distribution. There are no hot spots from timestamp clustering or sequential IDs.

The routing diagram is straightforward — every key flows through the hash function, and the result selects a shard:

graph TD
    K[Incoming Key] --> H{hash(key) mod 4}
    H -->|0| S0[Shard 0]
    H -->|1| S1[Shard 1]
    H -->|2| S2[Shard 2]
    H -->|3| S3[Shard 3]

The hash function collapses the key's natural ordering, so a query like WHERE user_id BETWEEN 100 AND 200 cannot go to one shard — the row for user_id = 101 could be on Shard 3 while user_id = 102 is on Shard 0. Every range query becomes a scatter-gather across all shards.

The resharding problem. When you add a fifth shard, hash(key) mod 5 produces different results for nearly every key, meaning almost all existing data must be moved. This is a full migration event — expensive, risky, and typically requires weeks of dual-write migration to execute safely.

Best fit: Write-heavy workloads with uniform key distribution where range queries are rare, such as user-session stores, feature flags, or event logs queried by exact ID.

Consistent Hashing: Minimising Data Movement When the Cluster Changes

Consistent hashing solves the resharding problem of simple hash sharding by mapping both nodes and keys onto the same circular hash space — the ring — spanning 0 to 2⁶⁴. A key belongs to the first node encountered when walking clockwise from the key's hash position.

graph LR
    NA((Node A)) --> NB((Node B))
    NB --> NC((Node C))
    NC --> ND((Node D))
    ND -->|wraps to start| NA
    K1[Key user-123 hash=100] -.->|clockwise next node| NB
    K2[Key order-456 hash=220] -.->|clockwise next node| NC

When Node E is inserted between Node B and Node C on the ring, only the keys that were previously assigned to Node C and now fall between B and E need to move. All other keys stay put. In a cluster of N nodes, adding one node moves only ~1/N of the total data — versus nearly 100% for simple modulo hashing.

Virtual nodes (vnodes). A physical node is assigned not one but dozens or hundreds of positions on the ring. This distributes its load across many segments, so when a node is added or removed, the redistribution is spread across all existing nodes rather than burdening only one neighbor. Cassandra defaults to 256 vnodes per physical node. DynamoDB's internal sharding engine uses a similar mechanism, though it is fully transparent to users.

Best fit: Systems where nodes are added or removed frequently, or where gradual resharding with minimal data movement is a hard operational requirement. Cassandra, Riak, and Amazon DynamoDB's internal layer all use consistent hashing.

Directory-Based Sharding: Maximum Flexibility With a Coordination Cost

Directory-based sharding replaces the deterministic routing function with a lookup table — a shard map service — that records exactly which shard owns each key or key range. There is no formula; the directory can assign any key to any shard for any reason.

Characteristic	Detail
Routing	Query the directory service, then forward to the named shard
Flexibility	Any key can live on any shard; supports heterogeneous shard sizes
Rebalancing	Move a key by updating one directory entry — no data-formula recalculation
Cost	The directory is a single point of failure and a query-path bottleneck
Mitigation	Aggressive client-side caching of directory entries; ZooKeeper or etcd for HA

Vitess (YouTube's MySQL sharding layer) implements a form of directory-based routing through its vtgate query router and vindex lookup tables. Facebook historically used a MySQL shard map stored in ZooKeeper and cached in every application server. The directory makes rebalancing surgical — you can move a single celebrity user's data to a dedicated shard without touching anyone else's routing — but the directory itself must be treated as a critical infrastructure component with high availability.

Best fit: Workloads with known-hot keys that need custom placement, multi-tenant SaaS where tenants vary wildly in size, or systems migrating from one sharding strategy to another where you need per-key routing control during the transition.

🧠 Deep Dive: How SQL and NoSQL Databases Wire Sharding Into Their Engines

Internals: Query Routing, Shard Maps, and Database-Specific Partitioning Models

SQL databases were designed around the assumption that all data lives in one engine on one server. Sharding is an external capability in most SQL systems, not a native one.

PostgreSQL and Citus. PostgreSQL's native PARTITION BY feature creates partitions on the same server — useful for query pruning but not horizontal scaling. True sharding requires the Citus extension (now part of Azure Cosmos DB for PostgreSQL). Citus converts a coordinator PostgreSQL node into a query router and distributes table data across worker nodes using hash or range distribution. A reference_table is a small lookup table replicated to every worker so it can be joined locally without cross-shard round-trips.

-- Distribute the orders table across worker nodes using hash sharding on user_id
SELECT create_distributed_table('orders', 'user_id');

-- Replicate a small countries table to every worker to enable cheap local joins
SELECT create_reference_table('countries');

MySQL and Vitess. Vitess was built inside YouTube when their MySQL cluster could no longer handle video write volume. The architecture separates routing from storage: vtgate is the stateless query router speaking MySQL wire protocol; vttablet is a proxy in front of each MySQL shard handling connection pooling and query rewriting; vtctld manages topology. Routing in Vitess is controlled by vindexes — shard key definitions in the VSchema JSON:

{
  "sharded": true,
  "vindexes": {
    "hash_user_id": { "type": "hash" }
  },
  "tables": {
    "orders": {
      "column_vindexes": [
        { "column": "user_id", "name": "hash_user_id" }
      ]
    }
  }
}

Cassandra. Partitioning is built into every Cassandra table. The partition key is the shard key. Cassandra hashes it with MurmurHash3 and maps the token to a position on the consistent hash ring. The clustering key defines sort order within a partition. Every query that includes the partition key goes to exactly one node; queries without it scatter to all nodes.

CREATE KEYSPACE social_network
  WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'datacenter1': 3
  };

-- user_id is the partition key; created_at is the clustering key (sort order within the partition)
CREATE TABLE posts (
  user_id   uuid,
  created_at timestamp,
  content   text,
  PRIMARY KEY (user_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

DynamoDB. DynamoDB's sharding is transparent — AWS manages the ring, tokens, and rebalancing automatically. You define a partition key (the shard key) and optionally a sort key (clustering key within a partition). The service splits and merges partitions automatically based on throughput consumption and storage size.

{
  "TableName": "Orders",
  "KeySchema": [
    { "AttributeName": "user_id",  "KeyType": "HASH" },
    { "AttributeName": "order_id", "KeyType": "RANGE" }
  ],
  "BillingMode": "PAY_PER_REQUEST"
}

MongoDB. MongoDB's sharding layer adds mongos (the query router) and config servers (the shard map, storing chunk-to-shard assignments) to a replica set cluster. A chunk is a contiguous range of shard key values. The balancer moves chunks between shards to keep data distribution even. MongoDB's zone sharding pins key ranges to specific shards, enabling geographic data residency — EU users to EU shards, US users to US shards.

sh.enableSharding("social_db");
sh.shardCollection("social_db.users", { user_id: "hashed" });

Performance Analysis: Query Amplification, Scatter-Gather Cost, and Resharding Overhead

Query amplification is the central performance risk of any sharded system. A query that includes the shard key predicate touches exactly one shard — latency is equivalent to a single-node query. A query without a shard key predicate must scatter to all N shards in parallel, wait for the slowest response, and merge results. If 10% of your queries are scatter-gather and you have 16 shards, those 10% of queries consume 16x the database compute resources of a single-shard query.

Resharding overhead varies dramatically by strategy:

Database	Resharding Mechanism	Approximate Data Movement
Cassandra (consistent hash)	vnodes redistribute automatically on `bootstrap` / `decommission`	~1/N of total data
DynamoDB	Fully automatic partition split/merge; no operator action	Zero (AWS-managed)
PostgreSQL + Citus	`select rebalance_table_shards()` — copy-based, runs online	Per-shard copy
MySQL + Vitess	Multi-step `vtctl` resharding workflow with traffic cutover	Full key range copy
MongoDB (range chunks)	Automatic chunk balancer; `moveChunk` for manual placement	Per-chunk copy
Simple hash (`key mod N`)	All keys remap on any change to N	~100% of total data

Replication factor impact. Each write in Cassandra with RF=3 must be acknowledged by a quorum of 2 replicas before the coordinator returns success. Every shard in a 16-node cluster with RF=3 writes to 3 nodes per partition. Write latency is bounded by the second-fastest replica (quorum), not the fastest. At 500k writes/sec across 16 nodes, you are executing 1.5M actual node writes per second.

📊 Shard Key Selection: How Your Choice Shapes Write Balance, Query Locality, and Resharding Cost

No section in this post matters more than this one.A wrong shard key cannot be fixed without a full data migration. Here are the five properties every shard key must satisfy:

High cardinality. The shard key must have many distinct values. Using gender (M/F) gives you two possible shards — you've partitioned a problem in half, not horizontally scaled. user_id is excellent; country_code is marginal if most of your users are in three countries.

Even distribution. Even with high cardinality, clustered distributions create hot spots. A shard key of created_at is high cardinality (each timestamp is unique) but all inserts go to the latest shard. Uniform distribution requires that the key's values spread across the full range naturally, or that you apply a hash.

Query locality. The shard key should match your dominant query pattern. If 90% of your queries filter by user_id, then user_id is the correct shard key: most queries hit one shard. If 50% filter by user_id and 50% by order_id, you have a harder problem — you may need a global secondary index on the non-shard dimension.

Immutability. If a row's shard key value changes, the row must be moved to a different shard. Most systems treat this as either an expensive operation or an outright forbidden one. Choose a key that will not change for the lifetime of the record.

Celebrity resistance. If 10% of your users generate 80% of your traffic (celebrity accounts, power users), their user_id becomes a hot key regardless of how even the distribution is across all other users. The mitigation is covered in the Hot Spots section.

The following table summarises how common shard key choices perform across these five dimensions:

Shard Key	Cardinality	Even Distribution	Query Locality	Immutable	Celebrity Risk
`user_id` (random or UUID)	High	High	High for user queries	Yes	Medium
`created_at` (timestamp)	High	Low — hot last shard	Low	Yes	Low
Auto-increment integer	High	Low with range; High with hash	Medium	Yes	Low
`country_code`	Low	Uneven	High for geo queries	Yes	High (US, CN)
`email` (hashed)	High	High	Low — opaque	Yes	Low
Compound `(tenant_id, user_id)`	Very High	High	High for tenant queries	Yes	Medium

The compound key pattern — used heavily in Cassandra and Cosmos DB — is often the right answer for multi-tenant systems because it provides both query locality (all data for a tenant on one shard) and enough cardinality to spread across many shards.

🌍 How Instagram, YouTube, and Discord Hit the Sharding Wall

These three companies represent three different failure modes that drove them to fundamentally redesign their data distribution strategies.

Instagram and the celebrity hot partition. Instagram's original data model sharded the media table by user_id. This worked well for the median user — posts and reads spread evenly. But when celebrities joined the platform, their user_id partition became a permanent hot spot. A single Beyoncé post sent millions of read requests to one shard in seconds. Instagram's solution was a two-tier approach: celebrity accounts (defined as users with follower count above a threshold) had their media replicated to a read replica pool, and the routing layer directed celebrity reads to this pool rather than the primary shard. The primary sharding strategy stayed unchanged; only the routing layer added a special case for hot keys — effectively combining hash sharding with directory-based routing for a subset of users.

YouTube and the MySQL single-node limit. YouTube's challenge was not a hot key problem — it was a total write volume problem. By 2008 the video metadata database had outgrown what a single MySQL primary could handle. The response was Vitess, which added a sharding proxy layer in front of MySQL without requiring any change to the application's SQL dialect. The key design insight was that Vitess needed to be SQL-transparent: applications continued to send standard MySQL queries, and Vitess's vindex layer handled routing. This allowed YouTube to shard incrementally — starting with 2 shards and expanding to hundreds — without changing application code each time.

Discord and the Cassandra compaction cliff. Discord ran a Cassandra cluster for message storage sharded by (channel_id, message_id). For most channels this worked well: all messages in a channel lived on one partition, ordered by message_id. The problem appeared at scale when large Discord servers with millions of messages triggered Cassandra's compaction process. Cassandra's LSM-tree storage periodically merges SSTables on disk (compaction). For massive partitions, compaction caused read latency spikes lasting minutes as the node churned through gigabytes of data on a single partition. Discord's eventual fix was to move away from Cassandra entirely for long-term message storage, migrating to ScyllaDB (a Cassandra-compatible engine with better compaction strategies) and eventually to a custom message store. The lesson: even a well-chosen shard key can create operational problems at extreme partition sizes if the storage engine's internal mechanics are not accounted for.

Company	Problem	Shard Key	Resolution
Instagram	Celebrity hot partition	`user_id`	Directory routing override for high-follower accounts
YouTube	MySQL write volume limit	N/A (single node)	Vitess proxy sharding layer preserving SQL compatibility
Discord	Cassandra compaction at large partition size	`(channel_id, message_id)`	ScyllaDB migration + custom message store

⚖️ Cross-Shard Operations, Hot Spots, and the Trade-offs That Horizontal Scale Cannot Eliminate

Once data lives on multiple shards, operations that used to be simple become multi-step distributed protocols — and some of them are permanently more expensive than their single-node equivalents.

Scatter-gather reads. A query without a shard key predicate must broadcast to all shards, wait for all responses, and merge the result set at the coordinator. A SELECT * FROM orders WHERE status = 'pending' against a 16-shard cluster is 16 parallel database queries. Latency is bounded by the slowest shard, not the average. Dashboard-style analytics queries are frequently scatter-gather — a major argument for maintaining a separate analytical replica or OLAP store rather than running analytics against your sharded OLTP database.

Cross-shard joins. When a users table and an orders table are sharded by different keys, a join between them crosses shard boundaries. The correct approach is either to co-locate the data (shard both tables by the same key — called a co-located join in Citus) or to denormalise the joined fields into the child table so no join is needed. Application-layer joins — loading both datasets into memory and joining in code — work for small datasets but fail at scale.

Cross-shard transactions. Atomic transactions across multiple shards require two-phase commit (2PC): the coordinator sends a PREPARE to all participating shards, waits for all OK votes, then sends COMMIT. If the coordinator fails between PREPARE and COMMIT, the shards remain in a locked "in-doubt" state until the coordinator recovers. Most NoSQL databases do not support cross-partition transactions at all, or restrict them to a single partition. Cassandra provides lightweight transactions (compare-and-set) within a single partition only. DynamoDB supports transactions across up to 100 items within a single region only.

Hot spots: three root causes and their mitigations. A hot spot is a shard receiving far more traffic than its peers. It degrades the entire system because your cluster's effective throughput is limited by the hottest shard.

Bad shard key choice. Timestamps and auto-increment integers direct all new writes to one shard. Switch to a hash or compound key.
Power-law data distribution. Even with a good hash, if 20% of users generate 80% of data, the shards holding those users are hotter. Fix: key salting — append a random suffix to the shard key, then fan-out on reads.
Celebrity users or viral content. A single high-follower account triggers millions of reads on one partition. Fix: directory-based routing override for known hot keys, combined with DAX or a read cache for reads.

Trade-off Dimension	Range Sharding	Hash Sharding	Consistent Hashing	Directory-Based
Range query cost	Single shard	Scatter-gather	Scatter-gather	Depends on map
Write distribution	Uneven (hot last shard)	Even	Even	Configurable
Resharding cost	Low — add range	High — rehash all	Low — ~1/N moves	Low — update map
Operational complexity	Low	Low	Medium (vnodes)	High (map HA)
Cross-shard transactions	Not solved	Not solved	Not solved	Not solved

🧭 Selecting the Right Sharding Strategy for Your Access Pattern

The flowchart below walks through the key decision axes. Start at the top with your primary constraint — range query requirement — and follow the branches.

graph TD
    Start([Start: Define primary access pattern]) --> Q1{Range queries needed?}
    Q1 -->|Yes, accept hot spots| RangeS[Range Sharding]
    Q1 -->|Yes, avoid hot spots| CHR[Consistent Hashing plus secondary index]
    Q1 -->|No| Q2{Uniform write distribution?}
    Q2 -->|Yes| Q3{Cloud-managed preference?}
    Q2 -->|No - power-law or celebrity keys| CHR
    Q3 -->|Yes| Managed[DynamoDB or Azure Cosmos DB]
    Q3 -->|No| Q4{Cross-shard transactions required?}
    Q4 -->|Yes| SQLLayer[PostgreSQL plus Citus or MySQL plus Vitess]
    Q4 -->|No| NoSQL[Cassandra or MongoDB hashed]
    RangeS --> Note1[Pre-split ranges and monitor for hot tails]
    CHR --> Note2[Add vnodes for even load distribution]

Read the flowchart from top to bottom. The first branch is the most critical: if your workload requires range queries, you start with range sharding or consistent hashing with a secondary index. If range queries are rare or absent, even distribution takes priority and hash-based strategies win. The managed cloud path (DynamoDB, Cosmos DB) removes operational overhead at the cost of flexibility in routing logic.

🧪 Worked Example: Designing a Sharded Orders Table for an E-Commerce Platform at Scale

This section walks through a concrete sharding design for an orders table supporting 10M daily active users and 200k orders per hour at peak. The goal is to identify the shard key, select a strategy, and reason through the failure modes.

The table structure. An orders table has the fields order_id, user_id, merchant_id, created_at, status, and total_amount. The access patterns are: (1) fetch all orders for a user — highly frequent; (2) fetch all orders for a merchant — moderately frequent; (3) fetch orders by status across all users — rare but needed for ops dashboards; (4) order details by order_id — very frequent.

Step 1: Evaluate candidate shard keys.

Candidate Key	Cardinality	Even Distribution	Satisfies Pattern 1	Satisfies Pattern 2	Satisfies Pattern 4
`user_id`	High	High	Yes — single shard	No — scatter-gather	No — scatter-gather
`merchant_id`	Medium	Medium	No — scatter-gather	Yes — single shard	No — scatter-gather
`order_id`	Very High	High	No — scatter-gather	No — scatter-gather	Yes — single shard
`created_at`	High	Low — hot shard	No	No	No

No single key satisfies all four patterns. user_id is the right choice for patterns 1 and the dominant access pattern — it collocates all of a user's orders and keeps writes spread across users. Patterns 2 and 3 will require scatter-gather unless addressed by a secondary structure.

Step 2: Address the secondary access patterns. For merchant queries (pattern 2), maintain a global secondary index (GSI) on merchant_id. In DynamoDB this is a first-class feature; in Cassandra it requires a separate table keyed by merchant_id that is updated in the same write path. For status dashboards (pattern 3), maintain a separate analytics replica or materialised view refreshed asynchronously — do not run ops dashboards against the sharded OLTP cluster.

Step 3: Choose the sharding strategy. Since user_id is a UUID (high cardinality, uniform distribution) and range queries on user_id are not required, hash sharding is the correct strategy. If the platform will grow from 4 to 16 shards over the next two years, use consistent hashing with vnodes to limit resharding data movement.

Step 4: Plan for celebrity merchants. If 5% of merchants generate 40% of orders, merchant_id hot spots will affect the GSI shard. Apply write sharding to the merchant index: merchant_id_0 through merchant_id_9, fan-out on reads and merge.

The key outcome: the primary access pattern (user orders) hits one shard; secondary access patterns are served by purpose-built structures rather than cross-shard scatter on the primary table.

🛠️ Citus: Distributed PostgreSQL Without Leaving the SQL Ecosystem

Citus is an open-source PostgreSQL extension (also available as Azure Cosmos DB for PostgreSQL) that adds a horizontal sharding layer directly into the PostgreSQL query engine. Unlike external middleware such as ShardingSphere, Citus runs inside the coordinator PostgreSQL process — the planner is aware of shard locations and generates distributed query plans natively.

How Citus distributes data. When you call create_distributed_table('orders', 'user_id'), Citus creates N logical shards (default: 32), hashes each row's user_id value to a shard slot, and stores each shard as a regular PostgreSQL table on a worker node. The shard count is fixed at distribution time — choosing a higher shard count up front (e.g., 512) gives you more granularity when rebalancing across an expanding worker fleet.

-- Add a worker node to the Citus cluster
SELECT * FROM citus_add_node('worker3.internal', 5432);

-- Distribute the orders table by user_id using hash sharding
SELECT create_distributed_table('orders', 'user_id', shard_count => 128);

-- Replicate the products lookup table to every worker for local joins
SELECT create_reference_table('products');

-- Rebalance shards after adding the new worker — runs online, no downtime
SELECT rebalance_table_shards();

Co-located tables. If orders and order_items are both distributed by user_id, Citus guarantees that rows for the same user_id live on the same worker node. A query that joins these two tables by user_id executes entirely on one worker without cross-node round-trips — this is the co-location optimization that makes Citus competitive for OLTP workloads.

Reference tables. Tables like countries, product_categories, or tax_rates are small and read frequently in joins. Citus replicates reference tables to all workers so every local query can join them without remote calls. The coordinator keeps reference table writes serialized to maintain consistency across all worker copies.

For a deeper walkthrough of Citus internals including distributed transaction support and time-series hypertable integration with TimescaleDB, the Citus documentation at citusdata.com provides comprehensive operational guides.

📊 How the Five Major Databases Compare Across Every Sharding Dimension

Database	Sharding Strategy	Shard Key Type	Cross-Shard Transactions	Resharding Mechanism	Hot Spot Handling	Managed Option
PostgreSQL + Citus	Hash (default) or Range	Any column	Yes, via 2PC (limited)	`rebalance_table_shards()`	Key salting + shard rebalancing	Azure Cosmos DB for PG
MySQL + Vitess	Hash (vindex)	Any column	No (application-level)	`vtctl` resharding workflow	Custom vindex lookups	PlanetScale (Vitess SaaS)
Cassandra	Consistent hashing (MurmurHash3)	Partition key	No (single-partition LWT only)	vnode rebalancing on join/leave	vnodes + key salting	AWS Keyspaces, Astra DB
DynamoDB	Managed hash (AWS-internal)	Partition key + sort key	Yes (up to 100 items, single region)	Fully automatic, transparent	Write sharding + DAX caching	AWS (fully managed)
MongoDB	Hash or Range per collection	Any field	Yes (4.0+, replica set; 4.2+, sharded)	Automatic chunk balancer	Zone sharding + compound keys	MongoDB Atlas

The standout observation from this table: operational simplicity and flexibility are inversely correlated. DynamoDB offers zero resharding overhead but gives you no control over partition placement. Vitess gives you fine-grained routing control through vindexes but requires a complex multi-component infrastructure and a dedicated resharding workflow. Most teams benefit from starting with managed sharding (DynamoDB, Atlas, Citus on Azure) and moving to self-managed sharding only when cost or routing requirements demand it.

📚 Lessons Learned From Production Sharding Failures

Pick your shard key for read locality, not write convenience. Auto-increment IDs are easy to generate but create a hot-last-shard problem with range sharding and lose ordering with hash sharding. User-facing IDs, tenant IDs, or compound keys almost always serve query patterns better.

Monotonically increasing shard keys and range sharding are an anti-pattern unless you plan for it. Time-series databases like InfluxDB and TimescaleDB solve this by automatically rolling over to a new shard partition as time advances. If you are building on a general-purpose database, you need to pre-create future range partitions before load arrives.

Cross-shard scatter is a last resort, not a design pattern. Teams that accept "we'll just scatter-gather" during design end up with dashboards that take 30 seconds to load in production. Design schemas so that your top-5 query patterns each touch at most one or two shards.

Test resharding before you need it. The worst time to learn that Vitess resharding takes 18 hours is during a capacity crisis. Every team operating a sharded database should rehearse a resharding event — including traffic cutover — in a staging environment at least annually.

Consistent hashing with vnodes is the right default for systems that will grow. The operational cost of simple modulo hash sharding (move all data on every resize) is enormous compared to consistent hashing. Unless you have a compelling reason for simplicity, start with consistent hashing.

The shard key cannot be changed after go-live without a full migration. This is the law of sharding. Treat the shard key decision with the same gravity as your primary key choice. If you are unsure, use a compound key — it gives you more flexibility and a higher cardinality ceiling.

📌 TLDR: What Every System Designer Must Remember About Sharding

Sharding = horizontal partitioning across nodes. Each node owns a subset of the data. You distribute both storage and write throughput, not just reads.
Range sharding is efficient for range queries but creates hot spots when keys are monotonically increasing. Pre-split ranges and seed shards before go-live.
Hash sharding eliminates hot spots from sequential keys but destroys range query efficiency. Every range scan becomes a scatter-gather.
Consistent hashing minimises data movement during resharding to ~1/N of total data. Virtual nodes (vnodes) provide even load distribution across heterogeneous hardware.
Directory-based sharding offers maximum flexibility for custom key placement but requires a highly available routing service that is itself a critical infrastructure component.
The shard key is irreversible. Choose it for query locality, high cardinality, even distribution, and immutability. Avoid timestamps, status flags, and low-cardinality columns.
Hot spots have three root causes: bad key choice, power-law distribution, and celebrity keys. Each has a distinct mitigation — key salting, vnodes, and directory overrides respectively.
For managed workloads: DynamoDB and Cosmos DB handle resharding transparently. For OLTP on PostgreSQL, Citus provides SQL-compatible horizontal sharding with co-located joins. For MySQL at extreme scale, Vitess is the proven path.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

SQL Partitioning: Range, Hash, List, and Composite Strategies Explained

TLDR: SQL partitioning divides one logical table into smaller physical child tables, all accessed through the parent table name. The query optimizer skips irrelevant child tables entirely — a process called partition pruning — turning a 30-second ful...

May 3, 2026•23 min read

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node — virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...

May 3, 2026•22 min read

Clock Skew and Causality Violations: Why Distributed Clocks Lie

TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions — but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...

May 3, 2026•18 min read

Stale Reads and Cascading Failures in Distributed Systems

TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable — stale reads...