All Posts

Partitioning Approaches in SQL and NoSQL: Horizontal, Vertical, Range, Hash, and List Partitioning

How PostgreSQL, Cassandra, DynamoDB, and MongoDB split data — and how to choose the right partition strategy for your system

Abstract AlgorithmsAbstract Algorithms
··37 min read
Cover Image for Partitioning Approaches in SQL and NoSQL: Horizontal, Vertical, Range, Hash, and List Partitioning
📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 37 min

AI-assisted content.

TLDR: Partitioning splits one logical table into smaller physical pieces called partitions. The database planner skips irrelevant partitions entirely — turning a 30-second full-table scan into a 200ms single-partition read. Range partitioning is best for time-series data. Hash partitioning prevents write hotspots. List partitioning supports geo-compliance. The wrong partition key eliminates all these gains. Partitioning is not sharding: if your app connects to one database endpoint, you are partitioning.

📖 The 2-Billion-Row Problem — Why Indexes Stop Being Enough

Your orders table has 2 billion rows. Every query scans gigabytes of data even with indexes. Partitioning can cut query time from 30 seconds to 200 milliseconds — by making the database ignore 95% of the data before it even looks at an index.

An index on a 2-billion-row table is still a 2-billion-entry B-Tree. Navigating that tree takes time, and the leaf pages are scattered across the entire 400 GB tablespace. Partitioning changes the problem fundamentally: instead of one 400 GB table, you have 48 separate 8 GB tables — one per month over four years. A query for last month's orders reads only 8 GB. The other 392 GB are never opened.

This is the core promise of partitioning. It is why every major database — PostgreSQL, MySQL, Cassandra, DynamoDB — builds it into the data model.

Partitioning vs. Sharding — The Distinction That Matters in Every System Design Interview

These two terms are used interchangeably by engineers and interviewers. They are not the same concept, and conflating them signals an incomplete understanding.

DimensionPartitioningSharding
Data locationSame database node, or transparent coordinator-managed distributionIndependent database nodes, each a fully separate database
Application connectionSingle endpoint; application sees one tableShard-aware routing layer; app knows about multiple databases
Cross-partition joinsSupported — planner merges resultsExpensive scatter-gather or not supported
Transaction guaranteesFull ACID within the nodeCross-shard: 2-Phase Commit or eventual consistency only
Primary scaling targetQuery performance, manageability, archivalWrite throughput, storage capacity beyond a single node
Classic examplesPostgreSQL PARTITION BY, MySQL partitionsCassandra, DynamoDB, Vitess, Citus

Rule of thumb: If your application connects to one connection string, you are partitioning. If it connects to multiple independent databases, you are sharding.

The blur zone: Cassandra and DynamoDB use the word "partition" but their partitions are physically distributed across nodes. They are partitioning and sharding simultaneously. In an interview, acknowledge the overlap: "Cassandra calls it a partition key, but the mechanism is consistent-hashing-based sharding because each partition lives on a different node."


🔍 A Map of Every Partitioning Strategy — What Each One Splits and When to Use It

Before going deep, here is the orientation table:

Partition TypeSplits ByBest ForSupported In
Horizontal (row-based)Row subsets by a partition key valueTime-series, large operational tablesPostgreSQL, MySQL, Cassandra, DynamoDB
Vertical (column-based)Column subsets into separate tablesWide tables with hot/cold access patternsManual table design; not a native DB feature
RangeOrdered value ranges (dates, IDs)Time-series archival, rolling-window queriesPostgreSQL, MySQL, CockroachDB, Spanner
ListExplicit enumerated valuesGeographic regions, categorical columnsPostgreSQL, MySQL
Hashhash(key) % NEven I/O distribution, preventing write hotspotsPostgreSQL, MySQL
Composite (sub)Two levels: range then hash or listVery large tables needing both pruning and balancePostgreSQL, MySQL

Horizontal vs. vertical — the conceptual split: Horizontal partitioning divides rows. Vertical partitioning divides columns. The diagram below shows both at once: the orders table is split into monthly row-subsets on the left; the users table is split into hot-path and cold-path column subsets on the right. These two strategies solve completely different problems and are often combined in the same data model.

graph TD
    subgraph HP[Horizontal Partitioning - Row Subsets]
        T1[Orders Table - 2B rows] --> P1[Jan 2024 - 200M rows]
        T1 --> P2[Feb 2024 - 195M rows]
        T1 --> P3[Mar 2024 - 210M rows]
        T1 --> P4[Apr 2024 - 205M rows]
    end
    subgraph VP[Vertical Partitioning - Column Subsets]
        T2[Users Table - 50 columns] --> C1[user_core - id, email, name, created_at]
        T2 --> C2[user_profile - bio, avatar, preferences, settings]
    end

In the horizontal case, every partition holds identical columns but a different month of data. The January 2024 partition is the only one opened when you query WHERE created_at BETWEEN '2024-01-01' AND '2024-02-01'. In the vertical case, user_core is read on every authentication request; user_profile is loaded only on profile pages. Separating them means the hot authentication path never reads the cold profile columns, reducing page I/O and improving buffer pool efficiency.


⚙️ How Range, List, Hash, and Composite Partitioning Actually Work

Range Partitioning: Month-by-Month Tables for Time-Series Data

Range partitioning assigns each row to a partition based on whether the partition key value falls within a defined range. The classic use case is a time-series table partitioned by month.

PostgreSQL DDL — range-partitioned orders table:

CREATE TABLE orders (
    id          BIGSERIAL,
    customer_id BIGINT NOT NULL,
    total_cents BIGINT NOT NULL,
    status      TEXT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2024_01
    PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE orders_2024_02
    PARTITION OF orders
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Local index: each partition gets its own index
CREATE INDEX ON orders_2024_01 (customer_id);
CREATE INDEX ON orders_2024_02 (customer_id);

MySQL DDL — same table with MySQL syntax:

CREATE TABLE orders (
    id          BIGINT AUTO_INCREMENT,
    customer_id BIGINT NOT NULL,
    total_cents BIGINT NOT NULL,
    status      VARCHAR(50) NOT NULL,
    created_at  DATE NOT NULL,
    PRIMARY KEY (id, created_at)   -- MySQL requires partition key in all unique keys
) PARTITION BY RANGE COLUMNS (created_at) (
    PARTITION p202401 VALUES LESS THAN ('2024-02-01'),
    PARTITION p202402 VALUES LESS THAN ('2024-03-01'),
    PARTITION p202403 VALUES LESS THAN ('2024-04-01'),
    PARTITION p_future VALUES LESS THAN (MAXVALUE)
);

Pros: Partition pruning on date predicates. Free archival — DROP TABLE orders_2021 removes a year of data in milliseconds instead of running a multi-hour DELETE. Natural fit for compliance windows.

Cons: Write hotspot — all current inserts go to the latest partition. Must pre-create partitions proactively (no matching partition → error). Skewed partition sizes if insert volume is uneven across time.

List Partitioning: Categorical Splits for Geo-Compliance

List partitioning assigns rows based on membership in an explicit set of values. The partition boundary becomes a compliance boundary when the values are geographic identifiers.

-- PostgreSQL list partitioning by region
CREATE TABLE orders (
    id          BIGSERIAL,
    customer_id BIGINT NOT NULL,
    region      TEXT NOT NULL,
    total_cents BIGINT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL
) PARTITION BY LIST (region);

CREATE TABLE orders_us
    PARTITION OF orders FOR VALUES IN ('US', 'CA', 'MX');

CREATE TABLE orders_eu
    PARTITION OF orders FOR VALUES IN ('DE', 'FR', 'GB', 'NL', 'ES', 'IT');

CREATE TABLE orders_apac
    PARTITION OF orders FOR VALUES IN ('JP', 'SG', 'AU', 'IN', 'KR');

-- Catch unknown regions
CREATE TABLE orders_default PARTITION OF orders DEFAULT;

Pros: Clean pruning on equality and membership predicates. Natural data locality — EU data physically lives in EU partition, enabling storage-tier pinning for GDPR compliance.

Cons: All partition values must be known at design time. Adding a new region requires ALTER TABLE orders ADD PARTITION — a DDL operation. Rows with unrecognized values go to DEFAULT; forgetting a DEFAULT partition causes insert errors.

Hash Partitioning: Even Write Distribution Without Range Pruning

Hash partitioning applies hash(partition_key) % N to distribute rows uniformly across N partitions. It eliminates write hotspots because inserts land on different partitions regardless of insertion time.

-- PostgreSQL hash partitioning by customer_id (4 partitions)
CREATE TABLE events (
    id          BIGSERIAL,
    customer_id BIGINT NOT NULL,
    event_type  TEXT NOT NULL,
    payload     JSONB,
    created_at  TIMESTAMPTZ NOT NULL
) PARTITION BY HASH (customer_id);

CREATE TABLE events_p0 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE events_p1 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE events_p2 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE events_p3 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Trade-off: Point lookups on customer_id prune to exactly one partition. Range queries (WHERE created_at > '2024-01-01') cannot prune — they must scan all four partitions. Use hash partitioning when point-lookup access patterns dominate and write throughput matters more than range scan efficiency.

Composite Partitioning: Year × Region for Tables That Outgrow Single-Level Strategies

Composite (sub) partitioning applies two levels. A common pattern: partition by year (range) first, then sub-partition each year by region (list) or by customer hash.

-- PostgreSQL composite: range by month, then hash by customer_id within each month
CREATE TABLE orders (
    id          BIGSERIAL,
    customer_id BIGINT NOT NULL,
    total_cents BIGINT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2024_01
    PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01')
    PARTITION BY HASH (customer_id);

CREATE TABLE orders_2024_01_p0
    PARTITION OF orders_2024_01 FOR VALUES WITH (MODULUS 4, REMAINDER 0);

CREATE TABLE orders_2024_01_p1
    PARTITION OF orders_2024_01 FOR VALUES WITH (MODULUS 4, REMAINDER 1);
-- ... repeat remainder 2, 3 and each subsequent month

With this structure, a query for WHERE created_at BETWEEN '2024-01-01' AND '2024-02-01' AND customer_id = 42 prunes to the single correct sub-partition — one of four January 2024 shards. At 100 billion rows across 48 months × 4 hash buckets = 192 partitions of roughly 500 million rows each — manageable for maintenance without the overhead of thousands of partitions.

Vertical Partitioning: Splitting Hot Columns From Cold Ones

Vertical partitioning is not a database DDL feature — it is a table design decision. You manually split a wide table into two tables linked by primary key:

-- Before: one 50-column table (hits on every login)
-- After: hot-path table + cold-path table

CREATE TABLE user_core (
    id         BIGINT PRIMARY KEY,
    email      TEXT NOT NULL,
    name       TEXT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL
);

CREATE TABLE user_profile (
    user_id              BIGINT PRIMARY KEY REFERENCES user_core(id),
    bio                  TEXT,
    avatar_url           TEXT,
    theme_preference     JSONB,
    notification_settings JSONB
);

The authentication path reads only user_core — 4 narrow columns. PostgreSQL stores rows in 8 KB pages. Narrower rows mean more rows per page, higher buffer pool efficiency, fewer I/O operations per query. The user_profile table is loaded only on profile page views, keeping it out of the hot buffer pool entirely.

Real-world application: E-commerce product catalogs often have 150–200 columns. Most search queries need 10 of them (title, price, category, thumbnail URL). Vertically partitioning the remaining 140 columns into a product_details table reduces hot-path I/O by 80%.


🧠 Deep Dive: How Partition Pruning Works Inside the Query Optimizer

Partition pruning is the mechanism that makes all the DDL above valuable. Without it, partitioning is extra metadata overhead with zero performance gain.

Internals of the Partition Pruning Optimizer

The PostgreSQL query optimizer stores partition bounds in the system catalog (pg_class, pg_inherits, and partition constraint tables). When a query arrives, the planner enters a phase called constraint exclusion (pre-v10) or native partition pruning (v10+). Native partition pruning reads partition metadata directly rather than re-evaluating CHECK constraints — it is faster and works with bind parameters.

Static pruning happens at parse time when the partition key predicate contains a literal:

-- Static pruning: planner knows at parse time that only Jan 2024 partition qualifies
SELECT * FROM orders WHERE created_at = '2024-01-15';

Dynamic pruning happens at execution time when the predicate value comes from a runtime parameter:

-- Dynamic pruning: planner cannot determine the partition until the parameter is bound
SELECT * FROM orders WHERE created_at = $1;

PostgreSQL v12+ supports dynamic partition pruning for parameterized queries and nested loop joins. MySQL supports static pruning for RANGE and LIST partitions but has limited dynamic pruning support.

The pruning decision inside the planner: For each partition, the planner evaluates whether the partition's bounds can possibly overlap with the query's predicate. For range partitions, this is a simple interval intersection check. For list partitions, it is a set membership check. For hash partitions, the planner computes hash(literal_value) % N to identify the exact target partition.

To observe pruning in action, run EXPLAIN (ANALYZE, VERBOSE) in PostgreSQL. Look for Partitions excluded: N in the output. In a 48-partition table with a query scoped to one month, you should see Partitions excluded: 47.

Performance Analysis: When Partitioning Helps and When It Hurts

The 10× rule: Partition pruning delivers dramatic speedups when a query touches ≤10% of partitions. A query scoped to one month on a 4-year table touches 1/48 of the data — a 48× reduction in I/O before even consulting an index. Combine this with a local index on the partition and a query that was a 30-second full-scan becomes a 200ms index seek.

The pruning killer — the most expensive partitioning mistake in production:

A query like SELECT * FROM orders WHERE customer_id = 42 on a range-partitioned table (partitioned by created_at) gets zero pruning benefit. The planner opens every partition because the partition key (created_at) is not in the predicate. This is slower than an equivalent unpartitioned table — you now have 48 table opens, 48 index scans, and a merge step instead of one.

The fix: include the partition key in critical queries, or redesign the partition strategy. If 90% of queries filter by customer_id, the table should be hash-partitioned by customer_id, not range-partitioned by created_at.

Planner overhead for large partition sets: PostgreSQL adds per-partition overhead at planning time. With 50–100 partitions, this overhead is negligible. With 5,000 partitions, planning time can increase by hundreds of milliseconds, degrading OLTP query latency even when pruning eliminates almost all partitions. The practical ceiling: keep total partition count below 500 for PostgreSQL. Above that, use composite partitioning to reduce the top-level count.

Write throughput characteristics by strategy:

StrategyWrite distributionWrite hotspot risk
Range by dateAll current inserts → latest partitionHigh for recent-data tables
List by regionInserts → matching region partitionMedium — depends on region traffic distribution
Hash by IDInserts spread evenly across N partitionsVery low
Composite (range + hash)Latest range → spread across hash bucketsLow — hash eliminates hotspot within each range

📊 Visualizing How the Database Skips 95% of Rows Before Looking at an Index

The diagram below traces the full decision path of a query optimizer on a range-partitioned table with data spanning 2021–2024. The query asks for January 2024 data. Follow the path from left to right to see exactly where the pruning decision happens and which branches are eliminated.

flowchart TD
    Q[Query: WHERE created_at between Jan 2024 and Feb 2024] --> OPT[Query Optimizer - partition pruning phase]
    OPT --> CHECK{Partition key in WHERE clause?}
    CHECK -->|yes - pruning active| BOUNDS[Evaluate each partition bound]
    CHECK -->|no - full scan| FULL[Scan all partitions - avoid this]
    BOUNDS --> S1[Skip Partition 2021 - bounds outside predicate range]
    BOUNDS --> S2[Skip Partition 2022 - bounds outside predicate range]
    BOUNDS --> S3[Skip Partition 2023 - bounds outside predicate range]
    BOUNDS --> SCAN[Scan Partition Jan 2024 - bounds match predicate]
    S1 --> RES[Return query results]
    S2 --> RES
    S3 --> RES
    SCAN --> RES
    FULL --> RES

The critical fork is the diamond node: is the partition key present in the WHERE clause? When it is, the planner eliminates three partitions before opening a single data file. When it is not, every partition is opened — making a partitioned table perform worse than a single-table scan because of the extra overhead of 48 table opens and result merges. The diagram captures the single most important rule in partition design: the partition key must appear in your most frequent query predicates, or you gain nothing.


🌍 Real-World Applications: How Stripe, Discord, and Airbnb Use Partitioning

Stripe — time-series ledger: The payments platform partitions its transaction ledger by settlement date. Every report for "last month's transactions" touches exactly one partition. Dropping partitions older than the regulatory retention window is a single DDL statement executed in milliseconds — no multi-hour DELETE with vacuuming required.

Discord — Cassandra message history: Discord's message storage uses Cassandra with channel_id as the partition key and a time-based message_id as the clustering key. Every "load last 50 messages in this channel" request hits exactly one Cassandra partition on one node — one disk read sequence, sub-millisecond response. At billions of messages across millions of channels, this works because the data model and query pattern are perfectly aligned: every query supplies the partition key.

Airbnb — geographic search: The search listings table is partitioned by geographic region. A search for "Paris apartments" queries only the EU partition. With 7 million listings worldwide, this pruning reduces the scan from 7 million rows to roughly 800,000 — a 9× reduction before any index is consulted.

Multi-tenant SaaS — query isolation: Partitioning by tenant_id using hash partitioning prevents one tenant's bulk export from causing a full-table scan that degrades query performance for every other tenant sharing the table. Each tenant's data occupies a separate physical partition. The tenant's bulk query scans only that partition.

PostgreSQL + CockroachDB — data residency: CockroachDB's zone configs pin partitions to specific geographic regions. A LIST partition for EU country codes can be assigned a zone constraint requiring data to reside on EU-region nodes. This enforces GDPR compliance at the database layer rather than the application layer:

-- CockroachDB: pin EU partition to EU nodes
ALTER TABLE orders PARTITION BY LIST (country) (
    PARTITION eu VALUES IN ('DE', 'FR', 'GB', 'NL'),
    PARTITION us VALUES IN ('US', 'CA')
);

ALTER PARTITION eu OF TABLE orders CONFIGURE ZONE USING
    constraints = '[+region=eu-west]';

🗄️ NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

When the post title says "SQL and NoSQL," the NoSQL side of that equation deserves the same depth as the PostgreSQL sections above. The fundamental difference is this: every major NoSQL database partitions data, but none of them gives you a DDL statement to declare it. Instead, partitioning is an emergent consequence of your data model choices — specifically, your choice of partition key. Get it wrong and you will not receive a DDL error at deployment; you will receive a production incident at 2 AM when a partition breaches its size limit or a write hot spot causes cascading coordinator timeouts across the cluster.

Cassandra: The Partition Key That Controls Everything From Node Assignment to Query Routing

The intuition first. Cassandra is, at its architectural core, a distributed key-value store. Every row belongs to a partition, and every partition is owned by exactly one primary node (with additional replicas for fault tolerance). When a write arrives, Cassandra faces one question: which node in the cluster is responsible for this data? The answer is encoded entirely in the partition key — the first component of the PRIMARY KEY definition. Choose the partition key well, and every query in your system hits exactly one node with one sequential disk read. Choose it poorly, and every query fans out to every node in the cluster.

Partition key vs. clustering key — what each controls. In Cassandra's PRIMARY KEY ((user_id), event_time), the double-parenthesised component user_id is the partition key. It determines which node stores the data. The event_time after the comma is the clustering key. It determines the sort order of rows within the partition on disk — enabling efficient range scans within a single partition. The partition key governs distribution across the cluster. The clustering key governs retrieval efficiency within a single node. Both are chosen at table creation time. Changing them later requires rebuilding the table entirely — there is no ALTER TABLE CHANGE PARTITION KEY in Cassandra.

The consistent hashing ring. Cassandra assigns each node ownership of a range of token values on a ring spanning the full 64-bit integer space. When a write arrives, Cassandra applies the Murmur3 hash function to the partition key value, producing a token. The ring lookup finds the node whose token range contains that token — that node becomes the primary replica. Virtual nodes (vnodes) improve the model by assigning each physical node 256 non-contiguous token ranges instead of one large arc. Adding a new node redistributes data from many small segments scattered around the ring rather than one large contiguous arc — making cluster expansion far smoother. A replication factor of 3 means the write is forwarded to the next two nodes clockwise on the ring. DynamoDB's internal partition map plays an analogous role to this ring.

The diagram below traces the full write path: from the application's write request to Murmur3 hashing, ring token lookup, primary node assignment, and replication acknowledgement. Follow it left to right to see exactly where the routing decision is made and how replication piggybacks on the same mechanism.

flowchart TD
    WRITE[Write Request - partition key: channel_id = gaming-general] --> HASH[Murmur3 hash applied to partition key]
    HASH --> TOKEN[Token value computed - maps to a position on the ring]
    TOKEN --> RING[Consistent hash ring lookup - find node that owns this token range]
    RING --> NODE[Primary node identified - Node B owns this token range]
    NODE --> COORD[Coordinator routes write to Node B]
    COORD --> REP[Replication factor = 3 - write forwarded to next 2 clockwise nodes]
    REP --> R1[Node B - primary replica - write confirmed]
    REP --> R2[Node C - replica 2 - write confirmed]
    REP --> R3[Node D - replica 3 - write confirmed]
    R1 --> ACK[Quorum satisfied - 2 of 3 replicas acknowledged]
    R2 --> ACK
    R3 --> ACK

The coordinator receives the write, applies Murmur3 to the partition key, consults the ring metadata to identify Node B as the primary, and fans the write to Node C and Node D simultaneously. The application receives acknowledgement as soon as any two nodes confirm — this is the QUORUM consistency level. The ring lookup itself takes microseconds; the bottleneck is the replication network round trip across data centre racks.

Partition sizing limits — the ~100 MB production hard wall. Cassandra stores all rows belonging to a partition together on disk in a single SSTable segment. No enforced byte limit exists at the protocol level, but production operations across large-scale Cassandra clusters consistently show that partitions above ~100 MB begin causing compaction stalls, garbage collection pressure on the responsible node, and read latency spikes that bleed over to unrelated partitions hosted on the same node. A 1 GB Cassandra partition does not simply slow its own reads — it triggers GC pauses during compaction that can freeze query servicing on that node for hundreds of milliseconds, degrading every other partition routed through it during that window. This is why Cassandra partition sizing is treated as a production SLA with an alert threshold, not a vague guideline.

The wide partition anti-pattern. This occurs when a partition key has too little cardinality relative to the data volume written under it. A messaging application that uses channel_id as the sole partition key will, over time, accumulate years of messages in a single partition for any active channel. The partition grows without bound. The correct design introduces a time-bucketing component into the partition key: PRIMARY KEY ((channel_id, week_bucket), message_id), where week_bucket is derived from the message timestamp (for example, YEARWEEK(created_at)). Each week of messages for a channel becomes its own partition — capping per-partition size while preserving the ability to read the current week's messages from a single node with a single sequential scan.


DynamoDB: Partition Keys, Internal Splitting, and the Hot Partition Problem AWS Doesn't Advertise

The intuition first. DynamoDB is a fully managed key-value and document store where AWS owns the entire partitioning layer. You do not configure a partition count, select a hash function, or manage a ring. You choose a partition key — called a hash key — and DynamoDB routes every read and write automatically. This abstraction eliminates operational overhead under normal conditions, but it also means that when the partitioning layer misbehaves, the failure surface is invisible. The two failure modes every DynamoDB user eventually encounters are automatic partition splitting under write pressure and hot partition throttling under skewed access. Both trace back to the partition key choice.

Partition key and sort key model. DynamoDB's primary key is either a simple partition key alone (user_id) or a composite key of partition key plus sort key (user_id + order_date). The partition key determines which physical partition holds the item — the same role as Cassandra's partition key. The sort key determines the ordering of items within a partition on disk — the same role as Cassandra's clustering key. Items sharing the same partition key but different sort keys are co-located on the same physical partition and retrieved in sorted key order via a Query operation. A Query supplying only the partition key fetches all items in that partition efficiently. A Scan reads every item across every partition — an operation whose cost grows linearly with table size and is billed accordingly.

How AWS internally splits partitions. DynamoDB allocates each partition an initial capacity ceiling of approximately 10 GB of storage along with a proportional share of the table's provisioned throughput (read capacity units and write capacity units divided evenly across all partitions). When a partition approaches 10 GB, DynamoDB automatically splits it into two halves, dividing both the data range and the throughput allocation equally between the two new partitions. This split is entirely transparent to the application — no connection string changes, no downtime. The consequence that surfaces in production: after a split, each half inherits half the throughput allocation. If subsequent access remains concentrated on one half, that half may exhaust its share of throughput and trigger throttling, even though the overall table has unused capacity on the other partitions.

The hot partition problem — symptoms and the write sharding fix. A hot partition exists when a single partition key receives a disproportionately large share of all reads or writes relative to every other partition key in the table. The symptom is ProvisionedThroughputExceededException errors appearing on one or two partition keys while the rest of the table sits idle. The diagnosis requires CloudWatch partition-level metrics: if total table throughput is below the provisioned limit but errors are concentrated on a small set of keys, the problem is structural — not a capacity shortage. Increasing provisioned capacity distributes the additional throughput evenly across all partitions, not preferentially to the hot one. More capacity does not fix a hot partition.

The fix is write sharding: instead of writing all items for a given entity under user_id = 42, distribute them across ten logical sub-partitions by appending a random suffix — user_id#0 through user_id#9. Every write picks a suffix at random. On read, fire ten parallel Query calls across all ten suffixed keys and merge the results in the application layer. This spreads the write load across ten physical DynamoDB partitions, eliminating the hot spot. The cost is explicit: reads now require ten parallel queries instead of one, and the application must merge and sort the combined result set. This is an architectural trade-off that permanently changes how the access layer is written — not a transparent fix.

Adaptive capacity. Since 2019, DynamoDB has shipped adaptive capacity — an automated mechanism that redistributes unused throughput from cold partitions to hot ones in near real-time, smoothing over transient traffic spikes without triggering throttling. Adaptive capacity is most effective for short-duration spikes such as flash sales or trending content bursts. It does not fix a partition key that is structurally imbalanced (for example, a celebrity user account responsible for 70% of all writes, permanently). Adaptive capacity operates within the constraints of the existing partition layout — it cannot redistribute capacity beyond what the physical partition map permits.

Single-table design tradeoffs. DynamoDB's single-table design pattern co-locates multiple entity types — users, orders, products — within one table, differentiated by a naming convention on the partition key and sort key (for example, PK: USER#42, SK: ORDER#2024-01-15). The benefit is that related entities share a partition and can be retrieved in a single Query without cross-table joins — delivering consistent single-digit millisecond response times regardless of data volume. The trade-off is rigidity: the access pattern is permanently encoded in the key schema. Introducing a new access pattern after the table is in production — such as "find all orders by email address" when the table was designed around user_id — requires either a Global Secondary Index (a separately managed partition projection billed independently) or a full table schema redesign with data migration.

The diagram below shows the DynamoDB write path including the internal partition split decision and the adaptive capacity pathway. The diamond at the centre — "Is this partition key disproportionately hot?" — is the decision that separates well-designed DynamoDB tables from ones that fail quietly under production load.

flowchart TD
    REQ[Write Request - PK: user_id = 42, SK: order_2024-01-15] --> HASH[Internal hash function applied to partition key]
    HASH --> MAP[Partition map lookup - identifies physical partition P3]
    MAP --> SIZECHECK{Partition P3 approaching 10 GB storage limit?}
    SIZECHECK -->|No - capacity available| HOTCHECK{Partition key disproportionately hot?}
    SIZECHECK -->|Yes - limit reached| SPLIT[Automatic split - P3 becomes P3a and P3b]
    SPLIT --> ALLOC[RCU and WCU divided equally between P3a and P3b]
    ALLOC --> HOTCHECK
    HOTCHECK -->|No - traffic is balanced| SUCCESS[Write succeeds - response returned]
    HOTCHECK -->|Yes - throttling risk| ADAPT[Adaptive capacity shifts unused RCU/WCU from cold partitions]
    ADAPT --> SUCCESS

The split path shows that after partition P3 divides, each half inherits half the throughput allocation. If access concentrates on one half, adaptive capacity temporarily bridges the gap — but structural hot spots require the write sharding fix at the application layer.


MongoDB: Shard Keys, Chunk Balancing, and the Jumbo Chunk Trap

The intuition first. MongoDB's distributed partitioning model is called sharding, and it operates at the chunk level rather than the individual document level. A chunk is a contiguous range or hash bucket of documents determined by the shard key field. MongoDB divides collections into chunks and distributes them across shard nodes. A mongos router sits between the application and the shards, forwarding each query to the correct chunk based on metadata held in the config server. When chunks grow beyond the configured maximum size, a background balancer splits them and optionally migrates chunks to less-loaded shards. The goal is identical to Cassandra and DynamoDB: even data distribution and even query load. The failure mode is also identical: a shard key with insufficient cardinality produces chunks that grow without bound and resist splitting.

Shard key selection — hashed vs. ranged. MongoDB supports two shard key strategies. A ranged shard key divides the collection by ordered value ranges of the shard key field — conceptually identical to PostgreSQL range partitioning, but physically distributed across multiple nodes. Ranged shard keys enable efficient range queries because all documents sharing a key range live on one shard; a find({ created_at: { $gte: start, $lte: end } }) touches a single shard. The weakness is the same as SQL range partitioning: monotonically increasing keys — ObjectId, timestamps, auto-increment integers — route all new inserts to the most recent chunk on the highest shard, creating a write hot spot at exactly the place that receives the most traffic. A hashed shard key applies MD5 to the shard key value before mapping to a chunk. This produces near-uniform write distribution regardless of key ordering, at the cost of range query efficiency — a range query on a hashed shard key requires a scatter-gather across all shards.

Chunk-based splitting and the balancer. MongoDB initialises each sharded collection with a single chunk spanning the full key range. As documents are inserted, the chunk grows. When it reaches the configured maximum size (128 MB in MongoDB 4.x; reduced to 64 MB in newer versions for more frequent, smaller migrations), the config server splits it into two equal-size chunks. The balancer — a background process running on the config server — monitors chunk counts per shard and migrates chunks from over-loaded shards to under-loaded ones. Chunk migration is a live operation: MongoDB streams the chunk documents to the destination shard, keeps both copies in sync during the transfer via an oplog-based catch-up phase, and atomically redirects routing metadata at the end. The balancer respects a configurable active window to avoid migrating chunks during peak traffic.

The jumbo chunk anti-pattern. A jumbo chunk occurs when a chunk cannot be split because all documents within it share the same shard key value. If the shard key is country_code and the value US is present in 40% of all documents, the US chunk accumulates without bound — there is no second key value inside the chunk to split on, so the chunk grows past the maximum size and the balancer marks it as jumbo and stops attempting to migrate it. The result is one shard bearing 40% of the collection permanently while the balancer is powerless to rebalance it. The fix is a compound shard key that combines the categorical value with a high-cardinality field: { country_code: 1, user_id: 1 }. Now each country can have thousands of chunks — one per range of user IDs — and the balancer can distribute them freely across shards. This is MongoDB's equivalent of Cassandra's composite partition key solution to the wide partition anti-pattern.


The Cross-Database View: SQL, Cassandra, DynamoDB, and MongoDB on Five Operational Dimensions

The table below normalises the five most operationally critical partitioning dimensions across all four systems covered in this post. Every entry reflects production-scale behaviour rather than theoretical design documents.

DimensionPostgreSQL (SQL)Apache CassandraAmazon DynamoDBMongoDB
Partition unitDDL-defined child tableAll rows sharing a partition key hash valueAll items sharing a partition key valueChunk — a contiguous range or hash bucket of documents
Routing mechanismQuery planner constraint exclusion at plan timeConsistent hashing ring (Murmur3) + vnodes per nodeInternal AWS-managed hash-based partition mapConfig server metadata + mongos router at query time
Cross-partition query supportFull support — planner merges results transparentlyALLOW FILTERING only — triggers a full cluster scan across all nodesScan operation — supported but billed per item readScatter-gather via mongos — supported, but latency proportional to shard count
Partition size limitLimited by disk; practical ceiling hundreds of GB per partition before I/O degrades~100 MB production limit — GC and compaction issues above this threshold~10 GB before automatic internal split and throughput reallocation128 MB default chunk size (64 MB in recent versions) before automatic split
Hot partition mitigationHash sub-partitioning; composite range-then-hash partitioningRedesign partition key; add a time-bucket component to composite keyWrite sharding (key#0..9 suffix) at app layer; DynamoDB adaptive capacityHashed shard key; compound shard key { category, user_id } for high-cardinality splits

The most important row is the last one. Every system has a hot partition failure mode and every system's mitigation is a variation of the same fundamental idea: distribute writes across a higher-cardinality key. The implementation differs by system — Cassandra demands a data model change, DynamoDB requires application-layer key mangling, MongoDB needs a compound shard key — but the diagnostic and the solution direction are universal across all four.


⚖️ Partitioning Trade-offs and the Anti-Patterns That Kill Performance

Partitioning has real costs that are undersold in conference talks. Here is the honest accounting:

Trade-offDetails
Range write hotspotAll current inserts concentrate on the latest partition. Monitor that partition's I/O separately from the rest.
No range pruning for hashHash-partitioned tables require full scans for range predicates — the hash has no sense of order.
Planner overhead for large partition sets>500 partitions in PostgreSQL adds measurable query planning latency even after pruning.
Cross-partition joinsJoins spanning all partitions may be slower than on an unpartitioned table due to multi-partition merge overhead.
Maintenance burdenAutomated partition creation must be reliable. A missed partition creation causes insert errors at midnight.
MySQL foreign key restrictionMySQL does not support foreign keys on partitioned tables. Referential integrity must be enforced at the application layer.

Five Anti-Patterns That Destroy Partitioning Performance

1. Not including the partition key in critical query predicates. If 90% of queries filter by customer_id but the table is partitioned by created_at, those queries never prune. You added complexity without benefit and likely made things slower.

2. Over-partitioning. 10,000 partitions on a PostgreSQL table causes the planner to spend hundreds of milliseconds evaluating partition bounds even for queries that prune to one. Keep the count below 500. Use composite partitioning to manage very large data sets at lower partition counts.

3. Low-cardinality partition key. A boolean column has two values — two partitions. This adds DDL overhead, halves future flexibility, and provides nearly zero pruning benefit. Partition keys must have enough distinct values to justify the number of partitions you intend to create.

4. Skewed range sizes. If 80% of your data falls within one range partition, that partition carries 80% of the I/O load. Range partitioning on time often causes this: the current month is constantly written to while historical months are static. The solution is monthly (not annual) ranges combined with hash sub-partitioning for the current period.

5. Forgetting the DEFAULT partition on list-partitioned tables. Adding a new country code that is not listed in any explicit partition causes the insert to fail with an error in PostgreSQL. A DEFAULT partition catches unlisted values gracefully and gives you time to add explicit partitions at the next deployment.


🧭 Choosing the Right Partition Strategy: Decision Guide

SituationRecommendation
Time-series data with rolling archivalRange partitioning by date (month or quarter). Use DETACH PARTITION for instant archival.
Even write distribution, high-cardinality keyHash partitioning by the primary access key (user ID, order ID).
Geo-compliance or categorical access patternsList partitioning by region or country code. Add a DEFAULT partition.
Table so large that monthly range partitions are still too bigComposite: range by month, hash by user ID within each month.
Queries mostly point lookups by IDHash partitioning. Range queries on other columns still scan all partitions — accept this trade-off.
Avoid whenTable has <50M rows and existing indexes are sufficient. No natural partition key. Primary bottleneck is write throughput beyond a single node (use sharding instead).
Alternative to partitioningIndex-only scans with covering indexes for selective queries. Table inheritance (PostgreSQL) for manual partitioning control. Sharding (Vitess, Citus) when data outgrows a single node.

Partitioning vs. Sharding — The Final Comparison for System Design Decisions

DimensionPartitioningSharding
Data locationSame node or transparent coordinatorIndependent database nodes
App connectionSingle connection stringShard-aware routing required
Cross-partition joinsSupported — planner merges resultsExpensive scatter-gather or forbidden by design
TransactionsFull ACID within the nodeLimited; 2PC or eventual consistency
Scaling targetQuery performance, manageabilityWrite throughput and storage beyond one node
RebalancingDDL: attach/detach partitions, no data movement across nodesMove chunks across nodes — major operational event
When to chooseTable is large but fits on one node; queries are slowDataset or write throughput exceeds one node's ceiling

The progression: Start with one database. Add partitioning when queries get slow on large tables — it is a DDL change, not a topology change. Move to sharding when the dataset or write throughput genuinely exceeds a single node. Partitioning and sharding are sequential steps on the scaling ladder.


🧪 Worked Example: Building a Partition-Pruned Orders Table End-to-End

This worked example demonstrates the full lifecycle of a range-partitioned orders table in PostgreSQL: creation, querying with partition pruning, instant archival, and a pruning failure case that shows exactly why partition key alignment matters. This scenario is representative of what you would build at a payments company, e-commerce platform, or any SaaS product with time-bounded billing data.

Step 1 — Create the partitioned parent and two monthly child tables:

CREATE TABLE orders (
    id          BIGSERIAL,
    customer_id BIGINT NOT NULL,
    total_cents BIGINT NOT NULL,
    status      TEXT NOT NULL DEFAULT 'pending',
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    PRIMARY KEY (id, created_at)   -- partition key must be in PK
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2024_01 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE orders_2024_02 PARTITION OF orders
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

CREATE INDEX ON orders_2024_01 (customer_id, created_at);
CREATE INDEX ON orders_2024_02 (customer_id, created_at);

Step 2 — Query with partition pruning (good path):

-- This query supplies the partition key — pruning activates
EXPLAIN (ANALYZE, FORMAT TEXT)
SELECT id, total_cents, status
FROM   orders
WHERE  created_at BETWEEN '2024-01-10' AND '2024-01-20'
  AND  customer_id = 42;

-- Expected plan includes:
--   Partitions scanned: 1 (orders_2024_01)
--   Partitions excluded: 1 (orders_2024_02 excluded at plan time)

Step 3 — The pruning failure case (what to avoid):

-- This query omits the partition key — no pruning, full scan of ALL partitions
EXPLAIN (ANALYZE, FORMAT TEXT)
SELECT id, total_cents FROM orders WHERE customer_id = 42;

-- Both partitions scanned. In a 48-partition table, all 48 are opened.

Step 4 — Instant archival by detaching an old partition:

-- Detach January 2024 from the live table (runs in milliseconds, no data copied)
ALTER TABLE orders DETACH PARTITION orders_2024_01;

-- orders_2024_01 is now a standalone table — archive it, compress it, or move it
-- When needed again: ALTER TABLE orders ATTACH PARTITION orders_2024_01
--   FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

This four-step pattern covers the complete partition lifecycle in production. The key insight from Step 3: always verify with EXPLAIN ANALYZE that Partitions excluded is non-zero before committing to a partitioning strategy. If excluded is 0, your query pattern and partition key are misaligned.


🛠️ pg_partman: Automated Partition Lifecycle Management for PostgreSQL

Manual partition creation works for small setups, but production systems need automation. If the cron job that creates next month's partition fails on the 20th, inserts fail at midnight on the 1st. pg_partman (PostgreSQL Partition Manager) solves this.

pg_partman is an open-source PostgreSQL extension that automatically creates future partitions, enforces retention policies, and handles cleanup. It ships a background worker (pg_partman_bgw) that runs maintenance on a configurable schedule inside the database process itself.

Setting up monthly range partitioning with pg_partman:

-- 1. Install the extension (available in RDS, Cloud SQL, Azure Database for PostgreSQL)
CREATE EXTENSION pg_partman;

-- 2. Create the parent table
CREATE TABLE orders (
    id          BIGSERIAL,
    customer_id BIGINT NOT NULL,
    total_cents BIGINT NOT NULL,
    status      TEXT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY RANGE (created_at);

-- 3. Hand the parent table to pg_partman
--    p_premake=4 creates the next 4 months of partitions immediately
SELECT partman.create_parent(
    p_parent_table => 'public.orders',
    p_control      => 'created_at',
    p_type         => 'range',
    p_interval     => 'monthly',
    p_premake      => 4
);

-- 4. Configure retention: keep 24 months, then drop old partitions
UPDATE partman.part_config
SET    retention             = '24 months',
       retention_keep_table  = FALSE,
       infinite_time_partitions = TRUE
WHERE  parent_table = 'public.orders';

-- 5. Schedule maintenance (run via pg_cron or external cron)
CALL partman.run_maintenance_proc();

With this configuration, pg_partman creates the next 4 months immediately so inserts never fail, drops partitions older than 24 months automatically, and handles the idempotent maintenance routine — safe to call frequently.

Key pg_partman configuration parameters:

ParameterPurpose
p_premakeNumber of future partitions to pre-create (keep ≥ 2 to handle deploy delays)
p_intervalPartition interval: 'daily', 'weekly', 'monthly', 'yearly'
retentionHow long to keep partitions before dropping: '24 months', '2 years'
retention_keep_tableTRUE to detach instead of drop (archive pattern); FALSE to drop outright
infinite_time_partitionsTRUE creates partitions beyond the premake window as data arrives

For a full deep-dive on pg_partman including native triggers, sub-partitioning hierarchies, and pg_partman_bgw tuning, see the pg_partman GitHub repository.


📚 Lessons Learned: What Production Teaches You That Documentation Skips

The partition key is a permanent architectural decision. Re-partitioning a 500 GB live table requires building a new partitioned table, migrating data with minimal downtime (logical replication or bulk INSERT/ATTACH), and cutting over. Plan the partition key with the same weight you give to API contracts. Changing it later is a multi-week project.

Monitor partition skew from day one. Set up a weekly query that reports row counts and size per partition. A partition growing 10× larger than its siblings signals skewed distribution. Catch it at 20 GB — fixing it at 200 GB means a live migration under load.

Test partition pruning explicitly before going to production. Add EXPLAIN (ANALYZE, FORMAT JSON) assertions for critical queries in your CI pipeline. Assert that Partitions excluded is non-zero. A routine refactor that wraps the partition key column in a function call — such as WHERE DATE(created_at) = '2024-01-15' instead of WHERE created_at >= '2024-01-15' AND created_at < '2024-01-16' — silently disables pruning.

Cassandra partition sizing is a production SLA, not a guideline. A wide Cassandra partition doesn't just slow one query — it causes coordinator timeouts that degrade all queries routed through that node during GC pauses. Monitor partition sizes in Cassandra JMX metrics. Alert when any partition approaches 100 MB. Fix the data model before it becomes a production incident.

DynamoDB hot partitions require shard key redesign, not more capacity. Adding provisioned throughput to a table with a hot partition key distributes capacity across partitions — but if one partition key is responsible for 80% of traffic, that partition's physical limit is still bounded. The fix is write sharding: append a random suffix (user_id#0 through user_id#9) to spread writes, then use a scatter-gather query on read.

Partition management automation is not optional. Every production system running range-partitioned tables in PostgreSQL or MySQL should use pg_partman, a custom cron job with alerting, or equivalent automation. The failure mode — no partition for tomorrow's date — is silent until midnight when inserts start failing in production.


📌 TLDR: The Partitioning Decision in Five Rules

  1. Range partitioning for time-series data. You get partition pruning on date predicates, free instant archival (drop a partition rather than run DELETE), and a natural fit for rolling-window retention policies.

  2. Hash partitioning for even write distribution. Range partitioning concentrates all writes on the current period. Hash partitioning spreads them evenly across N partitions at the cost of range query efficiency.

  3. List partitioning for geo-compliance and categorical access. When EU data must live on EU infrastructure, a LIST partition with EU country codes is the partition boundary AND the compliance boundary.

  4. The partition key must be in your most frequent WHERE clauses. The entire performance benefit disappears if critical queries don't include the partition key. Validate this with EXPLAIN ANALYZE before shipping.

  5. Partitioning makes one big table fast. Sharding makes one big system survive beyond a single node. They are sequential steps on the scaling ladder, not competing alternatives.


Share

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms