CosmosDB Partition Internals: Logical vs Physical Partitions Explained
How CosmosDB maps your partition key to physical storage โ and why the wrong choice costs 10x more at scale
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
๐ฅ When Your Database Bill Triples Overnight
A retail engineering team ships a flash-sale feature. Traffic spikes 10ร. Their Azure CosmosDB bill triples within 24 hours. Queries that ran in 5ms now take 800ms. The on-call engineer bumps provisioned RU/s from 10,000 to 40,000 โ and the throttling barely improves.
The problem is not throughput headroom. It's that 90% of all reads hit a single partition key: status = "active". No matter how many RU/s they add to the account, those requests all funnel into one logical partition, which maps to one physical partition, which has a hard ceiling of 10,000 RU/s.
This is the hot partition problem โ and avoiding it starts with understanding the difference between CosmosDB's two-layer partition model: logical partitions (which you design) and physical partitions (which CosmosDB manages). Get the distinction wrong at schema design time, and you cannot fix it without a full data migration. Get it right, and you get near-linear horizontal scalability with sub-10ms reads at any traffic level.
๐ CosmosDB Partitioning: Why Two Layers Exist
CosmosDB is a globally distributed, multi-model database designed to scale horizontally without the operational burden of manual sharding. To do this, it splits data across nodes โ but it exposes two conceptually distinct partition layers rather than one.
The logical partition is your contract with the database: you declare a partition key, and CosmosDB groups every document with the same key value together. The physical partition is CosmosDB's internal implementation detail: it actually stores data in replica sets distributed across nodes, and it manages those replica sets on your behalf.
This two-layer design gives you control over data placement (so queries stay in-partition) while giving CosmosDB freedom to redistribute load and grow capacity (so you don't need to re-shard manually). Understanding what each layer can and cannot do is the foundation of every correct CosmosDB data model.
๐ Two Layers of Partitioning โ One You Control, One You Don't
CosmosDB partitions data across nodes to achieve horizontal scalability. It does this with two distinct abstractions that operate at different layers:
| Concept | Who controls it | Application visibility | Hard size limit |
| Logical Partition | You (via partition key) | Fully visible | 20 GB |
| Physical Partition | CosmosDB internally | Mostly hidden | ~50 GB |
Think of logical partitions as the filing system you design โ you label every folder (partition key value). Physical partitions are the filing cabinets CosmosDB builds โ it decides which folders go in which cabinet, and it can add cabinets automatically as you fill them.
What Is a Logical Partition?
A logical partition is the set of all documents that share the same partition key value. You define the partition key when creating a container โ it is immutable after creation.
- All documents with
customerId = "cust-123"form one logical partition. - Maximum size: 20 GB per logical partition value โ a hard, unextendable ceiling.
- Maximum throughput: approximately 10,000 RU/s per logical partition.
- All documents in a logical partition are co-located on the same physical storage, which enables efficient single-partition queries without cross-node fan-out.
Logical partitions are permanent groupings. Once a document is written, its logical partition assignment is sealed by its partition key value. You cannot reassign a document to a different logical partition without deleting and re-inserting it.
What Is a Physical Partition?
A physical partition is the actual server-side storage and compute unit. CosmosDB creates, manages, and splits physical partitions entirely on your behalf.
- Each physical partition is backed by a replica set of four nodes for durability and high availability.
- A single physical partition holds one or more logical partitions.
- Physical partitions grow to approximately 50 GB and can serve up to 10,000 RU/s.
- When a physical partition becomes too large or receives too much traffic, CosmosDB automatically splits it into two physical partitions โ with zero downtime.
You never create, delete, or explicitly manage physical partitions. They are CosmosDB's internal elasticity mechanism.
โ๏ธ How CosmosDB Routes Requests: The Hash-Range Mapping
The routing layer between your application and physical storage is a partition key hash space spanning from 0x0000 to 0xFFFF. CosmosDB hashes each document's partition key value into this range. Physical partitions own contiguous slices of the hash space, and logical partitions land in the physical partition that owns their hash.
graph TD
A[Client Request] --> B[CosmosDB Gateway]
B --> C[Partition Key Router]
C --> D{Hash of Partition Key}
D --> E[Physical Partition 1 - Range 0x0000 to 0x3FFF]
D --> F[Physical Partition 2 - Range 0x4000 to 0x7FFF]
D --> G[Physical Partition 3 - Range 0x8000 to 0xBFFF]
D --> H[Physical Partition 4 - Range 0xC000 to 0xFFFF]
E --> I[LP: userId-A, userId-B, userId-C]
F --> J[LP: userId-D, userId-E]
G --> K[LP: userId-F, userId-G, userId-H]
H --> L[LP: userId-J, userId-K]
This diagram shows how a client request reaches a physical partition. The partition key is hashed into the 0x0000โ0xFFFF range. Each physical partition owns a contiguous band of that range and holds the logical partitions whose hashes land there. Multiple logical partitions can share one physical partition โ but a single logical partition always lives entirely within one physical partition. This atomic boundary is the critical architectural invariant: logical partitions are indivisible.
๐ง Partition Splitting Internals: How CosmosDB Scales Without Downtime
The Internals of a Split
CosmosDB triggers a physical partition split under two conditions:
- Data volume: The physical partition approaches ~50 GB.
- Throughput scale-up: You increase provisioned or autoscale RU/s and CosmosDB needs to spread load across more partitions to fulfill the new capacity.
graph TD
A[Physical Partition P1 - 45GB - 8000 RU/s] -->|Split triggered| B[Physical Partition P1a - 22GB - 4000 RU/s]
A -->|Split triggered| C[Physical Partition P1b - 23GB - 4000 RU/s]
B --> D[Logical Partitions LP1 to LP5]
C --> E[Logical Partitions LP6 to LP10]
When a split occurs, CosmosDB divides the hash range of the original physical partition into two equal halves and migrates each half's logical partitions to one of the two resulting physical partitions. Data migration happens in the background while reads and writes continue โ the gateway layer transparently redirects each request to the correct post-split target.
What you actually observe during a split: brief latency spikes (typically under one second), no data loss, and no application code change required. The entire operation is orchestrated by the CosmosDB control plane and is invisible to your application except for transient retry events.
Performance Analysis During a Split
Physical partition splits are generally transparent, but they do have measurable performance effects worth understanding for capacity planning:
- RU/s temporarily doubles: During migration, both the original and the new physical partition may serve reads simultaneously, meaning the total RU capacity briefly doubles before the old partition drains.
- Write latency increases transiently: Writes are held in a two-phase commit between the old and new partition during data migration. The spike is typically under 200ms but can surface in P99 latency dashboards.
- Partition count multiplies on scale-up: When you double provisioned RU/s, CosmosDB may double the physical partition count. More partitions enable more parallelism โ but only if the logical partition key provides sufficient cardinality to distribute load across them. A low-cardinality key (three
statusvalues) means most new physical partitions remain idle after the split.
Key implication: When you double provisioned RU/s via autoscale, CosmosDB may split physical partitions to accommodate the new throughput ceiling. More physical partitions means better distribution โ but only if your logical partition key provides enough cardinality to spread load across them.
๐ Hot Partition Flow: How a Bad Key Saturates Physical Storage
Physical partition splits and increased RU/s cannot save you from a poorly chosen partition key. The fundamental problem is at the logical partition layer โ before CosmosDB's elasticity mechanisms even get involved.
Scenario 1: Low-Cardinality Key
container: "orders"
partitionKey: "/status"
values: ["active", "completed", "cancelled"]
Three logical partitions exist regardless of how many millions of orders you store. Ninety percent of queries target status = "active". That single logical partition and the physical partition it resides on absorb 90% of all RU/s. CosmosDB can split that physical partition, but the hot logical partition moves intact to one of the two halves โ the hotness follows the key, not the partition.
Scenario 2: Skewed Natural Key
container: "users"
partitionKey: "/country"
If 80% of users are in the United States, country = "US" accumulates towards the 20 GB logical partition ceiling. Adding RU/s does not help once the ceiling is hit โ writes to that partition fail permanently with RequestEntityTooLargeException.
Even Distribution vs. Hot Partition โ A Visual Contrast
graph TD
subgraph Even[Good Key - userId - Even Distribution]
PP1[Physical Partition 1 - 3333 RU/s]
PP2[Physical Partition 2 - 3333 RU/s]
PP3[Physical Partition 3 - 3334 RU/s]
end
subgraph Uneven[Bad Key - status - Hot Partition]
HP1[Physical Partition A - 9500 RU/s THROTTLED]
HP2[Physical Partition B - 300 RU/s]
HP3[Physical Partition C - 200 RU/s]
end
This diagram illustrates the contrast between a well-chosen and a poorly-chosen partition key at the physical level. With userId as the key and millions of users, each physical partition handles a balanced fraction of total traffic โ all operating well below the 10,000 RU/s physical partition ceiling. With status as the key, the active partition absorbs almost all traffic and saturates its host physical partition, producing 429 throttling errors even while the other two physical partitions sit nearly idle. The account-level RU/s meter shows low utilization, yet users experience failures โ a counter-intuitive outcome that catches many teams off guard.
๐งช Practical: Choosing and Fixing Partition Keys
The goal is high cardinality + uniform access distribution + query alignment.
| Design Rule | Good Example | Bad Example |
| High cardinality | userId (millions of distinct values) | status (3โ5 values) |
| Uniform access | Each key accessed at roughly equal rate | One key dominates all reads |
| Query alignment | Most queries include the partition key | Most queries omit the partition key |
| Size headroom | Each logical partition stays well under 20 GB | Any single key will accumulate > 20 GB |
| Write spread | Writes distribute across many distinct key values | All writes target the "latest" key |
Synthetic Partition Keys for Difficult Workloads
When your data does not have a natural high-cardinality key, build one:
Suffix bucketing โ append a modulo bucket to an existing key:
orderId + "_" + (epochSeconds % 50) โ 50 logical partitions per base ID
Prefix randomization โ prepend a random segment to force even hashing:
String.valueOf(random.nextInt(100)) + "_" + entityId
Composite key โ combine two meaningful fields:
tenantId + "_" + userId โ cardinality multiplies across both fields
Suffix bucketing is especially important for time-series and append-heavy workloads where all writes naturally cluster on the most recent timestamp. Distributing writes across sensorId_0 through sensorId_99 prevents 100% of write traffic from hitting a single logical partition.
โ๏ธ Trade-offs and Failure Modes
| Scenario | Root Cause | Observable Impact | Resolution |
| 429 Too Many Requests | Logical partition exceeds physical partition RU/s ceiling | Read/write throttling on all items in that physical partition | Redesign partition key; add suffix bucketing |
| 20 GB limit exceeded | Low-cardinality key with unbounded data growth | Writes to that partition fail permanently | Container migration to a new key โ cannot be done in-place |
| Cross-partition fan-out | Query filter omits partition key | Latency 10รโ100ร higher; RU cost scales with partition count | Add partition key to query filter; use secondary indexes |
| Transient latency spikes | Physical partition split in progress | Sub-second latency blips during autoscale | Expected behavior; implement retries with exponential backoff |
| Uneven distribution after split | Too few distinct partition key values | Some physical partitions overloaded post-split | Higher-cardinality key improves distribution after each split |
The Cross-Partition Query Tax
A query without a partition key filter executes as a fan-out โ CosmosDB sends the query to every physical partition in parallel and merges results. At 50 physical partitions, you pay 50ร the RU cost of the equivalent single-partition query. Monitor the x-ms-documentdb-query-metrics response header: if retrievedDocumentCount greatly exceeds outputDocumentCount, you are paying for cross-partition scans. The fix is either adding the partition key to the filter or introducing a secondary index on a projected lookup field.
๐ Real-World Applications
E-commerce at scale: Order management systems partition by customerId rather than status. Every customer's order history lives in one logical partition โ enabling O(1) "my orders" queries โ while preventing any single key from accumulating unbounded data. Order IDs (GUIDs) are used as the key only when cross-customer analytics, not per-customer retrieval, is the primary access pattern.
IoT device telemetry: Devices emit readings continuously. A naive timestamp partition key creates a single hot logical partition for every second of data. Production IoT systems use deviceId_date โ for example, device-abc-2026-04-18 โ creating thousands of logical partitions per day and spreading writes evenly across physical partitions.
Multi-tenant SaaS: Small tenants fit comfortably within 20 GB per tenantId partition. For enterprise tenants approaching the limit, suffix buckets (tenantId_0, tenantId_1, ... tenantId_9) distribute one logical tenant across 10 logical partitions. CosmosDB 2023 introduced hierarchical partition keys (up to 3 levels: /tenantId, /userId, /sessionId) which handle this pattern natively without manual bucketing.
Social media feed: Partition by userId so each user's posts form one logical partition. Read-your-own-writes and timeline queries stay in-partition and complete in under 5ms. Celebrity accounts with millions of followers require fan-out to read but writes stay on one partition โ an acceptable trade-off given that reads can be cached at the CDN layer.
๐งญ Decision Guide: Selecting a Partition Key by Use Case
| Use Case | Recommended Key | Rationale |
| User profiles and activity | userId | High cardinality; even distribution; aligns with lookup pattern |
| Order management | customerId | Co-locates a customer's orders; efficient per-customer queries |
| IoT sensor telemetry | deviceId_date | Prevents write hot spots; keeps device history queryable |
| Multi-tenant app (small tenants) | tenantId | Isolates tenant data; all queries stay in-partition |
| Multi-tenant app (large tenants) | tenantId_bucket | Prevents 20 GB ceiling; spreads load across buckets |
| Product catalog | productId (GUID-based) | Uniform hashing; no natural cardinality problem |
| Audit / append-heavy logs | sourceId_bucket | Distributes high-frequency writes across logical partitions |
| Time-series analytics | metricName_hour | Balances writes; keeps related metrics queryable per hour |
๐ ๏ธ Azure CosmosDB SDK: Configuring Partition Key at Container Creation
The partition key is set at container provisioning and is permanent. There is no migration path within the same container โ if you choose the wrong key, you must create a new container and copy data. The SDK call that sets this irrevocable choice is simple and easy to overlook:
CosmosContainerProperties props = new CosmosContainerProperties(
"orders",
"/customerId" // partition key path โ cannot change after creation
);
// Hierarchical partition key (CosmosDB SDK 4.37+)
// CosmosContainerProperties props = new CosmosContainerProperties("orders",
// PartitionKeyDefinition.fromPaths(List.of("/tenantId", "/customerId")));
ThroughputProperties autoscale = ThroughputProperties.createAutoscaledThroughput(4000);
database.createContainerIfNotExists(props, autoscale).block();
CosmosDB automatically creates the initial set of physical partitions and handles all splits as data volume and provisioned RU/s grow. The hierarchical key variant (commented out above) allows up to three levels of nesting, co-locating tenant data while maintaining high cardinality โ eliminating the need for manual suffix bucketing in most multi-tenant scenarios.
For a full deep-dive on CosmosDB SDK configuration, indexing policies, and consistency levels, see the Azure CosmosDB Java SDK v4 documentation.
๐ Lessons Learned from CosmosDB Partition Failures
The 20 GB limit is a hard wall with no operational escape hatch. Unlike throughput limits that you can resolve by raising RU/s, hitting the logical partition size ceiling causes all writes to that partition to fail. The only fix is a full container migration with a new partition key โ an expensive operation requiring double-storage, a data copy pipeline, and a cutover window. Validate expected data volume per key value before launch.
Hot partitions are invisible in account-level metrics. CosmosDB's default monitoring shows account-level RU/s consumption. A container can report 40% account utilization while one physical partition inside it is at 100% and throttling. Always monitor per-partition RU/s metrics (available in the Azure portal under "Insights โ Throughput โ Normalized RU consumption by partition key range") to detect hot partitions before they cause production incidents.
Cross-partition queries are a cost multiplier, not just a latency issue. A query without a partition key filter fan-outs to every physical partition. At 100 physical partitions โ a realistic count for a multi-year production container โ a cross-partition query costs 100ร more RU/s than the equivalent in-partition query. Teams that start with 5 physical partitions rarely notice the cost. Teams that have grown to 100 experience unexplained billing spikes. Audit your query patterns whenever you significantly scale a container.
Partition key design is a first-class architecture decision. It determines query patterns, cost structure, and the scalability ceiling for the lifetime of that container. It deserves the same deliberate analysis as schema design in a relational database โ before the first byte of production data is written.
Hierarchical partition keys (introduced 2023) eliminate most synthetic key hacks. If you are designing a new container for a multi-tenant or hierarchical access pattern, evaluate whether hierarchical keys make suffix bucketing unnecessary. They allow CosmosDB to co-locate data at the top-level key while maintaining the cardinality of the full composite key.
๐ TLDR & Key Takeaways
TLDR: In CosmosDB, logical partitions are your grouping contract (defined by partition key, max 20 GB, permanent), and physical partitions are CosmosDB's elastic storage units that hold many logical partitions and split automatically. Hot partitions โ caused by low-cardinality or skewed keys โ cannot be fixed by adding RU/s; they require a better partition key or synthetic suffix bucketing.
- Logical partition = all documents sharing the same partition key value. You define it. Max 20 GB and ~10,000 RU/s per value. Permanent and indivisible.
- Physical partition = CosmosDB's actual storage unit backed by a 4-node replica set. Holds multiple logical partitions. Splits automatically when data or throughput exceeds thresholds.
- Routing = CosmosDB hashes your partition key into a fixed range; physical partitions own contiguous hash bands. Many logical partitions map to one physical partition โ never the reverse.
- Hot partitions arise from low-cardinality or skewed partition keys. They cannot be resolved by increasing RU/s โ the bottleneck is at the logical partition level.
- Fix hot partitions by choosing a high-cardinality key or building synthetic suffix-bucket keys to distribute data and traffic across many logical partitions.
- Partition key is permanent โ plan for worst-case data volume and access patterns before the first write. Changing it requires a full container migration.
๐ Related Posts
- Partitioning Approaches: SQL and NoSQL โ Compare range-based, hash-based, and directory-based sharding strategies across relational and non-relational databases.
- Consistent Hashing: Scaling Without Chaos โ How consistent hashing underpins partition routing in distributed databases and avoids full reshuffles on node changes.
- Azure Cosmos DB Consistency Levels Explained โ Understand the five CosmosDB consistency models โ Strong, Bounded Staleness, Session, Consistent Prefix, and Eventual โ and when to use each.
- System Design Interview Prep Roadmap โ Full learning path for distributed systems, database design, and scalability topics.
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
