Azure Cosmos DB Consistency Levels Explained: Strong, Bounded Staleness, Session, Consistent Prefix, and Eventual
What each consistency level actually guarantees — and the production pitfalls that happen when you get it wrong
Abstract AlgorithmsTLDR: Cosmos DB offers five consistency levels — Strong, Bounded Staleness, Session, Consistent Prefix, Eventual — each with precise, non-obvious internal mechanics. Session does not mean HTTP session; it means a client-side token that tracks what you have seen. Strong is unavailable with multi-region writes. Eventual allows genuinely out-of-order reads. Picking the wrong level costs you either correctness or throughput — and the bugs are silent.
🔥 The Banking App That Lost Money Three Times a Week
A fintech team running a banking application on Azure Cosmos DB filed a production incident: users in Southeast Asia were occasionally seeing their pre-deposit balance immediately after depositing money. The account balance showed the old amount as if the deposit had never happened. Transactions were completing successfully. Money was being transferred. But 3% of balance reads returned stale data.
The team had set their Cosmos DB account to Session consistency, which they understood to mean "consistent within a database session." They expected that any read following a write would see that write. So where was the staleness coming from?
The culprit was in their serverless architecture. Each AWS Lambda invocation instantiated a new Cosmos DB SDK client. In Cosmos DB, Session consistency does not mean "database session" in the SQL sense — it means client-session-token continuity. When a Lambda function wrote a deposit and received a session token encoding that write's version, that token lived only in that Lambda instance's memory. The next Lambda invocation handling the balance read started fresh with no session token. Cosmos DB had no obligation to serve a fresh replica. It returned data from any replica, including one that had not replicated the deposit yet.
This confusion is pervasive. Cosmos DB's five consistency levels have precise, non-obvious definitions that diverge sharply from the casual English meanings of words like "session," "eventual," and "consistent." This post explains what each level internally guarantees, how the mechanism enforces that guarantee, and exactly when it silently breaks.
📖 The Five Consistency Levels — What the Words Actually Mean in Cosmos DB
Cosmos DB exposes five consistency levels, ordered from strongest to weakest. Before examining each mechanically, it is worth seeing their contracts side by side — the casual meanings of these words will mislead you if you rely on them.
| Level | One-Line Contract | The Word That Lies |
| Strong | Every read returns the globally latest committed write — always | "Strong" understates it: this is full linearizability |
| Bounded Staleness | Reads lag behind writes by at most K versions OR T seconds | "Bounded" sounds reassuring — but the bound can be minutes |
| Session | Within a single SDK client session: read-your-own-writes, monotonic reads, monotonic writes | "Session" is not HTTP session, not DB connection — it is a client token |
| Consistent Prefix | Reads never see writes out of order — no gaps, no reversals | "Consistent" suggests freshness — it says nothing about recency |
| Eventual | All replicas eventually converge; reads may return any committed version in any order | "Eventual" feels like "soon" — it can mean out-of-order reads indefinitely |
The ordering matters: every level to the right trades away consistency guarantees in exchange for lower write latency, higher availability, or both. The following spectrum diagram places each level against its key implications:
graph LR
S[" Strong Linearizable 2x RU cost No multi-write support"]
BS["� Bounded Staleness Lag up to K versions or T secs Ordering in single-write region"]
SE["🪙 Session Read-your-own-writes 1x RU — token-scoped"]
CP["� Consistent Prefix No gaps or reversals 1x RU — ordering only"]
EV["⚡ Eventual Max throughput 1x RU — zero ordering"]
S -->|"Weaker"| BS
BS --> SE
SE --> CP
CP -->|"Weakest"| EV
style S fill:#c0392b,color:#fff
style BS fill:#e67e22,color:#fff
style SE fill:#27ae60,color:#fff
style CP fill:#2980b9,color:#fff
style EV fill:#8e44ad,color:#fff
This spectrum shows that Strong stands uniquely apart — it is the only level with a 2x RU cost on reads and the only level incompatible with multi-region writes. Session, Consistent Prefix, and Eventual share identical write latency and cost; their only difference is the semantic ordering guarantee each provides.
🔍 The CAP and PACELC Trade-offs That Forced Five Levels Into Existence
A single consistency level cannot serve every application in a globally distributed database. To understand why, start with a physical constraint: replication across regions takes time. Tokyo to London is roughly 9,500 km — at the speed of light, a minimum 31 ms one-way latency even before network overhead. Real cross-region replication typically adds another 20–100 ms.
The CAP Theorem formalizes the resulting dilemma: under a network partition, a distributed system can guarantee either Consistency (every read returns the latest write) or Availability (every request gets a response) — but not both simultaneously.
The PACELC refinement extends this: even without a partition, every operation faces a choice between Latency and Consistency. A linearizable write must synchronize across all replicas before returning — which costs round-trip latency. A low-latency write can return before remote replicas confirm — which risks stale reads.
One consistency level cannot serve all applications because their requirements are genuinely incompatible:
- A bank account balance requires reads to always return the latest write. Stale data causes incorrect overdraft checks or duplicate withdrawals. The team can afford the latency penalty of Strong consistency.
- A social media like counter can tolerate reading 998 likes instead of 1,001 for two seconds. The business cannot afford the latency penalty of strong consistency at 100 million operations per second.
- A shopping cart needs read-your-own-writes: add an item and immediately view your cart, it must appear. Whether another user's cart is stale is irrelevant.
- An audit log needs sequential ordering: entries must appear in the order they were committed. An entry written 30 seconds ago is fine; an entry appearing out of sequence is not.
These four requirements are genuinely incompatible in a single consistency model. Cosmos DB's five levels are an explicit, configurable trade-off menu designed to let you match the consistency model to each feature's actual needs.
⚙️ How Cosmos DB Enforces Each Level — Quorums, Version Watermarks, and Token Routing
Each consistency level is enforced through a different combination of three mechanisms: write quorum policy, read routing policy, and version tracking.
Strong — Write quorum spans ALL configured regions. A write returns success only after every region has durably committed it. Reads must route to the primary write region or a validated quorum — no lagging replica may serve a Strong read. Write latency equals the round-trip time to the farthest configured region.
Bounded Staleness — Write quorum covers only the local write region (same as Eventual). Cosmos DB tracks a per-region version watermark — a pointer to the oldest version any read in that region may serve. The watermark advances as replication catches up. A read cannot return data older than the watermark, bounding staleness to at most K versions or T seconds behind the write region.
Session — Write quorum covers only the local write region. The key mechanism is a session token — an opaque string returned in every write response, encoding the write's logical sequence number (LSN). The SDK stores this token and includes it in every subsequent read request. Cosmos DB routes the read to a replica whose committed LSN is at least equal to the token, guaranteeing read-your-own-writes. Requests without a token have no routing constraint.
Consistent Prefix — Write quorum covers only the local write region. The replication protocol enforces that write replicas apply operations in commit order — a replica cannot apply W3 before W2, even if W3 arrived over the network first. This prevents gaps and reversals. No promise is made about how far behind a replica may be.
Eventual — Write quorum covers only the local write region. Reads serve from whichever replica responds first with no version constraint. Replication is gossip-based and asynchronous — replicas may receive and apply writes in any order. The only guarantee is eventual convergence.
🧠 Deep Dive: Session Token Internals and Staleness Watermarks Under the Hood
Internals: How Session Tokens and Version Watermarks Track Distributed State
Session token mechanics operate entirely at the SDK layer. When you create a Cosmos DB client, it holds an empty partition-to-LSN map. After every write, the server response includes a session token — a string encoding the logical sequence number of the write on the relevant partition. The SDK stores this LSN per partition in memory.
On every subsequent read, the SDK attaches the stored LSN as a request header (x-ms-session-token). The Cosmos DB gateway receives this header and routes the request to a replica whose committed LSN for that partition is at least equal to the requested value. If the nearest replica has not replicated that far, the gateway either waits briefly for it to catch up or routes to a farther replica that has — this is transparent to the client.
The critical implication: the session token is a client-side object. It does not live in the database. It lives in the SDK instance memory. Creating a new SDK client creates a new, empty token map — making every read from that new client identical to an Eventual read until the client issues its first write.
Staleness watermark mechanics work server-side. Cosmos DB maintains a per-region "safe read version" that advances continuously as the replication pipeline commits writes. For Bounded Staleness, the service guarantees that the safe read version at any time T is at least as fresh as all writes committed before T - maxIntervalInSeconds. Reads in a region are served only from the safe version or newer. This is enforced at the storage engine level — a read arriving at a replica with a too-old safe version is either delayed or promoted to a quorum read.
The following sequence diagram shows how Strong and Session diverge in their routing behavior for the same write operation. Strong blocks the client until all regions confirm; Session returns immediately after the local quorum commits and relies on the client's session token for subsequent read routing.
sequenceDiagram
participant C as Client
participant WR as Write Region (SEA)
participant EU as Read Region (West EU)
participant US as Read Region (East US)
Note over C,US: Strong Consistency — synchronous global quorum
C->>WR: Write W1 (deposit $500)
WR->>EU: Synchronous replication
WR->>US: Synchronous replication
EU-->>WR: ACK committed
US-->>WR: ACK committed
WR-->>C: Success + LSN:1042 (all regions confirmed)
Note over C,US: Session Consistency — async replication + token routing
C->>WR: Write W2 (deposit $200)
WR-->>C: Success + session-token LSN:1043 (local quorum only)
C->>EU: Read balance [x-ms-session-token: LSN:1043]
EU-->>C: $1700 (routed to replica at LSN >= 1043)
This diagram highlights the fundamental difference: Strong waits for all three regions before returning to the client, adding 80–150 ms of cross-region round-trip to every write. Session returns after the local quorum only, then uses the session token to route subsequent reads to an up-to-date replica without blocking the write path.
Performance Analysis: Write Latency, RU Cost, and Throughput Across All Five Levels
| Level | Write latency driver | Read latency driver | RU multiplier | Throughput impact |
| Strong | Round-trip to ALL configured regions | Primary region quorum | 2x reads | ~50% vs Eventual |
| Bounded Staleness | Local write region quorum | Nearest replica within watermark | 1x | Near-Eventual |
| Session | Local write region quorum | Nearest replica with matching LSN | 1x | Near-Eventual |
| Consistent Prefix | Local write region quorum | Nearest replica (any ordered version) | 1x | Equivalent to Eventual |
| Eventual | Local write region quorum | Nearest replica (any version) | 1x | Maximum |
The 2x RU cost on Strong consistency reads comes from internal quorum reads required to guarantee linearizability — every Strong read confirms the current version across a replica quorum before returning, doubling the compute cost. Session, Consistent Prefix, and Eventual have identical write latency and cost profiles. The choice between them is purely about semantic correctness.
📊 Visualizing the Five Levels — How Read Visibility Differs Across the Spectrum
Four diagrams follow — each shows the same write operation (depositing money) and read operation, but with different replication and routing behavior. Taken together, they illustrate exactly where each level's guarantee ends.
Strong: Every Region Confirms Before the Client Receives Success
Strong consistency blocks the client until all regions synchronously commit the write. The diagram below shows a three-region account — the client's success message arrives only after West Europe and East US have both confirmed.
sequenceDiagram
participant C as Client (SEA)
participant WR as Write Region (SEA)
participant EU as West Europe
participant US as East US
C->>WR: Write W1 (deposit $500)
WR->>EU: Synchronous replication of W1
WR->>US: Synchronous replication of W1
EU-->>WR: ACK — W1 committed
US-->>WR: ACK — W1 committed
WR-->>C: Success — W1 globally visible
C->>EU: Read balance
EU-->>C: $1500 (W1 visible in all regions)
Strong consistency creates a globally agreed "write point" — after the client receives success, any read from any region is guaranteed to return W1 or a later write. If West Europe is partitioned during replication, the write hangs until the partition resolves. Availability is sacrificed for linearizability.
Session: Token-Scoped Read-Your-Own-Writes
Session consistency shows the asymmetry between a client carrying a session token and one without. This is the exact failure pattern from the opening banking app scenario — Lambda invocation B is Client B.
sequenceDiagram
participant A as Client A (has session token)
participant B as Client B (new SDK — no token)
participant DB as Cosmos DB
A->>DB: Write W1 (deposit $500)
DB-->>A: Success + session token LSN:1042
Note over A: SDK stores LSN:1042
A->>DB: Read balance [token: LSN:1042]
DB-->>A: $1500 (replica at LSN >= 1042 — guaranteed fresh)
B->>DB: Read balance [no token]
DB-->>B: $1000 (nearest replica — may predate W1)
Client A carries the LSN from its write and gets a routing guarantee. Client B carries nothing and is treated as an Eventual read. No error is thrown. The difference is invisible without observing the returned values against expected post-write state.
Consistent Prefix: No Gaps, No Reversals in Write Order
Consistent Prefix enforces that replicas only advance their visible state forward and never skip a write. The sequence below shows three commits to an order-tracking system — the replica may lag, but it will never serve W3 without W2.
sequenceDiagram
participant W as Write Region
participant R as Read Replica (lagging)
participant C as Client
W->>W: Commit W1 (order created)
W->>W: Commit W2 (payment charged)
W->>W: Commit W3 (order shipped)
Note over R: Replication lag — W1 and W2 received, W3 not yet
C->>R: Read order status
R-->>C: "Created, payment charged" (prefix W1+W2 — valid)
Note over R: W3 now replicated
C->>R: Read order status
R-->>C: "Order shipped" (full prefix W1+W2+W3 — valid)
Note over W,R: Prevented: serving W3 without W2 (gap)
Note over W,R: Prevented: serving W2 then W1 in successive reads (reversal)
The protocol enforces a forward-only invariant at the replica. Eventual consistency would permit reading W3, then W1 — the order appeared to ship before it was created. Consistent Prefix makes this impossible while still allowing the replica to lag behind the write region indefinitely.
Eventual: Gossip Replication with No Ordering Constraints
graph TD
WR["Write Region Commit W1 then W2 then W3 Ack after local quorum only"]
WR -->|"Async gossip — unordered"| R1["Read Region A Current state: W1 and W3 W2 not yet received"]
WR -->|"Async gossip — unordered"| R2["Read Region B Current state: W3 only W1 and W2 not yet received"]
WR -->|"Async gossip — unordered"| R3["Read Region C Current state: W1, W2, W3 Fully replicated"]
style WR fill:#27ae60,color:#fff
style R1 fill:#e67e22,color:#fff
style R2 fill:#c0392b,color:#fff
style R3 fill:#27ae60,color:#fff
Each read region has received a different subset of writes in a different order — all three states are valid under Eventual consistency. Region A has W1 and W3 but not W2 (a gap that Consistent Prefix prevents). Region B has only W3 (a severe reversal in the making). Region C is fully caught up. Given time with no new writes, all three will converge, but there is no time bound on when.
🌍 Real-World Deployments — Which Level Serves Banking, Social Media, and Audit Logs
Financial Services: Account Balances and Inventory Counters
Banking applications require Strong consistency for balance reads. A balance that shows the pre-deposit amount for even a single read can trigger incorrect overdraft fees, failed payment authorizations, or reconciliation discrepancies. The 2x RU cost and higher write latency are justified — incorrect balance data causes real monetary loss.
E-commerce inventory counters face the same trade-off at higher write frequency. A flash sale listing 100 units where 200 simultaneous reads all see "available" before the last unit sells results in overselling. Teams use either Strong consistency or conditional writes (optimistic concurrency with ETag) to prevent duplicate reservations. Bounded Staleness (T=30s) is a viable middle ground if a brief availability window during depletion is acceptable.
Social Feeds and Activity Counters
Twitter processes billions of like and view events daily. The precise like count on a post seconds after a viral moment is irrelevant — what matters is throughput and eventual correctness. Eventual consistency with high-frequency async writes and periodic read-time aggregation is the standard pattern. The 2–5 second lag before a like appears for another user is invisible in practice.
Shopping Carts and User Preferences
A user adding items to a shopping cart and immediately viewing it expects to see those items. Session consistency provides read-your-own-writes at 1x RU cost — the correct default for ~90% of user-facing web and mobile features. The session token is managed automatically by the SDK in singleton-client architectures.
Audit Logs and Event History
A financial audit log must show events in the exact order they were committed. If a compliance officer reads the log and sees a withdrawal before the transfer that funded it, the audit is incorrect. Consistent Prefix provides the necessary ordering guarantee without requiring Strong consistency on the high-frequency write path. Entries may appear 30–60 seconds behind real time — acceptable. Entries appearing out of sequence — never acceptable.
⚖️ Trade-offs and Failure Modes Across All Five Levels
Full Guarantee Comparison
| Level | Read-your-own-writes | Monotonic reads | Global ordering | Max staleness | Multi-write support | RU multiplier |
| Strong | Always | Always | Global linear order | None — always latest | Not supported | 2x reads |
| Bounded Staleness | After staleness window | Yes | Single-write-region only | K versions or T seconds | Degrades to CP ordering | 1x |
| Session | Within session only | Within session only | No cross-session guarantee | Unbounded | Fully supported | 1x |
| Consistent Prefix | No | No gaps or reversals | Ordered prefix only | Unbounded | Fully supported | 1x |
| Eventual | No | No — reversals possible | None | Unbounded | Fully supported | 1x |
Failure Modes That Catch Teams Off-Guard
Strong + multi-region write = hard platform rejection. Enabling multi-region writes and setting Strong consistency produces a configuration error. This is not a performance advisory — it is a hard platform limit. The synchronous cross-region coordination that Strong requires is physically incompatible with accepting concurrent writes from multiple regions.
Session + stateless functions = invisible staleness. Lambda functions and Azure Functions creating a new SDK client per invocation silently break Session consistency. There is no error, no warning, and no exception. Reads simply return from any replica without the session token constraint. The staleness is detectable only by comparing read timestamps against write timestamps — which most applications do not instrument.
Bounded Staleness + multi-region write = silent ordering degradation. Unlike Strong, this combination produces no error. The K/T staleness bound continues to apply, but the monotonic read and global ordering guarantees degrade to Consistent Prefix semantics. Teams often miss this during multi-region write migrations.
Eventual for sequential data = logic corruption. Any data with sequential semantic dependencies — order state machines, payment workflows, document version history — is unsafe under Eventual. A client reading "shipped" then "created" in successive reads and transitioning state based on those reads may apply invalid transitions. Eventual consistency does not cause this occasionally — it permits it by definition.
🧭 Choosing the Right Level — Decision Flowchart and Use-Case Reference
The right consistency level depends on three questions: Does stale data cause financial or correctness harm? Do users need to see their own writes immediately? Does the order of writes matter?
flowchart TD
START["What consistency level do I need?"]
START --> Q1{"Is stale data business-critical? money / inventory / compliance"}
Q1 -->|"Yes"| Q2{"Single write region or multi-region writes?"}
Q2 -->|"Single write region"| STRONG["Strong Linearizable — no staleness 2x RU on reads"]
Q2 -->|"Multi-region writes"| BS["Bounded Staleness Strongest with multi-write Set T or K to your SLA"]
Q1 -->|"No"| Q3{"Must users see their own writes immediately?"}
Q3 -->|"Yes"| SESSION["Session — Default Read-your-own-writes via token Maintain token in serverless!"]
Q3 -->|"No"| Q4{"Does write ORDER matter? audit log or event history"}
Q4 -->|"Yes — sequence matters"| CP["Consistent Prefix No gaps, no reversals Recency not guaranteed"]
Q4 -->|"No — any order is fine"| EVENTUAL["Eventual Max throughput Only for staleness-tolerant data"]
style STRONG fill:#c0392b,color:#fff
style BS fill:#e67e22,color:#fff
style SESSION fill:#27ae60,color:#fff
style CP fill:#2980b9,color:#fff
style EVENTUAL fill:#8e44ad,color:#fff
Walk this flowchart per-feature, not per-account. Cosmos DB allows per-request consistency overrides that weaken (never strengthen) the account default. A single account can serve balance reads at Strong and product catalog reads at Eventual.
| Use Case | Recommended Level | Key Reason |
| Bank account balance | Strong | Stale balance causes financial harm |
| Inventory check (exact) | Strong or Bounded Staleness | Oversell risk — T=30s may be acceptable |
| Near-real-time dashboard | Bounded Staleness | T=60s lag acceptable; ordering useful |
| Shopping cart items | Session | User must see items just added |
| User profile / preferences | Session | Read-your-own-writes; low write volume |
| Financial audit log | Consistent Prefix | Sequence correctness; real-time not required |
| Order history display | Consistent Prefix | Ordering matters; seconds of lag acceptable |
| Social media feed | Eventual | 2–5s stale: invisible to users |
| Like and view counters | Eventual | Counter accuracy within 1% irrelevant |
| Product catalog reads | Session or Eventual | Slight staleness acceptable |
🧪 The Lambda Session Token Trap — A Serverless Consistency Failure Dissected
This section walks through the exact failure the banking app experienced step by step and shows the fix.
Why this scenario matters: It represents the most common Cosmos DB production consistency bug. Teams adopt Session consistency as the default (the correct choice) and deploy on serverless platforms (the correct architecture) but never test cross-invocation read-after-write consistency. The bug surfaces only under real traffic patterns.
What to look for: How the session token creates an implicit dependency between write and read invocations, and what changes when the fix is applied.
The Failing Flow — New SDK Client Per Invocation
Lambda Invocation A — Deposit handler:
1. new CosmosClient() → session token map = empty
2. Write W1: deposit $500
3. Response includes session token "LSN:1042"
4. Token stored in: invocation-local memory only
5. Lambda returns 200 OK
← Container may spin down; token is lost
Lambda Invocation B — Balance query handler (new or reused container):
1. new CosmosClient() → session token map = empty ← THE BUG
2. Read balance [no session token header]
3. Cosmos DB routes to nearest replica (any LSN)
4. Nearest replica has LSN:1039 — W1 not replicated yet
5. Returns $1,000 — pre-deposit balance
← User sees incorrect balance; no error thrown
No error is thrown because Cosmos DB behaves correctly per the Session consistency contract. No session token means no LSN routing constraint. The replica's response is valid. The bug is a contract violation in how the application uses the SDK, not in the database.
The Fix — Externalizing the Session Token
Three approaches, in order of simplicity:
1. Pass the token as a response header in the API layer. The deposit handler includes the session token in its HTTP response. The API gateway passes it as a header on the immediately following balance-read request. The balance handler injects it into CosmosItemRequestOptions. Zero infrastructure overhead.
2. Store the token in a low-latency key-value store. After writing a deposit, serialize the session token to Redis keyed by userId. When handling a balance read for the same user, retrieve the token and inject it into the read options. Works for async flows where write and read are in separate user interactions.
3. Use a singleton SDK client outside the handler. Initialize the Cosmos DB client once per Lambda container (outside the handler function), not once per invocation. The SDK's in-memory token map persists across warm invocations. Effective for high-concurrency functions where containers stay warm; provides no guarantee on cold starts.
The session token is a 20–40 byte string. The overhead of persisting and retrieving it is negligible compared to the cost of a Cosmos DB operation.
🛠️ Azure SDK and CLI: Configuring Consistency Per Operation
Cosmos DB supports consistency configuration at two levels: the account default (ceiling for all operations) and per-request overrides (can only weaken, never strengthen beyond the account default).
Account Default via Azure CLI
# Set account-level consistency default
az cosmosdb update \
--name mycosmosaccount \
--resource-group myrg \
--default-consistency-level Session
# For Bounded Staleness: configure K and T bounds
az cosmosdb update \
--name mycosmosaccount \
--resource-group myrg \
--default-consistency-level BoundedStaleness \
--max-staleness-prefix 100000 \
--max-interval 300
# Valid values: Strong | BoundedStaleness | Session | ConsistentPrefix | Eventual
Per-Request Override and Session Token Passthrough in the Java SDK
// SDK client with account default (Session)
CosmosClient client = new CosmosClientBuilder()
.endpoint(System.getenv("COSMOS_ENDPOINT"))
.key(System.getenv("COSMOS_KEY"))
.consistencyLevel(ConsistencyLevel.SESSION)
.buildClient();
// Per-request WEAKER override: Session -> Eventual (allowed — weakening only)
CosmosItemRequestOptions catalogOptions = new CosmosItemRequestOptions();
catalogOptions.setConsistencyLevel(ConsistencyLevel.EVENTUAL);
container.readItem(productId, partitionKey, catalogOptions, Product.class);
// Session token extraction and passthrough for serverless fix:
CosmosItemResponse<BankAccount> depositResponse =
container.createItem(deposit, partitionKey, new CosmosItemRequestOptions());
String sessionToken = depositResponse.getSessionToken();
// Persist: SET session:{userId} {sessionToken} EX 300 (in Redis)
// On the next read invocation — retrieve and inject:
String storedToken = redisClient.get("session:" + userId);
CosmosItemRequestOptions balanceOptions = new CosmosItemRequestOptions();
if (storedToken != null) {
balanceOptions.setSessionToken(storedToken);
}
container.readItem(accountId, partitionKey, balanceOptions, BankAccount.class);
The session token passthrough is the fix for the banking app. depositResponse.getSessionToken() returns the LSN-encoded token from the write. Injecting it into the subsequent read forces routing to a replica at or ahead of that LSN. Per-request consistency can only weaken: an account configured at Session can override to Eventual per-request, but cannot override to Strong.
For the full Java SDK v4 documentation including session token management, see the Azure Cosmos DB Java SDK v4 reference.
📚 Lessons Learned from Production Cosmos DB Deployments
"Session" means client-session-token continuity — not HTTP session, not database session. The token is an SDK-managed string encoding the LSN your client has seen. A new SDK client instance starts with an empty token and zero read-your-own-writes guarantee until it issues a write.
Every new SDK client is a fresh consistency start. Lambda functions, Azure Functions, and any stateless runtime that instantiates a new SDK client per invocation silently break Session guarantees. No error is thrown. Reads return stale data from whichever replica happens to respond.
Strong consistency and multi-region writes are mutually exclusive — hard platform constraint. Not a performance recommendation. Attempting to combine them produces a configuration error. Choose between active-active writes and linearizable reads.
Eventual consistency allows genuinely out-of-order reads with no time bound. "Eventual" does not mean "consistent within a few seconds." A read can return an older value than the previous read, in the same session, indefinitely. Never use Eventual for data with sequential semantic dependencies.
Bounded Staleness is the underused middle ground. Teams often jump from Session to Strong for "fresher" reads, paying 2x RU unnecessarily. For dashboards and analytics where a configurable lag is acceptable, Bounded Staleness provides ordering guarantees at 1x RU cost.
Per-request overrides can only weaken consistency, never strengthen it. Design the account default around the most demanding use case. Weaker overrides per-request are free; stronger overrides are impossible.
📌 TLDR — Five-Bullet Decision Cheat Sheet
- Strong → Linearizable globally, always latest, 2x RU on reads, incompatible with multi-region writes. Use for account balances and inventory where any staleness causes financial harm.
- Bounded Staleness → Reads lag by at most K versions or T seconds. Same write latency as Eventual, with ordering guarantees in single-write-region mode. Use for dashboards and compliance scenarios with explicit staleness SLAs.
- Session → Read-your-own-writes within a single SDK client session via session token. Token must travel with the client — stateless functions silently break it. Use for shopping carts, user preferences, and any user-specific write-then-read flow.
- Consistent Prefix → No gaps, no reversals in write order. Reads may be stale but always see writes in sequence. Use for audit logs and event history where ordering matters but real-time does not.
- Eventual → Zero ordering guarantees, maximum throughput, 1x RU. Successive reads can return older values than previous reads. Use only for staleness-tolerant, non-sequential data like social feeds and view counters.
📝 Practice Quiz
A Lambda function creates a new Cosmos DB SDK client on every invocation and uses Session consistency. A user deposits money in invocation A. Invocation B reads the account balance. Why might the read return the pre-deposit amount?
A) Session consistency guarantees are limited to the write region only
B) The deposit write is not fully committed until both invocations complete
C) Each new SDK client starts with an empty session token, so the read has no LSN routing constraint and may be served from any replica including one that has not replicated the deposit
D) Cosmos DB does not support Session consistency for financial workloadsCorrect Answer: C
Your Cosmos DB account has multi-region writes enabled. A stakeholder requests Strong consistency for all account balance reads. What is your response and what alternative do you offer?
A) Strong consistency is available but requires provisioned throughput at 10000 RU/s or higher
B) Strong consistency is not supported with multi-region write configurations — this is a hard platform limit. The strongest available option with multi-region writes is Bounded Staleness, which provides ordering guarantees and a configurable K/T freshness bound
C) Strong consistency can be applied per-request even if the account default is Eventual
D) Strong consistency works with multi-region writes but adds 500 ms of latencyCorrect Answer: B
What is the concrete difference between Consistent Prefix and Eventual consistency? Give an example of what Eventual allows that Consistent Prefix prevents.
A) Consistent Prefix is always faster than Eventual because it applies ordering at write time
B) Consistent Prefix prevents read reversals and gaps: if W1, W2, W3 were committed, no read will see W3 without W2. Eventual makes no such promise — a client can read W3, then W1, then W2 in successive reads. For an order tracker, Eventual allows reading "shipped" then "created" — as if the order shipped before it was placed
C) There is no difference — both allow out-of-order reads
D) Consistent Prefix guarantees reads are never more than one write behind EventualCorrect Answer: B
Bounded Staleness is configured with maxIntervalInSeconds=60. Write W1 occurs at t=0. Is a read at t=55s guaranteed to see W1? What about at t=65s?
A) Both t=55s and t=65s are guaranteed to see W1
B) t=55s is guaranteed; t=65s is not because the window resets after each read
C) t=55s is not guaranteed (the 60s staleness window has not expired). t=65s is guaranteed (the window has expired and W1 must be visible to all reads)
D) Neither is guaranteed — Bounded Staleness only provides ordering, not freshness boundsCorrect Answer: C
Architecture design challenge (Open-ended — no single correct answer): Your team is building a multi-region Cosmos DB account with multi-region writes enabled to minimize write latency globally. You need to serve two features: a financial audit log (entries must be sequentially correct, real-time not required) and a real-time spend dashboard (must refresh within 30 seconds, ordering irrelevant). Which consistency levels would you select for each feature, and how would you configure the account default and per-request overrides? Consider the multi-region write constraint.
Strong points to address: why Strong is unavailable (multi-region write constraint), why Consistent Prefix suits the audit log (ordering without recency), why Bounded Staleness with T=30s or Eventual suits the dashboard, how the account default should be set to the stricter of the two requirements (Consistent Prefix), and how per-request weakening to Eventual covers the dashboard reads.
🔗 Related Posts
- CAP Theorem and Consistency in Distributed Systems — The CAP theorem foundation explaining the mathematical trade-offs behind Cosmos DB's five-level spectrum
- Consistency Patterns in Distributed Systems — Practical consistency patterns (saga, two-phase commit, eventual convergence) applied to distributed architectures
- Understanding Consistency Patterns: An In-Depth Analysis — Deep dive into linearizability, sequential consistency, and causal consistency formal models
- System Design: Replication and Failover — How multi-region replication works at the infrastructure level and how failover affects consistency guarantees
- Choosing the Right Database: CAP Theorem and Use Cases — Database selection guide using CAP trade-offs, placing Cosmos DB in context alongside other distributed databases
- System Design: Distributed Transactions — How distributed transaction patterns interact with consistency levels in multi-service architectures

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions — with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy — but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose — range, hash, consistent hashing, or directory — determines whether range queries stay ch...
