All Posts

Azure Cosmos DB Consistency Levels Explained: Strong, Bounded Staleness, Session, Consistent Prefix, and Eventual

What each consistency level actually guarantees — and the production pitfalls that happen when you get it wrong

Abstract AlgorithmsAbstract Algorithms
··26 min read
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: Cosmos DB offers five consistency levels — Strong, Bounded Staleness, Session, Consistent Prefix, Eventual — each with precise, non-obvious internal mechanics. Session does not mean HTTP session; it means a client-side token that tracks what you have seen. Strong is unavailable with multi-region writes. Eventual allows genuinely out-of-order reads. Picking the wrong level costs you either correctness or throughput — and the bugs are silent.

🔥 The Banking App That Lost Money Three Times a Week

A fintech team running a banking application on Azure Cosmos DB filed a production incident: users in Southeast Asia were occasionally seeing their pre-deposit balance immediately after depositing money. The account balance showed the old amount as if the deposit had never happened. Transactions were completing successfully. Money was being transferred. But 3% of balance reads returned stale data.

The team had set their Cosmos DB account to Session consistency, which they understood to mean "consistent within a database session." They expected that any read following a write would see that write. So where was the staleness coming from?

The culprit was in their serverless architecture. Each AWS Lambda invocation instantiated a new Cosmos DB SDK client. In Cosmos DB, Session consistency does not mean "database session" in the SQL sense — it means client-session-token continuity. When a Lambda function wrote a deposit and received a session token encoding that write's version, that token lived only in that Lambda instance's memory. The next Lambda invocation handling the balance read started fresh with no session token. Cosmos DB had no obligation to serve a fresh replica. It returned data from any replica, including one that had not replicated the deposit yet.

This confusion is pervasive. Cosmos DB's five consistency levels have precise, non-obvious definitions that diverge sharply from the casual English meanings of words like "session," "eventual," and "consistent." This post explains what each level internally guarantees, how the mechanism enforces that guarantee, and exactly when it silently breaks.


📖 The Five Consistency Levels — What the Words Actually Mean in Cosmos DB

Cosmos DB exposes five consistency levels, ordered from strongest to weakest. Before examining each mechanically, it is worth seeing their contracts side by side — the casual meanings of these words will mislead you if you rely on them.

LevelOne-Line ContractThe Word That Lies
StrongEvery read returns the globally latest committed write — always"Strong" understates it: this is full linearizability
Bounded StalenessReads lag behind writes by at most K versions OR T seconds"Bounded" sounds reassuring — but the bound can be minutes
SessionWithin a single SDK client session: read-your-own-writes, monotonic reads, monotonic writes"Session" is not HTTP session, not DB connection — it is a client token
Consistent PrefixReads never see writes out of order — no gaps, no reversals"Consistent" suggests freshness — it says nothing about recency
EventualAll replicas eventually converge; reads may return any committed version in any order"Eventual" feels like "soon" — it can mean out-of-order reads indefinitely

The ordering matters: every level to the right trades away consistency guarantees in exchange for lower write latency, higher availability, or both. The following spectrum diagram places each level against its key implications:

graph LR
    S[" Strong Linearizable 2x RU cost No multi-write support"]
    BS["� Bounded Staleness Lag up to K versions or T secs Ordering in single-write region"]
    SE["🪙 Session Read-your-own-writes 1x RU — token-scoped"]
    CP["� Consistent Prefix No gaps or reversals 1x RU — ordering only"]
    EV["⚡ Eventual Max throughput 1x RU — zero ordering"]

    S -->|"Weaker"| BS
    BS --> SE
    SE --> CP
    CP -->|"Weakest"| EV

    style S fill:#c0392b,color:#fff
    style BS fill:#e67e22,color:#fff
    style SE fill:#27ae60,color:#fff
    style CP fill:#2980b9,color:#fff
    style EV fill:#8e44ad,color:#fff

This spectrum shows that Strong stands uniquely apart — it is the only level with a 2x RU cost on reads and the only level incompatible with multi-region writes. Session, Consistent Prefix, and Eventual share identical write latency and cost; their only difference is the semantic ordering guarantee each provides.


🔍 The CAP and PACELC Trade-offs That Forced Five Levels Into Existence

A single consistency level cannot serve every application in a globally distributed database. To understand why, start with a physical constraint: replication across regions takes time. Tokyo to London is roughly 9,500 km — at the speed of light, a minimum 31 ms one-way latency even before network overhead. Real cross-region replication typically adds another 20–100 ms.

The CAP Theorem formalizes the resulting dilemma: under a network partition, a distributed system can guarantee either Consistency (every read returns the latest write) or Availability (every request gets a response) — but not both simultaneously.

The PACELC refinement extends this: even without a partition, every operation faces a choice between Latency and Consistency. A linearizable write must synchronize across all replicas before returning — which costs round-trip latency. A low-latency write can return before remote replicas confirm — which risks stale reads.

One consistency level cannot serve all applications because their requirements are genuinely incompatible:

  • A bank account balance requires reads to always return the latest write. Stale data causes incorrect overdraft checks or duplicate withdrawals. The team can afford the latency penalty of Strong consistency.
  • A social media like counter can tolerate reading 998 likes instead of 1,001 for two seconds. The business cannot afford the latency penalty of strong consistency at 100 million operations per second.
  • A shopping cart needs read-your-own-writes: add an item and immediately view your cart, it must appear. Whether another user's cart is stale is irrelevant.
  • An audit log needs sequential ordering: entries must appear in the order they were committed. An entry written 30 seconds ago is fine; an entry appearing out of sequence is not.

These four requirements are genuinely incompatible in a single consistency model. Cosmos DB's five levels are an explicit, configurable trade-off menu designed to let you match the consistency model to each feature's actual needs.


⚙️ How Cosmos DB Enforces Each Level — Quorums, Version Watermarks, and Token Routing

Each consistency level is enforced through a different combination of three mechanisms: write quorum policy, read routing policy, and version tracking.

Strong — Write quorum spans ALL configured regions. A write returns success only after every region has durably committed it. Reads must route to the primary write region or a validated quorum — no lagging replica may serve a Strong read. Write latency equals the round-trip time to the farthest configured region.

Bounded Staleness — Write quorum covers only the local write region (same as Eventual). Cosmos DB tracks a per-region version watermark — a pointer to the oldest version any read in that region may serve. The watermark advances as replication catches up. A read cannot return data older than the watermark, bounding staleness to at most K versions or T seconds behind the write region.

Session — Write quorum covers only the local write region. The key mechanism is a session token — an opaque string returned in every write response, encoding the write's logical sequence number (LSN). The SDK stores this token and includes it in every subsequent read request. Cosmos DB routes the read to a replica whose committed LSN is at least equal to the token, guaranteeing read-your-own-writes. Requests without a token have no routing constraint.

Consistent Prefix — Write quorum covers only the local write region. The replication protocol enforces that write replicas apply operations in commit order — a replica cannot apply W3 before W2, even if W3 arrived over the network first. This prevents gaps and reversals. No promise is made about how far behind a replica may be.

Eventual — Write quorum covers only the local write region. Reads serve from whichever replica responds first with no version constraint. Replication is gossip-based and asynchronous — replicas may receive and apply writes in any order. The only guarantee is eventual convergence.


🧠 Deep Dive: Session Token Internals and Staleness Watermarks Under the Hood

Internals: How Session Tokens and Version Watermarks Track Distributed State

Session token mechanics operate entirely at the SDK layer. When you create a Cosmos DB client, it holds an empty partition-to-LSN map. After every write, the server response includes a session token — a string encoding the logical sequence number of the write on the relevant partition. The SDK stores this LSN per partition in memory.

On every subsequent read, the SDK attaches the stored LSN as a request header (x-ms-session-token). The Cosmos DB gateway receives this header and routes the request to a replica whose committed LSN for that partition is at least equal to the requested value. If the nearest replica has not replicated that far, the gateway either waits briefly for it to catch up or routes to a farther replica that has — this is transparent to the client.

The critical implication: the session token is a client-side object. It does not live in the database. It lives in the SDK instance memory. Creating a new SDK client creates a new, empty token map — making every read from that new client identical to an Eventual read until the client issues its first write.

Staleness watermark mechanics work server-side. Cosmos DB maintains a per-region "safe read version" that advances continuously as the replication pipeline commits writes. For Bounded Staleness, the service guarantees that the safe read version at any time T is at least as fresh as all writes committed before T - maxIntervalInSeconds. Reads in a region are served only from the safe version or newer. This is enforced at the storage engine level — a read arriving at a replica with a too-old safe version is either delayed or promoted to a quorum read.

The following sequence diagram shows how Strong and Session diverge in their routing behavior for the same write operation. Strong blocks the client until all regions confirm; Session returns immediately after the local quorum commits and relies on the client's session token for subsequent read routing.

sequenceDiagram
    participant C as Client
    participant WR as Write Region (SEA)
    participant EU as Read Region (West EU)
    participant US as Read Region (East US)

    Note over C,US: Strong Consistency — synchronous global quorum
    C->>WR: Write W1 (deposit $500)
    WR->>EU: Synchronous replication
    WR->>US: Synchronous replication
    EU-->>WR: ACK committed
    US-->>WR: ACK committed
    WR-->>C: Success + LSN:1042 (all regions confirmed)

    Note over C,US: Session Consistency — async replication + token routing
    C->>WR: Write W2 (deposit $200)
    WR-->>C: Success + session-token LSN:1043 (local quorum only)
    C->>EU: Read balance [x-ms-session-token: LSN:1043]
    EU-->>C: $1700 (routed to replica at LSN >= 1043)

This diagram highlights the fundamental difference: Strong waits for all three regions before returning to the client, adding 80–150 ms of cross-region round-trip to every write. Session returns after the local quorum only, then uses the session token to route subsequent reads to an up-to-date replica without blocking the write path.

Performance Analysis: Write Latency, RU Cost, and Throughput Across All Five Levels

LevelWrite latency driverRead latency driverRU multiplierThroughput impact
StrongRound-trip to ALL configured regionsPrimary region quorum2x reads~50% vs Eventual
Bounded StalenessLocal write region quorumNearest replica within watermark1xNear-Eventual
SessionLocal write region quorumNearest replica with matching LSN1xNear-Eventual
Consistent PrefixLocal write region quorumNearest replica (any ordered version)1xEquivalent to Eventual
EventualLocal write region quorumNearest replica (any version)1xMaximum

The 2x RU cost on Strong consistency reads comes from internal quorum reads required to guarantee linearizability — every Strong read confirms the current version across a replica quorum before returning, doubling the compute cost. Session, Consistent Prefix, and Eventual have identical write latency and cost profiles. The choice between them is purely about semantic correctness.


📊 Visualizing the Five Levels — How Read Visibility Differs Across the Spectrum

Four diagrams follow — each shows the same write operation (depositing money) and read operation, but with different replication and routing behavior. Taken together, they illustrate exactly where each level's guarantee ends.

Strong: Every Region Confirms Before the Client Receives Success

Strong consistency blocks the client until all regions synchronously commit the write. The diagram below shows a three-region account — the client's success message arrives only after West Europe and East US have both confirmed.

sequenceDiagram
    participant C as Client (SEA)
    participant WR as Write Region (SEA)
    participant EU as West Europe
    participant US as East US

    C->>WR: Write W1 (deposit $500)
    WR->>EU: Synchronous replication of W1
    WR->>US: Synchronous replication of W1
    EU-->>WR: ACK — W1 committed
    US-->>WR: ACK — W1 committed
    WR-->>C: Success — W1 globally visible

    C->>EU: Read balance
    EU-->>C: $1500 (W1 visible in all regions)

Strong consistency creates a globally agreed "write point" — after the client receives success, any read from any region is guaranteed to return W1 or a later write. If West Europe is partitioned during replication, the write hangs until the partition resolves. Availability is sacrificed for linearizability.

Session: Token-Scoped Read-Your-Own-Writes

Session consistency shows the asymmetry between a client carrying a session token and one without. This is the exact failure pattern from the opening banking app scenario — Lambda invocation B is Client B.

sequenceDiagram
    participant A as Client A (has session token)
    participant B as Client B (new SDK — no token)
    participant DB as Cosmos DB

    A->>DB: Write W1 (deposit $500)
    DB-->>A: Success + session token LSN:1042
    Note over A: SDK stores LSN:1042

    A->>DB: Read balance [token: LSN:1042]
    DB-->>A: $1500 (replica at LSN >= 1042 — guaranteed fresh)

    B->>DB: Read balance [no token]
    DB-->>B: $1000 (nearest replica — may predate W1)

Client A carries the LSN from its write and gets a routing guarantee. Client B carries nothing and is treated as an Eventual read. No error is thrown. The difference is invisible without observing the returned values against expected post-write state.

Consistent Prefix: No Gaps, No Reversals in Write Order

Consistent Prefix enforces that replicas only advance their visible state forward and never skip a write. The sequence below shows three commits to an order-tracking system — the replica may lag, but it will never serve W3 without W2.

sequenceDiagram
    participant W as Write Region
    participant R as Read Replica (lagging)
    participant C as Client

    W->>W: Commit W1 (order created)
    W->>W: Commit W2 (payment charged)
    W->>W: Commit W3 (order shipped)

    Note over R: Replication lag — W1 and W2 received, W3 not yet

    C->>R: Read order status
    R-->>C: "Created, payment charged" (prefix W1+W2 — valid)

    Note over R: W3 now replicated

    C->>R: Read order status
    R-->>C: "Order shipped" (full prefix W1+W2+W3 — valid)

    Note over W,R: Prevented: serving W3 without W2 (gap)
    Note over W,R: Prevented: serving W2 then W1 in successive reads (reversal)

The protocol enforces a forward-only invariant at the replica. Eventual consistency would permit reading W3, then W1 — the order appeared to ship before it was created. Consistent Prefix makes this impossible while still allowing the replica to lag behind the write region indefinitely.

Eventual: Gossip Replication with No Ordering Constraints

graph TD
    WR["Write Region Commit W1 then W2 then W3 Ack after local quorum only"]

    WR -->|"Async gossip — unordered"| R1["Read Region A Current state: W1 and W3 W2 not yet received"]
    WR -->|"Async gossip — unordered"| R2["Read Region B Current state: W3 only W1 and W2 not yet received"]
    WR -->|"Async gossip — unordered"| R3["Read Region C Current state: W1, W2, W3 Fully replicated"]

    style WR fill:#27ae60,color:#fff
    style R1 fill:#e67e22,color:#fff
    style R2 fill:#c0392b,color:#fff
    style R3 fill:#27ae60,color:#fff

Each read region has received a different subset of writes in a different order — all three states are valid under Eventual consistency. Region A has W1 and W3 but not W2 (a gap that Consistent Prefix prevents). Region B has only W3 (a severe reversal in the making). Region C is fully caught up. Given time with no new writes, all three will converge, but there is no time bound on when.


🌍 Real-World Deployments — Which Level Serves Banking, Social Media, and Audit Logs

Financial Services: Account Balances and Inventory Counters

Banking applications require Strong consistency for balance reads. A balance that shows the pre-deposit amount for even a single read can trigger incorrect overdraft fees, failed payment authorizations, or reconciliation discrepancies. The 2x RU cost and higher write latency are justified — incorrect balance data causes real monetary loss.

E-commerce inventory counters face the same trade-off at higher write frequency. A flash sale listing 100 units where 200 simultaneous reads all see "available" before the last unit sells results in overselling. Teams use either Strong consistency or conditional writes (optimistic concurrency with ETag) to prevent duplicate reservations. Bounded Staleness (T=30s) is a viable middle ground if a brief availability window during depletion is acceptable.

Social Feeds and Activity Counters

Twitter processes billions of like and view events daily. The precise like count on a post seconds after a viral moment is irrelevant — what matters is throughput and eventual correctness. Eventual consistency with high-frequency async writes and periodic read-time aggregation is the standard pattern. The 2–5 second lag before a like appears for another user is invisible in practice.

Shopping Carts and User Preferences

A user adding items to a shopping cart and immediately viewing it expects to see those items. Session consistency provides read-your-own-writes at 1x RU cost — the correct default for ~90% of user-facing web and mobile features. The session token is managed automatically by the SDK in singleton-client architectures.

Audit Logs and Event History

A financial audit log must show events in the exact order they were committed. If a compliance officer reads the log and sees a withdrawal before the transfer that funded it, the audit is incorrect. Consistent Prefix provides the necessary ordering guarantee without requiring Strong consistency on the high-frequency write path. Entries may appear 30–60 seconds behind real time — acceptable. Entries appearing out of sequence — never acceptable.


⚖️ Trade-offs and Failure Modes Across All Five Levels

Full Guarantee Comparison

LevelRead-your-own-writesMonotonic readsGlobal orderingMax stalenessMulti-write supportRU multiplier
StrongAlwaysAlwaysGlobal linear orderNone — always latestNot supported2x reads
Bounded StalenessAfter staleness windowYesSingle-write-region onlyK versions or T secondsDegrades to CP ordering1x
SessionWithin session onlyWithin session onlyNo cross-session guaranteeUnboundedFully supported1x
Consistent PrefixNoNo gaps or reversalsOrdered prefix onlyUnboundedFully supported1x
EventualNoNo — reversals possibleNoneUnboundedFully supported1x

Failure Modes That Catch Teams Off-Guard

Strong + multi-region write = hard platform rejection. Enabling multi-region writes and setting Strong consistency produces a configuration error. This is not a performance advisory — it is a hard platform limit. The synchronous cross-region coordination that Strong requires is physically incompatible with accepting concurrent writes from multiple regions.

Session + stateless functions = invisible staleness. Lambda functions and Azure Functions creating a new SDK client per invocation silently break Session consistency. There is no error, no warning, and no exception. Reads simply return from any replica without the session token constraint. The staleness is detectable only by comparing read timestamps against write timestamps — which most applications do not instrument.

Bounded Staleness + multi-region write = silent ordering degradation. Unlike Strong, this combination produces no error. The K/T staleness bound continues to apply, but the monotonic read and global ordering guarantees degrade to Consistent Prefix semantics. Teams often miss this during multi-region write migrations.

Eventual for sequential data = logic corruption. Any data with sequential semantic dependencies — order state machines, payment workflows, document version history — is unsafe under Eventual. A client reading "shipped" then "created" in successive reads and transitioning state based on those reads may apply invalid transitions. Eventual consistency does not cause this occasionally — it permits it by definition.


🧭 Choosing the Right Level — Decision Flowchart and Use-Case Reference

The right consistency level depends on three questions: Does stale data cause financial or correctness harm? Do users need to see their own writes immediately? Does the order of writes matter?

flowchart TD
    START["What consistency level do I need?"]

    START --> Q1{"Is stale data business-critical? money / inventory / compliance"}

    Q1 -->|"Yes"| Q2{"Single write region or multi-region writes?"}
    Q2 -->|"Single write region"| STRONG["Strong Linearizable — no staleness 2x RU on reads"]
    Q2 -->|"Multi-region writes"| BS["Bounded Staleness Strongest with multi-write Set T or K to your SLA"]

    Q1 -->|"No"| Q3{"Must users see their own writes immediately?"}
    Q3 -->|"Yes"| SESSION["Session — Default Read-your-own-writes via token Maintain token in serverless!"]

    Q3 -->|"No"| Q4{"Does write ORDER matter? audit log or event history"}
    Q4 -->|"Yes — sequence matters"| CP["Consistent Prefix No gaps, no reversals Recency not guaranteed"]
    Q4 -->|"No — any order is fine"| EVENTUAL["Eventual Max throughput Only for staleness-tolerant data"]

    style STRONG fill:#c0392b,color:#fff
    style BS fill:#e67e22,color:#fff
    style SESSION fill:#27ae60,color:#fff
    style CP fill:#2980b9,color:#fff
    style EVENTUAL fill:#8e44ad,color:#fff

Walk this flowchart per-feature, not per-account. Cosmos DB allows per-request consistency overrides that weaken (never strengthen) the account default. A single account can serve balance reads at Strong and product catalog reads at Eventual.

Use CaseRecommended LevelKey Reason
Bank account balanceStrongStale balance causes financial harm
Inventory check (exact)Strong or Bounded StalenessOversell risk — T=30s may be acceptable
Near-real-time dashboardBounded StalenessT=60s lag acceptable; ordering useful
Shopping cart itemsSessionUser must see items just added
User profile / preferencesSessionRead-your-own-writes; low write volume
Financial audit logConsistent PrefixSequence correctness; real-time not required
Order history displayConsistent PrefixOrdering matters; seconds of lag acceptable
Social media feedEventual2–5s stale: invisible to users
Like and view countersEventualCounter accuracy within 1% irrelevant
Product catalog readsSession or EventualSlight staleness acceptable

🧪 The Lambda Session Token Trap — A Serverless Consistency Failure Dissected

This section walks through the exact failure the banking app experienced step by step and shows the fix.

Why this scenario matters: It represents the most common Cosmos DB production consistency bug. Teams adopt Session consistency as the default (the correct choice) and deploy on serverless platforms (the correct architecture) but never test cross-invocation read-after-write consistency. The bug surfaces only under real traffic patterns.

What to look for: How the session token creates an implicit dependency between write and read invocations, and what changes when the fix is applied.

The Failing Flow — New SDK Client Per Invocation

Lambda Invocation A — Deposit handler:
  1. new CosmosClient() → session token map = empty
  2. Write W1: deposit $500
  3. Response includes session token "LSN:1042"
  4. Token stored in: invocation-local memory only
  5. Lambda returns 200 OK
  ← Container may spin down; token is lost

Lambda Invocation B — Balance query handler (new or reused container):
  1. new CosmosClient() → session token map = empty  ← THE BUG
  2. Read balance [no session token header]
  3. Cosmos DB routes to nearest replica (any LSN)
  4. Nearest replica has LSN:1039 — W1 not replicated yet
  5. Returns $1,000 — pre-deposit balance
  ← User sees incorrect balance; no error thrown

No error is thrown because Cosmos DB behaves correctly per the Session consistency contract. No session token means no LSN routing constraint. The replica's response is valid. The bug is a contract violation in how the application uses the SDK, not in the database.

The Fix — Externalizing the Session Token

Three approaches, in order of simplicity:

1. Pass the token as a response header in the API layer. The deposit handler includes the session token in its HTTP response. The API gateway passes it as a header on the immediately following balance-read request. The balance handler injects it into CosmosItemRequestOptions. Zero infrastructure overhead.

2. Store the token in a low-latency key-value store. After writing a deposit, serialize the session token to Redis keyed by userId. When handling a balance read for the same user, retrieve the token and inject it into the read options. Works for async flows where write and read are in separate user interactions.

3. Use a singleton SDK client outside the handler. Initialize the Cosmos DB client once per Lambda container (outside the handler function), not once per invocation. The SDK's in-memory token map persists across warm invocations. Effective for high-concurrency functions where containers stay warm; provides no guarantee on cold starts.

The session token is a 20–40 byte string. The overhead of persisting and retrieving it is negligible compared to the cost of a Cosmos DB operation.


🛠️ Azure SDK and CLI: Configuring Consistency Per Operation

Cosmos DB supports consistency configuration at two levels: the account default (ceiling for all operations) and per-request overrides (can only weaken, never strengthen beyond the account default).

Account Default via Azure CLI

# Set account-level consistency default
az cosmosdb update \
  --name mycosmosaccount \
  --resource-group myrg \
  --default-consistency-level Session

# For Bounded Staleness: configure K and T bounds
az cosmosdb update \
  --name mycosmosaccount \
  --resource-group myrg \
  --default-consistency-level BoundedStaleness \
  --max-staleness-prefix 100000 \
  --max-interval 300

# Valid values: Strong | BoundedStaleness | Session | ConsistentPrefix | Eventual

Per-Request Override and Session Token Passthrough in the Java SDK

// SDK client with account default (Session)
CosmosClient client = new CosmosClientBuilder()
    .endpoint(System.getenv("COSMOS_ENDPOINT"))
    .key(System.getenv("COSMOS_KEY"))
    .consistencyLevel(ConsistencyLevel.SESSION)
    .buildClient();

// Per-request WEAKER override: Session -> Eventual (allowed — weakening only)
CosmosItemRequestOptions catalogOptions = new CosmosItemRequestOptions();
catalogOptions.setConsistencyLevel(ConsistencyLevel.EVENTUAL);
container.readItem(productId, partitionKey, catalogOptions, Product.class);

// Session token extraction and passthrough for serverless fix:
CosmosItemResponse<BankAccount> depositResponse =
    container.createItem(deposit, partitionKey, new CosmosItemRequestOptions());
String sessionToken = depositResponse.getSessionToken();
// Persist: SET session:{userId} {sessionToken} EX 300 (in Redis)

// On the next read invocation — retrieve and inject:
String storedToken = redisClient.get("session:" + userId);
CosmosItemRequestOptions balanceOptions = new CosmosItemRequestOptions();
if (storedToken != null) {
    balanceOptions.setSessionToken(storedToken);
}
container.readItem(accountId, partitionKey, balanceOptions, BankAccount.class);

The session token passthrough is the fix for the banking app. depositResponse.getSessionToken() returns the LSN-encoded token from the write. Injecting it into the subsequent read forces routing to a replica at or ahead of that LSN. Per-request consistency can only weaken: an account configured at Session can override to Eventual per-request, but cannot override to Strong.

For the full Java SDK v4 documentation including session token management, see the Azure Cosmos DB Java SDK v4 reference.


📚 Lessons Learned from Production Cosmos DB Deployments

  • "Session" means client-session-token continuity — not HTTP session, not database session. The token is an SDK-managed string encoding the LSN your client has seen. A new SDK client instance starts with an empty token and zero read-your-own-writes guarantee until it issues a write.

  • Every new SDK client is a fresh consistency start. Lambda functions, Azure Functions, and any stateless runtime that instantiates a new SDK client per invocation silently break Session guarantees. No error is thrown. Reads return stale data from whichever replica happens to respond.

  • Strong consistency and multi-region writes are mutually exclusive — hard platform constraint. Not a performance recommendation. Attempting to combine them produces a configuration error. Choose between active-active writes and linearizable reads.

  • Eventual consistency allows genuinely out-of-order reads with no time bound. "Eventual" does not mean "consistent within a few seconds." A read can return an older value than the previous read, in the same session, indefinitely. Never use Eventual for data with sequential semantic dependencies.

  • Bounded Staleness is the underused middle ground. Teams often jump from Session to Strong for "fresher" reads, paying 2x RU unnecessarily. For dashboards and analytics where a configurable lag is acceptable, Bounded Staleness provides ordering guarantees at 1x RU cost.

  • Per-request overrides can only weaken consistency, never strengthen it. Design the account default around the most demanding use case. Weaker overrides per-request are free; stronger overrides are impossible.


📌 TLDR — Five-Bullet Decision Cheat Sheet

  • Strong → Linearizable globally, always latest, 2x RU on reads, incompatible with multi-region writes. Use for account balances and inventory where any staleness causes financial harm.
  • Bounded Staleness → Reads lag by at most K versions or T seconds. Same write latency as Eventual, with ordering guarantees in single-write-region mode. Use for dashboards and compliance scenarios with explicit staleness SLAs.
  • Session → Read-your-own-writes within a single SDK client session via session token. Token must travel with the client — stateless functions silently break it. Use for shopping carts, user preferences, and any user-specific write-then-read flow.
  • Consistent Prefix → No gaps, no reversals in write order. Reads may be stale but always see writes in sequence. Use for audit logs and event history where ordering matters but real-time does not.
  • Eventual → Zero ordering guarantees, maximum throughput, 1x RU. Successive reads can return older values than previous reads. Use only for staleness-tolerant, non-sequential data like social feeds and view counters.

📝 Practice Quiz

  1. A Lambda function creates a new Cosmos DB SDK client on every invocation and uses Session consistency. A user deposits money in invocation A. Invocation B reads the account balance. Why might the read return the pre-deposit amount?

    A) Session consistency guarantees are limited to the write region only
    B) The deposit write is not fully committed until both invocations complete
    C) Each new SDK client starts with an empty session token, so the read has no LSN routing constraint and may be served from any replica including one that has not replicated the deposit
    D) Cosmos DB does not support Session consistency for financial workloads

    Correct Answer: C

  2. Your Cosmos DB account has multi-region writes enabled. A stakeholder requests Strong consistency for all account balance reads. What is your response and what alternative do you offer?

    A) Strong consistency is available but requires provisioned throughput at 10000 RU/s or higher
    B) Strong consistency is not supported with multi-region write configurations — this is a hard platform limit. The strongest available option with multi-region writes is Bounded Staleness, which provides ordering guarantees and a configurable K/T freshness bound
    C) Strong consistency can be applied per-request even if the account default is Eventual
    D) Strong consistency works with multi-region writes but adds 500 ms of latency

    Correct Answer: B

  3. What is the concrete difference between Consistent Prefix and Eventual consistency? Give an example of what Eventual allows that Consistent Prefix prevents.

    A) Consistent Prefix is always faster than Eventual because it applies ordering at write time
    B) Consistent Prefix prevents read reversals and gaps: if W1, W2, W3 were committed, no read will see W3 without W2. Eventual makes no such promise — a client can read W3, then W1, then W2 in successive reads. For an order tracker, Eventual allows reading "shipped" then "created" — as if the order shipped before it was placed
    C) There is no difference — both allow out-of-order reads
    D) Consistent Prefix guarantees reads are never more than one write behind Eventual

    Correct Answer: B

  4. Bounded Staleness is configured with maxIntervalInSeconds=60. Write W1 occurs at t=0. Is a read at t=55s guaranteed to see W1? What about at t=65s?

    A) Both t=55s and t=65s are guaranteed to see W1
    B) t=55s is guaranteed; t=65s is not because the window resets after each read
    C) t=55s is not guaranteed (the 60s staleness window has not expired). t=65s is guaranteed (the window has expired and W1 must be visible to all reads)
    D) Neither is guaranteed — Bounded Staleness only provides ordering, not freshness bounds

    Correct Answer: C

  5. Architecture design challenge (Open-ended — no single correct answer): Your team is building a multi-region Cosmos DB account with multi-region writes enabled to minimize write latency globally. You need to serve two features: a financial audit log (entries must be sequentially correct, real-time not required) and a real-time spend dashboard (must refresh within 30 seconds, ordering irrelevant). Which consistency levels would you select for each feature, and how would you configure the account default and per-request overrides? Consider the multi-region write constraint.

    Strong points to address: why Strong is unavailable (multi-region write constraint), why Consistent Prefix suits the audit log (ordering without recency), why Bounded Staleness with T=30s or Eventual suits the dashboard, how the account default should be set to the stricter of the two requirements (Consistent Prefix), and how per-request weakening to Eventual covers the dashboard reads.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms