Advanced39 min readDistributed SystemsDynamodbCosmosdb

ACID Transactions in Distributed Databases: DynamoDB, Cosmos DB, and Spanner Compared

How modern NoSQL databases achieve atomicity and isolation without a single machine

How It Works: Internals Explained

Abstract Algorithms

·Apr 5, 2026·39 min read

More actions⌄

Practice Interview Mock Discussion

Reading progress

39 min left

Metadata and pacing⌄

Total read

39 min

Sections

◴ On this page⌄

The Write That Silently Disappeared 📖 ACID Unpacked for Distributed Systems: What Each Letter Means When There Is No Single Machine 🔍 DynamoDB TransactWriteItems: Two-Phase Commit Without Calling It Two-Phase Commit How the Coordinator Item Orchestrates 2PC Three Failure Scenarios and Why the Idempotency Token Is Not Optional DynamoDB Transaction Limits That Shape Your Data Model 🔍 Cosmos DB Partition-Scoped Transactions: When JavaScript Is the Transaction Boundary Stored Procedures as the Transaction Unit The Cross-Partition Reality Multi-Region Write Consistency Trade-offs 🔍 Google Spanner and Global ACID: The Engineering Cost of External Consistency TrueTime: Bounding the Unknowable External Consistency Through Commit-Wait Two-Phase Locking and Two-Phase Commit at Scale 🔍 MongoDB Multi-Document Transactions: Oplog-Based Atomicity With a Sharding Caveat Replica Set Transactions: The Oplog as the Commit Record Sharded Cluster Transactions: Cross-Shard 2PC Through the Config Server 🧠 Deep Dive: How Distributed ACID Is Actually Enforced Under the Hood The Internals: Concurrency Control Models Compared Mathematical Model: TrueTime Intervals and Isolation Level Formalism Performance Analysis: Quantifying the Cost of Distributed Atomicity ⚙️ Isolation Levels Compared: Which Anomalies Each Database Prevents for Transactional Reads 🏗️ Advanced Patterns for When Database Transactions Are Not Enough 📊 Side-by-Side Comparison: Atomicity Scope, Isolation, Latency, and Cost 🌍 Real-World Use Cases: Where Each Database's Transaction Model Earns Its Keep ⚖️ Trade-offs and Failure Modes Across All Four Databases 🧭 Decision Framework: Matching Your Workload to the Right Transactional Database 🧪 Worked Example: Atomic Inventory Reservation for a Flash Sale 🛠️ SDK Integration: Transactional Writes in DynamoDB, Cosmos DB, and Spanner 📚 Lessons Learned: What Production Teaches You About Distributed Transactions 📌 TLDR: One Paragraph That Replaces 20 Stack Overflow Tabs 🔗 Related Posts

✣ Need another angle?⌄

Switch the article companion into a lower-complexity framing, then quiz yourself when you are ready.

Advanced39 min readDistributed SystemsDynamodbCosmosdb

ACID Transactions in Distributed Databases: DynamoDB, Cosmos DB, and Spanner Compared

How modern NoSQL databases achieve atomicity and isolation without a single machine

Abstract Algorithms

Apr 5, 2026 · 39 min read

Interview

Helpful?

The Write That Silently Disappeared

TLDR: ACID transactions in distributed databases are not equal.

1. Overview

How modern NoSQL databases achieve atomicity and isolation without a single machine

⌁

Why it matters

TLDR: ACID transactions in distributed databases are not equal.

Show high-level concept flow⌄

The Write That Silently Disappeared

Starting point

→

📖 ACID Unpacked for Distributed Systems: What Each Letter Means When There Is No Single Machine

Next concept

→

🔍 DynamoDB TransactWriteItems: Two-Phase Commit Without Calling It Two-Phase Commit

Next concept

→

🔍 Cosmos DB Partition-Scoped Transactions: When JavaScript Is the Transaction Boundary

Next concept

→

🔍 Google Spanner and Global ACID: The Engineering Cost of External Consistency

Outcome

Committed

At a glance

DifficultyAdvanced ▥

Concepts32

Estimated time39 min

PrerequisitesDistributed Systems, Dynamodb

System lens

See ACID Transactions in Distributed Databases: DynamoDB, Cosmos DB, and Spanner Compared as a living topology.

How modern NoSQL databases achieve atomicity and isolation without a single machine

The Write That Silently Disappeared

Ingress and assumptions

📖 ACID Unpacked for Distributed Systems: What Each Letter Means When There Is No Single Machine

State transition

🔍 DynamoDB TransactWriteItems: Two-Phase Commit Without Calling It Two-Phase Commit

State transition

🔍 Cosmos DB Partition-Scoped Transactions: When JavaScript Is the Transaction Boundary

State transition

🔍 Google Spanner and Global ACID: The Engineering Cost of External Consistency

Outcome and guarantees

The article becomes easier when every section maps to a state change, a guarantee, or a failure boundary.

Narrative transition

Move from explanation to operating judgment.

Use these checkpoints as the conceptual pacing layer before continuing into the full article.

!Why this matters

TLDR: ACID transactions in distributed databases are not equal.

#Key section to watch

Pay attention to "📖 ACID Unpacked for Distributed Systems: What Each Letter Means When There Is No Single Machine"; it usually contains the main mechanism or tradeoff.

?Interview angle

Be ready to explain The Write That Silently Disappeared and 📖 ACID Unpacked for Distributed Systems: What Each Letter Means When There Is No Single Machine with one concrete example and one tradeoff.

Tradeoff path 1

The Write That Silently Disappeared: speed-first

TLDR: ACID transactions in distributed databases are not equal.

Tradeoff path 2

📖 ACID Unpacked for Distributed Systems: What Each Letter Means When There Is No Single Machine: reliability-first

DynamoDB provides multi item atomicity scoped to 25 items using two phase commit with a coordinator item, but only within a single region.

Failure rehearsal

Pressure-test the mental model.

The Write That Silently Disappeared misunderstood

High model quality can still produce incorrect outputs without grounding and verification.

Mitigation: Revisit The Write That Silently Disappeared and validate the first principles.

Risk 68%

📖 ACID Unpacked for Distributed Systems: What Each Letter Means When There Is No Single Machine tradeoff missed

Low latency does not automatically mean high throughput under contention.

Mitigation: Compare against 📖 ACID Unpacked for Distributed Systems: What Each Letter Means When There Is No Single Machine and document the tradeoff.

Risk 58%

Back to the article

Continue into the authored sections with the topology in mind: each heading should now answer what changes, what can fail, and what guarantee the system is trying to preserve.

> TLDR: ACID transactions in distributed databases are not equal. DynamoDB provides multi-item atomicity scoped to 25 items using two-phase commit with a coordinator item, but only within a single region. Cosmos DB wraps partition-scoped operations inside a stored procedure using a JavaScript continuation model — cross-partition writes get no atomic guarantee. Google Spanner is the only cloud database that delivers true global ACID with external consistency, using TrueTime (GPS + atomic clocks) to assign globally ordered timestamps and a commit-wait delay to guarantee causality across datacenters. MongoDB supports multi-document transactions on replica sets but carries measurable performance overhead and does not prevent write skew under snapshot isolation. The right choice depends on atomicity scope, latency tolerance, and operational cost — not on which database claims to be "ACID-compliant."

The Write That Silently Disappeared

A payments team at a mid-size fintech company migrated their account ledger service from PostgreSQL to DynamoDB. In PostgreSQL, every funds transfer ran inside a BEGIN / COMMIT block: debit the source account, credit the destination account, write a transaction record — three rows, one atomic commit. The team ported the logic verbatim to DynamoDB and deployed. The monitoring dashboard showed zero errors. Latency improved by 40%.

Three weeks later, a nightly reconciliation job found thousands of accounts where a debit had fired but the corresponding credit had never landed. Source accounts had been charged. Destination accounts were unchanged. No exception had been logged. No CloudWatch alarm had fired.

// What the team wrote — looks like a safe "two-step" update
dynamoDB.updateItem(debitSourceAccountRequest);   // succeeds
dynamoDB.updateItem(creditDestinationRequest);    // may never run

The code issued two independent UpdateItem calls. Each call was individually atomic and durable. But the pair was not. A Lambda timeout, a network blip, or a silent retry on only the first call had created thousands of partial transfers. In DynamoDB — and in every distributed database — two sequential writes are never automatically part of the same transaction unless you use the transactional API explicitly.

This is not a DynamoDB bug. It is a failure of assumptions borrowed from PostgreSQL. And the assumption that "ACID-capable" means the same thing across all databases is the most expensive misconception in distributed systems engineering.

The real questions are not "does this database support ACID?" — all four databases in this post do, in some form. The real questions are: what scope does ACID cover?, what isolation level does it actually enforce?, and what does it cost in latency and write throughput?

📖 ACID Unpacked for Distributed Systems: What Each Letter Means When There Is No Single Machine

The ACID acronym was coined for single-node relational databases. Transplanted into distributed systems, each letter means something subtly different — and one of them barely means anything at all.

Atomicity in a distributed context means all writes in a logical transaction either all commit or all abort, even when those writes span multiple nodes, partitions, or shards. On a single machine, "all or nothing" is enforced by the storage engine via write-ahead logging — a crash before the WAL is flushed is a rollback, a crash after is a commit. Across machines, atomicity requires a coordination protocol — typically two-phase commit (2PC) — where every participant must agree to commit before anyone actually applies the change. Failure between the PREPARE phase and the COMMIT phase is the hardest sub-problem in distributed transactions: some participants may have tentative mutations that are never finalized.

Consistency is the weakest letter in distributed systems, and arguably the most misleading. In ACID, "consistent" means "your application invariants hold after the transaction." The database enforces type constraints and referential integrity. The application must encode business rules — no negative balances, no oversold inventory, no duplicate order IDs. The "C" does not map to the CAP theorem's definition of consistency (linearizability). Engineers who expect a distributed database's ACID guarantee to automatically enforce business-level correctness will be surprised to discover that a serializable transaction can still commit a business rule violation if the application code does not encode that rule as a condition.

Isolation is the most nuanced letter. It describes which anomalies concurrent transactions can produce when their operations interleave. The SQL standard defines four isolation levels, but the anomalies those levels prevent are more useful to reason about directly:

Anomaly	Description	Prevented By
Dirty Read	Reading uncommitted data from another in-flight transaction	Read Committed and above
Non-Repeatable Read	Re-reading the same row and finding it was modified by a committed concurrent transaction	Repeatable Read and above
Phantom Read	A range query re-executed returns new rows inserted by a committed concurrent transaction	Serializable only
Write Skew	Two concurrent transactions each read overlapping data, make independent decisions, and commit changes that together violate an invariant	Serializable only (not Snapshot Isolation)

Write skew is the anomaly most engineers encounter only in production. Two concurrent "reserve last seat" transactions can both read available_seats = 1, both decide to proceed, and both commit — leaving available_seats = -1. Serializable isolation prevents this. Snapshot isolation, which MongoDB uses, does not.

Durability in a distributed context means a committed write has been replicated to a quorum of nodes and persisted to disk before the acknowledgment reaches the client. Losing a single node after commit must not lose the data. All four databases in this post provide durable writes — but their quorum configurations, replication lag, and recovery guarantees differ meaningfully in multi-region deployments.

🔍 DynamoDB TransactWriteItems: Two-Phase Commit Without Calling It Two-Phase Commit

Amazon DynamoDB added transactional APIs in late 2018 via TransactWriteItems and TransactGetItems. The documentation describes ACID guarantees for up to 25 items across up to 25 tables within a single AWS region. What the documentation does not say is that these APIs implement a variant of two-phase commit backed by a coordinator item stored in DynamoDB's own infrastructure.

How the Coordinator Item Orchestrates 2PC

When a client calls TransactWriteItems, DynamoDB's transaction layer writes a transaction record — a special metadata item tracked internally — with status PENDING and the full set of intended operations, condition expressions, and item keys. This record is the single source of truth for the transaction's state. If the coordinator process dies after writing this record, another coordinator process can read the record, inspect its state, and either complete the transaction or roll it back.

The coordinator then sends PREPARE messages to each partition node that owns an affected item. Each partition node locks the item (blocking concurrent writes), evaluates the ConditionExpression, and writes a tentative mutation. If all partitions respond PREPARED, the coordinator updates the transaction record to COMMITTED and sends COMMIT to each partition. Each partition applies the tentative write and releases the lock.

The sequence below shows the full 2PC flow for a funds transfer involving items on two separate DynamoDB partitions. Pay attention to where the coordinator record transitions from PENDING to COMMITTED — this is the point of no return. Everything before it can be safely rolled back; everything after it must eventually be committed.

sequenceDiagram
    participant Client
    participant Coordinator
    participant PartitionA as Partition A
    participant PartitionB as Partition B

    Client->>Coordinator: TransactWriteItems with ClientRequestToken
    Coordinator->>Coordinator: Write PENDING transaction record
    Coordinator->>PartitionA: PREPARE - lock item, evaluate condition, tentative write
    Coordinator->>PartitionB: PREPARE - lock item, evaluate condition, tentative write
    PartitionA-->>Coordinator: PREPARED
    PartitionB-->>Coordinator: PREPARED
    Coordinator->>Coordinator: Update transaction record to COMMITTED
    Coordinator->>PartitionA: COMMIT - apply write, release lock
    Coordinator->>PartitionB: COMMIT - apply write, release lock
    PartitionA-->>Coordinator: ACK
    PartitionB-->>Coordinator: ACK
    Coordinator-->>Client: TransactionSucceeded

This diagram explains both the 2x write cost (every transactional write involves at least two round trips to the partition: PREPARE then COMMIT) and why DynamoDB bills transactional writes at 2x the standard write capacity unit rate. The coordinator overhead is real and unavoidable.

Three Failure Scenarios and Why the Idempotency Token Is Not Optional

DynamoDB's 2PC exposes three distinct failure windows. Engineers who do not anticipate them build retry loops that introduce duplicate operations.

Scenario 1 — Coordinator fails after PENDING but before any PREPARE. No partition has been touched. On recovery, a new coordinator reads the PENDING record, finds no PREPARED responses, and rolls back by deleting the PENDING record. No writes have occurred.

Scenario 2 — Coordinator fails after some or all partitions respond PREPARED but before the coordinator record transitions to COMMITTED. Affected partitions hold locked items with tentative mutations. Other writers will receive TransactionConflictException attempting to write those items. On recovery, a new coordinator reads the PENDING record, finds no COMMITTED state, and sends ROLLBACK to all participants. Tentative mutations are discarded and locks are released.

Scenario 3 — Coordinator fails after updating the record to COMMITTED but before all COMMIT messages are delivered. This is the safe failure: the record is in COMMITTED state. On recovery, a new coordinator reads COMMITTED, identifies which partitions have not yet ACKed, and replays COMMIT to those partitions. This replay is idempotent because each partition checks the coordinator record before applying.

The ClientRequestToken (idempotency token) is the mechanism that makes client-side retries safe across all three scenarios. If a client retries a TransactWriteItems call after a timeout using the same token, DynamoDB returns the result of the original call without re-executing — even if the original call succeeded. Without the token, a retry of a timed-out successful transaction throws TransactionCanceledException because the committed state already exists and the re-submission detects a conflict.

DynamoDB Transaction Limits That Shape Your Data Model

The constraints are not arbitrary API limits — they reflect the design of the in-memory coordinator state:

25 items maximum per transaction across up to 25 tables.
Single AWS region only — DynamoDB Global Tables does not extend transactional guarantees across regions.
4 MB total payload per transaction.
All items must reside in the same AWS account.

These limits mean that any application logic requiring atomic coordination across more than 25 items — a cart checkout with 40 line items, a bulk inventory adjustment — cannot use TransactWriteItems and must fall back to the Saga pattern with compensating transactions.

🔍 Cosmos DB Partition-Scoped Transactions: When JavaScript Is the Transaction Boundary

Azure Cosmos DB supports multi-document transactions, but under a constraint that surprises most developers: all documents in a transaction must share the same logical partition key. There is no cross-partition atomic write in Cosmos DB. Understanding why requires understanding what the transaction boundary actually is.

Stored Procedures as the Transaction Unit

Cosmos DB's transaction boundary is the stored procedure execution. A stored procedure is a JavaScript function registered against a container. When Cosmos DB executes a stored procedure, the entire JavaScript function body runs as a single atomic operation on a single partition's primary replica. If the JavaScript throws an uncaught exception at any point, the Cosmos DB engine rolls back every document operation performed in that invocation. If the function completes without throwing, all mutations are committed atomically.

This is not traditional 2PC. It is closer to a JavaScript coroutine that the database engine treats as an indivisible unit of work. The atomicity guarantee comes from the fact that execution is single-threaded on the partition's primary replica — no other operation can interleave with it mid-execution. The "transaction" is implicit in the execution context: start of function = begin transaction, clean return = commit, uncaught exception = rollback.

The diagram below shows the decision tree from client request to committed state. The key branch is at the partition routing layer — if the operation spans multiple logical partition keys, it is rejected before JavaScript ever executes.

flowchart TD
    A[Client sends stored procedure call] --> B{All document operations use same partition key?}
    B -->|No - different partition keys| C[Request rejected at routing layer]
    B -->|Yes| D[Route to owning partition primary replica]
    D --> E[Execute stored procedure JavaScript on primary]
    E --> F{JavaScript throws uncaught exception?}
    F -->|Yes| G[Engine rolls back all ops in this invocation]
    F -->|No| H[Commit - all mutations visible atomically]
    G --> I[Return error to client]
    H --> J[Return success to client]

This diagram makes the Cosmos DB transaction model immediately practical: the partition key is the transaction scope, and the stored procedure is the transaction unit. If your data model puts related documents behind the same partition key, Cosmos DB transactions work cleanly. If your documents live on different partition keys, no Cosmos DB API provides atomicity across them.

The Cross-Partition Reality

Consider an e-commerce system where orderId is the partition key for order documents and sku is the partition key for inventory documents. A stored procedure that attempts to atomically decrement inventory and create an order cannot do so — these documents are on different partitions. Cosmos DB has no cross-partition atomic write API.

The Cosmos DB Transactional Batch API (introduced in 2020) provides the same single-partition atomicity as stored procedures with better ergonomics — it supports up to 100 operations per batch without the overhead of JavaScript parsing. But it still enforces the single-partition-key constraint.

Teams working around the cross-partition limitation use three approaches: denormalize data so all related documents share a partition key (accept fan-out on updates), use the Cosmos DB Change Feed as a CDC trigger to drive compensating transactions asynchronously (Saga), or accept eventual consistency on cross-entity operations.

Multi-Region Write Consistency Trade-offs

Under single-region writes (one write region, multiple read replicas), Cosmos DB transactions are serialized through the partition's write-region primary. Isolation is serializable within the partition.

Under multi-region writes, two clients can write to the same partition key from different regions simultaneously. Cosmos DB detects write conflicts using its conflict resolution policies (Last-Writer-Wins by _ts timestamp, or a custom merge procedure). Transactions run locally on the nearest region's primary and conflict resolution happens asynchronously — meaning two transactions on the same partition key from different write regions can both "succeed" locally and then conflict during replication. Applications using multi-region writes with conflict-sensitive data must implement custom conflict resolution logic.

🔍 Google Spanner and Global ACID: The Engineering Cost of External Consistency

Google Spanner is the only database in this comparison that delivers true global ACID transactions across any number of rows, tables, and geographic regions simultaneously. This is not a marketing claim. It is an engineering achievement built on physical infrastructure most databases do not have.

TrueTime: Bounding the Unknowable

Every distributed database that uses timestamps for ordering faces the clock synchronization problem: two nodes' clocks can drift by milliseconds or more. A transaction that commits on Node A at 10:00:00.003 might appear to have committed before a transaction on Node B at 10:00:00.001 even though it physically happened later — if Node A's clock is slightly ahead. Traditional distributed databases work around this by using logical clocks (Lamport timestamps, vector clocks) or by routing all writes through a single leader. Neither approach allows arbitrary cross-region writes with globally consistent timestamps.

Spanner solves this with TrueTime, a Google-internal API that exposes the current wall-clock time as an interval [t_earliest, t_latest] rather than a single point. The interval is guaranteed to contain the true absolute time. Google keeps the uncertainty interval narrow — typically 1–7 milliseconds — by synchronizing server clocks against GPS receivers and atomic clocks deployed in every Google datacenter. TrueTime does not eliminate uncertainty; it bounds it explicitly and makes that bound available to the database engine.

External Consistency Through Commit-Wait

Spanner uses TrueTime to achieve external consistency — the strongest possible consistency guarantee for a distributed database. External consistency means: if transaction T1 commits before transaction T2 starts (in real absolute time), then T1's commit timestamp is strictly less than T2's commit timestamp, as seen by every reader anywhere in the world.

To achieve this, Spanner runs a commit-wait delay on every read-write transaction:

Acquire shared (read) locks on all rows touched by reads and exclusive (write) locks on all rows to be mutated.
Call TT.now() to obtain the current TrueTime interval [t_earliest, t_latest].
Assign commit timestamp s = t_latest — the upper bound of the current uncertainty interval.
Wait until TT.now().t_earliest > s — that is, until Spanner is certain that the absolute current time has passed the commit timestamp.
Apply writes with timestamp s, release all locks, and return the commit acknowledgment to the client.

The sequence diagram below traces this process from transaction start through commit-wait completion. The critical observation is that commit-wait happens after all locks are held, so no other transaction can observe partial state during the wait — and after the wait, any transaction that starts anywhere in the world will observe a TrueTime lower bound already past s.

sequenceDiagram
    participant Client
    participant Leader as Spanner Partition Leader
    participant TrueTime as TrueTime API

    Client->>Leader: Begin read-write transaction
    Client->>Leader: Read rows R1, R2 - shared locks acquired
    Client->>Leader: Write mutations M1, M2
    Leader->>Leader: Acquire exclusive write locks on M1, M2
    Leader->>TrueTime: TT.now() - request current time interval
    TrueTime-->>Leader: Interval [t_earliest, t_latest]
    Note over Leader: Assign commit timestamp s = t_latest
    Leader->>Leader: Commit-wait - sleep until TT.now().t_earliest is greater than s
    Leader->>Leader: Apply writes atomically with timestamp s
    Leader->>Leader: Release all read and write locks
    Leader-->>Client: Commit ACK with timestamp s

The commit-wait delay is bounded by the TrueTime epsilon — the clock uncertainty. Google keeps this under 7ms in production, typically under 1ms in well-provisioned datacenters. This means Spanner read-write transactions in a single region complete in 5–15ms. Cross-region transactions are dominated by network latency (30–80ms for intercontinental), not by commit-wait.

Two-Phase Locking and Two-Phase Commit at Scale

Spanner's read-write transactions use strict two-phase locking (S2PL): shared locks are acquired on reads and held until commit; exclusive locks are acquired on writes and held until commit. No lock is ever released before the transaction commits. This is the most conservative locking strategy possible and is necessary for serializability — no concurrent transaction can observe another transaction's intermediate state.

For cross-shard writes, Spanner coordinates with two-phase commit where one of the participating shards' Paxos leaders is elected as the transaction coordinator. Unlike DynamoDB's coordinator, Spanner's coordinator state is replicated via Paxos — coordinator failure does not leave the transaction in limbo. A new Paxos leader can be elected and the transaction can be resumed or aborted without client intervention.

Spanner's read-only transactions bypass 2PL and 2PC entirely. They use bounded-staleness reads served by any up-to-date replica without acquiring locks or coordinating across shards. The tradeoff is a configurable staleness bound — "read data no more than 15 seconds stale" — but for analytics, reporting, and non-critical reads, this provides dramatically better throughput than read-write transactions.

🔍 MongoDB Multi-Document Transactions: Oplog-Based Atomicity With a Sharding Caveat

MongoDB added multi-document transactions in version 4.0 for replica sets (2018) and extended them to sharded clusters in version 4.2 (2019). MongoDB's transaction model builds on its existing replication infrastructure — the oplog (operations log) — and shares conceptual ancestry with Spanner's 2PC, though with higher overhead in the sharded case.

Replica Set Transactions: The Oplog as the Commit Record

On a replica set, all read and write operations in a transaction execute on the primary node. Reads use a consistent point-in-time snapshot of the data (MVCC-based), meaning reads inside a transaction never see commits from concurrent transactions — only data as it existed when the transaction began. Writes are buffered in memory and not written to the oplog until the transaction commits.

At commit time, all mutations are written to the oplog as a single logical entry. Secondary nodes replicate this entry and apply all mutations atomically — secondaries never observe a partial transaction state. If a conflicting write is detected at commit time (another transaction modified a document this transaction read or intends to write), MongoDB aborts with a WriteConflict error and the client must retry.

The performance cost is documented and real. MongoDB's default transaction timeout is 60 seconds; transactions running longer are aborted. Under high write contention, WriteConflict retries cascade into latency spikes. The MongoDB documentation itself recommends avoiding transactions on "hot" documents with high concurrent write rates.

Sharded Cluster Transactions: Cross-Shard 2PC Through the Config Server

When a transaction spans documents on multiple shards, a mongos router serves as the 2PC coordinator. The coordinator writes the transaction state to the config.transactions collection on the config server. It sends PREPARE to all participating shard primaries, collects PREPARED responses, updates config.transactions to COMMITTED, and sends COMMIT to all shards.

Each shard holds document-level write locks for the entire duration of the transaction — from the first write through the final COMMIT ACK. A sharded transaction that reads three documents and writes two on three different shards holds locks on all three shards for its entire execution lifetime. Under high concurrency, this is a meaningful throughput bottleneck and a common source of lock contention escalation in sharded clusters.

🧠 Deep Dive: How Distributed ACID Is Actually Enforced Under the Hood

Understanding the mechanics of each database's transaction implementation requires examining three layers: the concurrency control model, the mathematical guarantees it formalizes, and the performance cost those guarantees impose in production.

The Internals: Concurrency Control Models Compared

Each database uses a different concurrency control strategy to achieve isolation, and that strategy determines both its failure modes under contention and its performance characteristics.

DynamoDB uses optimistic concurrency control (OCC) with condition expressions. Rather than locking items during the transaction's lifetime, DynamoDB validates conditions at PREPARE time — atomically checking version numbers, attribute existence, or value predicates at the point of tentative write. If another writer modified an item between the transaction's start and the PREPARE phase, the condition check fails and the transaction aborts. OCC performs well under low contention and degrades under high contention because aborted transactions require full retries; there is no mechanism to "wait" for a conflicting transaction to commit.

Cosmos DB (within a stored procedure) uses a single-threaded execution model on the partition's primary replica. There are no locks in the traditional sense — the JavaScript execution is the lock, because no other operation runs concurrently on that partition thread during the procedure's execution. This is extremely efficient for short-lived, high-frequency transactions on a single partition. The cost is that long-running stored procedures block all other writes to that partition for their entire execution duration — a poorly written stored procedure is a partition-level denial of service.

Google Spanner uses strict two-phase locking (S2PL) combined with multi-version concurrency control (MVCC). Read-write transactions hold all locks from first acquisition through commit. Read-only transactions use MVCC snapshots and require no locks. This separation means read-heavy workloads do not contend with write-heavy workloads, while write-write contention is handled by deadlock detection — when a deadlock is detected, Spanner aborts the younger transaction and expects the client to retry.

MongoDB uses MVCC for reads and optimistic commit-time conflict detection for writes. The snapshot provides read-your-own-writes, monotonic reads, and non-repeatable-read prevention within the transaction. Write conflicts are detected at commit time using document-level versioning. Like DynamoDB's OCC, this degrades under high write contention into retry cascades.

Mathematical Model: TrueTime Intervals and Isolation Level Formalism

TrueTime interval notation. Let TT.now() return interval [t_e, t_l] where t_e (earliest) ≤ true absolute time ≤ t_l (latest). The uncertainty ε = t_l - t_e is bounded by clock synchronization quality. Spanner's commit-wait condition is:

Assign commit timestamp s = TT.now().t_l. Do not make the transaction visible until TT.now().t_e > s.

This ensures that the true absolute time has passed s before any reader can observe the committed data. Consequently, if transaction T1 commits at real time r1 and transaction T2 starts at real time r2 > r1, then T2's coordinator observes TT.now().t_e > r1 ≥ s1 (where s1 is T1's commit timestamp), guaranteeing s2 > s1. External consistency follows: every transaction that starts after T1's real-world commit will assign a higher timestamp and will therefore read T1's committed writes.

Isolation level anomaly prevention via predicate locking. Serializable isolation prevents write skew by using predicate locks — locks on the set of rows matching a predicate, not just rows that currently exist. When T1 reads all rows where status = 'available', a predicate lock is acquired on that predicate. If T2 inserts a row matching status = 'available' before T1 commits, T2 must wait for T1's predicate lock to be released. Snapshot isolation does not use predicate locks, which is why MongoDB's snapshot-isolated transactions permit write skew. Spanner's S2PL acquires predicate locks on every range read, preventing phantom reads and write skew entirely.

2PC correctness condition. A 2PC transaction can safely commit if and only if all n participants have responded PREPARED before the coordinator transitions to COMMITTED. The unsafe window is coordinator failure between writing COMMITTED and delivering all COMMIT messages — this is resolved by the durable coordinator record that allows a recovered coordinator to resume COMMIT delivery idempotently.

Performance Analysis: Quantifying the Cost of Distributed Atomicity

The table below captures the key overhead categories for transactional operations across all four databases.

Database	Concurrency Control	Typical TX Latency	Write Overhead	Cross-Region
DynamoDB	OCC + 2PC	5–30ms (single region)	2x WCU cost	No
Cosmos DB	Single-threaded partition	2–10ms (same region)	1x RU (no tx surcharge for stored proc)	No
Spanner	S2PL + 2PC + commit-wait	10–100ms (global)	Higher replication factor	Yes
MongoDB replica set	MVCC + OCC commit	5–50ms	Contention-sensitive	No
MongoDB sharded	MVCC + 2PC coordinator	30–150ms	High lock duration	No

The Spanner latency range reflects two very different scenarios. Intra-region transactions (all partitions in one Google datacenter) complete in under 10ms because TrueTime uncertainty is under 1ms. Cross-region transactions where participants span datacenters on different continents are dominated by cross-datacenter round-trip time — typically 50–80ms for transatlantic — and the commit-wait adds only 1–3ms on top of that.

Bottlenecks in practice. DynamoDB's 4 MB payload limit is the first constraint teams hit before the 25-item limit when dealing with large documents. Cosmos DB's partition-hot-spot problem is amplified under transactional load because the single-threaded execution model means transactions queue behind each other on a hot partition. MongoDB's 60-second transaction timeout is operationally dangerous in sharded deployments — a transaction that hits the timeout while holding cross-shard locks generates a timeout error but does not immediately release those locks, causing cascading WriteConflict failures on related documents.

📊 Visual Reference

flowchart TD
    DB["Database
(WAL/Log)"]
    Capture["Capture Engine"]
    Transform["Transform"]
    Sink["Sink/Destination"]

    DB -->|"Change Log"| Capture
    Capture -->|"CDC Event"| Transform
    Transform -->|"Normalized"| Sink

⚙️ Isolation Levels Compared: Which Anomalies Each Database Prevents for Transactional Reads

The following table maps each database's transactional API to its effective isolation level and the anomalies it prevents. This is the information engineers should consult when evaluating whether a given database can serve a specific consistency requirement.

Database and API	Effective Isolation	Dirty Read	Non-Repeatable Read	Phantom Read	Write Skew
DynamoDB TransactWriteItems	Serializable	Prevented	Prevented	Prevented	Prevented
DynamoDB TransactGetItems	Serializable (snapshot at call time)	Prevented	Prevented	Prevented	N/A (reads only)
Cosmos DB stored procedure / transactional batch	Serializable (within partition)	Prevented	Prevented	Prevented	Prevented
Google Spanner read-write transaction	Serializable (external consistency)	Prevented	Prevented	Prevented	Prevented
Google Spanner read-only (stale)	Bounded Staleness	Prevented	N/A	N/A	N/A
MongoDB replica set transaction	Snapshot Isolation	Prevented	Prevented	Prevented	Not prevented
MongoDB sharded cluster transaction	Snapshot Isolation	Prevented	Prevented	Prevented	Not prevented

MongoDB's snapshot isolation is a meaningful isolation level for most workloads — it prevents the most common anomalies and performs well. But applications with write-skew-prone invariants (inventory reservations, seat allocations, financial balance constraints under concurrent reads) must add explicit condition checks inside the transaction to compensate for the absence of predicate locking.

🏗️ Advanced Patterns for When Database Transactions Are Not Enough

The databases in this post provide strong transactional guarantees within their defined scopes. Real-world applications routinely need to coordinate state changes across partition boundaries, across services, or across databases. Three patterns address this without requiring the database to expand its transaction scope.

The Saga Pattern. A saga decomposes a long-running, cross-boundary operation into a sequence of local transactions, each triggering the next via an event or message. Each local transaction has a corresponding compensating transaction that undoes its effect if a later step fails. The saga does not provide atomicity in the ACID sense — intermediate states are visible to concurrent readers — but it provides eventual consistency and rollback capability. Sagas are the correct pattern for Cosmos DB cross-partition operations, for DynamoDB operations that exceed the 25-item limit, and for any operation spanning multiple microservices. The failure risk in a saga is in the compensating transaction logic itself: if a compensation fails, the system may be left in a partially compensated state that requires manual intervention.

Idempotency Keys. Any operation that may be retried — including operations that time out — must be assigned a unique client-generated idempotency key. DynamoDB's ClientRequestToken, Stripe's idempotency keys, and MongoDB's session IDs all implement this concept. The key invariant is: a retry with the same key must return the result of the original call without re-executing the operation. Engineering teams that implement retry logic without idempotency keys on transactional paths discover the failure mode in production: a Lambda that retries a timed-out successful DynamoDB transaction without a ClientRequestToken throws TransactionCanceledException because the committed state conflicts with the re-submission's condition checks.

Transactional Outbox. When a database write must also trigger a downstream event (a Kafka message, an SQS notification, a search index update), the two writes cannot be made atomic unless the event is stored in the same database as an outbox record in the same transaction. A CDC relay (Debezium, DynamoDB Streams processor, Cosmos DB Change Feed consumer) then reads the outbox record and forwards the event after commit. This eliminates the dual-write gap — the event is never lost because it was committed to the database before the relay reads it.

📊 Side-by-Side Comparison: Atomicity Scope, Isolation, Latency, and Cost

Dimension	DynamoDB	Cosmos DB	Google Spanner	MongoDB
Atomicity Scope	Up to 25 items, any tables, same region	All docs in same logical partition key	Any rows, any tables, any region	All docs in same RS or sharded cluster
Cross-Partition Atomic Write	Yes (items on diff partitions, same region)	No — hard partition boundary	Yes — cross-shard, cross-region	Yes (sharded tx), but high lock cost
Cross-Region Atomic Write	No	No	Yes — global external consistency	No
Effective Isolation Level	Serializable	Serializable (within partition)	Serializable (external consistency)	Snapshot Isolation
Typical Transaction Latency	5–30ms	2–10ms	10–100ms (global)	5–50ms (RS) / 30–150ms (sharded)
Write Cost Premium	2x WCU	1x RU (no surcharge)	Higher per replication factor	Higher under write contention
Max Items per Transaction	25 items, 4 MB total	Partition-bound (no doc count limit)	No practical limit	16 MB per doc; no tx-level doc limit
Conflict Handling	OCC — abort and client retry	Single-threaded — no mid-proc conflict	S2PL + deadlock detection	OCC — WriteConflict abort and retry
Multi-Region Write	Global Tables (eventual, no ACID)	Multi-region write with conflict resolution	Full ACID globally	No built-in multi-region write tx

🌍 Real-World Use Cases: Where Each Database's Transaction Model Earns Its Keep

DynamoDB in payments (idempotent charge capture). A payment processing service issues a TransactWriteItems call that atomically writes a payment record and upserts an idempotency key item. Both items use paymentId as the partition key — the transaction is two items, well within the 25-item limit. If the Lambda handling the charge times out after the transaction succeeds, the retry with the same ClientRequestToken returns the original success without re-charging the customer. This is DynamoDB transactions used exactly as designed: small scope, high frequency, idempotency-critical operations.

Cosmos DB in gaming (partition-aligned session rewards). An online multiplayer game stores player session state, inventory items, and achievement flags all partitioned by playerId. A stored procedure atomically applies a level-up reward: add experience points, append an item to the inventory array, set an achievement flag — all within the same playerId partition. No cross-partition coordination is required because the data model was designed with the partition constraint in mind from the start. This is the ideal Cosmos DB transaction use case: high-frequency, partition-aligned operations on data that naturally lives together.

Spanner in global fintech (cross-region ledger integrity). Google Pay's internal systems use Spanner for globally consistent financial ledgers. A transfer between a user in Tokyo and a merchant in London involves two rows on two partitions in two geographic regions. Spanner commits this atomically with a globally consistent timestamp. A reconciliation job running in any region will see a consistent ledger — no credit without a corresponding debit, no gap between the two. This is the use case that justifies Spanner's operational complexity and cost: when cross-region atomic writes with serializable isolation are a business requirement, not a nice-to-have.

MongoDB in content management (publish-time consistency). A media platform uses MongoDB replica set transactions to atomically update an article document and its tag index document when an editor publishes. The transaction ensures the article is never searchable without its tags. The replica set model works because both documents live in the same cluster, write frequency is in the thousands per day (not millions per second), and the absence of write-skew risk (no two editors publish the same article simultaneously) makes snapshot isolation sufficient.

⚖️ Trade-offs and Failure Modes Across All Four Databases

Performance versus correctness. Every strong isolation guarantee costs latency. DynamoDB transactions cost 2x in WCU; Spanner's commit-wait adds bounded but real delay; MongoDB sharded transactions hold cross-shard locks for the entire transaction duration. Teams should evaluate whether they genuinely need serializable multi-item isolation or whether atomic conditional single-item writes (ConditionExpression, ETag checks, findOneAndUpdate with conditions) are sufficient. A large fraction of "transaction" use cases can be handled by single-item OCC without the overhead of 2PC.

Cascading contention under load. Under high write contention, all four databases degrade similarly. DynamoDB returns TransactionConflictException. Cosmos DB's stored procedure queue grows on hot partition keys. Spanner detects deadlocks and aborts younger transactions. MongoDB returns WriteConflict. All of these require client-side retries. Without exponential backoff and jitter in the retry logic, retries under load amplify contention rather than resolving it — a pattern known as the retry storm.

The cross-partition silent failure. The failure mode that opened this post — two sequential writes that should have been one transaction — is the most common and the most insidious. It does not produce an error. It does not appear on any dashboard. The mitigation is architectural: understand the atomicity scope of your chosen database before modeling data, and explicitly design your data model to keep related writes within that scope.

Multi-region transaction limitations. None of the four databases deliver fully serializable cross-region atomic writes except Spanner. Cosmos DB multi-region writes use conflict resolution, not serialization. DynamoDB Global Tables replicates asynchronously. MongoDB cross-region replication is not transactional. Engineers who require cross-region atomicity outside of Spanner must implement compensating transaction patterns (Saga) with idempotency guarantees — and must accept the weaker consistency model that implies.

🧭 Decision Framework: Matching Your Workload to the Right Transactional Database

The flowchart below captures the key decision points for selecting a transactional strategy. The most important branch comes first: does the workload actually require multi-item atomicity, or will atomic conditional single-item writes suffice?

flowchart TD
    A[Does the operation require multi-item atomicity?] -->|No| B[Use conditional writes - cheaper and sufficient]
    A -->|Yes| C{How many items per transaction?}
    C -->|Up to 25, same AWS region| D[DynamoDB TransactWriteItems - strong fit]
    C -->|All items share one partition key| E{Which cloud platform?}
    C -->|Cross-partition or more than 25 items| F{Is cross-region ACID required?}
    E -->|Azure| G[Cosmos DB transactional batch or stored procedure]
    E -->|Other| H[Re-evaluate data model or use Saga pattern]
    F -->|No - single region, cross-shard| I[MongoDB sharded tx or Saga pattern]
    F -->|Yes - global ACID required| J[Google Spanner - only option with external consistency]
    D --> K{High write contention on same items?}
    K -->|Yes| L[Add idempotency tokens and retry with jitter backoff]
    K -->|No| M[Standard TransactWriteItems with ClientRequestToken]

This flowchart surfaces the Spanner decision at the right place: it is the answer when and only when cross-region atomic writes with serializable isolation are a hard requirement. For every other use case, one of the other three databases provides sufficient guarantees at lower operational cost.

Situation	Recommended Approach
Single-item atomic update (increment counter, claim one slot)	Conditional write only — no transaction overhead needed
Up to 25 items, same AWS region, high frequency	DynamoDB `TransactWriteItems` with `ClientRequestToken`
All items share a partition key, Azure, moderate frequency	Cosmos DB transactional batch or stored procedure
Cross-partition operations, Azure	Saga pattern with Cosmos DB Change Feed as the event relay
Global cross-region atomic writes, serializable isolation	Google Spanner — accept commit-wait latency
Existing MongoDB replica set, low contention	MongoDB multi-document transaction — sufficient for most use cases
MongoDB, sharded, high concurrent write frequency	Evaluate Saga pattern — sharded transaction lock costs escalate quickly

🧪 Worked Example: Atomic Inventory Reservation for a Flash Sale

A flash sale service must atomically reserve inventory and create a customer order. The business invariant: the inventory count must never go below zero, and no order should be created without a corresponding inventory decrement.

System state before the transaction:

Item: INVENTORY#sku-789 with attribute stock = 1 (last unit)
Item to be created: ORDER#order-456 with status RESERVED

Scenario A: DynamoDB with one unit remaining, two concurrent buyers.

Two Lambda functions execute TransactWriteItems simultaneously. Each transaction contains:

Update on INVENTORY#sku-789 — decrement stock by 1, with ConditionExpression: stock >= :min where :min = 1.
Put on ORDER#order-456 — write the new order item.

Lambda A's transaction acquires locks at the PREPARE phase. Lambda B's transaction also attempts to acquire a lock on INVENTORY#sku-789 but finds it locked by Lambda A's in-flight PREPARE. Lambda B receives TransactionConflictException. Lambda A proceeds: its condition check passes (stock = 1 >= 1), both writes commit, stock becomes 0, and the order is created. Lambda B retries; this time the condition check fails (stock = 0, not >= 1) and the transaction is cancelled. The order is never created for Lambda B. No inventory goes negative.

Scenario B: What Cosmos DB would require for the same invariant.

If both INVENTORY#sku-789 and ORDER#order-456 use sku as the partition key and both carry pk = "sku-789", a Cosmos DB stored procedure can execute this atomically. The stored procedure reads the inventory document, checks the stock count, decrements it, and creates the order document — all in one JavaScript execution block. If the check fails (stock is 0), the JavaScript throws and both writes are rolled back. If the ORDER#order-456 document had a different partition key (for example, partitioned by customerId), no Cosmos DB transaction could span both documents — the Saga pattern with a compensating "release inventory" operation would be required.

Scenario C: What Spanner enables beyond both.

Spanner executes this as a read-write transaction with no item count limit and no partition constraint. The transaction reads the inventory row (acquiring a shared read lock), checks the stock, decrements inventory (acquiring an exclusive write lock), and inserts the order row. TrueTime commit-wait ensures that any read of either row from any region after this transaction's real-world commit time sees the post-sale state. If the platform runs flash sales with global inventory — one pool shared by buyers in Tokyo, London, and New York — Spanner is the only database that prevents oversell without application-level sharding of the inventory counter.

🛠️ SDK Integration: Transactional Writes in DynamoDB, Cosmos DB, and Spanner

The following snippets show the minimal SDK call pattern for transactional writes in each database. These are integration-level examples intended to show the structural shape of the API — not full application implementations.

DynamoDB TransactWriteItems (AWS SDK v2 for Java)

TransactWriteItemsRequest txRequest = TransactWriteItemsRequest.builder()
    .clientRequestToken(idempotencyToken)        // Required for safe retries
    .transactItems(
        TransactWriteItem.builder()
            .update(Update.builder()
                .tableName("Accounts")
                .key(Map.of("accountId", AttributeValue.fromS(sourceId)))
                .updateExpression("SET balance = balance - :amount")
                .conditionExpression("balance >= :amount")   // Prevents negative balance
                .expressionAttributeValues(Map.of(
                    ":amount", AttributeValue.fromN("100")))
                .build())
            .build(),
        TransactWriteItem.builder()
            .update(Update.builder()
                .tableName("Accounts")
                .key(Map.of("accountId", AttributeValue.fromS(destId)))
                .updateExpression("SET balance = balance + :amount")
                .expressionAttributeValues(Map.of(
                    ":amount", AttributeValue.fromN("100")))
                .build())
            .build()
    )
    .build();

dynamoDbClient.transactWriteItems(txRequest);
// Catch TransactionCanceledException for condition failures
// Catch TransactionConflictException for contention — retry with backoff

Cosmos DB Transactional Batch (Azure SDK for Java)

// All items in the batch MUST share the same partition key value
TransactionalBatch batch = TransactionalBatch.createTransactionalBatch(
    new PartitionKey("sku-789")
);

batch.upsertItem(inventoryUpdatePojo);   // Decrement stock
batch.createItem(newOrderPojo);          // Create order in RESERVED state

TransactionalBatchResponse response = container
    .executeTransactionalBatch(batch)
    .block();

if (!response.isSuccessStatusCode()) {
    // Check response.getResults() for per-operation failure details
    throw new RuntimeException("Transactional batch failed: " + response.getStatusCode());
}

Google Cloud Spanner Read-Write Transaction (Java)

// Spanner SDK handles retry on TransactionAbortedException automatically
dbClient.readWriteTransaction().run(txContext -> {
    // Read with shared lock held until commit
    Struct row = txContext.readRow(
        "Inventory", Key.of("sku-789"), List.of("stock"));

    long stock = row.getLong("stock");
    if (stock < 1) throw new RuntimeException("Out of stock");

    // Buffer mutations — applied atomically at commit, not immediately
    txContext.buffer(
        Mutation.newUpdateBuilder("Inventory")
            .set("sku").to("sku-789")
            .set("stock").to(stock - 1)
            .build()
    );
    txContext.buffer(
        Mutation.newInsertBuilder("Orders")
            .set("orderId").to(orderId)
            .set("status").to("RESERVED")
            .build()
    );
    return null;
});
// The lambda re-executes automatically if Spanner aborts due to lock conflict

For a full deep-dive on Spanner's session pooling, staleness configuration, and interleaved table design patterns, see the Cloud Spanner transactions documentation. For Cosmos DB Change Feed as the Saga event relay, see Cosmos DB API Modes Explained.

📚 Lessons Learned: What Production Teaches You About Distributed Transactions

1. Two sequential writes are never a transaction. The most common distributed transaction bug is issuing two independent write calls and assuming they are atomic because they run in the same Lambda or service method. They are not. Any failure, timeout, partial retry, or concurrent writer between them creates permanent partial state. Use the transactional API explicitly — every time.

2. Model data for your transaction scope before the first write hits production. DynamoDB's 25-item limit and Cosmos DB's single-partition constraint are not bugs. They are the design. A data model that requires atomic writes across partition keys in Cosmos DB or across more than 25 items in DynamoDB cannot be fixed with API calls — it requires a data model redesign or an architectural pattern change (Saga). Discovering this in production is expensive.

3. Idempotency tokens are not optional in any retried transactional path. Any transactional operation that can be retried — Lambda automatic retries, SDK-level retries, application-level error handling — must use idempotency tokens. Without them, retries of successful-but-timed-out operations produce duplicates, conflicts, or data corruption depending on the database. DynamoDB's ClientRequestToken, Cosmos DB's request IDs, and Spanner's session model each address this differently — understand which mechanism applies to your SDK and use it.

4. The "C" in ACID is your responsibility, not the database's. Serializable isolation ensures transactions do not interfere with each other's reads and writes. It does not ensure your business invariants hold. No negative balance, no oversold inventory, no duplicate order — these must be encoded as condition expressions, stored procedure checks, or read-then-write guards inside the transaction. A serializable transaction that does not check preconditions will still commit business rule violations.

5. Spanner's commit-wait is the mechanism that makes global ACID possible. Engineers encountering commit-wait for the first time look for a way to disable it. The delay is exactly 1–7ms in practice, and it is what makes external consistency possible. If your global application requires that a read starting after a committed write always sees that write — regardless of which region the reader is in — commit-wait is the mechanism guaranteeing this. Disabling it would silently downgrade the consistency guarantee.

6. MongoDB sharded cluster transactions trade throughput for atomicity. The lock duration in a sharded MongoDB transaction is bounded by the total transaction execution time, not individual write durations. This is a fundamental overhead characteristic, not a configuration issue. For write-heavy, highly concurrent workloads on a sharded MongoDB cluster, the Saga pattern with compensating transactions generally delivers better throughput and lower operational risk than sharded multi-document transactions.

📌 TLDR: One Paragraph That Replaces 20 Stack Overflow Tabs

ACID transactions in distributed databases are a spectrum — not a binary capability. DynamoDB delivers serializable multi-item atomicity within 25 items in a single region using 2PC backed by a coordinator record; use ClientRequestToken for safe retries or risk duplicate operations. Cosmos DB delivers serializable atomicity only within a single logical partition key, enforced through stored procedure execution; cross-partition operations require the Saga pattern or explicit data model redesign. Google Spanner is the only cloud database that delivers externally consistent global ACID using TrueTime-based commit timestamps, strict two-phase locking, and Paxos-replicated 2PC across geographic regions — at the cost of commit-wait latency that adds 1–7ms in practice. MongoDB provides snapshot-isolated multi-document transactions on replica sets with reasonable overhead, and sharded cluster transactions with significantly higher lock duration cost; neither prevents write skew, which requires application-level condition checks. The question is never "which database supports ACID?" — all four do, within their defined scope. The question is always "what scope, what isolation, and at what cost?"

Expandable deep dives

The Write That Silently Disappeared⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

📖 ACID Unpacked for Distributed Systems: What Each Letter Means When There Is No Single Machine⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

🔍 DynamoDB TransactWriteItems: Two-Phase Commit Without Calling It Two-Phase Commit⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

How the Coordinator Item Orchestrates 2PC⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

Key takeaways

✓TLDR: ACID transactions in distributed databases are not equal.
✓DynamoDB provides multi item atomicity scoped to 25 items using two phase commit with a coordinator item, but only within a single region.
✓Cosmos DB wraps partition scoped operations inside a stored procedure using a JavaScript continuation model — cross partition writes get no atomic guarantee.
✓Google Spanner is the only cloud database that delivers true global ACID with external consistency, using TrueTime (GPS + atomic clocks) to assign globally ordered timestamps and a commit wait delay to guarantee causality across datacenters.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Reader feedback

Was this article useful?

Rate it before you leave, then follow or subscribe for the next deep dive.

Continue learning

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

24 min · System Design · best next step

View roadmap