ACID Transactions in Distributed Databases: DynamoDB, Cosmos DB, and Spanner Compared
How modern NoSQL databases achieve atomicity and isolation without a single machine
Abstract Algorithms
TLDR: ACID transactions in distributed databases are not equal. DynamoDB provides multi-item atomicity scoped to 25 items using two-phase commit with a coordinator item, but only within a single region. Cosmos DB wraps partition-scoped operations inside a stored procedure using a JavaScript continuation model โ cross-partition writes get no atomic guarantee. Google Spanner is the only cloud database that delivers true global ACID with external consistency, using TrueTime (GPS + atomic clocks) to assign globally ordered timestamps and a commit-wait delay to guarantee causality across datacenters. MongoDB supports multi-document transactions on replica sets but carries measurable performance overhead and does not prevent write skew under snapshot isolation. The right choice depends on atomicity scope, latency tolerance, and operational cost โ not on which database claims to be "ACID-compliant."
The Write That Silently Disappeared
A payments team at a mid-size fintech company migrated their account ledger service from PostgreSQL to DynamoDB. In PostgreSQL, every funds transfer ran inside a BEGIN / COMMIT block: debit the source account, credit the destination account, write a transaction record โ three rows, one atomic commit. The team ported the logic verbatim to DynamoDB and deployed. The monitoring dashboard showed zero errors. Latency improved by 40%.
Three weeks later, a nightly reconciliation job found thousands of accounts where a debit had fired but the corresponding credit had never landed. Source accounts had been charged. Destination accounts were unchanged. No exception had been logged. No CloudWatch alarm had fired.
// What the team wrote โ looks like a safe "two-step" update
dynamoDB.updateItem(debitSourceAccountRequest); // succeeds
dynamoDB.updateItem(creditDestinationRequest); // may never run
The code issued two independent UpdateItem calls. Each call was individually atomic and durable. But the pair was not. A Lambda timeout, a network blip, or a silent retry on only the first call had created thousands of partial transfers. In DynamoDB โ and in every distributed database โ two sequential writes are never automatically part of the same transaction unless you use the transactional API explicitly.
This is not a DynamoDB bug. It is a failure of assumptions borrowed from PostgreSQL. And the assumption that "ACID-capable" means the same thing across all databases is the most expensive misconception in distributed systems engineering.
The real questions are not "does this database support ACID?" โ all four databases in this post do, in some form. The real questions are: what scope does ACID cover?, what isolation level does it actually enforce?, and what does it cost in latency and write throughput?
๐ ACID Unpacked for Distributed Systems: What Each Letter Means When There Is No Single Machine
The ACID acronym was coined for single-node relational databases. Transplanted into distributed systems, each letter means something subtly different โ and one of them barely means anything at all.
Atomicity in a distributed context means all writes in a logical transaction either all commit or all abort, even when those writes span multiple nodes, partitions, or shards. On a single machine, "all or nothing" is enforced by the storage engine via write-ahead logging โ a crash before the WAL is flushed is a rollback, a crash after is a commit. Across machines, atomicity requires a coordination protocol โ typically two-phase commit (2PC) โ where every participant must agree to commit before anyone actually applies the change. Failure between the PREPARE phase and the COMMIT phase is the hardest sub-problem in distributed transactions: some participants may have tentative mutations that are never finalized.
Consistency is the weakest letter in distributed systems, and arguably the most misleading. In ACID, "consistent" means "your application invariants hold after the transaction." The database enforces type constraints and referential integrity. The application must encode business rules โ no negative balances, no oversold inventory, no duplicate order IDs. The "C" does not map to the CAP theorem's definition of consistency (linearizability). Engineers who expect a distributed database's ACID guarantee to automatically enforce business-level correctness will be surprised to discover that a serializable transaction can still commit a business rule violation if the application code does not encode that rule as a condition.
Isolation is the most nuanced letter. It describes which anomalies concurrent transactions can produce when their operations interleave. The SQL standard defines four isolation levels, but the anomalies those levels prevent are more useful to reason about directly:
| Anomaly | Description | Prevented By |
| Dirty Read | Reading uncommitted data from another in-flight transaction | Read Committed and above |
| Non-Repeatable Read | Re-reading the same row and finding it was modified by a committed concurrent transaction | Repeatable Read and above |
| Phantom Read | A range query re-executed returns new rows inserted by a committed concurrent transaction | Serializable only |
| Write Skew | Two concurrent transactions each read overlapping data, make independent decisions, and commit changes that together violate an invariant | Serializable only (not Snapshot Isolation) |
Write skew is the anomaly most engineers encounter only in production. Two concurrent "reserve last seat" transactions can both read available_seats = 1, both decide to proceed, and both commit โ leaving available_seats = -1. Serializable isolation prevents this. Snapshot isolation, which MongoDB uses, does not.
Durability in a distributed context means a committed write has been replicated to a quorum of nodes and persisted to disk before the acknowledgment reaches the client. Losing a single node after commit must not lose the data. All four databases in this post provide durable writes โ but their quorum configurations, replication lag, and recovery guarantees differ meaningfully in multi-region deployments.
๐ DynamoDB TransactWriteItems: Two-Phase Commit Without Calling It Two-Phase Commit
Amazon DynamoDB added transactional APIs in late 2018 via TransactWriteItems and TransactGetItems. The documentation describes ACID guarantees for up to 25 items across up to 25 tables within a single AWS region. What the documentation does not say is that these APIs implement a variant of two-phase commit backed by a coordinator item stored in DynamoDB's own infrastructure.
How the Coordinator Item Orchestrates 2PC
When a client calls TransactWriteItems, DynamoDB's transaction layer writes a transaction record โ a special metadata item tracked internally โ with status PENDING and the full set of intended operations, condition expressions, and item keys. This record is the single source of truth for the transaction's state. If the coordinator process dies after writing this record, another coordinator process can read the record, inspect its state, and either complete the transaction or roll it back.
The coordinator then sends PREPARE messages to each partition node that owns an affected item. Each partition node locks the item (blocking concurrent writes), evaluates the ConditionExpression, and writes a tentative mutation. If all partitions respond PREPARED, the coordinator updates the transaction record to COMMITTED and sends COMMIT to each partition. Each partition applies the tentative write and releases the lock.
The sequence below shows the full 2PC flow for a funds transfer involving items on two separate DynamoDB partitions. Pay attention to where the coordinator record transitions from PENDING to COMMITTED โ this is the point of no return. Everything before it can be safely rolled back; everything after it must eventually be committed.
sequenceDiagram
participant Client
participant Coordinator
participant PartitionA as Partition A
participant PartitionB as Partition B
Client->>Coordinator: TransactWriteItems with ClientRequestToken
Coordinator->>Coordinator: Write PENDING transaction record
Coordinator->>PartitionA: PREPARE - lock item, evaluate condition, tentative write
Coordinator->>PartitionB: PREPARE - lock item, evaluate condition, tentative write
PartitionA-->>Coordinator: PREPARED
PartitionB-->>Coordinator: PREPARED
Coordinator->>Coordinator: Update transaction record to COMMITTED
Coordinator->>PartitionA: COMMIT - apply write, release lock
Coordinator->>PartitionB: COMMIT - apply write, release lock
PartitionA-->>Coordinator: ACK
PartitionB-->>Coordinator: ACK
Coordinator-->>Client: TransactionSucceeded
This diagram explains both the 2x write cost (every transactional write involves at least two round trips to the partition: PREPARE then COMMIT) and why DynamoDB bills transactional writes at 2x the standard write capacity unit rate. The coordinator overhead is real and unavoidable.
Three Failure Scenarios and Why the Idempotency Token Is Not Optional
DynamoDB's 2PC exposes three distinct failure windows. Engineers who do not anticipate them build retry loops that introduce duplicate operations.
Scenario 1 โ Coordinator fails after PENDING but before any PREPARE. No partition has been touched. On recovery, a new coordinator reads the PENDING record, finds no PREPARED responses, and rolls back by deleting the PENDING record. No writes have occurred.
Scenario 2 โ Coordinator fails after some or all partitions respond PREPARED but before the coordinator record transitions to COMMITTED. Affected partitions hold locked items with tentative mutations. Other writers will receive TransactionConflictException attempting to write those items. On recovery, a new coordinator reads the PENDING record, finds no COMMITTED state, and sends ROLLBACK to all participants. Tentative mutations are discarded and locks are released.
Scenario 3 โ Coordinator fails after updating the record to COMMITTED but before all COMMIT messages are delivered. This is the safe failure: the record is in COMMITTED state. On recovery, a new coordinator reads COMMITTED, identifies which partitions have not yet ACKed, and replays COMMIT to those partitions. This replay is idempotent because each partition checks the coordinator record before applying.
The ClientRequestToken (idempotency token) is the mechanism that makes client-side retries safe across all three scenarios. If a client retries a TransactWriteItems call after a timeout using the same token, DynamoDB returns the result of the original call without re-executing โ even if the original call succeeded. Without the token, a retry of a timed-out successful transaction throws TransactionCanceledException because the committed state already exists and the re-submission detects a conflict.
DynamoDB Transaction Limits That Shape Your Data Model
The constraints are not arbitrary API limits โ they reflect the design of the in-memory coordinator state:
- 25 items maximum per transaction across up to 25 tables.
- Single AWS region only โ DynamoDB Global Tables does not extend transactional guarantees across regions.
- 4 MB total payload per transaction.
- All items must reside in the same AWS account.
These limits mean that any application logic requiring atomic coordination across more than 25 items โ a cart checkout with 40 line items, a bulk inventory adjustment โ cannot use TransactWriteItems and must fall back to the Saga pattern with compensating transactions.
๐ Cosmos DB Partition-Scoped Transactions: When JavaScript Is the Transaction Boundary
Azure Cosmos DB supports multi-document transactions, but under a constraint that surprises most developers: all documents in a transaction must share the same logical partition key. There is no cross-partition atomic write in Cosmos DB. Understanding why requires understanding what the transaction boundary actually is.
Stored Procedures as the Transaction Unit
Cosmos DB's transaction boundary is the stored procedure execution. A stored procedure is a JavaScript function registered against a container. When Cosmos DB executes a stored procedure, the entire JavaScript function body runs as a single atomic operation on a single partition's primary replica. If the JavaScript throws an uncaught exception at any point, the Cosmos DB engine rolls back every document operation performed in that invocation. If the function completes without throwing, all mutations are committed atomically.
This is not traditional 2PC. It is closer to a JavaScript coroutine that the database engine treats as an indivisible unit of work. The atomicity guarantee comes from the fact that execution is single-threaded on the partition's primary replica โ no other operation can interleave with it mid-execution. The "transaction" is implicit in the execution context: start of function = begin transaction, clean return = commit, uncaught exception = rollback.
The diagram below shows the decision tree from client request to committed state. The key branch is at the partition routing layer โ if the operation spans multiple logical partition keys, it is rejected before JavaScript ever executes.
flowchart TD
A[Client sends stored procedure call] --> B{All document operations use same partition key?}
B -->|No - different partition keys| C[Request rejected at routing layer]
B -->|Yes| D[Route to owning partition primary replica]
D --> E[Execute stored procedure JavaScript on primary]
E --> F{JavaScript throws uncaught exception?}
F -->|Yes| G[Engine rolls back all ops in this invocation]
F -->|No| H[Commit - all mutations visible atomically]
G --> I[Return error to client]
H --> J[Return success to client]
This diagram makes the Cosmos DB transaction model immediately practical: the partition key is the transaction scope, and the stored procedure is the transaction unit. If your data model puts related documents behind the same partition key, Cosmos DB transactions work cleanly. If your documents live on different partition keys, no Cosmos DB API provides atomicity across them.
The Cross-Partition Reality
Consider an e-commerce system where orderId is the partition key for order documents and sku is the partition key for inventory documents. A stored procedure that attempts to atomically decrement inventory and create an order cannot do so โ these documents are on different partitions. Cosmos DB has no cross-partition atomic write API.
The Cosmos DB Transactional Batch API (introduced in 2020) provides the same single-partition atomicity as stored procedures with better ergonomics โ it supports up to 100 operations per batch without the overhead of JavaScript parsing. But it still enforces the single-partition-key constraint.
Teams working around the cross-partition limitation use three approaches: denormalize data so all related documents share a partition key (accept fan-out on updates), use the Cosmos DB Change Feed as a CDC trigger to drive compensating transactions asynchronously (Saga), or accept eventual consistency on cross-entity operations.
Multi-Region Write Consistency Trade-offs
Under single-region writes (one write region, multiple read replicas), Cosmos DB transactions are serialized through the partition's write-region primary. Isolation is serializable within the partition.
Under multi-region writes, two clients can write to the same partition key from different regions simultaneously. Cosmos DB detects write conflicts using its conflict resolution policies (Last-Writer-Wins by _ts timestamp, or a custom merge procedure). Transactions run locally on the nearest region's primary and conflict resolution happens asynchronously โ meaning two transactions on the same partition key from different write regions can both "succeed" locally and then conflict during replication. Applications using multi-region writes with conflict-sensitive data must implement custom conflict resolution logic.
๐ Google Spanner and Global ACID: The Engineering Cost of External Consistency
Google Spanner is the only database in this comparison that delivers true global ACID transactions across any number of rows, tables, and geographic regions simultaneously. This is not a marketing claim. It is an engineering achievement built on physical infrastructure most databases do not have.
TrueTime: Bounding the Unknowable
Every distributed database that uses timestamps for ordering faces the clock synchronization problem: two nodes' clocks can drift by milliseconds or more. A transaction that commits on Node A at 10:00:00.003 might appear to have committed before a transaction on Node B at 10:00:00.001 even though it physically happened later โ if Node A's clock is slightly ahead. Traditional distributed databases work around this by using logical clocks (Lamport timestamps, vector clocks) or by routing all writes through a single leader. Neither approach allows arbitrary cross-region writes with globally consistent timestamps.
Spanner solves this with TrueTime, a Google-internal API that exposes the current wall-clock time as an interval [t_earliest, t_latest] rather than a single point. The interval is guaranteed to contain the true absolute time. Google keeps the uncertainty interval narrow โ typically 1โ7 milliseconds โ by synchronizing server clocks against GPS receivers and atomic clocks deployed in every Google datacenter. TrueTime does not eliminate uncertainty; it bounds it explicitly and makes that bound available to the database engine.
External Consistency Through Commit-Wait
Spanner uses TrueTime to achieve external consistency โ the strongest possible consistency guarantee for a distributed database. External consistency means: if transaction T1 commits before transaction T2 starts (in real absolute time), then T1's commit timestamp is strictly less than T2's commit timestamp, as seen by every reader anywhere in the world.
To achieve this, Spanner runs a commit-wait delay on every read-write transaction:
- Acquire shared (read) locks on all rows touched by reads and exclusive (write) locks on all rows to be mutated.
- Call
TT.now()to obtain the current TrueTime interval[t_earliest, t_latest]. - Assign commit timestamp
s = t_latestโ the upper bound of the current uncertainty interval. - Wait until
TT.now().t_earliest > sโ that is, until Spanner is certain that the absolute current time has passed the commit timestamp. - Apply writes with timestamp
s, release all locks, and return the commit acknowledgment to the client.
The sequence diagram below traces this process from transaction start through commit-wait completion. The critical observation is that commit-wait happens after all locks are held, so no other transaction can observe partial state during the wait โ and after the wait, any transaction that starts anywhere in the world will observe a TrueTime lower bound already past s.
sequenceDiagram
participant Client
participant Leader as Spanner Partition Leader
participant TrueTime as TrueTime API
Client->>Leader: Begin read-write transaction
Client->>Leader: Read rows R1, R2 - shared locks acquired
Client->>Leader: Write mutations M1, M2
Leader->>Leader: Acquire exclusive write locks on M1, M2
Leader->>TrueTime: TT.now() - request current time interval
TrueTime-->>Leader: Interval [t_earliest, t_latest]
Note over Leader: Assign commit timestamp s = t_latest
Leader->>Leader: Commit-wait - sleep until TT.now().t_earliest is greater than s
Leader->>Leader: Apply writes atomically with timestamp s
Leader->>Leader: Release all read and write locks
Leader-->>Client: Commit ACK with timestamp s
The commit-wait delay is bounded by the TrueTime epsilon โ the clock uncertainty. Google keeps this under 7ms in production, typically under 1ms in well-provisioned datacenters. This means Spanner read-write transactions in a single region complete in 5โ15ms. Cross-region transactions are dominated by network latency (30โ80ms for intercontinental), not by commit-wait.
Two-Phase Locking and Two-Phase Commit at Scale
Spanner's read-write transactions use strict two-phase locking (S2PL): shared locks are acquired on reads and held until commit; exclusive locks are acquired on writes and held until commit. No lock is ever released before the transaction commits. This is the most conservative locking strategy possible and is necessary for serializability โ no concurrent transaction can observe another transaction's intermediate state.
For cross-shard writes, Spanner coordinates with two-phase commit where one of the participating shards' Paxos leaders is elected as the transaction coordinator. Unlike DynamoDB's coordinator, Spanner's coordinator state is replicated via Paxos โ coordinator failure does not leave the transaction in limbo. A new Paxos leader can be elected and the transaction can be resumed or aborted without client intervention.
Spanner's read-only transactions bypass 2PL and 2PC entirely. They use bounded-staleness reads served by any up-to-date replica without acquiring locks or coordinating across shards. The tradeoff is a configurable staleness bound โ "read data no more than 15 seconds stale" โ but for analytics, reporting, and non-critical reads, this provides dramatically better throughput than read-write transactions.
๐ MongoDB Multi-Document Transactions: Oplog-Based Atomicity With a Sharding Caveat
MongoDB added multi-document transactions in version 4.0 for replica sets (2018) and extended them to sharded clusters in version 4.2 (2019). MongoDB's transaction model builds on its existing replication infrastructure โ the oplog (operations log) โ and shares conceptual ancestry with Spanner's 2PC, though with higher overhead in the sharded case.
Replica Set Transactions: The Oplog as the Commit Record
On a replica set, all read and write operations in a transaction execute on the primary node. Reads use a consistent point-in-time snapshot of the data (MVCC-based), meaning reads inside a transaction never see commits from concurrent transactions โ only data as it existed when the transaction began. Writes are buffered in memory and not written to the oplog until the transaction commits.
At commit time, all mutations are written to the oplog as a single logical entry. Secondary nodes replicate this entry and apply all mutations atomically โ secondaries never observe a partial transaction state. If a conflicting write is detected at commit time (another transaction modified a document this transaction read or intends to write), MongoDB aborts with a WriteConflict error and the client must retry.
The performance cost is documented and real. MongoDB's default transaction timeout is 60 seconds; transactions running longer are aborted. Under high write contention, WriteConflict retries cascade into latency spikes. The MongoDB documentation itself recommends avoiding transactions on "hot" documents with high concurrent write rates.
Sharded Cluster Transactions: Cross-Shard 2PC Through the Config Server
When a transaction spans documents on multiple shards, a mongos router serves as the 2PC coordinator. The coordinator writes the transaction state to the config.transactions collection on the config server. It sends PREPARE to all participating shard primaries, collects PREPARED responses, updates config.transactions to COMMITTED, and sends COMMIT to all shards.
Each shard holds document-level write locks for the entire duration of the transaction โ from the first write through the final COMMIT ACK. A sharded transaction that reads three documents and writes two on three different shards holds locks on all three shards for its entire execution lifetime. Under high concurrency, this is a meaningful throughput bottleneck and a common source of lock contention escalation in sharded clusters.
๐ง Deep Dive: How Distributed ACID Is Actually Enforced Under the Hood
Understanding the mechanics of each database's transaction implementation requires examining three layers: the concurrency control model, the mathematical guarantees it formalizes, and the performance cost those guarantees impose in production.
The Internals: Concurrency Control Models Compared
Each database uses a different concurrency control strategy to achieve isolation, and that strategy determines both its failure modes under contention and its performance characteristics.
DynamoDB uses optimistic concurrency control (OCC) with condition expressions. Rather than locking items during the transaction's lifetime, DynamoDB validates conditions at PREPARE time โ atomically checking version numbers, attribute existence, or value predicates at the point of tentative write. If another writer modified an item between the transaction's start and the PREPARE phase, the condition check fails and the transaction aborts. OCC performs well under low contention and degrades under high contention because aborted transactions require full retries; there is no mechanism to "wait" for a conflicting transaction to commit.
Cosmos DB (within a stored procedure) uses a single-threaded execution model on the partition's primary replica. There are no locks in the traditional sense โ the JavaScript execution is the lock, because no other operation runs concurrently on that partition thread during the procedure's execution. This is extremely efficient for short-lived, high-frequency transactions on a single partition. The cost is that long-running stored procedures block all other writes to that partition for their entire execution duration โ a poorly written stored procedure is a partition-level denial of service.
Google Spanner uses strict two-phase locking (S2PL) combined with multi-version concurrency control (MVCC). Read-write transactions hold all locks from first acquisition through commit. Read-only transactions use MVCC snapshots and require no locks. This separation means read-heavy workloads do not contend with write-heavy workloads, while write-write contention is handled by deadlock detection โ when a deadlock is detected, Spanner aborts the younger transaction and expects the client to retry.
MongoDB uses MVCC for reads and optimistic commit-time conflict detection for writes. The snapshot provides read-your-own-writes, monotonic reads, and non-repeatable-read prevention within the transaction. Write conflicts are detected at commit time using document-level versioning. Like DynamoDB's OCC, this degrades under high write contention into retry cascades.
Mathematical Model: TrueTime Intervals and Isolation Level Formalism
TrueTime interval notation. Let TT.now() return interval [t_e, t_l] where t_e (earliest) โค true absolute time โค t_l (latest). The uncertainty ฮต = t_l - t_e is bounded by clock synchronization quality. Spanner's commit-wait condition is:
Assign commit timestamp
s = TT.now().t_l. Do not make the transaction visible untilTT.now().t_e > s.
This ensures that the true absolute time has passed s before any reader can observe the committed data. Consequently, if transaction T1 commits at real time r1 and transaction T2 starts at real time r2 > r1, then T2's coordinator observes TT.now().t_e > r1 โฅ s1 (where s1 is T1's commit timestamp), guaranteeing s2 > s1. External consistency follows: every transaction that starts after T1's real-world commit will assign a higher timestamp and will therefore read T1's committed writes.
Isolation level anomaly prevention via predicate locking. Serializable isolation prevents write skew by using predicate locks โ locks on the set of rows matching a predicate, not just rows that currently exist. When T1 reads all rows where status = 'available', a predicate lock is acquired on that predicate. If T2 inserts a row matching status = 'available' before T1 commits, T2 must wait for T1's predicate lock to be released. Snapshot isolation does not use predicate locks, which is why MongoDB's snapshot-isolated transactions permit write skew. Spanner's S2PL acquires predicate locks on every range read, preventing phantom reads and write skew entirely.
2PC correctness condition. A 2PC transaction can safely commit if and only if all n participants have responded PREPARED before the coordinator transitions to COMMITTED. The unsafe window is coordinator failure between writing COMMITTED and delivering all COMMIT messages โ this is resolved by the durable coordinator record that allows a recovered coordinator to resume COMMIT delivery idempotently.
Performance Analysis: Quantifying the Cost of Distributed Atomicity
The table below captures the key overhead categories for transactional operations across all four databases.
| Database | Concurrency Control | Typical TX Latency | Write Overhead | Cross-Region |
| DynamoDB | OCC + 2PC | 5โ30ms (single region) | 2x WCU cost | No |
| Cosmos DB | Single-threaded partition | 2โ10ms (same region) | 1x RU (no tx surcharge for stored proc) | No |
| Spanner | S2PL + 2PC + commit-wait | 10โ100ms (global) | Higher replication factor | Yes |
| MongoDB replica set | MVCC + OCC commit | 5โ50ms | Contention-sensitive | No |
| MongoDB sharded | MVCC + 2PC coordinator | 30โ150ms | High lock duration | No |
The Spanner latency range reflects two very different scenarios. Intra-region transactions (all partitions in one Google datacenter) complete in under 10ms because TrueTime uncertainty is under 1ms. Cross-region transactions where participants span datacenters on different continents are dominated by cross-datacenter round-trip time โ typically 50โ80ms for transatlantic โ and the commit-wait adds only 1โ3ms on top of that.
Bottlenecks in practice. DynamoDB's 4 MB payload limit is the first constraint teams hit before the 25-item limit when dealing with large documents. Cosmos DB's partition-hot-spot problem is amplified under transactional load because the single-threaded execution model means transactions queue behind each other on a hot partition. MongoDB's 60-second transaction timeout is operationally dangerous in sharded deployments โ a transaction that hits the timeout while holding cross-shard locks generates a timeout error but does not immediately release those locks, causing cascading WriteConflict failures on related documents.
โ๏ธ Isolation Levels Compared: Which Anomalies Each Database Prevents for Transactional Reads
The following table maps each database's transactional API to its effective isolation level and the anomalies it prevents. This is the information engineers should consult when evaluating whether a given database can serve a specific consistency requirement.
| Database and API | Effective Isolation | Dirty Read | Non-Repeatable Read | Phantom Read | Write Skew |
| DynamoDB TransactWriteItems | Serializable | Prevented | Prevented | Prevented | Prevented |
| DynamoDB TransactGetItems | Serializable (snapshot at call time) | Prevented | Prevented | Prevented | N/A (reads only) |
| Cosmos DB stored procedure / transactional batch | Serializable (within partition) | Prevented | Prevented | Prevented | Prevented |
| Google Spanner read-write transaction | Serializable (external consistency) | Prevented | Prevented | Prevented | Prevented |
| Google Spanner read-only (stale) | Bounded Staleness | Prevented | N/A | N/A | N/A |
| MongoDB replica set transaction | Snapshot Isolation | Prevented | Prevented | Prevented | Not prevented |
| MongoDB sharded cluster transaction | Snapshot Isolation | Prevented | Prevented | Prevented | Not prevented |
MongoDB's snapshot isolation is a meaningful isolation level for most workloads โ it prevents the most common anomalies and performs well. But applications with write-skew-prone invariants (inventory reservations, seat allocations, financial balance constraints under concurrent reads) must add explicit condition checks inside the transaction to compensate for the absence of predicate locking.
๐๏ธ Advanced Patterns for When Database Transactions Are Not Enough
The databases in this post provide strong transactional guarantees within their defined scopes. Real-world applications routinely need to coordinate state changes across partition boundaries, across services, or across databases. Three patterns address this without requiring the database to expand its transaction scope.
The Saga Pattern. A saga decomposes a long-running, cross-boundary operation into a sequence of local transactions, each triggering the next via an event or message. Each local transaction has a corresponding compensating transaction that undoes its effect if a later step fails. The saga does not provide atomicity in the ACID sense โ intermediate states are visible to concurrent readers โ but it provides eventual consistency and rollback capability. Sagas are the correct pattern for Cosmos DB cross-partition operations, for DynamoDB operations that exceed the 25-item limit, and for any operation spanning multiple microservices. The failure risk in a saga is in the compensating transaction logic itself: if a compensation fails, the system may be left in a partially compensated state that requires manual intervention.
Idempotency Keys. Any operation that may be retried โ including operations that time out โ must be assigned a unique client-generated idempotency key. DynamoDB's ClientRequestToken, Stripe's idempotency keys, and MongoDB's session IDs all implement this concept. The key invariant is: a retry with the same key must return the result of the original call without re-executing the operation. Engineering teams that implement retry logic without idempotency keys on transactional paths discover the failure mode in production: a Lambda that retries a timed-out successful DynamoDB transaction without a ClientRequestToken throws TransactionCanceledException because the committed state conflicts with the re-submission's condition checks.
Transactional Outbox. When a database write must also trigger a downstream event (a Kafka message, an SQS notification, a search index update), the two writes cannot be made atomic unless the event is stored in the same database as an outbox record in the same transaction. A CDC relay (Debezium, DynamoDB Streams processor, Cosmos DB Change Feed consumer) then reads the outbox record and forwards the event after commit. This eliminates the dual-write gap โ the event is never lost because it was committed to the database before the relay reads it.
๐ Side-by-Side Comparison: Atomicity Scope, Isolation, Latency, and Cost
| Dimension | DynamoDB | Cosmos DB | Google Spanner | MongoDB |
| Atomicity Scope | Up to 25 items, any tables, same region | All docs in same logical partition key | Any rows, any tables, any region | All docs in same RS or sharded cluster |
| Cross-Partition Atomic Write | Yes (items on diff partitions, same region) | No โ hard partition boundary | Yes โ cross-shard, cross-region | Yes (sharded tx), but high lock cost |
| Cross-Region Atomic Write | No | No | Yes โ global external consistency | No |
| Effective Isolation Level | Serializable | Serializable (within partition) | Serializable (external consistency) | Snapshot Isolation |
| Typical Transaction Latency | 5โ30ms | 2โ10ms | 10โ100ms (global) | 5โ50ms (RS) / 30โ150ms (sharded) |
| Write Cost Premium | 2x WCU | 1x RU (no surcharge) | Higher per replication factor | Higher under write contention |
| Max Items per Transaction | 25 items, 4 MB total | Partition-bound (no doc count limit) | No practical limit | 16 MB per doc; no tx-level doc limit |
| Conflict Handling | OCC โ abort and client retry | Single-threaded โ no mid-proc conflict | S2PL + deadlock detection | OCC โ WriteConflict abort and retry |
| Multi-Region Write | Global Tables (eventual, no ACID) | Multi-region write with conflict resolution | Full ACID globally | No built-in multi-region write tx |
๐ Real-World Use Cases: Where Each Database's Transaction Model Earns Its Keep
DynamoDB in payments (idempotent charge capture). A payment processing service issues a TransactWriteItems call that atomically writes a payment record and upserts an idempotency key item. Both items use paymentId as the partition key โ the transaction is two items, well within the 25-item limit. If the Lambda handling the charge times out after the transaction succeeds, the retry with the same ClientRequestToken returns the original success without re-charging the customer. This is DynamoDB transactions used exactly as designed: small scope, high frequency, idempotency-critical operations.
Cosmos DB in gaming (partition-aligned session rewards). An online multiplayer game stores player session state, inventory items, and achievement flags all partitioned by playerId. A stored procedure atomically applies a level-up reward: add experience points, append an item to the inventory array, set an achievement flag โ all within the same playerId partition. No cross-partition coordination is required because the data model was designed with the partition constraint in mind from the start. This is the ideal Cosmos DB transaction use case: high-frequency, partition-aligned operations on data that naturally lives together.
Spanner in global fintech (cross-region ledger integrity). Google Pay's internal systems use Spanner for globally consistent financial ledgers. A transfer between a user in Tokyo and a merchant in London involves two rows on two partitions in two geographic regions. Spanner commits this atomically with a globally consistent timestamp. A reconciliation job running in any region will see a consistent ledger โ no credit without a corresponding debit, no gap between the two. This is the use case that justifies Spanner's operational complexity and cost: when cross-region atomic writes with serializable isolation are a business requirement, not a nice-to-have.
MongoDB in content management (publish-time consistency). A media platform uses MongoDB replica set transactions to atomically update an article document and its tag index document when an editor publishes. The transaction ensures the article is never searchable without its tags. The replica set model works because both documents live in the same cluster, write frequency is in the thousands per day (not millions per second), and the absence of write-skew risk (no two editors publish the same article simultaneously) makes snapshot isolation sufficient.
โ๏ธ Trade-offs and Failure Modes Across All Four Databases
Performance versus correctness. Every strong isolation guarantee costs latency. DynamoDB transactions cost 2x in WCU; Spanner's commit-wait adds bounded but real delay; MongoDB sharded transactions hold cross-shard locks for the entire transaction duration. Teams should evaluate whether they genuinely need serializable multi-item isolation or whether atomic conditional single-item writes (ConditionExpression, ETag checks, findOneAndUpdate with conditions) are sufficient. A large fraction of "transaction" use cases can be handled by single-item OCC without the overhead of 2PC.
Cascading contention under load. Under high write contention, all four databases degrade similarly. DynamoDB returns TransactionConflictException. Cosmos DB's stored procedure queue grows on hot partition keys. Spanner detects deadlocks and aborts younger transactions. MongoDB returns WriteConflict. All of these require client-side retries. Without exponential backoff and jitter in the retry logic, retries under load amplify contention rather than resolving it โ a pattern known as the retry storm.
The cross-partition silent failure. The failure mode that opened this post โ two sequential writes that should have been one transaction โ is the most common and the most insidious. It does not produce an error. It does not appear on any dashboard. The mitigation is architectural: understand the atomicity scope of your chosen database before modeling data, and explicitly design your data model to keep related writes within that scope.
Multi-region transaction limitations. None of the four databases deliver fully serializable cross-region atomic writes except Spanner. Cosmos DB multi-region writes use conflict resolution, not serialization. DynamoDB Global Tables replicates asynchronously. MongoDB cross-region replication is not transactional. Engineers who require cross-region atomicity outside of Spanner must implement compensating transaction patterns (Saga) with idempotency guarantees โ and must accept the weaker consistency model that implies.
๐งญ Decision Framework: Matching Your Workload to the Right Transactional Database
The flowchart below captures the key decision points for selecting a transactional strategy. The most important branch comes first: does the workload actually require multi-item atomicity, or will atomic conditional single-item writes suffice?
flowchart TD
A[Does the operation require multi-item atomicity?] -->|No| B[Use conditional writes - cheaper and sufficient]
A -->|Yes| C{How many items per transaction?}
C -->|Up to 25, same AWS region| D[DynamoDB TransactWriteItems - strong fit]
C -->|All items share one partition key| E{Which cloud platform?}
C -->|Cross-partition or more than 25 items| F{Is cross-region ACID required?}
E -->|Azure| G[Cosmos DB transactional batch or stored procedure]
E -->|Other| H[Re-evaluate data model or use Saga pattern]
F -->|No - single region, cross-shard| I[MongoDB sharded tx or Saga pattern]
F -->|Yes - global ACID required| J[Google Spanner - only option with external consistency]
D --> K{High write contention on same items?}
K -->|Yes| L[Add idempotency tokens and retry with jitter backoff]
K -->|No| M[Standard TransactWriteItems with ClientRequestToken]
This flowchart surfaces the Spanner decision at the right place: it is the answer when and only when cross-region atomic writes with serializable isolation are a hard requirement. For every other use case, one of the other three databases provides sufficient guarantees at lower operational cost.
| Situation | Recommended Approach |
| Single-item atomic update (increment counter, claim one slot) | Conditional write only โ no transaction overhead needed |
| Up to 25 items, same AWS region, high frequency | DynamoDB TransactWriteItems with ClientRequestToken |
| All items share a partition key, Azure, moderate frequency | Cosmos DB transactional batch or stored procedure |
| Cross-partition operations, Azure | Saga pattern with Cosmos DB Change Feed as the event relay |
| Global cross-region atomic writes, serializable isolation | Google Spanner โ accept commit-wait latency |
| Existing MongoDB replica set, low contention | MongoDB multi-document transaction โ sufficient for most use cases |
| MongoDB, sharded, high concurrent write frequency | Evaluate Saga pattern โ sharded transaction lock costs escalate quickly |
๐งช Worked Example: Atomic Inventory Reservation for a Flash Sale
A flash sale service must atomically reserve inventory and create a customer order. The business invariant: the inventory count must never go below zero, and no order should be created without a corresponding inventory decrement.
System state before the transaction:
- Item:
INVENTORY#sku-789with attributestock = 1(last unit) - Item to be created:
ORDER#order-456with statusRESERVED
Scenario A: DynamoDB with one unit remaining, two concurrent buyers.
Two Lambda functions execute TransactWriteItems simultaneously. Each transaction contains:
UpdateonINVENTORY#sku-789โ decrementstockby 1, withConditionExpression: stock >= :minwhere:min = 1.PutonORDER#order-456โ write the new order item.
Lambda A's transaction acquires locks at the PREPARE phase. Lambda B's transaction also attempts to acquire a lock on INVENTORY#sku-789 but finds it locked by Lambda A's in-flight PREPARE. Lambda B receives TransactionConflictException. Lambda A proceeds: its condition check passes (stock = 1 >= 1), both writes commit, stock becomes 0, and the order is created. Lambda B retries; this time the condition check fails (stock = 0, not >= 1) and the transaction is cancelled. The order is never created for Lambda B. No inventory goes negative.
Scenario B: What Cosmos DB would require for the same invariant.
If both INVENTORY#sku-789 and ORDER#order-456 use sku as the partition key and both carry pk = "sku-789", a Cosmos DB stored procedure can execute this atomically. The stored procedure reads the inventory document, checks the stock count, decrements it, and creates the order document โ all in one JavaScript execution block. If the check fails (stock is 0), the JavaScript throws and both writes are rolled back. If the ORDER#order-456 document had a different partition key (for example, partitioned by customerId), no Cosmos DB transaction could span both documents โ the Saga pattern with a compensating "release inventory" operation would be required.
Scenario C: What Spanner enables beyond both.
Spanner executes this as a read-write transaction with no item count limit and no partition constraint. The transaction reads the inventory row (acquiring a shared read lock), checks the stock, decrements inventory (acquiring an exclusive write lock), and inserts the order row. TrueTime commit-wait ensures that any read of either row from any region after this transaction's real-world commit time sees the post-sale state. If the platform runs flash sales with global inventory โ one pool shared by buyers in Tokyo, London, and New York โ Spanner is the only database that prevents oversell without application-level sharding of the inventory counter.
๐ ๏ธ SDK Integration: Transactional Writes in DynamoDB, Cosmos DB, and Spanner
The following snippets show the minimal SDK call pattern for transactional writes in each database. These are integration-level examples intended to show the structural shape of the API โ not full application implementations.
DynamoDB TransactWriteItems (AWS SDK v2 for Java)
TransactWriteItemsRequest txRequest = TransactWriteItemsRequest.builder()
.clientRequestToken(idempotencyToken) // Required for safe retries
.transactItems(
TransactWriteItem.builder()
.update(Update.builder()
.tableName("Accounts")
.key(Map.of("accountId", AttributeValue.fromS(sourceId)))
.updateExpression("SET balance = balance - :amount")
.conditionExpression("balance >= :amount") // Prevents negative balance
.expressionAttributeValues(Map.of(
":amount", AttributeValue.fromN("100")))
.build())
.build(),
TransactWriteItem.builder()
.update(Update.builder()
.tableName("Accounts")
.key(Map.of("accountId", AttributeValue.fromS(destId)))
.updateExpression("SET balance = balance + :amount")
.expressionAttributeValues(Map.of(
":amount", AttributeValue.fromN("100")))
.build())
.build()
)
.build();
dynamoDbClient.transactWriteItems(txRequest);
// Catch TransactionCanceledException for condition failures
// Catch TransactionConflictException for contention โ retry with backoff
Cosmos DB Transactional Batch (Azure SDK for Java)
// All items in the batch MUST share the same partition key value
TransactionalBatch batch = TransactionalBatch.createTransactionalBatch(
new PartitionKey("sku-789")
);
batch.upsertItem(inventoryUpdatePojo); // Decrement stock
batch.createItem(newOrderPojo); // Create order in RESERVED state
TransactionalBatchResponse response = container
.executeTransactionalBatch(batch)
.block();
if (!response.isSuccessStatusCode()) {
// Check response.getResults() for per-operation failure details
throw new RuntimeException("Transactional batch failed: " + response.getStatusCode());
}
Google Cloud Spanner Read-Write Transaction (Java)
// Spanner SDK handles retry on TransactionAbortedException automatically
dbClient.readWriteTransaction().run(txContext -> {
// Read with shared lock held until commit
Struct row = txContext.readRow(
"Inventory", Key.of("sku-789"), List.of("stock"));
long stock = row.getLong("stock");
if (stock < 1) throw new RuntimeException("Out of stock");
// Buffer mutations โ applied atomically at commit, not immediately
txContext.buffer(
Mutation.newUpdateBuilder("Inventory")
.set("sku").to("sku-789")
.set("stock").to(stock - 1)
.build()
);
txContext.buffer(
Mutation.newInsertBuilder("Orders")
.set("orderId").to(orderId)
.set("status").to("RESERVED")
.build()
);
return null;
});
// The lambda re-executes automatically if Spanner aborts due to lock conflict
For a full deep-dive on Spanner's session pooling, staleness configuration, and interleaved table design patterns, see the Cloud Spanner transactions documentation. For Cosmos DB Change Feed as the Saga event relay, see Cosmos DB API Modes Explained.
๐ Lessons Learned: What Production Teaches You About Distributed Transactions
1. Two sequential writes are never a transaction. The most common distributed transaction bug is issuing two independent write calls and assuming they are atomic because they run in the same Lambda or service method. They are not. Any failure, timeout, partial retry, or concurrent writer between them creates permanent partial state. Use the transactional API explicitly โ every time.
2. Model data for your transaction scope before the first write hits production. DynamoDB's 25-item limit and Cosmos DB's single-partition constraint are not bugs. They are the design. A data model that requires atomic writes across partition keys in Cosmos DB or across more than 25 items in DynamoDB cannot be fixed with API calls โ it requires a data model redesign or an architectural pattern change (Saga). Discovering this in production is expensive.
3. Idempotency tokens are not optional in any retried transactional path.
Any transactional operation that can be retried โ Lambda automatic retries, SDK-level retries, application-level error handling โ must use idempotency tokens. Without them, retries of successful-but-timed-out operations produce duplicates, conflicts, or data corruption depending on the database. DynamoDB's ClientRequestToken, Cosmos DB's request IDs, and Spanner's session model each address this differently โ understand which mechanism applies to your SDK and use it.
4. The "C" in ACID is your responsibility, not the database's. Serializable isolation ensures transactions do not interfere with each other's reads and writes. It does not ensure your business invariants hold. No negative balance, no oversold inventory, no duplicate order โ these must be encoded as condition expressions, stored procedure checks, or read-then-write guards inside the transaction. A serializable transaction that does not check preconditions will still commit business rule violations.
5. Spanner's commit-wait is the mechanism that makes global ACID possible. Engineers encountering commit-wait for the first time look for a way to disable it. The delay is exactly 1โ7ms in practice, and it is what makes external consistency possible. If your global application requires that a read starting after a committed write always sees that write โ regardless of which region the reader is in โ commit-wait is the mechanism guaranteeing this. Disabling it would silently downgrade the consistency guarantee.
6. MongoDB sharded cluster transactions trade throughput for atomicity. The lock duration in a sharded MongoDB transaction is bounded by the total transaction execution time, not individual write durations. This is a fundamental overhead characteristic, not a configuration issue. For write-heavy, highly concurrent workloads on a sharded MongoDB cluster, the Saga pattern with compensating transactions generally delivers better throughput and lower operational risk than sharded multi-document transactions.
๐ TLDR: One Paragraph That Replaces 20 Stack Overflow Tabs
ACID transactions in distributed databases are a spectrum โ not a binary capability. DynamoDB delivers serializable multi-item atomicity within 25 items in a single region using 2PC backed by a coordinator record; use ClientRequestToken for safe retries or risk duplicate operations. Cosmos DB delivers serializable atomicity only within a single logical partition key, enforced through stored procedure execution; cross-partition operations require the Saga pattern or explicit data model redesign. Google Spanner is the only cloud database that delivers externally consistent global ACID using TrueTime-based commit timestamps, strict two-phase locking, and Paxos-replicated 2PC across geographic regions โ at the cost of commit-wait latency that adds 1โ7ms in practice. MongoDB provides snapshot-isolated multi-document transactions on replica sets with reasonable overhead, and sharded cluster transactions with significantly higher lock duration cost; neither prevents write skew, which requires application-level condition checks. The question is never "which database supports ACID?" โ all four do, within their defined scope. The question is always "what scope, what isolation, and at what cost?"
๐ Practice Quiz
A DynamoDB
TransactWriteItemscall involves 8 items across 3 tables. The coordinator writes the transaction record to COMMITTED state and sends COMMIT to 5 of the 8 partitions before the coordinator process crashes. What happens when a new coordinator picks up the transaction?- A) The 5 committed items are rolled back to be consistent with the 3 that did not receive COMMIT
- B) The transaction is permanently lost โ partial state remains
- C) The new coordinator reads the COMMITTED record and replays COMMIT to the 3 partitions that did not ACK, completing the transaction idempotently
- D) The client receives a
TransactionCanceledExceptionand must resubmit the full transaction Correct Answer: C
A Cosmos DB stored procedure operates on documents D1, D2, and D3, all with partition key
tenantId = "acme". The procedure updates D1 and D2 successfully and is then interrupted by an uncaught exception when processing D3. What is the observable state of D1 and D2 to a concurrent reader?- A) D1 and D2 reflect the stored procedure's changes; D3 is unmodified
- B) D1, D2, and D3 all remain at their pre-procedure values โ the entire invocation is rolled back
- C) D1 and D2 are visible in the updated state; D3 shows a partial write
- D) The stored procedure is automatically retried from the beginning Correct Answer: B
Google Spanner assigns commit timestamp
s = TT.now().t_latestand waits untilTT.now().t_earliest > sbefore applying writes. A read-only transaction starts in a different datacenter one second after Spanner releases the locks. Which guarantee does commit-wait provide to that reader?- A) The read will return data no more than 7ms stale due to Paxos replication lag
- B) The reader's transaction will observe a TrueTime lower bound strictly greater than
s, guaranteeing it sees the committed state without any additional synchronization - C) The reader is served a snapshot from before the write and must retry to see the updated data
- D) Spanner blocks the read until the Paxos leader replicates the commit to the reader's datacenter Correct Answer: B
A MongoDB replica set transaction reads an inventory document (
available = 1) and a reservation document (reserved = 0) in the same snapshot, then writesavailable = 0andreserved = 1. A concurrent transaction reads the same snapshot and makes the same decision. Both transactions commit successfully. What anomaly has occurred, and what is the correct mitigation at the database level?- A) Dirty read โ mitigated by upgrading to Read Committed isolation
- B) Phantom read โ mitigated by using range locks on the inventory collection
- C) Write skew โ both transactions saw a consistent snapshot that allowed each to proceed, resulting in
available = -1andreserved = 2; mitigated by adding an explicit check-then-write inside the transaction usingfindOneAndUpdatewith a filter onavailable > 0as a condition - D) Non-repeatable read โ mitigated by re-reading documents before the commit Correct Answer: C
Open-ended challenge: A global payments platform requires that a debit from a Tokyo account and a credit to a New York account be committed atomically, and that any balance read starting anywhere in the world after the transfer's real-world completion time reflects the post-transfer state โ with no possibility of a reader in either region observing stale pre-transfer balances. Explain precisely why DynamoDB, Cosmos DB, and standard MongoDB cannot satisfy this requirement in their standard configurations. Then explain the three specific mechanisms in Google Spanner โ TrueTime, strict two-phase locking, and Paxos-replicated 2PC โ that together make this guarantee possible, and describe the engineering trade-off the platform must accept to use Spanner for this workload.
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions โ with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy โ but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose โ range, hash, consistent hashing, or directory โ determines whether range queries stay ch...
