All Posts

The Dual Write Problem in NoSQL: MongoDB, DynamoDB, and Cassandra

Why NoSQL consistency is harder than you think — and the per-database patterns that fix it

Abstract AlgorithmsAbstract Algorithms
··37 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: NoSQL databases trade cross-entity atomicity for scale — and every database draws that atomicity boundary in a different place. MongoDB's boundary is the document (pre-4.0) or the replica set (4.0+ multi-doc transactions). DynamoDB's boundary is a single TransactWriteItems call, with failure modes that silently burn throughput. Cassandra's boundary is the partition key. Write across any of these boundaries without a compensating pattern — Outbox, CDC, or conditional writes — and you will have silent data divergence in production.


📖 The Orders That Existed in DynamoDB But Not for the Customer

An e-commerce platform migrates from PostgreSQL to DynamoDB. The engineering team spends six months rewriting the order service. Load tests pass. The team ships on a Thursday.

By the following Monday at 50,000 transactions per second, the support queue is filling up: customers who placed orders see them in "My Orders" — but only sometimes. The operations team traces the issue to two DynamoDB tables: orders (the source of truth) and order-by-customer (a denormalized index table maintained separately for query flexibility, because DynamoDB's GSIs had latency characteristics the team did not want). Writes to orders were succeeding. Writes to order-by-customer were occasionally failing with ProvisionedThroughputExceededException at high TPS — and the application was catching the exception, logging a warning, and moving on.

The result: orders existed in the database. Customers could not find them in the orders list. No foreign key enforcement caught the gap. No constraint raised an error. No rollback cleaned up the orphaned row. PostgreSQL would never have allowed this scenario — a single transaction would have committed both writes atomically or rolled both back. DynamoDB, designed to handle millions of requests per second, simply does not work that way.

The team had chosen NoSQL for scale. Scale introduced consistency gaps a relational database would have made structurally impossible.

This post is about exactly those gaps — per-database, with the specific failure modes and the specific patterns that close them. If you are looking for the general dual write problem (DB + Kafka Outbox, DB + Redis cache invalidation), see the companion post The Dual Write Problem and Solutions. This post starts where that one ends: inside NoSQL.


🔍 Why NoSQL Makes Dual Write Harder Than You Remember

In a relational database, dual write within that database is trivially solved. PostgreSQL, MySQL, and Oracle all support multi-table ACID transactions natively:

BEGIN;
INSERT INTO orders (order_id, status) VALUES ('abc-123', 'PENDING');
INSERT INTO order_events (order_id, event_type) VALUES ('abc-123', 'OrderPlaced');
COMMIT;

Both rows commit or both rows roll back. The atomicity boundary in a relational database is the transaction, which can span any number of tables in the same instance. The dual write problem in relational systems only appears when you try to write to two separate systems — the database and Kafka, or the database and Redis.

NoSQL databases are architected differently. They sacrifice cross-entity atomicity in exchange for horizontal scalability. Where they draw the atomicity boundary depends on the database:

DatabaseAtomicity BoundaryMulti-Entity Atomic WriteCross-Boundary Option
PostgreSQLTransaction (all tables, same instance)✅ NativeN/A — native ACID
MongoDB (< 4.0)Single document❌ NoneEmbedding or manual
MongoDB (≥ 4.0)Replica set / sharded cluster✅ Multi-document transactionsAvailable but costly
DynamoDBSingle itemTransactWriteItems (up to 100 items)Available with failure modes
CassandraSingle partition⚠️ LOGGED BATCH (same partition only)LWT within one partition
RedisSingle command✅ Lua scripts / MULTI/EXECNo cross-key rollback

The dual write problem in NoSQL has two distinct flavors:

  • Intra-DB dual write: writing to two documents, collections, tables, or partitions within the same NoSQL database — e.g., two DynamoDB tables, two MongoDB collections, two Cassandra partitions.
  • Inter-system dual write: writing to a NoSQL database and a secondary system — Elasticsearch for full-text search, Redis for caching, another NoSQL database.

Both flavors require different reasoning. The rest of this post walks through each major NoSQL database's specific failure modes and the patterns engineered to address them.


⚙️ MongoDB: The Multi-Document Write Trap and How Embedding Escapes It

Pre-4.0 MongoDB: Two Collections, No Safety Net

Before MongoDB 4.0, there were no multi-document transactions. If your service wrote to two collections in the same code path, those were two entirely independent writes with no atomicity guarantee. The second write could fail, and MongoDB had no mechanism to compensate for the first.

// WRONG — pre-4.0 MongoDB: no atomicity between two collection writes
MongoCollection<Document> orderCollection = db.getCollection("orders");
MongoCollection<Document> orderEventsCollection = db.getCollection("order_events");

// If this succeeds...
orderCollection.insertOne(orderDoc);

// ...and this throws, the order exists but the event is permanently lost.
// There is no rollback. There is no journal entry linking the two writes.
orderEventsCollection.insertOne(eventDoc);  // MongoException? Order row is orphaned.

This is not a programming error — it is a structural gap. The driver has no concept of a multi-collection transaction. Every insert is final the moment it returns.

Post-4.0 MongoDB: Multi-Document Transactions With a Real Cost

MongoDB 4.0 introduced multi-document ACID transactions, scoped to a replica set (4.0) and later extended to sharded clusters (4.2). Spring Data MongoDB maps @Transactional to these transactions automatically:

@Service
public class OrderService {

    private final MongoClient mongoClient;
    private final MongoDatabase db;

    public OrderService(MongoClient mongoClient) {
        this.mongoClient = mongoClient;
        this.db = mongoClient.getDatabase("shop");
    }

    @Transactional  // Spring Data MongoDB maps this to a multi-document MongoDB transaction
    public void placeOrder(Order order, OrderEvent event) {
        db.getCollection("orders")
          .insertOne(toDocument(order));

        db.getCollection("order_events")
          .insertOne(toDocument(event));

        // If the second insert throws, Spring rolls back the first insert atomically.
        // Both collections reflect the same committed state — or neither does.
    }

    private Document toDocument(Order order) {
        return new Document("orderId", order.getId())
            .append("customerId", order.getCustomerId())
            .append("status", order.getStatus().name())
            .append("total", order.getTotal());
    }

    private Document toDocument(OrderEvent event) {
        return new Document("orderId", event.getOrderId())
            .append("eventType", event.getType().name())
            .append("occurredAt", event.getOccurredAt().toString());
    }
}

This works. But MongoDB multi-document transactions carry costs that do not exist in PostgreSQL:

  • 2× write amplification: every write inside a transaction is recorded both in the collection and in the oplog, increasing I/O.
  • WiredTiger lock contention: long-running transactions hold document-level locks that block concurrent writers on the same documents. MongoDB aborts transactions exceeding 60 seconds by default.
  • Replica set only: transactions require a replica set (standalone instances do not support them) and mongos routing for sharded clusters.
  • Retryability: transactions do not auto-retry on TransientTransactionError. The application is responsible for the retry loop.

For write-heavy workloads, multi-document transactions in MongoDB can reduce throughput by 10–30% compared to single-document writes. This is the most important MongoDB-specific trade-off to understand before defaulting to @Transactional.

The MongoDB-Native Alternative: Single-Document Atomicity Through Embedding

MongoDB's design strength is the document model. A single insertOne or updateOne is always atomic. If the event logically belongs to the order — and it does, at creation time — embed it:

@Service
public class OrderEmbeddedOutboxService {

    private final MongoCollection<Document> orderCollection;

    public OrderEmbeddedOutboxService(MongoDatabase db) {
        this.orderCollection = db.getCollection("orders");
    }

    public void placeOrder(Order order, OrderEvent event) {
        Document outboxEntry = new Document("eventType", event.getType().name())
            .append("payload", toDocument(event))
            .append("published", false)
            .append("createdAt", System.currentTimeMillis());

        Document orderDoc = toDocument(order)
            .append("_outbox", outboxEntry);

        // Single document write — atomically stores the order AND the pending event.
        // No transaction needed. No second write. No failure window between two operations.
        orderCollection.insertOne(orderDoc);

        // A separate relay process watches for documents where _outbox.published = false,
        // publishes the event to Kafka/SNS, then clears the _outbox field with $unset.
    }
}

This is the MongoDB-native Outbox pattern. The event relay reads documents with _outbox.published: false (efficiently, via a Change Stream or a tailable cursor), publishes the event to the downstream system, and marks published: true. The atomicity boundary is a single document write — MongoDB's native strength.

The following diagram shows both paths and when to choose each.

flowchart TD
    A[Order + Event Write Request] --> B{Can the event be embedded\nin the order document?}
    B -- Yes --> C[Single Document Write\norder _outbox embedded]
    C --> D[Change Stream Relay\ndetects _outbox.published = false]
    D --> E[Relay publishes event\nto Kafka / SNS]
    E --> F[Relay clears _outbox\nwith atomicUpdate]
    B -- No / event is\nlarge or independent --> G[MongoDB Multi-Doc Transaction\n@Transactional]
    G --> H[orders.insertOne + order_events.insertOne\nin same session]
    H --> I{Both inserted?}
    I -- Yes --> J[Transaction commits\nBoth collections consistent]
    I -- No --> K[Spring rolls back\nNeither collection is modified]

The left branch (embedded outbox) should be the default for MongoDB. It writes once, requires no transaction overhead, and the relay operates asynchronously outside the request path. The right branch (multi-document transaction) is appropriate when the event cannot be embedded — for example, when the event target collection has an independent schema lifecycle or size constraints that make embedding impractical.


🧠 Deep Dive: How NoSQL Atomicity Boundaries Create Consistency Gaps

Understanding why NoSQL dual writes fail requires looking inside the write path of each database — specifically at where the atomicity guarantee ends and the application's responsibility begins.

The Internals: How Each Database Manages Write Atomicity

MongoDB's document-level atomicity is enforced by WiredTiger, MongoDB's storage engine. Every write to a single document — whether an insertOne, updateOne, or replaceOne — is an atomic operation at the storage engine level. WiredTiger uses a copy-on-write mechanism: the new version of the document is written to a new page, and the page pointer is updated atomically. Either the pointer update happens (write is visible) or it does not (write is invisible). There is no partial document state.

When MongoDB 4.0 introduced multi-document transactions, it added a two-phase commit layer on top of WiredTiger. The transaction coordinator (embedded in the mongod process) manages the prepare and commit phases across all participating replica set members. A prepared transaction holds WiredTiger snapshot locks on all affected documents until the commit or abort message arrives. The lock scope — and therefore the contention scope — scales with the number of documents and the transaction duration.

DynamoDB's item-level atomicity is enforced at the partition level. Each item write is processed by a single partition leader in DynamoDB's Paxos-based replication group. The leader writes the item to its local storage, replicates to followers, and acknowledges the write only after a quorum is confirmed. This makes individual item writes strongly consistent and highly available. TransactWriteItems coordinates multiple partition leaders using a two-phase commit managed by DynamoDB's transaction coordinator service — a separate infrastructure component that introduces the failure modes described below. Critically, the transaction coordinator holds no locks during the coordination phase; it uses version-check optimistic concurrency instead, which is why TransactionConflictException occurs when two transactions touch the same item simultaneously.

Cassandra's partition-scoped atomicity is enforced by the storage engine's memtable and SSTable write path. A write to a single partition key goes to the commit log and the memtable in a single operation — this is atomic. A write to two different partition keys is two separate commit log entries and two separate memtable operations. The Cassandra coordinator node sends these writes to the appropriate replica nodes independently. There is no two-phase commit between partitions. A logged batch adds a batch log (written to two additional nodes before the data writes begin), but the batch log replay is best-effort across partitions — it guarantees completion of a started batch, not prevention of partial application if the coordinator dies at exactly the wrong moment.

Performance Analysis: The Real Cost of Cross-Boundary Consistency

Choosing a cross-boundary consistency strategy is a performance decision as much as a correctness decision. The table below quantifies the typical overhead per approach at production scale:

StrategyLatency OverheadThroughput ImpactFailure Complexity
MongoDB single-document writeBaseline (0ms overhead)NoneZero — one operation
MongoDB multi-doc transaction (4.0+)+5–15ms (oplog + coordination)10–30% reduction on contested documentsTransientTransactionError → must retry whole txn
DynamoDB single PutItemBaselineNoneZero — one operation
DynamoDB TransactWriteItems (2 tables)+2–8ms (coordinator round-trip)~2× WCU cost per write; consumed on rollback tooTransactionConflictException → manual retry required
Cassandra single-partition writeBaselineNoneZero — commit log + memtable atomic
Cassandra LOGGED BATCH (same partition)+2–5ms (batch log write)~1.5× write overhead (batch log extra writes)Self-healing on replay
Cassandra LWT (IF NOT EXISTS)+30–100ms (4 Paxos phases)60–80% reduction under contentionWriteTimeoutException → must retry
Any NoSQL → CDC pipeline (async)0ms on write pathNone (write path unchanged)Consumer lag; at-least-once delivery

The performance analysis makes the fundamental insight concrete: every cross-boundary consistency mechanism adds overhead relative to a single-boundary write, and that overhead grows non-linearly under contention. MongoDB transactions under high concurrent access, DynamoDB transactions at high TPS, and Cassandra LWT under sustained concurrent writes each show disproportionate throughput degradation. The CDC-based async approach is the only strategy that adds zero overhead to the write path — it shifts the consistency work entirely to the background pipeline.


📊 Visualizing the Dual Write Failure and Recovery Paths

The following flowchart maps every dual write scenario in a NoSQL context — from the atomicity boundary check through to the correct resolution strategy. Reading it top-down: the first decision identifies whether the write stays within one atomicity boundary or crosses it; subsequent decisions branch by database type and partition scope.

flowchart TD
    Start([Dual write needed]) --> B1{Stays within one\natomicity boundary?}

    B1 -- Yes: same document\nor same partition --> Safe[Single atomic write\nNo extra pattern needed]

    B1 -- No: crosses boundary --> B2{Which database?}

    B2 --> MDB[MongoDB]
    B2 --> DDB[DynamoDB]
    B2 --> CAS[Cassandra]
    B2 --> XSys[Cross-system\ne.g. NoSQL + ES]

    MDB --> M1{Can the second entity\nbe embedded?}
    M1 -- Yes --> M2[Embed outbox in document\nSingle insertOne — zero txn cost]
    M1 -- No --> M3[Multi-doc @Transactional\nRequires replica set]

    DDB --> D1{Traffic level?}
    D1 -- Low–medium TPS --> D2[TransactWriteItems\nwith idempotency token\n+ condition expressions]
    D1 -- High TPS --> D3[DynamoDB Streams → Lambda\nEvent-driven secondary table]

    CAS --> C1{Same partition key\nacross tables?}
    C1 -- Yes --> C2[LOGGED BATCH\nAtomic within partition]
    C1 -- No --> C3[Outbox table + LOGGED BATCH\nLWT relay for deduplication]

    XSys --> X1[CDC Pipeline\nDebezium → Kafka → Sink]

This diagram makes the key observation visible: when the write stays within one atomicity boundary (same MongoDB document, same DynamoDB item, same Cassandra partition), no compensating pattern is needed — the database guarantees atomicity natively. The patterns exist specifically for cross-boundary writes, and the correct pattern depends on the database and traffic characteristics. The bottom-right path (CDC pipeline) applies to any cross-system write regardless of which NoSQL database is the source.


⚙️ DynamoDB: TransactWriteItems and the Three Production Failure Modes

The Surface: TransactWriteItems Looks Atomic

DynamoDB's TransactWriteItems API accepts up to 100 items across multiple tables and commits them atomically. On the surface, this solves the two-table dual write problem:

DynamoDbClient dynamoDb = DynamoDbClient.create();

Map<String, AttributeValue> orderItem = Map.of(
    "orderId",     AttributeValue.fromS(orderId),
    "customerId",  AttributeValue.fromS(customerId),
    "status",      AttributeValue.fromS("PENDING"),
    "version",     AttributeValue.fromN("1")
);

Map<String, AttributeValue> indexItem = Map.of(
    "customerId",  AttributeValue.fromS(customerId),
    "orderId",     AttributeValue.fromS(orderId),
    "createdAt",   AttributeValue.fromN(String.valueOf(System.currentTimeMillis()))
);

TransactWriteItemsRequest request = TransactWriteItemsRequest.builder()
    .transactItems(
        TransactWriteItem.builder()
            .put(Put.builder().tableName("orders").item(orderItem).build())
            .build(),
        TransactWriteItem.builder()
            .put(Put.builder().tableName("order-by-customer").item(indexItem).build())
            .build()
    )
    .clientRequestToken(orderId)   // idempotency key — covered below
    .build();

dynamoDb.transactWriteItems(request);

Under normal conditions this works. Under production load with concurrent writers, three failure modes surface.

Failure Mode 1 — Idempotency Token Misuse Causes Duplicate Writes

clientRequestToken makes a TransactWriteItems request idempotent: if the network drops and the client retries with the same token, DynamoDB recognizes the duplicate and returns success without re-executing the transaction. This is correct retry behavior.

The failure mode: the retry logic uses a new token — for example, by generating a fresh UUID on each retry attempt, or by losing the original token across a service restart. DynamoDB treats each unique token as a new transaction. The result is duplicate orders in both tables, with no duplicate-key protection unless the application explicitly adds condition expressions (e.g., attribute_not_exists(orderId) on the Put).

Production fix: generate the idempotency token deterministically from the business entity ID (the orderId itself, or a hash of the operation inputs), and persist it before the first attempt. Never generate it inside the retry loop.

Failure Mode 2 — TransactionConflictException Does Not Auto-Retry

When two concurrent TransactWriteItems calls touch the same item, DynamoDB's optimistic concurrency control detects the conflict. One transaction succeeds; the other throws TransactionConflictException. DynamoDB does not automatically retry the failed transaction. The application receives the exception and is responsible for the entire retry cycle — re-reading the current item state, re-building the transaction, and re-submitting.

Applications that catch TransactionConflictException and log-and-drop the exception will silently lose writes under any contention. This is structurally identical to the original two-table dual write failure.

Production fix: implement explicit retry with exponential backoff and jitter for TransactionConflictException. Use condition expressions to implement optimistic locking so the retry detects concurrent modifications:

// Optimistic lock on version attribute — detects concurrent modification before committing
UpdateItemRequest updateOrder = UpdateItemRequest.builder()
    .tableName("orders")
    .key(Map.of("orderId", AttributeValue.fromS(orderId)))
    .updateExpression("SET #status = :newStatus, #ver = :newVer")
    .conditionExpression("#ver = :expectedVer")
    .expressionAttributeNames(Map.of(
        "#status", "status",
        "#ver",    "version"
    ))
    .expressionAttributeValues(Map.of(
        ":newStatus",   AttributeValue.fromS("CONFIRMED"),
        ":newVer",      AttributeValue.fromN(String.valueOf(expectedVersion + 1)),
        ":expectedVer", AttributeValue.fromN(String.valueOf(expectedVersion))
    ))
    .build();

try {
    dynamoDb.updateItem(updateOrder);
} catch (ConditionalCheckFailedException e) {
    // Another writer modified the item since we read it.
    // Re-read the current state, re-build the update, and retry.
    throw new OptimisticLockException("Version mismatch for order " + orderId, e);
}

Failure Mode 3 — Write Capacity Is Consumed Even on Rollback

DynamoDB bills write capacity units (WCUs) for all items in a TransactWriteItems call even when the transaction fails and rolls back. At 50,000 TPS with a transaction size of 3 items, a 2% failure rate due to ProvisionedThroughputExceededException on one table will trigger an additional ~3,000 WCU/second on the other tables — wasted capacity that accelerates the throughput exception cascade.

This was the root cause of the e-commerce failure described in the opening scenario: throughput exceptions on the order-by-customer table caused TransactWriteItems to roll back, but the orders table still consumed WCUs on every failed attempt, increasing the provisioned throughput pressure and creating a feedback loop.

The Right DynamoDB Architecture: Streams Over Transactions at Scale

The correct pattern for high-TPS DynamoDB dual writes is not TransactWriteItems across two tables — it is write to one table and let DynamoDB Streams propagate the change to the secondary table. This shifts the dual write from synchronous in-request to asynchronous and event-driven:

sequenceDiagram
    participant App as Order Service
    participant OT as orders table
    participant DS as DynamoDB Streams
    participant Lam as Lambda (stream consumer)
    participant IX as order-by-customer table

    App->>OT: PutItem / UpdateItem (single write)
    OT-->>App: Success (1 WCU consumed)
    OT--)DS: Stream record emitted (INSERT / MODIFY / REMOVE)
    DS--)Lam: Trigger: batch of stream records
    Lam->>IX: PutItem (write index entry)
    IX-->>Lam: Success
    Note over Lam,IX: Retry is built into Lambda stream processing. Failed batches are retried automatically up to the configured retry count before going to a DLQ.

The sequence diagram illustrates why DynamoDB Streams is the preferred pattern at high TPS. The orders table write is a single, cheap PutItem with no transaction overhead and no cross-table coordination. DynamoDB emits a stream record after the write is committed — not before, so the stream is always consistent with the table. Lambda processes the stream in order, with built-in at-least-once delivery and a configurable DLQ for failed batches. The order-by-customer table is eventually consistent with orders, with a propagation lag typically under 500ms.

The trade-off: the secondary table is not immediately consistent. If your application performs a read of order-by-customer immediately after a write to orders, the index entry may not yet exist. Design read paths to tolerate this window — for example, by reading from orders directly on the write path and only using order-by-customer for list/search queries where eventual consistency is acceptable.


⚙️ Cassandra: Partition Boundaries and the Limits of Lightweight Transactions

Everything in Cassandra Is Partition-Scoped

Cassandra's architecture is built around partitions. Each partition lives on a set of replica nodes determined by the partition key. Writes within a single partition are handled by those nodes without coordination with other nodes. Writes that span two partitions require two independent write paths — and Cassandra provides no coordinator to make them atomic.

This is the fundamental constraint: Cassandra has no cross-partition transactions. It is not a missing feature waiting to be added. It is a deliberate consequence of the architecture that gives Cassandra its linear scalability and multi-datacenter availability.

Lightweight Transactions: IF NOT EXISTS at Paxos Cost

Cassandra Lightweight Transactions (LWT) provide conditional writes within a single partition using a Paxos consensus round. The classic use case is deduplication — guaranteeing that an order ID is inserted exactly once:

CqlSession session = CqlSession.builder()
    .withKeyspace("shop")
    .build();

// Conditional insert — only succeeds if the row does not already exist.
// Paxos ensures no two concurrent writers with the same order_id both succeed.
SimpleStatement insertOrder = SimpleStatement.newInstance(
    "INSERT INTO orders (order_id, customer_id, status, created_at) " +
    "VALUES (?, ?, ?, ?) IF NOT EXISTS",
    orderId, customerId, "PENDING", Instant.now()
);

ResultSet rs = session.execute(insertOrder);
Row applied = rs.one();

if (applied == null || !applied.getBoolean("[applied]")) {
    throw new DuplicateOrderException(
        "Order " + orderId + " already exists — possible duplicate submission"
    );
}

LWT is correct and useful for single-partition conditional writes. Its costs are significant:

  • 4× latency penalty: a Paxos LWT requires four network round-trips — Prepare, Promise, Propose, Commit — compared to one round-trip for a regular write.
  • Contention: multiple LWT writes to the same partition key compete for the Paxos lock, causing WriteTimeoutException under sustained concurrent load.
  • Scope: LWT works only within one partition. It cannot atomically span two tables or two partition keys.

Logged Batch: Atomic Only When the Partition Key Matches

Cassandra's LOGGED BATCH provides atomicity for multiple writes, but with a critical constraint that is widely misunderstood:

// Logged batch — atomic only when all statements target the SAME partition key.
// Here: both inserts use orderId as the partition key, so this is safe.
BatchStatement batch = BatchStatement.builder(BatchType.LOGGED)
    .addStatement(SimpleStatement.newInstance(
        "INSERT INTO orders (order_id, status, customer_id) VALUES (?, ?, ?)",
        orderId, "PENDING", customerId
    ))
    .addStatement(SimpleStatement.newInstance(
        "INSERT INTO order_audit_log (order_id, action, occurred_at) VALUES (?, ?, ?)",
        orderId, "ORDER_CREATED", Instant.now()
    ))
    .build();

session.execute(batch);

When all statements in a logged batch share the same partition key, the batch journal is written atomically to the same replica set. If the coordinator crashes mid-batch, the receiving replicas use the batch log to replay the incomplete writes. This is genuine atomicity.

The misconception: a logged batch with different partition keys is not atomic. The batch log is written to two replicas chosen by the coordinator, and the coordinator applies the writes to each partition independently. If the coordinator crashes after applying writes to one partition but before the other, the batch journal may not be replayed correctly across both partitions. The Cassandra documentation explicitly calls cross-partition logged batches an anti-pattern — they add coordinator memory pressure without providing the atomicity guarantee developers expect.

Cross-Partition Dual Write in Cassandra: The Outbox in the Same Keyspace

For cross-partition or cross-table writes in Cassandra, the correct pattern is an Outbox table in the same keyspace, written in the same logged batch as the business entity (using the same partition key):

// Write the order row AND an outbox event row — same partition key, same logged batch.
// This is atomic: both rows are committed together or neither is.
BatchStatement batch = BatchStatement.builder(BatchType.LOGGED)
    .addStatement(SimpleStatement.newInstance(
        "INSERT INTO orders (order_id, customer_id, status) VALUES (?, ?, ?)",
        orderId, customerId, "PENDING"
    ))
    .addStatement(SimpleStatement.newInstance(
        "INSERT INTO order_outbox (order_id, event_type, payload, published) " +
        "VALUES (?, ?, ?, ?)",
        orderId, "ORDER_PLACED", serializePayload(order), false
    ))
    .build();

session.execute(batch);

// The relay service reads: SELECT * FROM order_outbox WHERE published = false ALLOW FILTERING
// For large tables, maintain a separate pending_events materialized view or use a secondary index.
// The relay marks processed entries with a conditional LWT update:
//   UPDATE order_outbox SET published = true WHERE order_id = ? IF published = false
// The IF clause prevents duplicate processing in case of relay restart.

The relay uses a conditional LWT update (IF published = false) to mark outbox entries as processed. This prevents duplicate event publishing if the relay crashes and restarts after publishing but before marking the entry — the LWT ensures only one relay instance can claim the transition from false to true.


🌍 Cross-System Dual Write: NoSQL + Elasticsearch and Why CDC Is the Right Answer

The most common inter-system dual write in NoSQL-first architectures is NoSQL DB + Elasticsearch: MongoDB or DynamoDB as the source of truth, Elasticsearch for full-text search and complex filtering. The naive implementation is immediate write propagation:

// WRONG — two independent writes with a failure window between them.
orderRepository.save(order);                     // MongoDB write — succeeds
elasticsearchOps.save(toSearchDocument(order));  // ES write — can fail silently
// If ES write fails: MongoDB has the order. ES does not. Search is permanently stale.

Retrying the Elasticsearch write does not help structurally. The application cannot know whether ES received the write and failed to acknowledge, or never received it at all — retrying without idempotency guarantees can cause duplicate documents in ES.

The correct architecture is Change Data Capture: the application writes only to MongoDB, and a CDC pipeline propagates the change to Elasticsearch asynchronously:

flowchart LR
    App[Order Service] -->|Single write| M[(MongoDB\norders collection)]
    M -->|Change Stream\noplog tail| D[Debezium\nMongoDB Connector]
    D -->|CDC event| K[Kafka\nordershop.orders topic]
    K -->|Kafka Connect| ES[Elasticsearch\nSink Connector]
    ES --> ESI[(Elasticsearch\norders index)]

This diagram shows the full CDC propagation path from a single MongoDB write to an Elasticsearch index. The application writes once to MongoDB. Debezium reads MongoDB's oplog via a Change Stream and emits a structured CDC event per document change to Kafka. The Elasticsearch Kafka Connect Sink Connector reads from that topic and upserts documents into the Elasticsearch index. Every component is independently scalable, independently restartable, and has its own retry and offset tracking. The application code has no knowledge of Elasticsearch — it is entirely decoupled.

A minimal Debezium MongoDB connector configuration for this pipeline:

{
  "name": "mongodb-orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
    "mongodb.connection.string": "mongodb://mongo:27017/?replicaSet=rs0",
    "collection.include.list": "shop.orders",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.connector.mongodb.transforms.ExtractNewDocumentState",
    "transforms.unwrap.drop.tombstones": "false",
    "transforms.unwrap.delete.handling.mode": "rewrite",
    "topic.prefix": "ordershop"
  }
}

The ExtractNewDocumentState transform flattens the Debezium envelope, extracting just the new document state for insert/update events and emitting a tombstone (null value) for delete events — which the Elasticsearch sink connector interprets as a document deletion.

For a deep dive on how CDC internals work per database (including MongoDB oplog requirements and replica set constraints), see How CDC Works Across Databases.


🧪 NoSQL-Specific Outbox Pattern Implementations

The Outbox pattern adapts to each NoSQL database's natural atomicity boundary. The relay mechanism also changes per database because each exposes a different change notification API.

MongoDB: Embedded Outbox With Change Stream Relay

The MongoDB-native outbox embeds the pending event inside the order document, making the outbox entry part of the same atomic document write. The relay uses MongoDB Change Streams — which tail the oplog in real time — to detect new outbox entries without polling:

@Component
public class MongoOutboxRelay {

    private final MongoCollection<Document> orderCollection;
    private final EventPublisher eventPublisher;

    public MongoOutboxRelay(MongoDatabase db, EventPublisher eventPublisher) {
        this.orderCollection = db.getCollection("orders");
        this.eventPublisher = eventPublisher;
    }

    public void startRelay() {
        // Watch for any document where _outbox exists and published = false.
        // Change Streams guarantee ordered delivery from the oplog.
        List<Bson> pipeline = List.of(
            Aggregates.match(Filters.and(
                Filters.eq("operationType", "insert"),
                Filters.exists("fullDocument._outbox"),
                Filters.eq("fullDocument._outbox.published", false)
            ))
        );

        orderCollection.watch(pipeline).forEach(changeEvent -> {
            Document doc = changeEvent.getFullDocument();
            Document outbox = doc.get("_outbox", Document.class);

            // Publish the event to the downstream system (Kafka, SNS, etc.)
            eventPublisher.publish(
                outbox.getString("eventType"),
                outbox.get("payload", Document.class)
            );

            // Mark as published — atomic update so duplicate relay restarts don't re-publish.
            orderCollection.updateOne(
                Filters.and(
                    Filters.eq("_id", doc.getObjectId("_id")),
                    Filters.eq("_outbox.published", false)
                ),
                Updates.set("_outbox.published", true)
            );
        });
    }
}

Change Streams require a MongoDB replica set (standalone instances do not support them). They provide ordered, resumable delivery — the relay stores the resume token and can restart from the last processed position after a crash.

DynamoDB: Attribute-Based Outbox With Streams + Lambda

In DynamoDB, the outbox is modelled as an attribute on the order item rather than a separate document. Write the order item with an outbox_pending attribute set to true as part of the same PutItem. DynamoDB Streams triggers a Lambda function on every item change:

// Application write: include outbox_pending = true in the initial item
Map<String, AttributeValue> orderItem = new HashMap<>();
orderItem.put("orderId",        AttributeValue.fromS(orderId));
orderItem.put("customerId",     AttributeValue.fromS(customerId));
orderItem.put("status",         AttributeValue.fromS("PENDING"));
orderItem.put("outbox_pending", AttributeValue.fromBool(true));
orderItem.put("outbox_event",   AttributeValue.fromS("ORDER_PLACED"));

dynamoDb.putItem(PutItemRequest.builder()
    .tableName("orders")
    .item(orderItem)
    .build());

// The DynamoDB Stream triggers the Lambda relay (configured separately).
// Lambda publishes the event, then removes outbox_pending with a conditional update:
//   UpdateItem: SET outbox_pending = :false IF outbox_pending = :true
// This prevents double-publishing on Lambda retry.

DynamoDB Streams with a Lambda trigger provides at-least-once delivery. The Lambda function must be idempotent — use the outbox_event attribute as the event ID and check for duplicates at the consumer. Configure a Dead Letter Queue (DLQ) for the Lambda event source mapping to capture batches that exhaust retries.

Cassandra: Logged Batch Outbox With LWT Processing Guard

The Cassandra outbox writes the business entity row and the outbox event row in a single logged batch, then uses LWT to claim exclusive processing:

// Relay: claim the outbox entry atomically with LWT before publishing
SimpleStatement claimEntry = SimpleStatement.newInstance(
    "UPDATE order_outbox SET processing = true, processing_started = ? " +
    "WHERE order_id = ? IF published = false AND processing = false",
    Instant.now(), orderId
);

ResultSet claimResult = session.execute(claimEntry);
if (!claimResult.one().getBoolean("[applied]")) {
    // Another relay instance already claimed this entry. Skip it.
    return;
}

// Now publish the event — we hold the exclusive claim.
eventPublisher.publish(orderId, "ORDER_PLACED", payload);

// Mark as published (idempotent final state).
session.execute(SimpleStatement.newInstance(
    "UPDATE order_outbox SET published = true, processing = false " +
    "WHERE order_id = ?",
    orderId
));

The LWT claim (IF published = false AND processing = false) ensures that even if multiple relay instances run concurrently — a common deployment scenario — only one instance publishes each event. The processing flag with a TTL (set separately via Cassandra's native TTL mechanism) prevents permanently stuck entries if the relay crashes after claiming but before publishing.


⚖️ Consistency vs. Scale: Trade-offs in Every NoSQL Dual Write Choice

Every consistency mechanism in NoSQL involves a deliberate trade-off. Unlike PostgreSQL where multi-table ACID transactions are the default and the trade-off is already made at the system level, NoSQL requires each team to consciously choose where on the consistency–scale spectrum to operate.

MongoDB: Embedding versus Transaction

Embedding the outbox in a single document achieves strong consistency at zero additional cost — but it binds the event lifecycle to the document lifecycle. If the document grows large (many embedded events) or if the event needs an independent schema evolution path, embedding becomes a maintenance burden. Multi-document transactions solve the schema independence problem but reduce throughput by 10–30% under write contention and require a replica set deployment (which adds operational complexity for teams running MongoDB without replicas in development or staging).

DynamoDB: TransactWriteItems versus Streams

TransactWriteItems provides synchronous consistency — after the call returns, both tables are updated. DynamoDB Streams provides eventual consistency — the secondary table is updated within milliseconds to seconds, but not immediately. For UI flows where a write is immediately followed by a read of the secondary table (e.g., "place order, immediately show order in list"), eventual consistency requires either reading from the primary table on the write path or accepting that the customer may not see their new order in the list for a brief period. Teams that cannot accept any read-after-write inconsistency must use TransactWriteItems and manage its failure modes explicitly.

Cassandra: Logged Batch versus Outbox Relay

A logged batch provides synchronous atomicity within a single partition — no relay, no background process, no eventual consistency delay. The outbox relay introduces a background process with its own operational surface: relay lag monitoring, DLQ management, and idempotency at the consumer. For teams without mature background job infrastructure, the logged batch is simpler to operate — within its partition-scoped limits.

CDC pipelines: Reliability versus Operational Complexity

CDC pipelines are the most reliable mechanism for cross-system consistency but also the most operationally complex. They require deploying and managing Kafka (or an equivalent), a connector framework (Debezium), schema registry for event schema evolution, and a sink connector per downstream system. The operational cost is justified when the secondary system is Elasticsearch, a second NoSQL database, or a data warehouse — systems where the volume and variety of downstream consumers make point-to-point dual writes unmanageable. For simple cases (one downstream cache, one downstream search index, modest traffic), a well-designed outbox relay is often the right trade-off.

StrategyConsistency TypeOperational ComplexityBest For
Embedded MongoDB outboxStrong (single write)LowMongoDB-native event relay
MongoDB multi-doc transactionStrong (synchronous)Medium (replica set required)Independent collection schemas
DynamoDB TransactWriteItemsStrong (synchronous)Medium (failure mode handling)Low–medium TPS, read-after-write required
DynamoDB Streams + LambdaEventual (sub-second)Medium (Lambda + DLQ management)High TPS, no immediate read-after-write
Cassandra logged batchStrong (within partition)LowSame partition key, multiple tables
Cassandra outbox + LWT relayEventual (relay-speed)Medium (relay process management)Cross-partition writes
CDC pipeline (Debezium)Eventual (pipeline-speed)High (Kafka + connector infra)Multiple downstream systems, high volume

🧭 Choosing the Right Strategy for Each NoSQL Dual Write Scenario

The decision depends on the atomicity boundary of your specific database and whether the writes are intra-DB or inter-system.

flowchart TD
    Start[Dual write scenario] --> Q1{Same NoSQL DB\nor different systems?}

    Q1 -- Same DB --> Q2{Same document /\npartition / item?}
    Q2 -- Yes --> A1[Single atomic write\nEmbed or update in one operation]

    Q2 -- No, cross-collection /\ncross-table --> Q3{Which database?}
    Q3 -- MongoDB --> A2[Multi-doc transaction\nOR embedded outbox\nPrefer embedded outbox]
    Q3 -- DynamoDB --> A3[DynamoDB Streams → Lambda\nAvoid TransactWriteItems\nat high TPS]
    Q3 -- Cassandra --> Q4{Same partition key?}
    Q4 -- Yes --> A4[Logged BATCH\nAtomic within partition]
    Q4 -- No --> A5[Outbox table in same keyspace\nLogged batch + LWT relay]

    Q1 -- Different systems --> Q5{Latency\nrequirement?}
    Q5 -- Sub-second eventual\nis acceptable --> A6[CDC pipeline\nDebezium / Change Streams\nKafka → Sink Connector]
    Q5 -- Strict synchronous read-your-writes --> A7[Write-through cache or\nRead from DB directly on write path]

The flowchart guides you from the scenario to the matching strategy. The key decision points are: same system vs. cross-system, same atomicity boundary vs. cross-boundary, and for cross-system writes, whether strict read-your-writes is required or eventual consistency is acceptable.

ScenarioRecommended StrategyWhy
MongoDB: same documentinsertOne / replaceOneAtomicity is native at document level
MongoDB: two collectionsEmbedded outbox (preferred) or multi-doc @TransactionalEmbedded outbox avoids transaction overhead
DynamoDB: two tables, low TPSTransactWriteItems with idempotency token + condition expressionsTransaction feasible under moderate load
DynamoDB: two tables, high TPSDynamoDB Streams → Lambda → secondary tableDecouples writes, eliminates throughput contention
Cassandra: same partition, two tablesLOGGED BATCHNative partition-scoped atomicity
Cassandra: cross-partition / cross-tableOutbox table + logged batch + LWT relayLWT provides exclusive relay claim within partition
Any NoSQL → ElasticsearchCDC pipeline (Debezium + Kafka Connect)Decoupled, at-least-once, independently scalable
Any NoSQL → Redis cacheWrite-through or cache-aside deleteSame principle as relational cache invalidation
NoSQL → another NoSQL DBCDC pipeline or Outbox + relayAvoids synchronous cross-system dependency

🛠️ Debezium MongoDB Connector: CDC-Based Dual Write Prevention for NoSQL Pipelines

Debezium is an open-source CDC platform that reads database change logs and publishes structured events to Kafka. The MongoDB connector tails MongoDB's oplog via a Change Stream, requiring a replica set deployment. It emits one Kafka message per document operation (insert, update, replace, delete).

The configuration shown in the Cross-System section above is the minimal viable setup. For production, add:

{
  "snapshot.mode": "initial",
  "heartbeat.interval.ms": "10000",
  "errors.tolerance": "all",
  "errors.deadletterqueue.topic.name": "mongodb.orders.dlq",
  "errors.deadletterqueue.context.headers.enable": "true",
  "poll.interval.ms": "500",
  "max.batch.size": "2048"
}

Key production settings:

  • snapshot.mode: initial — takes a full collection snapshot before tailing the oplog, ensuring no events are missed if the connector is deployed against an existing collection.
  • heartbeat.interval.ms — emits heartbeat events so the connector's resume token advances even when the source collection has no activity. Without this, the oplog position can fall behind on low-traffic collections.
  • errors.deadletterqueue.topic.name — routes events that fail transformation or serialization to a DLQ topic instead of halting the connector.

The Debezium MongoDB connector eliminates application-level dual writes to Elasticsearch, Redis, or any other secondary system by making the MongoDB write the single source of truth. Secondary systems become consumers of the CDC stream rather than recipients of direct application writes.

For a full internals breakdown of how CDC works inside MongoDB (oplog structure, Change Stream vs. oplog tail, replica set requirements, and resume token behavior), see How CDC Works Across Databases.


🛠️ Spring Data MongoDB: Reactive Transactions for Consistent Multi-Collection Writes

For reactive Spring Boot applications using Project Reactor, Spring Data MongoDB provides reactive multi-document transactions via ReactiveMongoTemplate.inTransaction(). This wraps the reactive pipeline in a MongoDB transaction that commits when the pipeline completes successfully and rolls back on any error:

@Service
public class ReactiveOrderService {

    private final ReactiveMongoTemplate mongoTemplate;

    public ReactiveOrderService(ReactiveMongoTemplate mongoTemplate) {
        this.mongoTemplate = mongoTemplate;
    }

    public Mono<Order> placeOrder(Order order, OrderEvent event) {
        return mongoTemplate.inTransaction().execute(session ->
            session.insert(order)                         // Insert order into orders collection
                   .then(session.insert(event))           // Insert event into order_events collection
                   .thenReturn(order)
        )
        .retryWhen(Retry.backoff(3, Duration.ofMillis(50))
            .filter(ex -> ex instanceof TransientMongoClientException))
        .next();
        // Both inserts share the same MongoDB session and transaction context.
        // If session.insert(event) fails with a non-transient error,
        // the transaction rolls back and the order insert is reverted.
    }
}

The retryWhen operator handles TransientMongoClientException — the reactive equivalent of TransientTransactionError — which MongoDB throws for network blips and brief replication lag events. Transient errors are safe to retry; the transaction has already been rolled back and can be re-attempted from scratch.

Reactive transactions require:

  • MongoDB 4.0+ with a replica set or sharded cluster (standalone deployments do not support transactions)
  • Spring Data MongoDB 2.2+ with ReactiveMongoTemplate
  • The Reactor Netty driver (not the blocking MongoDB Java driver) on the classpath

For CPU-bound services with high concurrency, reactive MongoDB transactions can improve throughput compared to blocking transactions by keeping threads free during I/O waits. For I/O-light services, the added complexity of the reactive pipeline may not justify the gain — blocking @Transactional on a virtual thread (Project Loom) achieves similar throughput with significantly simpler code.


📚 Lessons Learned From Running NoSQL Dual Writes in Production

NoSQL does not mean no consistency — it means you manage consistency explicitly. Relational databases enforce consistency as a default through foreign keys, transaction isolation, and constraint checking. NoSQL databases place that responsibility on the application. The tradeoff for horizontal scale is that you must know your database's atomicity boundary and design your write paths around it.

MongoDB multi-document transactions are powerful but expensive — prefer single-document atomicity. The embedded outbox pattern is not a workaround; it is the idiomatic MongoDB approach to atomic event recording. Use multi-document transactions when embedding is genuinely impractical, not as the default for any two-collection write.

DynamoDB TransactWriteItems is not a free lunch. Three production failure modes — idempotency token misuse, un-retried TransactionConflictException, and WCU consumption on rollback — combine to create silent data loss under contention. At high TPS, DynamoDB Streams + Lambda is almost always the correct architecture for secondary table synchronization.

Cassandra logged batches provide atomicity only within the same partition key. The most common misconception in Cassandra development is that a logged batch spanning two different partition keys is atomic. It is not. Cross-partition logged batches are documented as anti-patterns in the Cassandra documentation. If you need atomicity across partitions in Cassandra, the outbox table pattern with LWT is the correct approach.

For NoSQL → secondary system writes (Elasticsearch, Redis, another DB): CDC is almost always the right architecture. Direct application writes to secondary systems introduce a fragile synchronous dependency and a failure window. CDC pipelines (Debezium, MongoDB Change Streams, DynamoDB Streams) make secondary systems event-driven consumers rather than write targets, eliminating the dual write failure mode entirely.

Idempotency is the relay's responsibility, not just the publisher's. Every outbox relay — whether it is a Cassandra LWT relay, a DynamoDB Streams Lambda, or a MongoDB Change Stream processor — must handle the scenario where it publishes an event and then crashes before marking the event as processed. The relay will restart and re-publish. Downstream consumers must be designed as idempotent receivers (using event ID deduplication) to handle at-least-once delivery.


📌 TLDR — The NoSQL Dual Write Problem in Five Bullets

  • NoSQL atomicity is per-boundary: MongoDB's boundary is the document (single-doc) or the replica set (multi-doc txn). DynamoDB's boundary is TransactWriteItems (up to 100 items, with contention costs). Cassandra's boundary is the partition key. Write across any boundary without a compensating pattern and you will lose writes silently.
  • MongoDB: prefer embedding over transactions. A single insertOne with an embedded outbox entry is always atomic and never requires a transaction. Reserve multi-document @Transactional for writes that cannot be embedded.
  • DynamoDB at high TPS: use Streams over TransactWriteItems. TransactWriteItems consumes WCUs even on rollback, does not auto-retry on TransactionConflictException, and is vulnerable to idempotency token misuse. DynamoDB Streams + Lambda is the scalable default for cross-table synchronization.
  • Cassandra logged batch: same partition key only. Cross-partition logged batches do not provide the atomicity guarantee developers expect. Use the outbox table pattern with LWT for cross-partition dual writes.
  • NoSQL → external system: CDC first. Whether the secondary target is Elasticsearch, Redis, or another NoSQL database, a CDC pipeline (Debezium, Change Streams, DynamoDB Streams) is more reliable, more scalable, and more operationally transparent than application-level dual writes.

📝 Practice Quiz — Test Your NoSQL Dual Write Knowledge

  1. A MongoDB service running on pre-4.0 MongoDB writes an order to the orders collection and then throws an exception while writing to the order_events collection. What is the state of the orders collection after the exception?

Correct Answer: The order document is permanently committed to the orders collection. Pre-4.0 MongoDB has no multi-document transactions. The first insertOne is final the moment it returns — there is no rollback, no compensation, and no mechanism to undo a committed document write in a later operation. The orders collection has the order; order_events does not have the event. The collections are permanently inconsistent unless the application has an explicit repair process.


  1. A DynamoDB TransactWriteItems call fails with a TransactionConflictException and rolls back. Does DynamoDB consume write capacity units for the failed transaction? Why does this matter at high TPS?

Correct Answer: Yes, DynamoDB consumes WCUs for all items in a TransactWriteItems call even when the transaction rolls back. At high TPS, frequent TransactionConflictException events create a cascade: failed transactions consume WCUs, pushing other tables toward ProvisionedThroughputExceededException, which causes more failures, which consumes more WCUs. This positive feedback loop is why DynamoDB Streams-based synchronization is the correct high-TPS architecture — it eliminates the cross-table write entirely and removes the cascading WCU amplification.


  1. A Cassandra logged batch writes two rows — one to the orders table with partition key order_id = 'abc-123', and one to the orders_by_customer table with partition key customer_id = 'cust-456'. Is this batch atomic?

Correct Answer: No. A Cassandra logged batch is only guaranteed to be atomic when all statements share the same partition key. The batch log is written to two coordinator-selected replicas and the writes to each partition are applied independently. If the coordinator crashes mid-execution, the batch log may not replay correctly across both partitions. The Cassandra documentation labels cross-partition logged batches an anti-pattern. The correct approach for cross-partition consistency is an outbox table in the same keyspace, written in a single-partition logged batch sharing the business entity's partition key.


  1. Your service writes order documents to MongoDB and needs to keep an Elasticsearch index in sync. Why is direct application dual write (orderRepo.save(order); elasticsearchOps.save(doc)) problematic, and what is the recommended architecture?

Correct Answer: Direct dual write is problematic because: (1) the Elasticsearch write can fail independently, leaving the index permanently stale; (2) retrying without idempotency can create duplicate ES documents; (3) the application is tightly coupled to Elasticsearch availability — an ES outage impacts order write latency and throughput. The recommended architecture is a CDC pipeline: MongoDB write → Change Stream → Debezium MongoDB Connector → Kafka → Elasticsearch Sink Connector. The application writes only to MongoDB; ES becomes an eventually-consistent consumer of the CDC stream, fully decoupled from the write path.


  1. (Open-ended challenge — no single correct answer) You are designing a high-traffic Cassandra-backed order service where each order write must also update a customer-level order_count counter in a separate partition. Logged batch cannot help (different partition keys), and LWT only covers one partition. Outline two different architectural approaches, explain the consistency guarantee each provides, and describe the failure mode each leaves you responsible for handling. Which would you choose for a 10K TPS system, and why?

Consider: the outbox table pattern, a separate counter service backed by a CRDT counter, an eventually-consistent materialized view maintained by a relay, and the operational burden of each. The right answer depends on acceptable counter staleness and your team's infrastructure maturity.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms