The Dual Write Problem: Why Two Writes Always Fail Eventually — and How to Fix It

Transactional Outbox, CDC publishing, and cache invalidation — the patterns that eliminate dual write failures

Architecture Patterns for Production Systems

Abstract Algorithms

·Apr 5, 2026·22 min read

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 22 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Any service that writes to a database and publishes a message in the same logical operation has a dual write problem. try/catch retries don't fix it — they turn failures into duplicates. The Transactional Outbox pattern co-writes business data and a pending event into the same DB transaction, then a relay delivers to Kafka separately. At-least-once delivery, idempotent consumers — no ghost orders, no phantom fulfillments.

📖 The Order That Silently Disappeared From Fulfillment

It is 11:42 PM on Black Friday. An order service receives a checkout request, writes the new row to PostgreSQL — status PENDING, payment captured — and then calls kafkaTemplate.send("order-events", payload). The Kafka broker is under peak load. The network call times out. A NetworkException is thrown, caught by catch (Exception e) { log.error(...); }, and silently swallowed.

The order exists in the database. The fulfillment service never receives the OrderPlaced event. The warehouse never picks the item. The customer sees "Order Confirmed ✅" on their screen. Three days later they call support — where is my order?

This is the dual write problem: a service commits state to two independent systems in what should be a single logical operation. Because the two writes are not atomic, any failure between them leaves the systems permanently inconsistent. And critically — both the database and the message broker appear healthy. There is no alert. There is no error page. Just a silent gap.

The dual write problem is not an edge case. It exists in 100% of services that write to a database and send a message, update a cache, or push a search index entry in the same code path — without a coordination strategy. The code looks correct. The bug is structural.

🔍 Three Failure Modes That Strike Every Multi-System Service

Before exploring solutions, it is essential to see all three failure modes. Engineers who only know Scenario A are surprised by B and C in production.

Scenario A — The Ghost Order: DB Succeeds, Kafka Fails

The most common failure. The PostgreSQL transaction commits successfully. The Kafka publish throws an exception after the commit. The downstream fulfillment consumer never receives the event — the order exists in the database but is invisible to every consumer that acts on it.

sequenceDiagram
    participant App as Order Service
    participant DB as PostgreSQL
    participant K as Kafka Broker

    App->>DB: BEGIN - INSERT INTO orders - COMMIT
    DB-->>App: Row committed id=7821
    App->>K: send order-events OrderPlaced id=7821
    K-->>App: NetworkException - timeout under load
    Note over App,K: Event never delivered. Order 7821 is a ghost in fulfillment.

The sequence diagram shows the critical timing: the COMMIT returns before the Kafka call is even attempted. PostgreSQL has no knowledge of Kafka — once committed, that row is permanent. There is no automatic compensation. The database and the message broker are now out of sync, forever, unless a human notices.

Scenario B — The Phantom Fulfillment: Kafka Succeeds, DB Fails

The failure can run in the opposite direction, and the consequences are worse. The event is published to Kafka successfully. Consumers start processing immediately — inventory is reserved, a shipment record is created. Then the database write fails with a ConstraintViolationException. The local transaction rolls back. The order row does not exist in the source-of-truth database. But the fulfillment service has already acted on the event and there is no undo. The customer may be charged for a shipment that has no corresponding order record.

Scenario C — The Stale Cache: DB Succeeds, Redis Fails

Not every dual write involves Kafka. When a user updates their shipping address, the service writes to PostgreSQL and then calls redisTemplate.opsForValue().set("user:" + userId, json). The Redis SET call fails silently — connection pool exhausted, Redis OOM, a transient network blip. The database holds the new address. The cache holds the old one. For the duration of the TTL, every read returns the stale address. The user places another order, sees their old address pre-filled, misses the discrepancy, and their package ships to the wrong location.

This variant is especially dangerous because Redis connection pool exhaustion rarely throws a loud exception in an async fire-and-forget call. The failure disappears without a trace.

The common thread across all three scenarios: writing to two independent systems without coordination creates a consistency window. Any failure inside that window produces permanent divergence.

⚙️ Why Every Instinctive Fix Makes the Problem Worse

When engineers first encounter the dual write problem, there are three instinctive fixes. All three are wrong — not because of implementation details, but because of structural incompatibilities with the systems involved.

Fix Attempt 1 — "Just Wrap It in a Try-Catch and Retry"

// This looks safe. It is not.
try {
    orderRepo.save(order);             // DB write
    kafka.send("order-events", event); // Kafka write — outside DB transaction
} catch (Exception e) {
    retryPublish(event); // Which write failed? We don't know.
}

The problem is that the catch block cannot distinguish which write failed. If the DB succeeded and Kafka failed, the retry re-publishes the event — safe only if every consumer is provably idempotent. If Kafka succeeded and the DB failed (and rolled back), the retry publishes a second message for data that no longer exists. Retry logic converts write failures into duplicate delivery problems. It moves the consistency gap; it does not close it.

Fix Attempt 2 — "Use a Distributed Transaction (2PC / XA)"

Two-Phase Commit is the theoretically correct solution. A coordinator sends PREPARE to all participants, waits for votes, then sends COMMIT. Either all commit or all roll back.

There are two fundamental blockers for modern distributed systems:

sequenceDiagram
    participant C as Coordinator
    participant DB as PostgreSQL XA
    participant K as Kafka

    C->>DB: PREPARE
    C->>K: PREPARE
    K-->>C: Protocol Not Supported - no XA implementation
    Note over C,K: 2PC cannot proceed. Kafka has no XA resource manager.

First: Kafka does not implement the XA protocol. The Kafka wire protocol has no concept of XA prepare/commit. Even Kafka's own transactions API (idempotent producers + beginTransaction) only guarantees atomicity across Kafka partitions — not across Kafka and an external RDBMS. You cannot enlist Kafka in a JTA transaction.

Second: 2PC is a blocking protocol prone to coordinator failure. If the coordinator crashes after sending PREPARE but before sending COMMIT, all participants hold locks and wait indefinitely. In a microservices environment where pods restart constantly, this pathological blocked state is not theoretical. The resolution requires manual operator intervention, which means downtime.

Fix Attempt 3 — "Write to Kafka First, Then the DB"

Reversing the order does not eliminate the dual write problem — it only changes which system holds the orphan record on failure. If Kafka publish succeeds and the DB write subsequently fails, you now have an event for an order that was never persisted. Scenario B becomes the default failure mode.

There is no safe ordering of two independent writes to two independent systems. The root cause is the absence of a shared transaction boundary.

🧠 Deep Dive: How the Transactional Outbox Achieves True Atomicity

The Transactional Outbox pattern resolves the dual write problem by changing the question. Instead of "how do we atomically write to DB and Kafka?" — which is impossible — it asks: "what if both writes go to the database, and we let a separate process deliver to Kafka?"

The answer is that you can write the business record and a pending event record to the same PostgreSQL database in the same ACID transaction. The two rows either commit together or roll back together. There is no gap. Kafka delivery becomes a separate, retryable, eventually-consistent operation.

The Internals: Outbox Table, Transaction Boundary, and Relay Pipeline

The pattern has three components:

Component 1: The outbox_events table — a staging area in the same database as your business tables. Each row represents an event that should be published to a message broker. The row is written atomically with the business record in the same transaction.

Column	Type	Purpose
`event_id`	VARCHAR (PK)	Globally unique ID — used as Kafka message key for consumer-side deduplication
`aggregate_type`	VARCHAR	Domain entity type, e.g. `"Order"`
`aggregate_id`	VARCHAR	Business entity ID, e.g. `"7821"`
`event_type`	VARCHAR	Logical event name, e.g. `"OrderPlaced"`
`payload`	TEXT	JSON-serialized event body
`created_at`	TIMESTAMPTZ	Commit time — controls relay ordering
`published`	BOOLEAN	`false` = pending delivery; `true` = relayed to Kafka

Component 2: The OrderService — co-writes business data and the outbox row in a single @Transactional method. If anything throws before the transaction commits, both rows roll back. Neither a ghost order nor a missing event is possible at the database level.

Component 3: The Outbox Relay — a @Scheduled background bean that reads unpublished rows, calls KafkaTemplate.send(), and marks rows published = true on success. If delivery fails, the row remains published = false and is retried on the next poll cycle. This gives at-least-once delivery — consumers must be idempotent.

Performance Analysis: Polling Latency, Batch Size, and the CDC Alternative

The polling relay introduces a latency floor equal to the poll interval — typically 500ms–2s. Three tuning variables govern the throughput ceiling:

Variable	Impact
Poll interval (`fixedDelay`)	Lower interval = lower latency, higher DB query frequency
Batch size (PageRequest)	Larger batch = higher throughput per cycle, more memory pressure
Index on `(published, created_at)`	Without this index, the `SELECT WHERE published = false` degrades to a full table scan as the outbox grows

The (published, created_at) composite index is non-optional on any table that will hold more than a few hundred rows. A missing index is the most common cause of outbox relay CPU spikes in production.

For scenarios where 500ms–2s latency is unacceptable — payment confirmations, real-time notifications, trading systems — the polling relay should be replaced with Change Data Capture (CDC). Debezium reads directly from the PostgreSQL Write-Ahead Log (WAL), detecting new outbox_events inserts within 50–100ms of commit without polling the primary database at all.

📊 Visualizing the Dual Write Gap and the Outbox Solution End to End

The diagram below contrasts the naive dual write (left path, failure gap visible) with the Transactional Outbox (right path, atomicity guaranteed through the shared DB transaction).

flowchart TD
    subgraph naive[Naive Dual Write - failure gap]
        N1[orderRepo.save] --> N2[COMMIT success]
        N2 --> N3[kafkaTemplate.send]
        N3 -->|NetworkException| N4[Event lost - gap opens]
    end

    subgraph outbox[Transactional Outbox - atomic]
        O1[BEGIN TX] --> O2[INSERT INTO orders]
        O2 --> O3[INSERT INTO outbox_events published=false]
        O3 --> O4[COMMIT - both rows durable]
        O4 --> O5[Outbox Relay - Scheduled 500ms]
        O5 --> O6[KafkaTemplate.send]
        O6 -->|success| O7[UPDATE outbox_events published=true]
        O6 -->|failure| O8[Row stays unpublished - retried next cycle]
    end

The two subgraphs make the structural difference explicit. In the naive path, the commit and the Kafka send are sequential and independent — any exception between them creates a permanent gap. In the Outbox path, both writes happen inside one database transaction. The Kafka send is delegated to the relay, which runs independently and retries safely. Failure means retry, not data loss.

🌍 Where Dual Writes Appear Across Real Production Architecture Layers

The dual write problem appears at every layer of a production system. Once you know the pattern, you see it everywhere.

Order service (DB + Kafka): The canonical case. Save order → publish OrderPlaced. Every e-commerce checkout path has this dual write. Netflix, Amazon, Stripe all solved it with some variant of the Transactional Outbox.

User profile service (DB + Redis): Update profile → update cache. Every read-heavy user service has this. The failure mode is stale PII served to downstream callers — wrong address, outdated payment preference, stale permission set.

Search indexing service (DB + Elasticsearch): Save product → index in Elasticsearch. Failed index writes mean stale search results. Users searching for a product that exists in the catalog cannot find it. Debezium's Elasticsearch Sink Connector solves this via CDC on the product table.

Read replica synchronization (primary DB + read replica): Technically handled by the database replication protocol, but application code that reads from a replica immediately after a write can observe stale data in the replication lag window. Services must either read from primary for consistency-sensitive reads or use read-your-writes sessions.

Notification service (DB + SQS/SNS): Save notification record → enqueue SQS message. If the SQS publish fails, the notification never delivers. The Transactional Outbox works identically with SQS as the delivery target — just change the relay to call sqsClient.sendMessage() instead of kafkaTemplate.send().

The pattern is universal. Any service that writes to a primary store and then propagates that write to a secondary system needs a coordination strategy.

⚖️ Trade-offs and Failure Modes Across Outbox, CDC, and Event Sourcing

No solution is universally optimal. Understanding the failure modes of each pattern is what lets you choose intelligently.

Dimension	Transactional Outbox (Polling)	CDC + Debezium	Event Sourcing	Cache-Aside DELETE
Atomicity guarantee	✅ Both DB writes in one TX	✅ Derives from WAL	✅ Event IS the write	✅ DB write is the source of truth
Delivery semantics	At-least-once	At-least-once	At-least-once	N/A (no broker)
Event latency	~500ms–2s	~50–100ms	Eventually consistent reads	~1ms (cache miss path)
Operational complexity	Low — one `@Scheduled` bean	Medium — Debezium cluster	High — CQRS, projections, schema evolution	Very Low
Consumer requirement	Idempotent	Idempotent	Idempotent	N/A
Primary failure mode	Relay pod crash → delayed delivery (not data loss)	Debezium connector lag → delayed delivery	Projection rebuild on schema change	DELETE call failure → cache miss (acceptable)
DB WAL/binlog required	No	Yes (`wal_level = logical`)	N/A	No
Works with Kafka SaaS	✅ Any broker	✅ Via Kafka Connect	✅	N/A
Outbox table growth	❌ Needs archival job	✅ Debezium marks offset, DB rows cleaned	❌ Event log grows forever	N/A

The most important row is primary failure mode. The Transactional Outbox cannot lose events — a relay crash means delayed delivery, not lost delivery. When the relay restarts, it re-reads all published = false rows and retries. This at-least-once guarantee is the whole point of the pattern.

The dangerous failure mode for CDC is Debezium connector lag: if the connector falls behind the WAL (e.g., during a Kafka outage), events queue up in the WAL. Once WAL segments rotate, the connector must catch up or replay. Monitoring connector lag is mandatory in production.

Event Sourcing's primary failure mode is projection rebuild complexity: when the event schema evolves, existing projections may become invalid and must be rebuilt from the full event log, which can take hours for large event stores.

🧭 Choosing the Right Pattern for Your System's Constraints

The decision tree below maps system constraints to the appropriate pattern. Start with the type of secondary system you are writing to, then apply the latency and complexity budget.

flowchart TD
    A[Service writes to two systems] --> B{Secondary system type?}
    B -->|Message broker Kafka SQS RabbitMQ| C{Sub-100ms latency required?}
    C -->|Yes| D[CDC + Debezium Outbox Router - 50ms WAL-based]
    C -->|No| E[Transactional Outbox Polling Relay - 500ms to 2s]
    B -->|Cache Redis Memcached| F[Cache-Aside DELETE on write - read-through repopulates]
    B -->|Search Index Elasticsearch| G[CDC to ES Sink Connector]
    B -->|Another service DB or cross-service state| H{Audit trail or time-travel required?}
    H -->|Yes| I[Event Sourcing + CQRS]
    H -->|No| J[Transactional Outbox + Saga for compensation]

The flowchart above guides the initial pattern selection. Use the decision table below to validate your choice against operational constraints.

Consolidated Decision Table:

Scenario	Recommended Pattern	Latency	Complexity	Consumer Idempotency
DB + Kafka, standard team	Transactional Outbox (polling)	~500ms	Low	Required
DB + Kafka, sub-100ms SLA	CDC + Debezium Outbox Router	~50ms	Medium	Required
DB + Redis cache	Cache-Aside DELETE on write	~1ms	Very Low	N/A
DB + Elasticsearch	CDC to ES Sink Connector	~100ms	Medium	Required
Event-driven domain, audit trail	Event Sourcing + CQRS	Eventually consistent reads	High	Required
Multi-service distributed operation	Saga + Outbox per service	Eventually consistent	High	Required

When Event Sourcing is the right choice: Audit trail is a first-class business requirement (financial services, healthcare, legal). You need time-travel or full replay capability — "what was the state of order 7821 at 3:15 PM on Tuesday?" The domain model is naturally event-driven. For the full implementation of Event Sourcing with Axon Framework, see Event Sourcing Pattern: Auditability, Replay, and Evolution of Domain State.

🧪 Full Transactional Outbox Implementation in Spring Boot

The following is a complete, production-ready implementation using Spring Data JPA for the database writes and Spring Kafka for event delivery. Each class maps directly to one of the three components described in the Deep Dive section.

The `OutboxEvent` Entity — Staging Table for Pending Events

@Entity
@Table(name = "outbox_events", indexes = {
    @Index(name = "idx_outbox_pending", columnList = "published, created_at")
})
public class OutboxEvent {

    @Id
    private String eventId;

    private String aggregateType;   // "Order", "Payment", "User"
    private String aggregateId;     // business entity ID — becomes Kafka message key
    private String eventType;       // "OrderPlaced", "PaymentCaptured"

    @Column(columnDefinition = "TEXT")
    private String payload;         // JSON-serialized domain event

    private Instant createdAt;
    private boolean published;

    protected OutboxEvent() {}

    public OutboxEvent(String eventId, String aggregateType, String aggregateId,
                       String eventType, String payload, Instant createdAt, boolean published) {
        this.eventId = eventId;
        this.aggregateType = aggregateType;
        this.aggregateId = aggregateId;
        this.eventType = eventType;
        this.payload = payload;
        this.createdAt = createdAt;
        this.published = published;
    }

    public String getEventId()     { return eventId; }
    public String getAggregateId() { return aggregateId; }
    public String getPayload()     { return payload; }
    public boolean isPublished()   { return published; }
    public void setPublished(boolean p) { this.published = p; }
}

The `OrderService` — Atomic Dual Write in a Single Transaction

Both the business record and the outbox row are written inside one @Transactional boundary. If any step throws before the transaction commits, both rows roll back atomically. Ghost orders and phantom events are structurally impossible at this layer.

@Service
@Transactional
public class OrderService {

    private final OrderRepository orderRepo;
    private final OutboxEventRepository outboxRepo;
    private final ObjectMapper mapper;

    public OrderService(OrderRepository orderRepo,
                        OutboxEventRepository outboxRepo,
                        ObjectMapper mapper) {
        this.orderRepo  = orderRepo;
        this.outboxRepo = outboxRepo;
        this.mapper     = mapper;
    }

    public Order placeOrder(PlaceOrderRequest req) throws JsonProcessingException {
        // Step 1: save business entity
        Order order = new Order(
            req.getCustomerId(),
            req.getItems(),
            OrderStatus.PENDING
        );
        orderRepo.save(order);

        // Step 2: write outbox event — SAME transaction as step 1
        String payload = mapper.writeValueAsString(
            new OrderPlacedEvent(order.getId(), req.getCustomerId(), order.getTotalAmount())
        );
        OutboxEvent event = new OutboxEvent(
            UUID.randomUUID().toString(),   // unique event_id — Kafka key for dedup
            "Order",
            order.getId().toString(),
            "OrderPlaced",
            payload,
            Instant.now(),
            false
        );
        outboxRepo.save(event);

        return order;
        // @Transactional commits both rows here — or rolls both back on any exception
    }
}

The `OutboxRelay` — Polls and Publishes Every 500ms

The relay is a lightweight Spring component with one responsibility: read published = false rows, deliver to Kafka, mark as delivered. It runs every 500ms with a batch cap of 100 rows to bound the Kafka round-trip window.

@Component
public class OutboxRelay {

    private final OutboxEventRepository outboxRepo;
    private final KafkaTemplate<String, String> kafka;

    public OutboxRelay(OutboxEventRepository outboxRepo,
                       KafkaTemplate<String, String> kafka) {
        this.outboxRepo = outboxRepo;
        this.kafka      = kafka;
    }

    @Scheduled(fixedDelay = 500)
    @Transactional
    public void relayPendingEvents() {
        List<OutboxEvent> pending = outboxRepo
            .findByPublishedFalseOrderByCreatedAtAsc(PageRequest.of(0, 100));

        for (OutboxEvent event : pending) {
            kafka.send("order-events", event.getAggregateId(), event.getPayload())
                 .whenComplete((result, ex) -> {
                     if (ex == null) {
                         event.setPublished(true);
                         outboxRepo.save(event);
                         // Row is now permanently published — will not be retried
                     }
                     // On failure: row stays published=false → retried next cycle
                     // This is correct at-least-once behavior — not a bug
                 });
        }
    }
}

The `OutboxEventRepository`

public interface OutboxEventRepository extends JpaRepository<OutboxEvent, String> {
    List<OutboxEvent> findByPublishedFalseOrderByCreatedAtAsc(Pageable pageable);
}

The CDC + Debezium Alternative: Connector Config Only

When the polling relay's 500ms latency floor is unacceptable, replace the relay entirely with a Debezium PostgreSQL connector. Debezium reads from the WAL — no poll thread, ~50ms latency from commit to Kafka delivery.

{
  "name": "order-outbox-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "secret",
    "database.dbname": "orders",
    "table.include.list": "public.outbox_events",
    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.table.field.event.id": "event_id",
    "transforms.outbox.table.field.event.key": "aggregate_id",
    "transforms.outbox.route.by.field": "event_type"
  }
}

The EventRouter Single Message Transform reads event_type from each captured row and routes the event to the corresponding Kafka topic. The application writes to one table; consumers receive type-partitioned topics. For the full mechanics of WAL-based CDC, see How CDC Works Across Databases.

The DB + Redis Cache Variant: Invalidate, Never Update

For the DB + Redis scenario, the outbox is not needed. The correct fix is a single-line change in the write path: replace SET with DELETE.

@Transactional
public void updateUserProfile(Long userId, ProfileUpdateRequest req) {
    userRepo.save(new User(userId, req));        // DB write — source of truth
    redisTemplate.delete("user:" + userId);      // invalidate key — NOT update
    // Next GET /users/{userId}: cache miss → read from DB → repopulate with fresh data
}

Why DELETE beats SET under failures:

DELETE is idempotent. Deleting a non-existent key is a no-op. Retry is always safe.
A stale SET is dangerous. A SET with an outdated value propagates incorrect data to every caller until TTL expiry — potentially minutes.
Read-through repopulation is correct by construction. The next read misses, fetches from the authoritative database (which has the new value), and repopulates the cache with fresh data. The window of inconsistency is bounded to a single cache miss latency — milliseconds.

🛠️ Spring Modulith: Transactional Events Without the Outbox Boilerplate

Spring Modulith offers a higher-level abstraction that delivers transactional event semantics without writing an outbox entity, relay bean, or repository. Under the hood it uses @TransactionalEventListener(phase = AFTER_COMMIT) — events are only dispatched after the owning transaction successfully commits to the database.

// Publish a domain event within the business transaction — dispatch is deferred
@Service
@Transactional
public class OrderService {

    private final ApplicationEventPublisher events;
    private final OrderRepository repo;

    public OrderService(ApplicationEventPublisher events, OrderRepository repo) {
        this.events = events;
        this.repo   = repo;
    }

    public Order placeOrder(PlaceOrderRequest req) {
        Order order = repo.save(new Order(req));
        events.publishEvent(new OrderPlacedEvent(order.getId(), req.getCustomerId()));
        // event is held in the Spring context until AFTER this @Transactional method commits
        return order;
    }
}

// Listener fires only after the DB transaction commits — if DB rolls back, this never executes
@ApplicationModuleListener
public void onOrderPlaced(OrderPlacedEvent event) {
    kafkaTemplate.send("order-events", event.orderId().toString(), toJson(event));
}

The key guarantee: if repo.save() throws and the transaction rolls back, publishEvent() is discarded — @ApplicationModuleListener never fires. This eliminates Scenario A (ghost orders) without a polling relay or a separate outbox table.

The remaining exposure: the Kafka send() inside the listener can still fail. Spring Modulith does not retry failed listener invocations by default — you need a dead-letter queue or a fallback to the polling outbox for full at-least-once guarantees. For teams that want WAL-level durability without managing Debezium, combining Spring Modulith events with a persisted outbox table gives both the developer experience benefit and the operational safety net.

For a full deep-dive on Spring Modulith's event system, see the official Spring Modulith documentation.

📚 Hard Lessons from Dual Write Failures in Production

Every service that writes to DB and publishes an event has a dual write problem. Audit every method body containing both save() and kafkaTemplate.send() — each one is a latent incident with a date on it.
"Just retry" converts write failures into duplicate events. Retry without knowing which write failed turns a consistency problem into a duplication problem. It is not a fix.
2PC is dead for Kafka. Kafka has no XA implementation. Even JTA-aware brokers make 2PC a coordinator-failure liability. The engineering consensus reached this conclusion in 2010; do not rediscover it on-call at 2 AM.
The Transactional Outbox is the default answer for 80% of teams. One extra table, one @Scheduled bean, no external tooling. It works with any message broker. Reach for CDC only when you have hit the polling latency wall.
The outbox table is not self-cleaning. published = true rows accumulate. Without a periodic archival job (DELETE FROM outbox_events WHERE published = true AND created_at < NOW() - INTERVAL '7 days'), the table becomes a performance liability within weeks on high-throughput systems.
Delete beats update for cache invalidation. An idempotent DELETE is always safe under failures. A stale SET propagates incorrect data to every reader. Treat the cache as a read accelerator, not a write-through mirror.
At-least-once delivery + idempotent consumers is the correct default posture for event-driven systems. Design for it from day one. The event_id as Kafka message key, checked against a processed_events table or Redis set, is the standard consumer-side deduplication pattern.

📌 TLDR: The Dual Write Decision Cheat Sheet

The problem: Writing to two independent systems (DB + Kafka, DB + Redis) without coordination creates a consistency window — any failure inside that window leaves the systems permanently out of sync.
Try-catch retries are wrong: They cannot distinguish which write failed — they turn write failures into duplicate delivery without closing the consistency gap.
2PC is a dead end for Kafka: Kafka has no XA support; 2PC coordinator failure permanently blocks all participants. The solution adds more failure modes than it removes.
Transactional Outbox is the default fix: Co-write business data and a pending event in one DB transaction. A relay delivers to Kafka separately. At-least-once — consumers must be idempotent.
CDC + Debezium upgrades the latency: WAL-reading replaces polling, cutting event latency from ~500ms to ~50ms at the cost of a Debezium cluster to operate.
Cache invalidation: always DELETE, never SET: A DELETE on write is idempotent and safe under failures. A stale SET propagates incorrect data to every caller until TTL expiry.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Stale Reads and Cascading Failures in Distributed Systems

TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable — stale reads...

May 3, 2026•23 min read

Split Brain Explained: When Two Nodes Both Think They Are Leader

TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader — each accepting writes the other never sees. Prevent it with quorum consensus (at least ⌊N/2⌋+1 nodes must agree before leadership is g...

May 3, 2026•20 min read

Clock Skew and Causality Violations: Why Distributed Clocks Lie

TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions — but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...

May 3, 2026•18 min read

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node — virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...

May 3, 2026•22 min read