49 min readDistributed Systems Cdc Change Data Capture

Change Feed vs Change Stream: CDC Internals, Reliability, and When to Avoid Each

How DynamoDB Streams, Cosmos DB, CockroachDB, MongoDB, and Debezium capture changes — delivery semantics, ordering, and when not to use CDC.

Abstract Algorithms/Apr 17, 2026/How It Works: Internals Explained

On this page

📖 Why Polling Always Fails: The Three Structural Gaps ⚙️ How the Database Log Becomes a Change Event: WAL, Binlog, and Oplog Internals 🔍 Change Feed vs Change Stream — Two API Shapes for the Same Underlying Log 📊 Source of Truth, Replayability, and Backpressure at a Glance 🧠 Platform Deep-Dives: Mechanism, Event Schema, Delivery, Resume, and Key Failure Modes CDC Internals: How Each Database Exposes Its Native Log DynamoDB Streams Azure Cosmos DB Change Feed CockroachDB Changefeeds

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Executive TLDR

In the summer of 2023, the platform team at a fast growing e commerce company was handling 100,000 orders per day across three microservices: Order Service, Inventory Service, and Billing Service.
All three needed to react to the same database mutations — when an order was placed, inventory had to be reserved; when it was confirmed, billing had to be triggered; when it was cancelled, inventory had to be released.
The solution they shipped was polling: each downstream service ran a background thread that executed every five seconds.
Under peak load, the polling interval fell behind.

Core mental model

Read this as a system of state, constraints, and failure boundaries.

How DynamoDB Streams, Cosmos DB, CockroachDB, MongoDB, and Debezium capture changes — delivery semantics, ordering, and when not to use CDC.

Explain simpler Compare tradeoffs

Key systems visualization

The article’s conceptual path

📖 Why Polling Always Fails: The Three Structural Gaps

⚙️ How the Database Log Becomes a Change Event: WAL, Binlog, and Oplog Internals

🔍 Change Feed vs Change Stream — Two API Shapes for the Same Underlying Log

📊 Source of Truth, Replayability, and Backpressure at a Glance

🧠 Platform Deep-Dives: Mechanism, Event Schema, Delivery, Resume, and Key Failure Modes

In the summer of 2023, the platform team at a fast-growing e-commerce company was handling 100,000 orders per day across three microservices: Order Service, Inventory Service, and Billing Service. All three needed to react to the same database mutations — when an order was placed, inventory had to be reserved; when it was confirmed, billing had to be triggered; when it was cancelled, inventory had to be released. The solution they shipped was polling: each downstream service ran a background thread that executed SELECT * FROM orders WHERE updated_at > :last_check every five seconds. For six weeks it worked. Then Black Friday arrived.

Under peak load, the polling interval fell behind. The Inventory Service polled at T=0 and found 500 new orders. It was still processing them at T=5. The next poll ran and found 480 of those same orders again because the updated_at timestamp round-trip hadn't settled. Orders were double-reserved. At the same time, three hard deletes executed by a cleanup job went completely unnoticed — updated_at doesn't exist on a deleted row. Inventory counts went negative for two SKUs. The Billing Service, polling from a read replica, was 45 seconds behind due to replication lag and missed 120 order confirmations entirely.

The root cause was not a bug in application logic. It was an architectural decision: using polling instead of a proper change capture mechanism. Change feeds and change streams — the native CDC mechanisms built into CockroachDB, DynamoDB, Cosmos DB, MongoDB, and open-source databases via Debezium — exist precisely to solve this class of failure.

TLDR: Polling is a broken CDC substitute — it misses deletes, double-processes on retry, and adds query load to your primary database. Change feeds (DynamoDB Streams, Cosmos DB, CockroachDB) and change streams (MongoDB) are CDC mechanisms built directly into the database, providing ordered, at-least-once delivery of every mutation event. They differ in delivery model (pull vs push), resume strategy, and ordering guarantees. CockroachDB is the only platform offering near-total-order via resolved timestamps. Debezium on PostgreSQL and MySQL fills the gap for open-source infrastructure. Every platform except CockroachDB with its Kafka sink requires consumer-side deduplication. Know the failure modes before you ship.

📖 Why Polling Always Fails: The Three Structural Gaps

The polling pattern feels reasonable at design time — run a query, process new rows, advance a watermark. But it has three structural gaps that cannot be closed with clever query tuning.

Gap 1 — Hard deletes are invisible. A SELECT WHERE updated_at > X cannot return a row that no longer exists. Soft-delete workarounds (adding is_deleted = true to every table) require schema changes, add indefinite row retention, and still miss deletions that bypass the application layer — direct SQL, migration scripts, bulk cleanup jobs, and stored procedures all execute deletes without setting any application flag.

Gap 2 — Clock skew causes double-processing. In a distributed system, updated_at is set by the application clock, not the database transaction clock. Two writes 1ms apart on different application servers can produce identical updated_at values or out-of-order values. The safe response is to overlap the polling window — but overlapping windows process the same rows multiple times. The e-commerce team's double-reservation was caused by exactly this: an overlapping window under load.

Gap 3 — Polling amplifies database load. With three downstream consumers, three identical queries run against the primary database every five seconds — eighteen queries per minute, all scanning the same indexed column. Under load, each query competes with production write traffic, drives up I/O, and triggers plan changes that slow every other query. Adding a fourth consumer makes it linearly worse.

The two common alternatives also fall short. A dual write (write to the database then publish to Kafka) fails silently when the Kafka publish succeeds but the database write fails, or vice versa — you get phantom events or lost events depending on which side fails. The outbox pattern solves the dual-write problem by co-locating the event in the same transaction as the business record, then relaying it asynchronously. It works, but it adds operational complexity: you need a relay process, outbox table growth management, and idempotency handling. CDC from the database log sidesteps all of this.

The database's own write-ahead log, binlog, or oplog is written synchronously with every commit — before the commit is acknowledged to the writer. It is the authoritative, ordered, complete record of everything that changed, including hard deletes. CDC that reads from this log is non-intrusive (zero write-path overhead), complete (captures every commit, including schema-bypassing SQL), and ordered (events arrive in transaction-commit order within a partition or shard).

The following diagram contrasts the polling architecture — with its three failure modes — against a CDC architecture where all three consumers receive events from a single change log path.

graph TD
    subgraph polling[Polling Approach - Three Failure Modes]
        P1[Order Service] --> P2[orders table]
        P2 -.->|SELECT every 5s| P3[Inventory Consumer]
        P2 -.->|SELECT every 5s| P4[Billing Consumer]
        P3 --> P5[Misses hard deletes]
        P4 --> P6[Double-processes on clock skew]
        P2 -.->|SELECT every 5s| P7[3x read load on primary DB]
    end

    subgraph cdcflow[CDC Approach - Single Log Path]
        C1[Order Service] --> C2[orders table]
        C2 --> C3[Database WAL or oplog]
        C3 --> C4[CDC Connector]
        C4 --> C5[Kafka: order-changes]
        C5 --> C6[Inventory Consumer]
        C5 --> C7[Billing Consumer]
        C5 --> C8[Notification Consumer]
    end

In the polling subgraph, each consumer runs independent queries against the live database. In the CDC subgraph, every committed mutation flows through the database log once. The CDC connector reads it once. All consumers receive events from the Kafka topic, entirely decoupled from database query load.

⚙️ How the Database Log Becomes a Change Event: WAL, Binlog, and Oplog Internals

Every production database engine maintains a sequential, append-only durability log. This log is what makes crash recovery possible: before any transaction is acknowledged as committed, every change must be written to the log first. If the database crashes immediately after acknowledging a commit, the log can replay the change on restart. This same log is what CDC reads.

The log goes by different names depending on the engine:

PostgreSQL: Write-Ahead Log (WAL), read via logical decoding using the pgoutput plugin
MySQL / MariaDB: Binary Log (binlog), requires binlog_format = ROW for full before/after images
MongoDB: operations log (oplog), a capped collection on the primary that secondaries tail to replicate
DynamoDB: DynamoDB Streams, a proprietary item-level change log per table partition
CockroachDB: MVCC (multi-version concurrency control) transaction log, read natively by the changefeed subsystem
Cosmos DB: Change Feed, a logical append-only log of all writes to a container, partitioned by physical partition

Each log entry for a change event contains the same fundamental payload, regardless of the database: the operation type (INSERT, UPDATE, DELETE), the before-image of the row or document (the state prior to the change), the after-image (the state after the change), a monotonically increasing position marker (LSN in PostgreSQL, binlog file and offset in MySQL, oplog timestamp in MongoDB), and a wall-clock timestamp.

CDC reads this log from a stored position and emits a structured event for every committed change. The log is retained only for a finite window — DynamoDB Streams retains 24 hours, MongoDB's capped oplog retains based on configured collection size (often hours to days on a busy database), Cosmos DB retains 7 days by default, CockroachDB retains based on gc.ttlseconds (configurable, with 1 hour as a common default, extendable to days for changefeed consumers), and PostgreSQL retains WAL until the consumer's replication slot advances past it. This retention window is the single most important operational constraint in any CDC deployment.

The following diagram shows the universal four-step path from a committed application write to a structured change event delivered to a downstream consumer.

graph TD
    A[Application commits write] --> B[Database Engine]
    B --> C[Durability Log - WAL or binlog or oplog]
    C --> D[CDC Reader]
    D --> E[Change Event: op + before + after + LSN]
    E --> F[Kafka Topic or Stream API]
    F --> G[Consumer: Inventory]
    F --> H[Consumer: Billing]
    F --> I[Consumer: Search Index]

Notice that the application write path is unchanged — the Order Service writes to the database exactly as it did before. The CDC reader sits entirely outside the write path, reading the log asynchronously. Adding or removing CDC consumers has zero impact on application write latency.

🔍 Change Feed vs Change Stream — Two API Shapes for the Same Underlying Log

The terms "change feed" and "change stream" are often used interchangeably, but they describe meaningfully different API shapes — and choosing the wrong shape for your consumer architecture leads to operational problems.

A change feed is a pull-based or cursor-based API. The consumer requests a batch of changes starting from a given position (a sequence number, cursor, or timestamp). The consumer is responsible for tracking and storing its own position. When the consumer restarts, it presents its last position and receives events from that point. CockroachDB, Azure Cosmos DB, and DynamoDB Streams all implement this model. The database engine exposes a "give me changes since position X" interface, and the consumer drives the pace of consumption. This is well-suited to batch-oriented consumers that process events in chunks, to consumers that need to slow down under backpressure, and to consumers that need fine-grained control over exactly which events they re-process after a failure.

A change stream is a push-based subscription model. The consumer opens a subscription — or registers a callback or watch cursor — and the platform delivers events to it as they arrive. MongoDB Change Streams work this way: you call watch() on a collection, database, or deployment, and MongoDB pushes events to the open cursor. Each event includes a resume token (the _id field of the event document) that the consumer must persist. If the connection drops or the process restarts, the consumer re-opens watch() with resumeAfter: <last_resume_token> and MongoDB replays events from that position in the oplog.

The practical difference is in who drives the pace and who manages position tracking. Pull models give consumers explicit control over batching and backpressure. Push models are simpler to implement for real-time event handlers but require the consumer to handle connection drops and resume token management explicitly.

	Change Feed	Change Stream
Delivery model	Pull or cursor-based	Push or subscription
Resume mechanism	Sequence number or cursor	Resume token or offset
Ordering guarantee	Per-shard or per-partition	Per-document (MongoDB)
Consumer controls pace	Yes	Depends on platform
Schema of event	Database-native format	Driver-level abstraction
Primary implementors	CockroachDB, Cosmos DB, DynamoDB Streams	MongoDB, some Kafka connectors

Debezium sits outside this taxonomy as an external log reader. It reads the PostgreSQL WAL or MySQL binlog and publishes structured events to Kafka. It is neither a native feed nor a native stream — it is a separate connector process that acts as a CDC reader and translator, converting database-specific log formats into the Kafka Connect envelope format.

📊 Source of Truth, Replayability, and Backpressure at a Glance

The API label matters less than the recovery contract underneath it. In production, the useful questions are: what is the real source of truth, how long can I stay down before replay is impossible, who absorbs backpressure, and what happens when duplicates show up after a failure. That is where change feeds and change streams stop being naming variations and start becoming different operational models.

System	Shape	Real source of truth	Replay window	Ordering scope	Backpressure behavior	Where it breaks down first
DynamoDB Streams	Native change feed	Managed per-partition stream behind the table write path	24 hours	Per shard	Consumer can slow reads, but Lambda retries can build shard lag fast	Consumer outage beyond 24h requires table re-scan
Cosmos DB Change Feed	Native change feed	Logical append-only feed per physical partition	7 days by default	Per physical partition	Feed processor can fall behind while leases preserve checkpoints	Latest Version mode hides hard deletes and may collapse intermediate updates
CockroachDB changefeeds	Native change feed	MVCC history plus changefeed checkpoints	Up to `gc.ttlseconds`	Near-global processing barrier via resolved timestamps	Sink lag propagates to checkpoint lag; retention must be sized for downtime	Falling behind GC TTL forces re-snapshot
MongoDB Change Streams	Native change stream	Replica set oplog plus resume token	Until oplog rolls over	Per shard / per document	Open cursor keeps pushing events; slow consumers risk resume token aging out	Oplog rollover invalidates resume token and forces snapshot
Debezium on PostgreSQL	External CDC reader	WAL retained by replication slot	As long as WAL disk can be retained	Per table in LSN order	Kafka or connector lag grows; PostgreSQL keeps WAL on disk	Slot lag can fill the database disk
Debezium on MySQL	External CDC reader	Row-based binlog	Until binlog retention expires	Per table in binlog order	Connector lag is usually absorbed by Kafka and binlog retention	Statement-mode or minimal row images make events incomplete

Two patterns emerge from this table. Change feeds usually give the consumer more explicit pull control and checkpoint ownership, which makes backpressure easier to manage deliberately. Change streams feel simpler for near-real-time subscribers, but that convenience shifts risk into retention sizing and resume-token hygiene.

There is also no universal "exactly-once" magic here. In almost every real deployment, the durable truth is the database log plus a replay checkpoint, and the delivery contract is at-least-once. The safe default is still idempotent consumers, even on platforms with stronger sink guarantees.

🧠 Platform Deep-Dives: Mechanism, Event Schema, Delivery, Resume, and Key Failure Modes

CDC Internals: How Each Database Exposes Its Native Log

Before examining each platform individually, it helps to understand the three architectural layers that every CDC implementation traverses. The log layer is the database's own durability mechanism — the WAL, binlog, oplog, or proprietary stream. The capture layer reads the log and converts database-native log entries into structured change events (Debezium's envelope format, DynamoDB's stream records, MongoDB's change event documents). The delivery layer routes events to consumers with position tracking, fan-out, and resume semantics.

The key internals difference across platforms is at the capture layer. PostgreSQL and MySQL use logical decoding — the database engine parses its own WAL/binlog and emits a structured row-level change stream that the capture layer (Debezium) reads via a replication protocol. MongoDB uses oplog tailing — the change stream API wraps the replica set oplog, which is itself a MongoDB collection. DynamoDB and Cosmos DB use proprietary managed streams — the cloud provider handles log exposure entirely; you interact only through their stream APIs. CockroachDB's MVCC log is read by the changefeed subsystem natively, without an external connector.

Each of these differences in the capture layer determines the event schema, the resume token format, and the failure modes that operators must manage.

DynamoDB Streams

DynamoDB Streams captures item-level changes in DynamoDB tables. When you enable Streams on a table, DynamoDB begins writing change records to a stream that maps 1:1 with the table's underlying partitions — each DynamoDB partition gets a corresponding stream shard. The shards are ordered per-shard; there is no cross-shard ordering guarantee.

You configure one of four stream view types when enabling streams on a table:

KEYS_ONLY: only the primary key attributes of the modified item — the smallest payload, sufficient for triggering a downstream lookup
NEW_IMAGE: the entire item state after the write — for consumers that only care about the current state
OLD_IMAGE: the entire item state before the write — useful for audit trails and reverse reconciliation
NEW_AND_OLD_IMAGES: both the before and after image in every event — required for reliable downstream processing that needs to compute the diff

Retention: 24 hours, non-configurable. This is the single biggest operational constraint for DynamoDB Streams. If your consumer goes offline for more than 24 hours — a long weekend deployment freeze, a database migration, an extended incident — the stream records expire and cannot be recovered. You must re-scan the table to rebuild state.

Delivery: at-least-once. Every record has a SequenceNumber that you must use for consumer-side deduplication. The SequenceNumber is monotonically increasing within a shard and is the correct resume key after a consumer restart.

Integration patterns: For serverless architectures, enable a Lambda event source mapping directly from the DynamoDB Stream — Lambda polls the stream on your behalf and invokes your function with batches of up to 10,000 records. For higher-throughput consumers needing explicit position management, use the Kinesis Client Library (KCL) via the DynamoDB Streams Kinesis Adapter.

Failure mode — shard iterator expiry: A shard iterator obtained via GetShardIterator is valid for only 15 minutes of idle time. If your consumer reads from a shard iterator but then pauses (slow downstream processing, a rate limit backoff, a network stall), the iterator expires with an ExpiredIteratorException. You must request a new iterator using AT_SEQUENCE_NUMBER with the last successfully processed SequenceNumber. If that value was not durably stored, you may lose your position.

Failure mode — shard splits during scaling: When DynamoDB auto-scales a table by splitting a partition, the corresponding stream shard closes and two child shards open in its place. A naive consumer that does not enumerate and process child shards will process records from the parent shard but miss the children entirely. Lambda and KCL handle this automatically; a hand-rolled consumer must explicitly handle shard discovery.

Azure Cosmos DB Change Feed

Cosmos DB Change Feed exposes an ordered, append-only log of all writes to a Cosmos DB container, partitioned by physical partition range. It captures every INSERT and every REPLACE (full document overwrites). Hard deletes are not captured by default — the change feed never emits a delete event for a document that was removed. If your downstream system requires delete propagation (removing items from a search index, propagating GDPR erasure requests, maintaining a synchronized read model), you must implement soft deletes: write a flag field ("isDeleted": true) and optionally set a TTL so Cosmos DB eventually removes the document. The change feed emits the update that sets isDeleted = true, and your consumer propagates the delete.

Retention: 7 days default (configurable via the container's analytical store or extended retention settings). The change feed log is backed by the container's own storage, so increasing retention increases storage cost.

Delivery: at-least-once. The change feed processor (available in the .NET, Java, Python, and Node.js SDKs) uses a lease container pattern for distributed consumers. A separate leases container (a standard Cosmos DB container you create) tracks which physical partition ranges have been assigned to which consumer instance and stores the continuation token (the position in the feed) for each range. If you run three consumer instances, the processor distributes partition ranges across them and load-balances automatically. If one instance fails, its leases expire and are acquired by remaining instances, which resume from the stored continuation token — no data loss.

Latest version vs all-versions mode: The default mode (Latest Version) delivers the latest state of a document after each write. If a document is updated five times in rapid succession, you may receive only one or two events reflecting later states — intermediate states can be coalesced. The "All Versions and Deletes" mode (a preview feature as of 2024) delivers every individual change including intermediate states and hard deletes. Choose Latest Version for cache invalidation and read-model sync where only the final state matters; use All Versions for audit logging or compliance pipelines where every intermediate change must be captured.

Failure mode — missed deletes: The most common Cosmos DB CDC surprise in production. Any team migrating from a trigger-based or dual-write solution that previously captured deletes will find those events missing from the Change Feed. Audit log pipelines and GDPR erasure propagation pipelines both require careful redesign around the soft-delete workaround.

CockroachDB Changefeeds

CockroachDB changefeeds are defined with a single DDL statement, making them a first-class database object rather than an external connector process. The changefeed reads from CockroachDB's MVCC transaction log and emits structured change events to a configured sink.

The most important feature unique to CockroachDB is resolved timestamps: periodic heartbeat signals embedded in the event stream that state "there are no more changes before this timestamp." When a consumer receives a resolved timestamp T, it knows it has received all committed changes with a commit timestamp less than T. This means the consumer can safely sort, aggregate, or process all buffered events with ts < T without risk of a late-arriving event invalidating that work. No other platform in this comparison offers this guarantee.

Exactly-once delivery: CockroachDB is the only platform that supports exactly-once delivery out of the box, available when using a Kafka sink with the WITH exactly_once option. CockroachDB uses a two-phase protocol to guarantee deduplication at the sink level. Exactly-once has a throughput cost (typically 20–30% lower maximum throughput compared to at-least-once), but it eliminates consumer-side deduplication logic entirely for the most latency-sensitive pipelines.

Schema change handling: By default, a CockroachDB changefeed pauses when a schema change is detected on a watched table — a column was added, dropped, or renamed. This prevents malformed or schema-mismatched events from reaching consumers silently. You can configure schema_change_policy = 'backfill' to have the changefeed automatically re-scan the affected table after the schema change and resume streaming, at the cost of a temporary data backlog that consumers must process.

Sinks: Kafka is the most common production sink and the only sink that supports exactly-once delivery. CockroachDB also supports Google Cloud Storage and Amazon S3 sinks (producing timestamped JSON or Avro batch files, suitable for warehouse ingestion) and webhook sinks for low-volume event notifications.

Failure mode — GC TTL window: CockroachDB's garbage collector reclaims MVCC versions older than gc.ttlseconds (default: 4 hours on most clusters, configurable per-table or zone). If a changefeed consumer falls offline for longer than the GC TTL window, the MVCC data it needs to resume from its last checkpoint may have been garbage-collected. The changefeed cannot resume and must be recreated, triggering a full re-snapshot. Set gc.ttlseconds well above your maximum expected consumer downtime for changefeed-watched tables.

MongoDB Change Streams

MongoDB Change Streams are built directly on the replica set oplog — the same sequential log that secondaries tail to replicate writes from the primary. Every committed write to a replica set produces an oplog entry. Change Streams expose a filtered, resumable view of this oplog through the MongoDB driver.

You open a Change Stream with a watch() call that accepts an optional aggregation pipeline filter. The scope of watch() can be a single collection, an entire database, or an entire deployment (all databases on all nodes in the replica set or sharded cluster). Each document in the Change Stream cursor is a change event, and critically, the _id field of every event is the resume token — the cursor position that enables resumption after a disconnect.

Resume tokens and oplog rollover: When your process restarts, you re-open watch() with resumeAfter: <last_resume_token>, and MongoDB replays events from that oplog position. The critical constraint is that the oplog is a capped collection — it has a fixed maximum size (typically a few gigabytes on a small replica set, up to hundreds of gigabytes on a high-traffic cluster). When the oplog reaches capacity, it wraps around and overwrites the oldest entries. If your consumer has been offline long enough that the oplog has rolled past the resume token's position, watch(resumeAfter: token) throws ChangeStreamHistoryLost. You must perform a full initial snapshot of the collection and restart the stream from scratch.

fullDocument option and the staleness pitfall: By default, update events in a Change Stream include only the modified fields, not the full document. The fullDocument: 'updateLookup' option makes MongoDB fetch the current state of the document and include it in the event. The word "current" is the danger: this is a separate read that happens after the event is emitted. If the document was updated again between when the change event was generated and when the lookup was performed, you receive the later state, not the state at the time of the original change. This is not an edge case — on a busy collection under load, it is the common case.

If you need the exact document state at the time of the change, use MongoDB 6.0+ pre-images and post-images: set changeStreamPreAndPostImages: { enabled: true } on the collection, and use fullDocumentBeforeChange: 'whenAvailable' and fullDocument: 'whenAvailable' in your watch() options. MongoDB stores a snapshot of the document before and after each change in a system collection and returns both images with every event. This is the correct implementation for audit logging, delete propagation, and any use case that requires point-in-time accuracy.

Ordering guarantee: Per-document, not cross-document. Two concurrent updates to different documents may arrive in any order relative to each other. This is sufficient for most event-driven microservice and cache-invalidation use cases. It is not sufficient for use cases that require strict temporal ordering across all documents in a collection.

Failure mode — sharded clusters: On a sharded MongoDB cluster, watch() at the database or deployment scope aggregates change events from all shards. MongoDB guarantees that events from a single shard are delivered in oplog order, but events from different shards are interleaved and can arrive out of wall-clock order. Design multi-shard consumers to tolerate out-of-order delivery across shard boundaries.

Debezium on PostgreSQL and MySQL

Debezium is an open-source CDC platform that reads the PostgreSQL WAL via logical replication or the MySQL binlog and publishes structured events to Kafka, running as a Kafka Connect source connector. It is the most widely deployed CDC mechanism for open-source database infrastructure.

PostgreSQL setup: Debezium creates a logical replication slot on the PostgreSQL primary — a named cursor into the WAL that PostgreSQL must retain until the consumer has read past it. Debezium uses the pgoutput plugin (built into PostgreSQL 10+, no extension installation required) to decode WAL entries into structured change records. You also define a publication — a named logical object that specifies which tables will be replicated. Debezium reads from the publication through the replication slot.

MySQL setup: Debezium reads the MySQL binlog. MySQL must be configured with binlog_format = ROW (row-based binlog captures full before/after images) and binlog_row_image = FULL (all columns captured, not just changed columns). Debezium registers itself as a MySQL replica with a unique database.server.id and requests the binlog stream from the specified position.

Event envelope format: Every Debezium event follows the Kafka Connect envelope schema with five top-level fields: before (row state before the change, null for inserts), after (row state after the change, null for deletes), source (metadata: connector name, database, schema, table, LSN, timestamp, transaction ID), op (operation: c for create, u for update, d for delete, r for read during snapshot), and ts_ms (milliseconds since epoch at the time the change was captured from the log).

Snapshot modes: On first startup, Debezium bootstraps by performing an initial snapshot of all watched tables before beginning incremental log streaming. Configure snapshot.mode to control this behavior: initial (default — full table scan then WAL streaming), schema_only (schema captured, no data rows, only new changes emitted from startup), or never (no snapshot, streaming starts from the current log position, risks missing all historical rows and all changes that occurred before the connector started).

Failure mode — replication slot lag and WAL disk exhaustion: This is the most severe operational risk in any Debezium/PostgreSQL deployment. PostgreSQL cannot reclaim WAL disk space that is ahead of a replication slot's current restart_lsn. If Debezium falls behind — a slow consumer, a Kafka Connect cluster restart, a prolonged network partition — unprocessed WAL accumulates on disk indefinitely. A connector offline for 36 hours on a database generating 2 GB/hour of WAL wakes up to 72 GB of retained WAL that PostgreSQL cannot clean up. If this fills the disk, PostgreSQL crashes. Monitor pg_replication_slots.restart_lsn lag daily. Configure max_slot_wal_keep_size = '15GB' (PostgreSQL 13+) as a safety cap — when the replication slot falls more than 15 GB behind, PostgreSQL automatically drops the slot, which is destructive but prevents a database outage. Combine with alerting so you catch lag before the safety cap triggers.

Performance Analysis: Throughput Ceilings, Consumer Lag, and End-to-End Latency

Understanding the performance envelope of each platform prevents capacity planning surprises in production.

DynamoDB Streams shards deliver a maximum of 2 MB/second of read throughput per shard (shared across all consumers of that shard). A single Lambda consumer reading at maximum batch size (10,000 records) can process bursts well, but a slow Lambda function that holds the iterator open reduces effective throughput for all other stream readers on that shard. End-to-end latency from write to Lambda invocation is typically 200ms–1s under normal load.

Cosmos DB Change Feed latency from write to event delivery is typically under 1 second. Throughput scales with the number of physical partitions — each partition delivers changes independently to its assigned consumer instance. The lease container pattern adds ~10ms of lease-check overhead per polling cycle but enables horizontal scaling: double the consumer instances to double the partition-level processing throughput.

CockroachDB Changefeeds can sustain hundreds of thousands of events per second to a Kafka sink. The primary performance cost is the resolved timestamp protocol, which adds 10–50ms of latency to ensure no straggler events arrive after the resolved timestamp is emitted. Exactly-once delivery adds another 20–30% throughput overhead over at-least-once due to the two-phase write protocol.

MongoDB Change Streams deliver events with sub-second latency on replica sets under normal write load. On sharded clusters with large document volumes, the aggregation pipeline $match filter is evaluated on the primary before the event reaches the consumer — pushing selectivity into the pipeline reduces network transfer and deserialization overhead significantly.

Debezium on PostgreSQL can sustain roughly 10,000–50,000 row-level events per second per connector on typical hardware, limited primarily by Kafka producer throughput. Initial snapshot performance depends on the table size and the snapshot.fetch.size setting — large tables benefit from increasing the default fetch size to 10,000 or higher. Replication slot lag is the primary throughput indicator: if lag is consistently growing, the connector is not keeping up with the database write rate.

📊 Visual Reference

flowchart TD
    DB["Database
(WAL/Log)"]
    Capture["Capture Engine"]
    Transform["Transform"]
    Sink["Sink/Destination"]

    DB -->|"Change Log"| Capture
    Capture -->|"CDC Event"| Transform
    Transform -->|"Normalized"| Sink

📊 How Backpressure Shows Up in Each CDC Model

Backpressure is what happens when the producer of changes can outrun the consumer's ability to process them. In CDC systems, the database usually keeps accepting writes even while downstream consumers slow down, so the pressure does not disappear — it moves into retention windows, growing lag, and larger replay gaps. That is why "can this system buffer a bad hour?" is often a more important question than "what is the median event latency?"

DynamoDB Streams pushes the problem into shard lag. Lambda retries, slow handlers, or downstream rate limits can hold up a shard until records age out of the 24-hour window.
Cosmos DB Change Feed handles backpressure more gracefully because the lease container preserves continuation tokens, but a badly under-provisioned consumer fleet still accumulates a larger and larger partition backlog.
MongoDB Change Streams keep an open cursor, so slow consumers mostly pay in oplog age. If the consumer falls far enough behind, the resume token becomes useless when the oplog rolls.
Debezium shifts backpressure into Kafka queues and source-log retention. Kafka can absorb spikes, but PostgreSQL replication slots and MySQL binlog retention still define the outer safety boundary.
CockroachDB changefeeds checkpoint only as fast as the sink accepts data. A blocked Kafka sink means the resolved timestamp stalls, which is a clear signal that consumers are no longer keeping up.

The practical lesson is simple: backpressure is never "handled" unless you know where it lands. For DynamoDB and MongoDB it usually lands in a short retention window. For Debezium it often lands on disk. For CockroachDB it lands in checkpoint lag. Those are very different failure modes even though all of them start with the same symptom: a slow consumer.

📊 Delivery Semantics: What Each Platform Actually Guarantees in Production

Platform	Delivery Guarantee	Ordering Scope	Deduplication Required	Resume Mechanism
DynamoDB Streams	At-least-once	Per shard	Yes	Shard SequenceNumber
Cosmos DB Change Feed	At-least-once	Per physical partition	Yes	Continuation token in lease container
CockroachDB Changefeed	At-least-once or exactly-once	Global via resolved timestamps	Optional with exactly-once	Resolved timestamp + checkpoint
MongoDB Change Streams	At-least-once	Per document	Yes	Resume token
Debezium on PostgreSQL	At-least-once	Per table in LSN order	Yes	Replication slot LSN
Debezium on MySQL	At-least-once	Per table in binlog order	Yes	Binlog file and position offset

The only platform offering exactly-once delivery is CockroachDB, and only when using a Kafka sink with WITH exactly_once. Every other platform delivers events at-least-once. This means designing consumers to be idempotent is not an optimization — it is the contract. Idempotent consumers use upsert semantics for database writes, key their deduplication checks on a composite of the source primary key and the event LSN or ts_ms, and treat re-processing as a normal operating mode, not an error condition.

⚖️ The Exactly-Once Myth: Most Distributed CDC Pipelines Still Replay and Deduplicate

"Exactly once" sounds like a property of the stream, but in distributed systems it is usually a property of a very specific end-to-end contract: source position tracking, sink commits, retry semantics, and duplicate suppression all have to line up at once. The moment one consumer writes to a database, calls an HTTP API, updates Redis, and emits a metric, exactly-once stops being a checkbox and becomes a chain of failure cases.

For that reason, the safer mental model is:

The source log is durable
Delivery is usually at-least-once
Recovery depends on replay
Correctness comes from idempotent side effects

CockroachDB with a Kafka sink can genuinely improve the sink side of this story, but even there, downstream business actions still need idempotency if they touch external systems. If an inventory consumer writes to PostgreSQL exactly once but calls a warehouse reservation API twice after a retry, the pipeline is not exactly once in any business sense that matters.

So when choosing between a change feed and a change stream, do not ask, "Which one gives me exactly once?" Ask, "Which one gives me the clearest replay boundary, the most observable lag signal, and the easiest path to idempotent recovery?" That is the version of reliability that survives real incidents.

🔢 Ordering Across Partitions: Why Global Order Is a Coordination Problem

A common and costly misconception is that CDC gives you a globally ordered stream of all changes across all tables and all partitions. With one partial exception, it does not. Understanding why helps you design consumers that degrade gracefully when events arrive out of order.

Total order means every consumer sees every event in exactly the same global sequence — event 1 before event 2 before event 3, across all entities. Achieving total order requires a single sequencer — a serialization point — through which all writes must pass. At database scale with millions of writes per second, a single sequencer is a throughput bottleneck. This is why distributed databases shard writes across partitions: to eliminate the single sequencer. Without a sequencer, there is no total order.

Per-entity order means changes to the same key always arrive in the order they were committed. Order-101 will always arrive in the sequence: placed → confirmed → shipped → delivered, regardless of concurrency. The interleaving of order-101 events and order-102 events may not match wall-clock commit order. Per-entity order is what every platform in this post guarantees within a partition or shard, and it is what most event-driven microservice use cases actually need.

CockroachDB resolved timestamps are the closest any production distributed database comes to total order. The resolved timestamp protocol requires every CockroachDB range to report its minimum active transaction timestamp. The changefeed advances its resolved timestamp to the minimum across all ranges — meaning the feed "knows" that all changes with a commit timestamp below the resolved timestamp have been delivered. Consumers can use the resolved timestamp as a processing barrier: buffer all events received since the last resolved timestamp, wait for the next resolved timestamp, then process the buffered batch in order. This adds tens of milliseconds of latency but enables temporal ordering guarantees that no other platform in this comparison provides.

The following sequence diagram illustrates what happens in a system without total ordering guarantees: two concurrent writes to different partitions, consumed by a single downstream consumer, can arrive in an order that differs from wall-clock commit order.

sequenceDiagram
    participant PA as Writer on Partition A
    participant PB as Writer on Partition B
    participant Con as Downstream Consumer

    PA->>PA: Commits order-101 at T=100ms
    PB->>PB: Commits order-102 at T=101ms
    PB->>Con: Event: order-102 delivered first
    PA->>Con: Event: order-101 delivered second
    Note over Con: Wall-clock commit order was 101 then 102
    Note over Con: Delivery order was 102 then 101
    Note over Con: Per-partition order preserved - cross-partition order not guaranteed

For most microservice use cases — projecting read models, invalidating caches, syncing search indexes — per-entity ordering is sufficient and the cross-partition interleaving shown above is harmless. For financial reconciliation or compliance audit trails that require strict temporal ordering across all entities, use CockroachDB resolved timestamps or redesign around a single-partition event sourcing log.

⚖️ Common Failure Modes and How to Recover Before They Escalate

Six failure patterns account for the majority of production CDC incidents. Each has a specific detection method and a specific recovery path.

1. Replication slot lag accumulating unchecked (Debezium/PostgreSQL): WAL grows silently on disk. No error is thrown until the disk fills and PostgreSQL crashes or refuses new writes. Detection: run SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots WHERE slot_type = 'logical' every 5 minutes. Alert when lag_bytes exceeds 5 GB. Mitigation: configure max_slot_wal_keep_size = '15GB' in postgresql.conf (PostgreSQL 13+). This drops the slot automatically when lag exceeds the limit, which triggers a re-snapshot but prevents a database outage. Never deploy Debezium on PostgreSQL without this setting.

2. Shard iterator expiry (DynamoDB Streams): A shard iterator obtained via GetShardIterator expires after exactly 15 minutes of idle time, throwing ExpiredIteratorException. Happens to consumers that process events slowly, back off under rate limits, or are on tables with infrequent writes. Recovery: always store the last successfully processed SequenceNumber durably (in DynamoDB itself, or in an external state store). On ExpiredIteratorException, request a new iterator using ShardIteratorType = AT_SEQUENCE_NUMBER with the stored value. Implement a heartbeat read every 5 minutes during idle periods to keep iterators alive when the source table is quiet.

3. Resume token invalidation from oplog rollover (MongoDB): If the oplog rolls over while a change stream consumer is offline, re-opening watch(resumeAfter: token) throws ChangeStreamHistoryLost. Recovery path: detect the error → trigger a full collection snapshot into a staging area → apply the snapshot to the downstream system → restart watch() without a resume token. Prevention: increase oplog minimum retention (storage.oplog.minRetentionHours in MongoDB 4.4+) to well above your maximum expected consumer downtime. For a consumer with a 48-hour maximum expected downtime, set at least 72 hours of oplog retention.

4. Schema evolution breaking event deserialization (all platforms): Adding a NOT NULL column without a default, renaming a column, or changing a column type in the source database while consumers are running will break deserialization of events that carry the new schema. CDC consumers fail hard on fields they don't recognize or fail to find fields they expect. Safe migration protocol: (1) add new columns as nullable with a default value, (2) dual-write both old and new column names for one complete release cycle, (3) update all consumers to use the new column, (4) drop the old column. Never rename a live column in a CDC pipeline. Never add a NOT NULL constraint to an existing column without a default.

5. Poison pill events blocking a partition: A malformed, oversized, or semantically invalid event causes a consumer to throw an unhandled exception. The consumer retries the same event repeatedly, blocking all subsequent events on that partition. Detection: monitor per-partition consumer lag — a partition that stops advancing while others continue is almost always a poison pill. Recovery: configure a dead letter queue (DLQ): route events that fail after N consecutive retries to a separate DLQ topic for manual inspection, and commit the offset to resume processing of subsequent events. In Kafka consumers, use errors.tolerance = all and set errors.deadletterqueue.topic.name to a dedicated DLQ topic.

6. CockroachDB changefeed GC gap causing forced re-snapshot: If a CockroachDB changefeed consumer falls offline for longer than the table's gc.ttlseconds setting, the MVCC history the changefeed needs to resume from its checkpoint has been garbage-collected. The changefeed errors with a GC threshold past changefeed timestamp error on reconnect. Recovery requires dropping the changefeed, setting gc.ttlseconds to a value greater than your maximum expected consumer downtime on the watched tables, and recreating the changefeed from an initial snapshot. Increase gc.ttlseconds proactively for any table with an active changefeed — the storage overhead is predictable and manageable.

📊 Visual Reference

sequenceDiagram
    participant Txn as Transaction
    participant DB as Database
    participant Log as Transaction Log

    Txn->>DB: BEGIN
    Txn->>DB: Read/Write
    Log->>Log: Write to log
    Txn->>DB: COMMIT
    DB->>Log: Durability confirmed

🌍 When Change Feeds and Streams Are the Right Architectural Tool

CDC is the correct choice for these patterns, and applying it correctly here saves engineering effort that polling, dual-writes, or synchronous hooks cannot avoid.

Event-driven microservice choreography: An order is inserted → the Inventory Service reserves stock → the Billing Service creates an invoice → the Notification Service sends a confirmation email. Each service subscribes to the orders change feed and reacts to the relevant op: 'c' (insert) or op: 'u' (update) events. No synchronous service-to-service HTTP calls, no shared transaction boundaries, no tight coupling. A failure in the Notification Service does not block Inventory or Billing. The change feed retains events during the outage, and the Notification Service processes them when it recovers.

Cache invalidation with strict consistency: An item in the product catalog is updated in PostgreSQL → Debezium emits a change event → a cache invalidation consumer receives the event → it deletes the affected keys from Redis. This is strictly more reliable than a dual write (write to PostgreSQL and evict from Redis in the same request handler) because the cache invalidation event can only fire after the PostgreSQL write has committed. There is no window where the cache contains stale data and the database write has already committed, as there is in the dual-write pattern.

Near-real-time search index synchronization: A product record is updated in the source database → Debezium emits the change → an Elasticsearch sink connector applies the update to the index within seconds. This eliminates nightly batch re-indexing jobs and provides near-real-time search freshness at a fraction of the infrastructure cost.

Immutable audit logging: Every insert, update, and delete on sensitive tables — financial ledger entries, user identity records, access control lists — is captured by the change feed and written to an immutable audit sink (S3, GCS, or a WORM-configured object store). Because the log reader captures at the database log level, the audit trail cannot be bypassed by any application path, including schema migration scripts and direct SQL executed by database administrators.

CQRS read model projection: The write model is a normalized transactional PostgreSQL database. The read model is a denormalized DynamoDB table or MongoDB collection optimized for the read access pattern. The CDC stream keeps the read model synchronized with write-model changes without any application-layer synchronization logic. Adding a new read model is as simple as adding a new consumer to the existing change topic.

🚫 When to Avoid Change Feeds and Streams

CDC is not always the right tool. Using it in these scenarios introduces unnecessary operational complexity or actively makes the problem worse.

High-frequency ephemeral state updates: A multiplayer game writing 10,000 position updates per second per player does not need every position update propagated to downstream services — it needs the latest position. Emitting every game state mutation through a CDC pipeline produces massive Kafka topic lag, stale data by the time consumers process events, and very high storage costs for events that have no durable business significance. Use a last-write-wins cache (Redis) for ephemeral state. CDC is for business events that have meaning beyond the instant they occur.

Use cases requiring strict total order across all entities: If your system requires that every event from every partition arrive in strict wall-clock order — a financial ledger that must globally sequence all debit and credit events across all accounts — you need CockroachDB resolved timestamps or a single-partition event sourcing design. Building a total-order consumer on top of DynamoDB Streams, Cosmos DB, or MongoDB Change Streams is architecturally impossible without introducing a serializing middleware, which defeats the scaling benefits of the distributed database.

Small teams without CDC operational experience: Running Debezium on PostgreSQL in production requires monitoring WAL lag proactively, managing replication slot lifecycle, handling connector schema evolution, and building reprocessing pipelines for gap recovery. Running MongoDB Change Streams requires managing oplog sizing and resume token lifecycle. These are specialized operational skills. If your team is deploying its first event-driven architecture, start with the outbox pattern (smaller operational surface, easier debugging) and graduate to full CDC after the team has built familiarity with at-least-once delivery, idempotent consumers, and event schema contracts.

When synchronous in-transaction side effects are sufficient: If the downstream action is simple, fast, and operates within the same database — writing to an audit log table, incrementing a counter, updating a materialized view — a database trigger or application-level in-transaction hook is architecturally simpler and more reliable than a CDC pipeline. Reserve CDC for cross-service, cross-database, or cross-technology propagation where a transactional hook cannot reach the destination.

Schema-volatile systems in active development: If your team is in early product development and the database schema changes weekly — adding columns, renaming fields, refactoring table layouts — a CDC pipeline will break on every schema change. Schema changes require coordinated updates to connector configuration, consumer deserialization logic, and sometimes a full re-snapshot of large tables. CDC is most valuable and most stable on a production-grade, relatively stable schema. Introduce it after the data model has settled.

🧪 Full System Architecture: From Database Commit to Downstream Consumer with Recovery Paths

The following diagram shows a production CDC architecture end-to-end: source database through a CDC connector to a Kafka topic, fanned out to three downstream consumers, with failure recovery paths and dead-letter handling made explicit. This is the architecture that replaces the broken polling approach from the opening scenario.

graph TD
    SDB[Source DB - orders table] --> WAL[WAL or oplog]
    WAL --> CDC[CDC Connector - Debezium or native]
    CDC --> KT[Kafka Topic: order-changes]
    KT --> INV[Inventory Consumer]
    KT --> BILL[Billing Consumer]
    KT --> NOTIF[Notification Consumer]
    INV --> INVDB[Inventory DB - upsert semantics]
    BILL --> BILLDB[Billing DB - upsert semantics]
    NOTIF --> PUSH[Email and Push Service]

    subgraph Recovery[Failure and Resume Paths]
        R1[Offset committed per consumer group]
        R2[Resume from last committed offset on restart]
        R3[DLQ for poison pill events]
    end

    INV -.->|checkpoints| R1
    BILL -.->|checkpoints| R1
    NOTIF -.->|checkpoints| R1
    R1 -.->|on restart| R2
    KT -.->|on N failures| R3

Three architectural decisions embedded in this diagram are worth emphasizing. First, each consumer group maintains an independent Kafka offset — a Billing Consumer crash does not delay or block Inventory Consumer event processing. Second, all downstream writes use upsert semantics, not insert: this is what makes the consumers safe to re-run on duplicate events. Third, the Dead Letter Queue (DLQ) is wired at the Kafka level, not the application level — poison pills are routed away from the main topic before they can block partition consumption.

🧭 Choosing the Right CDC Approach for Your System: A Decision Reference

If you need...	Use...	Why
Serverless, AWS-native, minimal ops	DynamoDB Streams + Lambda	Fully managed, zero connector infrastructure to operate
Global ordering via resolved timestamps	CockroachDB Changefeed	Only platform with near-total-order guarantee
Per-document resume with rich event filters	MongoDB Change Streams	Resume tokens, aggregation pipeline filters, pre-image support in 6.0+
PostgreSQL or MySQL source with Kafka sink	Debezium	Open-source, mature, supports schema registry integration
Multi-region Azure, Cosmos DB source	Cosmos DB Change Feed	Native SDK, lease container for distributed consumers
Exactly-once delivery without consumer deduplication	CockroachDB Changefeed with Kafka sink	Only platform with built-in exactly-once
Cross-table fan-out from a single PostgreSQL source	Debezium with multiple Kafka consumers	Each consumer tracks its own LSN independently
Immutable audit trail including schema-bypassing SQL	Any log-based CDC to S3 or GCS	Log captures all commits regardless of application path

🛠️ OSS Configuration Reference: CockroachDB, Debezium, MongoDB, and PostgreSQL

The following snippets are production-starting-point configurations. Every configuration uses only DDL, connector config, or query syntax — no application code.

CockroachDB: Creating a Changefeed to a Kafka Sink

-- Create a changefeed that emits all changes on the orders table
-- to a Kafka topic, with resolved timestamps every 10 seconds,
-- before/after diff included in every update event, and
-- automatic backfill on schema changes.
CREATE CHANGEFEED FOR TABLE orders
  INTO 'kafka://kafka-broker:9092?topic_prefix=crdb.'
  WITH
    resolved = '10s',
    format = 'json',
    diff,
    schema_change_policy = 'backfill',
    min_checkpoint_frequency = '5s';

The resolved = '10s' option emits a resolved timestamp heartbeat every 10 seconds into the Kafka topic — consumers can use this as a processing barrier to ensure temporal ordering. The diff option includes the before-image in every update event, enabling downstream consumers to compute the precise delta. Setting schema_change_policy = 'backfill' means the changefeed performs an initial scan after detecting a schema change and resumes streaming automatically, instead of pausing indefinitely.

Debezium: PostgreSQL Connector Configuration

{
  "name": "orders-pg-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "pg-primary.internal",
    "database.port": "5432",
    "database.user": "debezium_user",
    "database.password": "${file:/opt/kafka/secrets/pg.properties:password}",
    "database.dbname": "orders_db",
    "database.server.name": "orders-pg",
    "plugin.name": "pgoutput",
    "publication.name": "orders_publication",
    "slot.name": "debezium_orders_slot",
    "table.include.list": "public.orders,public.order_items",
    "snapshot.mode": "initial",
    "heartbeat.interval.ms": "30000",
    "max.queue.size": "8192",
    "max.batch.size": "2048",
    "topic.prefix": "orders-pg"
  }
}

plugin.name: pgoutput uses the built-in PostgreSQL logical decoding output plugin — no extension installation required on the database host. slot.name uniquely identifies this consumer in pg_replication_slots; it is the key you monitor for WAL lag. heartbeat.interval.ms: 30000 ensures the connector emits a heartbeat every 30 seconds even when no rows change, which advances the replication slot and allows PostgreSQL to reclaim WAL that has already been consumed.

PostgreSQL: Replication Slot, Publication, and WAL Lag Monitoring

-- Create the publication for the tables to be replicated.
-- The publication name must match publication.name in the connector config.
CREATE PUBLICATION orders_publication
  FOR TABLE public.orders, public.order_items
  WITH (publish = 'insert, update, delete');

-- The replication slot is created automatically by Debezium on first startup.
-- Pre-create it manually only to verify connectivity before deploying the connector.
SELECT pg_create_logical_replication_slot(
  'debezium_orders_slot',
  'pgoutput'
);

-- WAL lag monitoring query — run this every 5 minutes and alert on high lag_bytes.
-- Alert threshold: 5 GB. Safety cap: max_slot_wal_keep_size = 15GB in postgresql.conf.
SELECT
  slot_name,
  active,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_type = 'logical';

The monitoring query is as important as the connector configuration. lag_bytes represents how much WAL PostgreSQL must retain on disk for this slot. An idle or offline Debezium connector causes this number to grow indefinitely. Automate this query in your observability stack and page on-call when lag_bytes exceeds your alert threshold.

MongoDB: Change Stream with Aggregation Pipeline Filter and Pre-Image

// Open a Change Stream on the orders collection,
// filtered to insert, replace, and update operations only.
// Requires MongoDB 6.0+ for pre-image and post-image support.
// The lastResumeToken must be persisted by the consumer across restarts.
db.orders.watch(
  [
    {
      $match: {
        operationType: { $in: ["insert", "replace", "update"] },
        "ns.coll": "orders"
      }
    }
  ],
  {
    fullDocument: "whenAvailable",
    fullDocumentBeforeChange: "whenAvailable",
    resumeAfter: lastResumeToken
  }
)

The aggregation pipeline $match stage filters the event stream before events are sent to the consumer — only matching events consume network and deserialization overhead. fullDocumentBeforeChange: "whenAvailable" returns the document state before the change (requires pre-images enabled on the collection via collMod). fullDocument: "whenAvailable" returns the post-change state from the stored post-image rather than performing a separate lookup, giving you the exact state at change time instead of the potentially-stale result of updateLookup.

📊 Visual Reference

flowchart TD
    Young["Young Generation
(Fast)"]
    Old["Old Generation
(Slow)"]
    FullGC["Full GC
(Stop-the-world)"]

    Young -->|"Survivors moved"| Old
    Old -->|"No space"| FullGC

📚 Seven Lessons Learned from CDC Pipelines Running in Production

1. Deduplication is not an edge case — it is the baseline operating model. Every CDC platform except CockroachDB with exactly-once configured will deliver the same event more than once under failure conditions. Design every consumer as if duplicates arrive regularly: use upsert semantics for all downstream writes, key idempotency checks on source primary key plus event LSN or ts_ms, and test the consumer with deliberately injected duplicate events before shipping to production.

2. Retention windows are the most common source of silent data loss. DynamoDB's 24-hour window has surprised more than one team whose consumer went offline over a long weekend. MongoDB's capped oplog has a variable window that depends on write volume — a quiet database can retain months of oplog; a busy one, hours. CockroachDB's GC TTL defaults to 4 hours. Document your platform's effective retention window, measure your maximum expected consumer downtime, and configure monitoring to detect gaps before they exceed the window.

3. Replication slot lag is a silent PostgreSQL time bomb. The WAL accumulates without any error or warning until the disk is full. max_slot_wal_keep_size in PostgreSQL 13+ is not optional for production Debezium deployments — it is the difference between a controlled re-snapshot and an unplanned database outage. Add it before you deploy the connector, not after you receive the disk-full PagerDuty alert.

4. fullDocument: 'updateLookup' in MongoDB is not the state at change time. This is the most common MongoDB Change Streams bug in production CDC pipelines. It is a separate database read that returns the current document state, not the state at the change. If you need point-in-time accuracy — audit logging, delete propagation, financial state tracking — enable pre-images on the collection (MongoDB 6.0+) and use fullDocumentBeforeChange: 'whenAvailable'.

5. Schema changes require a coordinated, multi-release migration protocol. A column rename or type change in the source database invalidates every downstream CDC consumer simultaneously. The safe protocol is three releases: (1) add the new column alongside the old one, (2) dual-populate both columns for one complete release cycle, (3) update consumers, then (4) drop the old column. Never rename a column on a live CDC pipeline without this ceremony.

6. CockroachDB resolved timestamps eliminate an entire class of ordering bugs. Teams processing time-series aggregations, financial reconciliation, or ordered audit logs from CockroachDB should use resolved timestamps as a processing barrier. They are the most underused feature in distributed CDC. Buffer events between resolved timestamps, process each batch after the barrier advances, and you eliminate the out-of-order interleaving that plagues every other platform.

7. The outbox pattern is the right starting point; CDC is the right destination. If your team is new to event-driven architecture, deploy the outbox pattern first. It has a smaller operational surface — a relay query (or Debezium watching the outbox table) instead of a full WAL replication slot — and simpler failure modes. Graduate to full CDC when the team understands idempotent consumers, at-least-once delivery, and event schema lifecycle management. The architecture is compatible: an outbox table watched by Debezium is itself a CDC pipeline.

📊 Visual Reference

flowchart TD
    DB["Database
(WAL/Log)"]
    Capture["Capture Engine"]
    Transform["Transform"]
    Sink["Sink/Destination"]

    DB -->|"Change Log"| Capture
    Capture -->|"CDC Event"| Transform
    Transform -->|"Normalized"| Sink

📌 TLDR Summary & Key Takeaways: Which CDC Model Fits Which Distributed-System Constraint

Polling fails in three structural ways: it misses hard deletes (the row is gone), double-processes on clock skew (overlapping watermark windows), and amplifies primary database query load (one query per consumer per poll interval). CDC from the database log eliminates all three failure modes.
Change feeds (CockroachDB, Cosmos DB, DynamoDB Streams) are pull-based cursor APIs where the consumer manages its own position. Change streams (MongoDB) are push-based subscription APIs where MongoDB pushes events and provides resume tokens. Both are CDC mechanisms. The distinction matters for consumer architecture design.
Every platform except CockroachDB with exactly-once enabled is at-least-once delivery. Design all consumers to be idempotent from day one. Upsert, not insert. Deduplication keyed on source primary key plus LSN, not just primary key.
Ordering guarantees are weaker than most engineers assume. Per-partition order is guaranteed everywhere. Cross-partition, cross-shard, and cross-document ordering is not. CockroachDB resolved timestamps are the only production mechanism for near-total ordering in a distributed database.
Know your platform's retention window before going to production. DynamoDB Streams: 24 hours. Cosmos DB: 7 days. CockroachDB: GC TTL (default 4 hours per table). MongoDB oplog: size-dependent, hours to days. PostgreSQL WAL: retained until the replication slot advances. Design your consumer availability SLA around the retention window, not the other way around.
Cosmos DB Change Feed does not capture hard deletes in Latest Version mode. Implement soft deletes with isDeleted: true and TTL for any pipeline that requires delete propagation.
The top operational risk in Debezium/PostgreSQL is WAL disk exhaustion. Configure max_slot_wal_keep_size before deploying. Monitor pg_replication_slots.restart_lsn lag every 5 minutes. Alert at 5 GB. This single configuration decision separates a production-grade Debezium deployment from a ticking clock.

Quiet AI help

Explain simpler Compare approaches What next?

Article metadata

Written by

Abstract Algorithms

@abstractalgorithms

Related deep dives

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

24 min read

ANN Index Types Explained: When to Choose Flat, HNSW, IVF, or IVF-PQ

14 min read

Softmax Function Explained: From Raw Scores to Probabilities

23 min read

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

31 min · Llm · best next step

Open Collection