All Posts

Big Data Architecture Patterns: Lambda, Kappa, CDC, Medallion, and Data Mesh

Choose ingestion, serving, and ownership patterns deliberately when data platforms start to scale.

Abstract AlgorithmsAbstract Algorithms
ยทยท8 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: A serious data platform is defined less by where files are stored and more by how changes enter the system, how serving layers are materialized, and who owns quality over time. Lambda, Kappa, CDC, Medallion, and Data Mesh are patterns for making those choices explicit.

TLDR: Big-data architecture fails when ingestion, curation, and ownership are left as accidental pipeline glue.

๐Ÿ“– Why Big-Data Patterns Are About Flow and Ownership

Many teams think the big-data problem is solved once they choose a lakehouse, warehouse, or streaming stack. That is only the storage decision. The harder question is how data moves from operational truth into trustworthy analytical and machine-learning products.

At scale, data systems must answer several architectural questions:

  • Do we ingest by polling tables, receiving events, or capturing database changes directly?
  • Do we maintain both batch and streaming paths or unify on one processing style?
  • How do raw, curated, and serving layers stay aligned over time?
  • Which team owns data quality for each domain?
  • How expensive is replay when the pipeline needs to be rebuilt?

These are exactly the concerns addressed by patterns like Lambda, Kappa, CDC, Medallion, and Data Mesh.

๐Ÿ” Comparing Lambda, Kappa, CDC, Medallion, and Data Mesh

The patterns operate at different levels, but they intersect constantly.

PatternMain goalBest fitMain cost
Change Data Capture (CDC)Stream source-of-truth database changes reliablyOLTP systems feeding analytics or downstream syncBackfill and schema-change complexity
Lambda ArchitectureCombine batch accuracy with streaming freshnessSystems needing both full recompute and low-latency viewsDuplicate logic in two paths
Kappa ArchitectureUse one streaming model for both incremental and replay processingStrong streaming culture and append-friendly dataReplay discipline and stream cost
Medallion ArchitectureSeparate raw, refined, and curated data layersLakehouse-based analytics platformsLayer sprawl without discipline
Data MeshPush domain ownership to teams that know the data bestLarge org with many domains and strong platform supportGovernance and enablement overhead

The real design choice is often combination, not substitution. A platform may use CDC for ingestion, Medallion for curation, and a limited Data Mesh operating model for domain ownership.

โš™๏ธ Core Mechanics: From Source Changes to Serving Tables

Most modern platforms follow a staged flow.

  1. Source data enters through CDC, logs, APIs, or batch loads.
  2. Raw records land in a bronze layer or equivalent immutable staging zone.
  3. Validation and normalization produce silver-level datasets.
  4. Gold-level marts or serving tables are shaped for dashboards, search, or ML features.
  5. Ownership and quality checks are attached to each domain output.

Lambda and Kappa differ mainly in how they treat recomputation. Lambda keeps a batch path plus a speed path. Kappa prefers one streaming abstraction and treats replay as another stream-processing job.

CDC matters because it moves platforms closer to source-of-truth semantics. Instead of relying on nightly table dumps, it captures inserts, updates, and deletes with more timely and more accurate lineage.

๐Ÿง  Deep Dive: Schema Drift, Replay Cost, and Exactly-Once Myths

The Internals: CDC Connectors, Curated Layers, and Domain Ownership

CDC pipelines usually start from database logs or change streams. The connector emits ordered changes, often with metadata such as source position, operation type, and schema version. That metadata is critical for replay and deduplication.

Medallion layering works when each layer has a clear contract:

  • Bronze keeps raw data close to source shape.
  • Silver resolves quality issues, standardizes types, and joins supporting context.
  • Gold optimizes for specific consumers such as BI dashboards or features.

Data Mesh changes ownership expectations more than technical flow. Domains publish data products with documented schema, quality metrics, and support expectations rather than dumping tables into a shared platform and leaving downstream teams to guess meaning.

Performance Analysis: Late Data, Rebuild Economics, and Stream Cost

Pressure pointWhy it matters
End-to-end freshnessTells you whether the platform meets dashboard or feature latency needs
Replay durationDetermines how painful pipeline rebuilds become
Late-arriving data rateImpacts window correctness and reconciliation logic
Always-on stream costStreaming may overspend compared with simpler batch for some domains
Ownership bottlenecksShared teams can become throughput constraints for every domain

Exactly-once is often misunderstood. Many systems achieve durable, correct outcomes through idempotent writes and careful offsets rather than magical one-time processing. That distinction matters because teams can overspend or overcomplicate architecture chasing guarantees the product does not require.

Replay cost is also a first-class architecture decision. If recomputing gold tables from bronze takes three days, the platform has effectively declared slow recovery as normal. That may be acceptable for historical analytics. It is unacceptable for systems supporting operational decision-making.

๐Ÿ“Š Big-Data Flow: CDC Into Bronze, Silver, Gold, and Serving

flowchart TD
    A[OLTP databases and event sources] --> B[CDC or ingest layer]
    B --> C[Bronze raw storage]
    C --> D[Silver validated datasets]
    D --> E[Gold marts and feature tables]
    E --> F[Dashboards and analytics]
    E --> G[ML and operational consumers]
    H[Domain owners] --> D
    H --> E

This diagram shows the important split: ingestion is not the same thing as curation, and ownership should attach at the product layer rather than disappear into a central platform backlog.

๐ŸŒ Real-World Applications: Clickstream, Fraud, and ML Feature Backfills

Clickstream analytics often need a blend of fast ingestion and curated outputs. Bronze data preserves raw behavior for replay. Silver normalizes device, session, and event semantics. Gold produces dashboard-ready funnels.

Fraud pipelines often mix CDC from transactional systems with streaming features. Here freshness matters more than raw storage elegance because decisions may need to react within seconds or minutes.

ML feature backfills are where replay economics become painfully visible. If historical features cannot be rebuilt from trusted raw inputs, model iteration slows and trust in the feature platform erodes.

These are not storage problems alone. They are architecture problems about contracts, recomputation, and owner accountability.

โš–๏ธ Trade-offs and Failure Modes

Failure modeSymptomRoot causeFirst mitigation
Dual-path driftBatch and stream outputs disagreeLambda logic duplicated imperfectlyShared transformation semantics
CDC backfill painHistorical rebuilds are slow or incompleteIngestion designed only for live deltasPlan replay from day one
Layer swampBronze, silver, gold proliferate without purposeNo contract per layerDefine consumers explicitly
Mesh without enablementDomain teams publish inconsistent outputsNo platform standards or toolingAdd schema and quality guardrails
Streaming overspendAlways-on jobs cost too muchChosen stream processing where batch was enoughMatch latency need to cost

The central trade-off is flexibility versus governance. Raw storage gives freedom, but without explicit contracts it turns into long-lived confusion. Strong ownership gives clarity, but only if the platform makes the right path easier than the ad hoc path.

๐Ÿงญ Decision Guide: Which Big-Data Pattern Fits Your Platform?

SituationRecommendation
Need reliable database-to-analytics propagationStart with CDC
Need both recomputation and low-latency viewsLambda may still be justified
Team is streaming-native and data is append-friendlyKappa can simplify architecture
Building a lakehouse with clear quality stagesUse Medallion layers
Large org with domain-aligned teams and platform maturityAdd Data Mesh ownership model

Treat Data Mesh as an organizational pattern, not a license to let every team invent its own platform. Treat Medallion as a curation model, not as proof that governance exists automatically.

๐Ÿงช Practical Example: Order and Clickstream Into a Lakehouse

Imagine an e-commerce company with OLTP order tables plus high-volume clickstream events.

A practical design could be:

  1. CDC for orders and payments into raw storage,
  2. append-only clickstream ingest into the same bronze zone,
  3. silver transformations that unify customer, session, and order keys,
  4. gold marts for finance, growth, and support dashboards,
  5. feature tables for fraud and recommendations,
  6. domain-owned quality checks on the gold products.

This design keeps replay possible, makes freshness measurable, and avoids forcing analysts to reason directly from raw source quirks.

๐Ÿ“š Lessons Learned

  • Storage choice alone does not define a data-platform architecture.
  • CDC is often the cleanest bridge from operational truth to downstream systems.
  • Medallion works when each layer has a real contract and consumer.
  • Data Mesh succeeds only with strong platform standards and domain accountability.
  • Replay cost and late-data handling should be designed before incidents force them.

๐Ÿ“Œ Summary and Key Takeaways

  • Big-data patterns make ingestion, curation, and ownership explicit.
  • CDC improves lineage and timeliness from operational systems.
  • Lambda and Kappa make different trade-offs between dual paths and replay simplicity.
  • Medallion clarifies raw versus refined versus curated data products.
  • Data Mesh is about domain ownership with platform guardrails, not decentralization alone.

๐Ÿ“ Practice Quiz

  1. What is the main benefit of CDC in a data platform?

A) It replaces every need for batch processing
B) It captures source-of-truth data changes more reliably and promptly than periodic dumps
C) It removes schema evolution problems entirely

Correct Answer: B

  1. Why is Lambda Architecture expensive to maintain?

A) Because it usually requires both batch and streaming logic for similar outcomes
B) Because it stores no historical data
C) Because it forbids curated datasets

Correct Answer: A

  1. What is the biggest organizational requirement for Data Mesh?

A) Every team must build its own query engine
B) Domain teams need platform support and clear quality ownership for data products
C) No central standards should exist

Correct Answer: B

  1. Open-ended challenge: if your gold dashboards are accurate but always 45 minutes behind source changes, how would you decide whether to add CDC, streaming, or simply redefine freshness expectations per consumer?
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms