Big Data Architecture Patterns: Lambda, Kappa, CDC, Medallion, and Data Mesh

Choose ingestion, serving, and ownership patterns deliberately when data platforms start to scale.

Abstract Algorithms

·Mar 13, 2026·15 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: A serious data platform is defined less by where files are stored and more by how changes enter the system, how serving layers are materialized, and who owns quality over time. Lambda, Kappa, CDC, Medallion, and Data Mesh are patterns for making those choices explicit.

TLDR: Strong data architecture is a sequence of choices: how changes enter, how quality is staged, how replay works, and who owns each data product.

In 2019, Airbnb's analytics team discovered six weeks of revenue reports had been silently wrong — a broken ETL job had been overwriting validated silver-layer data with raw, unvalidated rows for 43 days. Every pipeline step showed green. No alert fired. The issue surfaced only during a finance audit. The fix wasn't better hardware: it was explicit ownership contracts between raw, validated, and gold-layer data. That incident is why Medallion, CDC, and Data Mesh exist — not as theoretical constructs, but as scars from real postmortems.

Here is what the Medallion fix looks like in practice — one order event flowing through layers:

Layer	What it holds	Who can mutate it	What a bug here means
Bronze	Raw CDC event, source-faithful, append-only	Nobody — immutable	Nothing; replay from source always works
Silver	Validated, deduped order rows	Silver ETL job only	Downstream Gold breaks; Bronze replay recovers
Gold	Finance daily revenue aggregate	Gold job + owner	Dashboard shows wrong numbers; replay from Silver recovers
Consumer	BI tool / ML feature	Read-only	Downstream impact only; no data loss

When the Airbnb ETL bug hit Silver, Bronze was untouched. Replay from Bronze to Gold took 90 minutes and recovered all 43 days of data.

📖 Why These Big-Data Patterns Matter in Real Teams

Most platform incidents are not caused by storage technology. They come from unclear ingestion boundaries, expensive reprocessing, and missing ownership for quality.

Practical architecture questions:

Do we need minute-level freshness or hourly is fine?
Can we replay yesterday safely if a bug slips in?
Who signs off quality for each gold dataset?
What is the cost ceiling for always-on streaming?

Problem you see	Pattern that helps first
OLTP and analytics keep drifting	CDC for source-of-truth propagation
Raw data is hard to trust	Medallion-style curation contracts
Teams fight over central backlog	Data Mesh ownership model
Need both fast and recomputable views	Lambda or Kappa decision

🔍 When to Use Lambda, Kappa, CDC, Medallion, and Data Mesh

Pattern	Use when	Avoid when	Practical first step
CDC	You need reliable change propagation from transactional systems	Source systems do not expose stable change logs	Start with one critical table family
Lambda	You require low-latency serving plus deterministic batch recompute	Team cannot support dual pipeline logic	Reuse shared transformations across batch and speed paths
Kappa	Streaming is mature and replay from log is operationally proven	Data is not append-friendly or stream cost is unjustified	Prove one replay drill on real backlog
Medallion	You need clear raw-to-curated progression	Layers are created without consumer contracts	Define bronze/silver/gold acceptance rules
Data Mesh	Domain teams can own and operate data products	Platform standards and tooling are immature	Pilot with one domain and one product SLA

When not to over-engineer

If freshness target is daily and replay is rare, simple batch may be enough.
If one team owns everything today, full Data Mesh can add overhead too early.

📊 Lambda Architecture: Dual-Path Batch and Speed

flowchart LR
  A[Event Stream] --> B[Speed Layer Kafka / Flink]
  A --> C[Batch Layer Spark / Hadoop]
  B --> D[Real-time Views low latency]
  C --> E[Batch Views high accuracy]
  D --> F[Serving Layer merge both views]
  E --> F
  F --> G[Consumer Queries]

This diagram illustrates the Lambda Architecture's dual-path design, where the same event stream feeds both a real-time Speed Layer (Kafka/Flink) and a high-accuracy Batch Layer (Spark/Hadoop). Both paths converge at the Serving Layer, which merges low-latency views from the speed path with fully recomputed batch views to answer consumer queries. The key takeaway is that Lambda buys both freshness and accuracy at the cost of maintaining two parallel processing pipelines with potentially divergent semantics — the exact drift problem LinkedIn discovered across 23 metrics before moving to Kappa.

📊 Kappa Architecture: Single Stream Path

flowchart LR
  A[Event Stream] --> B[Stream Layer Kafka + Flink]
  B --> C[Serving Layer Materialized Views]
  C --> D[Consumer Queries]
  B --> E[Log Replay for corrections]
  E --> C

Lambda keeps batch and speed paths separate — accurate but operationally duplicated. Kappa unifies them into one stream with replay; simpler to own, but replay cost can spike on large backlogs.

⚙️ How the Architecture Works End-to-End

Capture source changes via CDC/events/batch.
Land immutable raw records in bronze.
Validate and normalize into silver with data quality checks.
Publish consumer-specific gold products.
Track ownership, lineage, and freshness SLOs per product.

Stage	Non-negotiable control	Common anti-pattern
Ingestion	Replayable offsets/checkpoints	Polling snapshots with no lineage
Bronze	Immutable append and schema capture	Mutating raw data in place
Silver	Rule-based validation and dedupe	Silent cleanup logic buried in notebooks
Gold	Explicit contract per consumer	One gold table pretending to fit all use cases
Ownership	Named owner and support SLA	Central platform team owning semantics for every domain

🛠️ How to Implement: Practical 12-Step Plan

Pick one high-value domain (orders, payments, clickstream).
Define target freshness and correctness SLO per consumer.
Add CDC or event ingest with checkpointed offsets.
Land bronze as immutable with schema version metadata.
Build silver validations: null checks, type rules, dedupe keys.
Create first gold product with explicit consumer contract.
Add data tests for freshness, volume anomaly, and key integrity.
Run replay drill from bronze to gold and record duration.
Publish data product owner and escalation path.
Track unit economics: compute spend per data product.
Expand to second domain only after first meets SLO for 2-4 weeks.
Introduce Mesh governance only where domain ownership is stable.

Implementation done criteria:

Gate	Pass condition
Freshness	Product meets defined latency window
Replayability	Full recompute succeeds within business recovery target
Quality	Silver and gold checks catch contract drift
Ownership	Named team handles incidents and schema changes

🧠 Deep Dive: Schema Drift, Replay Economics, and Cost Control

The Internals: Contracts, Late Data, and Replay Safety

CDC events should include operation type, source position, and schema version. Without these fields, replay and dedupe become guesswork.

Medallion is effective only when each layer has a contract:

Bronze: no business interpretation, source-faithful storage.
Silver: cleaned and conformed entities.
Gold: explicitly modeled for a consumer decision.

Exactly-once claims should be practical, not marketing-driven. Most teams succeed with idempotent writes + checkpoint discipline.

Drift type	First symptom	Mitigation
Source schema change	Broken downstream transforms	Contract tests + compatibility checks
Late-arriving events	Incorrect windows and aggregates	Watermark strategy + reconciliation jobs
Duplicate ingest	Inflated metrics	Business-key dedupe in silver

Performance Analysis: Metrics That Predict Pain Early

Metric	Why it matters
End-to-end freshness by product	Shows if consumers can trust timeliness
Replay duration	Defines recovery realism after incidents
Late-data percentage	Predicts reconciliation complexity
Stream compute cost per TB	Prevents unnoticed cost drift
Product-level incident count	Reveals weak ownership boundaries

Use these metrics in weekly architecture reviews, not only post-incident retros.

📊 Big-Data Flow: CDC to Curated Data Products

flowchart TD
  A[OLTP and event sources] --> B[CDC or ingest service]
  B --> C[Bronze immutable storage]
  C --> D[Silver validation and conformance]
  D --> E[Gold data products]
  E --> F[BI and dashboards]
  E --> G[ML features and operational services]
  H[Domain owner and SLA] --> E

This diagram maps the full Medallion data flow from raw operational sources to consumer-ready data products. OLTP databases and event streams feed a CDC or ingest service that lands immutable records in the Bronze layer; each subsequent layer — Silver for validation and conformance, Gold for consumer-specific aggregates — adds quality guarantees before data reaches BI tools, ML feature stores, or operational services. The Domain Owner and SLA node emphasizes that architectural correctness alone is insufficient: each Gold product requires a named owner accountable for freshness, quality, and incident response.

🌍 Real-World Applications: Realistic Scenario: Retail Analytics and Fraud Platform

Airbnb's Medallion Architecture (Databricks Delta Lake)

Airbnb rebuilt their analytics platform after the 2019 ETL incident. Their Bronze→Silver→Gold pipeline on Delta Lake now lands raw CDC events within 4 minutes of source commit. Replay from Bronze to Gold covers a full 30-day window in under 90 minutes — down from 18 hours in the old batch model. Each gold dataset has a named domain owner and a schema contract; a new consumer cannot query gold without accepting the published SLA.

Netflix: Data Mesh at 500B+ Events/Day

Netflix's shift to a Data Mesh model — where each of 200+ domain teams publishes a registered data product with a freshness SLA — reduced data incident MTTR from 4 hours to 35 minutes by eliminating the "who owns this table?" escalation loop. Operational metrics are published with < 10-minute freshness; historical datasets are hourly. The central platform team owns tooling, but never data semantics.

LinkedIn: Why They Left Lambda for Samza

LinkedIn's original Lambda setup ran Hadoop (batch) and Storm (speed) separately. A 2016 production audit found 23 metrics with persistent divergence between paths — some drifting for 6+ months. Finance reported a 4% discrepancy in session counts. LinkedIn migrated to Apache Samza for unified stream processing with replay support, eliminating dual-path semantic drift entirely.

Real system	Pattern used	Key metric
Airbnb / Databricks	Medallion (CDC→Bronze→Silver→Gold)	90-min replay, 4-min freshness
Netflix	Data Mesh (domain-owned products)	MTTR 4h → 35 min
LinkedIn	Kappa / Samza unified stream	Eliminated 23 drifted metrics

Failure scenario: Before Medallion contracts at Airbnb, one bad ETL job corrupted 43 days of revenue data silently. Every pipeline step was green; no cross-layer correctness check existed. By the time finance discovered the issue during a quarterly audit, 6 weeks of authoritative numbers were wrong. Bronze immutability means even when silver transforms break, the raw data can always be replayed from a known-good starting point.

⚖️ Trade-offs & Failure Modes: Pros, Cons, and Risks Across Pattern Choices

Pattern decision	Pros	Cons	Main risk	Mitigation
CDC-first ingestion	Better lineage and freshness	Connector and schema complexity	Backfill pain	Standardized replay tooling
Lambda dual-path	Fast + accurate views	Duplicate logic	Drift between paths	Shared transformation library
Kappa single-stream	Conceptual simplicity	Replay cost can spike	Long recovery windows	Regular replay drills
Medallion layering	Clear data quality progression	Layer sprawl risk	Unused intermediate sets	Layer contracts tied to consumers
Data Mesh ownership	Faster domain decisions	Governance overhead	Inconsistent quality standards	Platform guardrails + product templates

🧭 Decision Guide: What to Choose First

Situation	Recommendation
Transactional changes are late or inconsistent in analytics	Implement CDC first
Teams cannot explain source-to-dashboard lineage	Add Medallion contracts and tests
You need <5 minute serving plus daily guaranteed accuracy	Consider Lambda in that domain only
Streaming cost is high and replay is rare	Prefer batch-centric model with targeted streaming
Domain teams are mature and accountable	Add Mesh ownership incrementally

🧪 Practical Example: Order + Clickstream Lakehouse Slice

PySpark Medallion Pipeline (Delta Lake pattern from Airbnb/Databricks)

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, col

spark = SparkSession.builder.appName("medallion-orders").getOrCreate()

# ── Bronze: append-only CDC event landing ────────────────────────────────────
(spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka:9092")
    .option("subscribe", "orders.cdc")
    .load()
    .selectExpr("CAST(value AS STRING) as payload", "timestamp as ingest_ts")
    .writeStream.format("delta").outputMode("append")
    .option("checkpointLocation", "/bronze/orders/_chk")
    .start("/bronze/orders"))

# ── Silver: dedupe + validate via MERGE (idempotent on rerun) ─────────────────
spark.sql("""
  MERGE INTO silver.orders AS t
  USING (
    SELECT order_id, customer_id, amount, status, event_ts,
           ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY event_ts DESC) rn
    FROM bronze.orders WHERE ingest_ts >= current_timestamp() - INTERVAL 10 MINUTES
  ) AS s ON t.order_id = s.order_id AND s.rn = 1
  WHEN MATCHED AND s.status != t.status
    THEN UPDATE SET t.status = s.status, t.updated_at = current_timestamp()
  WHEN NOT MATCHED THEN INSERT *
""")

# ── Gold: finance daily revenue aggregate ────────────────────────────────────
spark.sql("""
  INSERT OVERWRITE gold.finance_revenue_daily
  SELECT DATE(event_ts) AS day, SUM(amount) AS revenue,
         COUNT(DISTINCT order_id) AS order_count
  FROM silver.orders WHERE status = 'completed'
  GROUP BY DATE(event_ts)
""")

Key design points from production:

Bronze is append-only — bad transforms cannot corrupt it; replay always works from the same source.
Silver MERGE deduplicates by business key + latest event_ts — idempotent on every rerun.
Gold uses INSERT OVERWRITE — downstream consumers always see a complete, consistent day.
Target freshness: Bronze within 4 min → Silver within 8 min → Gold within 15 min.

Implementation checklist for this slice:

Define gold consumers and acceptance tests first.
Build silver quality checks before dashboarding.
Run one replay from bronze to gold weekly (target: < 90 min for 30-day window).
Review cost per gold product monthly.

Operator Field Note: What Fails First in Production

The Airbnb pattern — corrupted data platforms rarely announce themselves loudly. At Airbnb, the 2019 revenue corruption ran silently for 43 days because every individual pipeline step appeared healthy: ETL jobs ran on schedule, row counts looked normal, dashboards showed numbers. What was missing was a cross-layer correctness check — does gold actually match what silver validated? Without Medallion contracts, there was no layer that owned the answer to that question.

At LinkedIn, 23 metrics drifted between speed and batch paths for months. The signal was a finance partner comparing quarterly reports across two systems — not an automated alert. By the time it was investigated, drift had been running for 6+ months in some metrics. The pattern is consistent: the signal that breaks first is a business metric that "doesn't feel right," not a system alert.

Early warning signal: a gold dataset's row count or aggregate value diverges from expected range by more than 1% across two consecutive runs.
First containment move: pin consumers to the last-known-good gold snapshot and trigger a targeted silver→gold replay from bronze.
Escalate immediately when: divergence affects any metric with financial, compliance, or fraud-scoring implications.

15-Minute SRE Drill

Replay one bounded failure case in staging.
Capture one metric, one trace, and one log that prove the guardrail worked.
Update the runbook with exact rollback command and owner on call.

🛠️ Apache Spark and Delta Lake: PySpark Medallion Pipeline with Flink for Streaming

Apache Spark is the dominant distributed compute engine for big data batch and micro-batch processing; Delta Lake adds ACID transactions, schema enforcement, and time-travel replay to Spark's storage layer — the combination that powers the Medallion architecture at Airbnb and Databricks. Apache Flink handles the sub-second streaming tier when Spark's micro-batch latency is too high.

How it solves the problem: The PySpark Medallion snippet in the Practical Example section above uses Delta Lake's MERGE INTO for idempotent silver writes and INSERT OVERWRITE for atomic gold refreshes. Delta Lake's transaction log makes bronze append-only and enables full-table replay from any committed snapshot — the property that recovered Airbnb's 43 days of revenue data in 90 minutes.

from pyspark.sql import SparkSession
from delta.tables import DeltaTable
from pyspark.sql.functions import col, current_timestamp, row_number
from pyspark.sql.window import Window

spark = (SparkSession.builder
    .appName("medallion-delta")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog",
            "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .getOrCreate())

# ── Bronze: append-only CDC landing with schema enforcement ──────────────────
(spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka:9092")
    .option("subscribe", "orders.cdc")
    .load()
    .selectExpr("CAST(value AS STRING) as payload", "timestamp as ingest_ts")
    .writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/delta/bronze/orders/_chk")
    .option("mergeSchema", "false")   # reject schema drift at bronze boundary
    .start("/delta/bronze/orders"))

# ── Silver: idempotent MERGE deduplication via Delta ACID ────────────────────
bronze_df = spark.read.format("delta").load("/delta/bronze/orders")

window = Window.partitionBy("order_id").orderBy(col("event_ts").desc())
deduped = (bronze_df
    .withColumn("rn", row_number().over(window))
    .filter(col("rn") == 1)
    .drop("rn"))

silver_table = DeltaTable.forPath(spark, "/delta/silver/orders")
silver_table.alias("t").merge(
    deduped.alias("s"),
    "t.order_id = s.order_id"
).whenMatchedUpdate(
    condition="s.status != t.status",
    set={"status": "s.status", "updated_at": current_timestamp()}
).whenNotMatchedInsertAll().execute()

# ── Gold: finance daily aggregate (INSERT OVERWRITE = idempotent) ─────────────
spark.sql("""
  INSERT OVERWRITE gold.finance_revenue_daily
  SELECT DATE(event_ts) AS day,
         SUM(amount)            AS revenue,
         COUNT(DISTINCT order_id) AS order_count
  FROM   silver.orders
  WHERE  status = 'completed'
  GROUP  BY DATE(event_ts)
""")

# ── Time-travel replay: recover gold from bronze after an ETL bug ─────────────
# Delta Lake retains full transaction log — replay from any snapshot version
bronze_at_bug = (spark.read
    .format("delta")
    .option("versionAsOf", 142)        # last known good version before the bug
    .load("/delta/bronze/orders"))
# Re-run silver and gold jobs against bronze_at_bug to recover 43 days of data

For sub-minute streaming (fraud features requiring < 2-minute freshness), Apache Flink replaces Spark's micro-batch tier:

# Apache Flink Python API: fraud feature stream with 1-minute windows
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment

env   = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)

# Read CDC events from Kafka with exactly-once semantics
t_env.execute_sql("""
  CREATE TABLE orders_cdc (
    order_id   STRING,
    customer_id STRING,
    amount     DOUBLE,
    event_ts   TIMESTAMP(3),
    WATERMARK FOR event_ts AS event_ts - INTERVAL '5' SECOND
  ) WITH (
    'connector' = 'kafka',
    'topic'     = 'orders.cdc',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format'    = 'json'
  )
""")

# 1-minute tumbling window: fraud feature for high-velocity spend detection
t_env.execute_sql("""
  INSERT INTO fraud_features
  SELECT customer_id,
         TUMBLE_END(event_ts, INTERVAL '1' MINUTE) AS window_end,
         SUM(amount)  AS spend_1m,
         COUNT(*)     AS order_count_1m
  FROM   orders_cdc
  GROUP  BY customer_id,
            TUMBLE(event_ts, INTERVAL '1' MINUTE)
""")

For a full deep-dive on Apache Spark, Delta Lake, and Apache Flink in production data pipelines, a dedicated follow-up post is planned.

📚 Lessons Learned

Pattern selection should be driven by freshness, replay, and ownership requirements.
CDC and Medallion are often the most practical first moves.
Lambda/Kappa decisions should be scoped per domain, not platform-wide ideology.
Data Mesh succeeds only when platform guardrails are strong.
Replay economics should be measured before incidents force urgency.

📌 TLDR: Summary & Key Takeaways

Big-data architecture is about controlled flow plus accountable ownership.
Use explicit contracts at ingestion, curation, and serving layers.
Design replay and late-data handling before scale.
Match streaming complexity to business freshness requirements.
Prefer incremental pattern adoption with measurable SLO outcomes.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read