All Posts

Big Data Architecture Patterns: Lambda, Kappa, CDC, Medallion, and Data Mesh

Choose ingestion, serving, and ownership patterns deliberately when data platforms start to scale.

Abstract AlgorithmsAbstract Algorithms
··15 min read

AI-assisted content.

TLDR: A serious data platform is defined less by where files are stored and more by how changes enter the system, how serving layers are materialized, and who owns quality over time. Lambda, Kappa, CDC, Medallion, and Data Mesh are patterns for making those choices explicit.

TLDR: Strong data architecture is a sequence of choices: how changes enter, how quality is staged, how replay works, and who owns each data product.

In 2019, Airbnb's analytics team discovered six weeks of revenue reports had been silently wrong — a broken ETL job had been overwriting validated silver-layer data with raw, unvalidated rows for 43 days. Every pipeline step showed green. No alert fired. The issue surfaced only during a finance audit. The fix wasn't better hardware: it was explicit ownership contracts between raw, validated, and gold-layer data. That incident is why Medallion, CDC, and Data Mesh exist — not as theoretical constructs, but as scars from real postmortems.

Here is what the Medallion fix looks like in practice — one order event flowing through layers:

LayerWhat it holdsWho can mutate itWhat a bug here means
BronzeRaw CDC event, source-faithful, append-onlyNobody — immutableNothing; replay from source always works
SilverValidated, deduped order rowsSilver ETL job onlyDownstream Gold breaks; Bronze replay recovers
GoldFinance daily revenue aggregateGold job + ownerDashboard shows wrong numbers; replay from Silver recovers
ConsumerBI tool / ML featureRead-onlyDownstream impact only; no data loss

When the Airbnb ETL bug hit Silver, Bronze was untouched. Replay from Bronze to Gold took 90 minutes and recovered all 43 days of data.

📖 Why These Big-Data Patterns Matter in Real Teams

Most platform incidents are not caused by storage technology. They come from unclear ingestion boundaries, expensive reprocessing, and missing ownership for quality.

Practical architecture questions:

  • Do we need minute-level freshness or hourly is fine?
  • Can we replay yesterday safely if a bug slips in?
  • Who signs off quality for each gold dataset?
  • What is the cost ceiling for always-on streaming?
Problem you seePattern that helps first
OLTP and analytics keep driftingCDC for source-of-truth propagation
Raw data is hard to trustMedallion-style curation contracts
Teams fight over central backlogData Mesh ownership model
Need both fast and recomputable viewsLambda or Kappa decision

🔍 When to Use Lambda, Kappa, CDC, Medallion, and Data Mesh

PatternUse whenAvoid whenPractical first step
CDCYou need reliable change propagation from transactional systemsSource systems do not expose stable change logsStart with one critical table family
LambdaYou require low-latency serving plus deterministic batch recomputeTeam cannot support dual pipeline logicReuse shared transformations across batch and speed paths
KappaStreaming is mature and replay from log is operationally provenData is not append-friendly or stream cost is unjustifiedProve one replay drill on real backlog
MedallionYou need clear raw-to-curated progressionLayers are created without consumer contractsDefine bronze/silver/gold acceptance rules
Data MeshDomain teams can own and operate data productsPlatform standards and tooling are immaturePilot with one domain and one product SLA

When not to over-engineer

  • If freshness target is daily and replay is rare, simple batch may be enough.
  • If one team owns everything today, full Data Mesh can add overhead too early.

📊 Lambda Architecture: Dual-Path Batch and Speed

flowchart LR
  A[Event Stream] --> B[Speed Layer Kafka / Flink]
  A --> C[Batch Layer Spark / Hadoop]
  B --> D[Real-time Views low latency]
  C --> E[Batch Views high accuracy]
  D --> F[Serving Layer merge both views]
  E --> F
  F --> G[Consumer Queries]

This diagram illustrates the Lambda Architecture's dual-path design, where the same event stream feeds both a real-time Speed Layer (Kafka/Flink) and a high-accuracy Batch Layer (Spark/Hadoop). Both paths converge at the Serving Layer, which merges low-latency views from the speed path with fully recomputed batch views to answer consumer queries. The key takeaway is that Lambda buys both freshness and accuracy at the cost of maintaining two parallel processing pipelines with potentially divergent semantics — the exact drift problem LinkedIn discovered across 23 metrics before moving to Kappa.

📊 Kappa Architecture: Single Stream Path

flowchart LR
  A[Event Stream] --> B[Stream Layer Kafka + Flink]
  B --> C[Serving Layer Materialized Views]
  C --> D[Consumer Queries]
  B --> E[Log Replay for corrections]
  E --> C

Lambda keeps batch and speed paths separate — accurate but operationally duplicated. Kappa unifies them into one stream with replay; simpler to own, but replay cost can spike on large backlogs.

⚙️ How the Architecture Works End-to-End

  1. Capture source changes via CDC/events/batch.
  2. Land immutable raw records in bronze.
  3. Validate and normalize into silver with data quality checks.
  4. Publish consumer-specific gold products.
  5. Track ownership, lineage, and freshness SLOs per product.
StageNon-negotiable controlCommon anti-pattern
IngestionReplayable offsets/checkpointsPolling snapshots with no lineage
BronzeImmutable append and schema captureMutating raw data in place
SilverRule-based validation and dedupeSilent cleanup logic buried in notebooks
GoldExplicit contract per consumerOne gold table pretending to fit all use cases
OwnershipNamed owner and support SLACentral platform team owning semantics for every domain

🛠️ How to Implement: Practical 12-Step Plan

  1. Pick one high-value domain (orders, payments, clickstream).
  2. Define target freshness and correctness SLO per consumer.
  3. Add CDC or event ingest with checkpointed offsets.
  4. Land bronze as immutable with schema version metadata.
  5. Build silver validations: null checks, type rules, dedupe keys.
  6. Create first gold product with explicit consumer contract.
  7. Add data tests for freshness, volume anomaly, and key integrity.
  8. Run replay drill from bronze to gold and record duration.
  9. Publish data product owner and escalation path.
  10. Track unit economics: compute spend per data product.
  11. Expand to second domain only after first meets SLO for 2-4 weeks.
  12. Introduce Mesh governance only where domain ownership is stable.

Implementation done criteria:

GatePass condition
FreshnessProduct meets defined latency window
ReplayabilityFull recompute succeeds within business recovery target
QualitySilver and gold checks catch contract drift
OwnershipNamed team handles incidents and schema changes

🧠 Deep Dive: Schema Drift, Replay Economics, and Cost Control

The Internals: Contracts, Late Data, and Replay Safety

CDC events should include operation type, source position, and schema version. Without these fields, replay and dedupe become guesswork.

Medallion is effective only when each layer has a contract:

  • Bronze: no business interpretation, source-faithful storage.
  • Silver: cleaned and conformed entities.
  • Gold: explicitly modeled for a consumer decision.

Exactly-once claims should be practical, not marketing-driven. Most teams succeed with idempotent writes + checkpoint discipline.

Drift typeFirst symptomMitigation
Source schema changeBroken downstream transformsContract tests + compatibility checks
Late-arriving eventsIncorrect windows and aggregatesWatermark strategy + reconciliation jobs
Duplicate ingestInflated metricsBusiness-key dedupe in silver

Performance Analysis: Metrics That Predict Pain Early

MetricWhy it matters
End-to-end freshness by productShows if consumers can trust timeliness
Replay durationDefines recovery realism after incidents
Late-data percentagePredicts reconciliation complexity
Stream compute cost per TBPrevents unnoticed cost drift
Product-level incident countReveals weak ownership boundaries

Use these metrics in weekly architecture reviews, not only post-incident retros.

📊 Big-Data Flow: CDC to Curated Data Products

flowchart TD
  A[OLTP and event sources] --> B[CDC or ingest service]
  B --> C[Bronze immutable storage]
  C --> D[Silver validation and conformance]
  D --> E[Gold data products]
  E --> F[BI and dashboards]
  E --> G[ML features and operational services]
  H[Domain owner and SLA] --> E

This diagram maps the full Medallion data flow from raw operational sources to consumer-ready data products. OLTP databases and event streams feed a CDC or ingest service that lands immutable records in the Bronze layer; each subsequent layer — Silver for validation and conformance, Gold for consumer-specific aggregates — adds quality guarantees before data reaches BI tools, ML feature stores, or operational services. The Domain Owner and SLA node emphasizes that architectural correctness alone is insufficient: each Gold product requires a named owner accountable for freshness, quality, and incident response.

🌍 Real-World Applications: Realistic Scenario: Retail Analytics and Fraud Platform

Airbnb's Medallion Architecture (Databricks Delta Lake)

Airbnb rebuilt their analytics platform after the 2019 ETL incident. Their Bronze→Silver→Gold pipeline on Delta Lake now lands raw CDC events within 4 minutes of source commit. Replay from Bronze to Gold covers a full 30-day window in under 90 minutes — down from 18 hours in the old batch model. Each gold dataset has a named domain owner and a schema contract; a new consumer cannot query gold without accepting the published SLA.

Netflix: Data Mesh at 500B+ Events/Day

Netflix's shift to a Data Mesh model — where each of 200+ domain teams publishes a registered data product with a freshness SLA — reduced data incident MTTR from 4 hours to 35 minutes by eliminating the "who owns this table?" escalation loop. Operational metrics are published with < 10-minute freshness; historical datasets are hourly. The central platform team owns tooling, but never data semantics.

LinkedIn: Why They Left Lambda for Samza

LinkedIn's original Lambda setup ran Hadoop (batch) and Storm (speed) separately. A 2016 production audit found 23 metrics with persistent divergence between paths — some drifting for 6+ months. Finance reported a 4% discrepancy in session counts. LinkedIn migrated to Apache Samza for unified stream processing with replay support, eliminating dual-path semantic drift entirely.

Real systemPattern usedKey metric
Airbnb / DatabricksMedallion (CDC→Bronze→Silver→Gold)90-min replay, 4-min freshness
NetflixData Mesh (domain-owned products)MTTR 4h → 35 min
LinkedInKappa / Samza unified streamEliminated 23 drifted metrics

Failure scenario: Before Medallion contracts at Airbnb, one bad ETL job corrupted 43 days of revenue data silently. Every pipeline step was green; no cross-layer correctness check existed. By the time finance discovered the issue during a quarterly audit, 6 weeks of authoritative numbers were wrong. Bronze immutability means even when silver transforms break, the raw data can always be replayed from a known-good starting point.

⚖️ Trade-offs & Failure Modes: Pros, Cons, and Risks Across Pattern Choices

Pattern decisionProsConsMain riskMitigation
CDC-first ingestionBetter lineage and freshnessConnector and schema complexityBackfill painStandardized replay tooling
Lambda dual-pathFast + accurate viewsDuplicate logicDrift between pathsShared transformation library
Kappa single-streamConceptual simplicityReplay cost can spikeLong recovery windowsRegular replay drills
Medallion layeringClear data quality progressionLayer sprawl riskUnused intermediate setsLayer contracts tied to consumers
Data Mesh ownershipFaster domain decisionsGovernance overheadInconsistent quality standardsPlatform guardrails + product templates

🧭 Decision Guide: What to Choose First

SituationRecommendation
Transactional changes are late or inconsistent in analyticsImplement CDC first
Teams cannot explain source-to-dashboard lineageAdd Medallion contracts and tests
You need <5 minute serving plus daily guaranteed accuracyConsider Lambda in that domain only
Streaming cost is high and replay is rarePrefer batch-centric model with targeted streaming
Domain teams are mature and accountableAdd Mesh ownership incrementally

🧪 Practical Example: Order + Clickstream Lakehouse Slice

PySpark Medallion Pipeline (Delta Lake pattern from Airbnb/Databricks)

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, col

spark = SparkSession.builder.appName("medallion-orders").getOrCreate()

# ── Bronze: append-only CDC event landing ────────────────────────────────────
(spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka:9092")
    .option("subscribe", "orders.cdc")
    .load()
    .selectExpr("CAST(value AS STRING) as payload", "timestamp as ingest_ts")
    .writeStream.format("delta").outputMode("append")
    .option("checkpointLocation", "/bronze/orders/_chk")
    .start("/bronze/orders"))

# ── Silver: dedupe + validate via MERGE (idempotent on rerun) ─────────────────
spark.sql("""
  MERGE INTO silver.orders AS t
  USING (
    SELECT order_id, customer_id, amount, status, event_ts,
           ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY event_ts DESC) rn
    FROM bronze.orders WHERE ingest_ts >= current_timestamp() - INTERVAL 10 MINUTES
  ) AS s ON t.order_id = s.order_id AND s.rn = 1
  WHEN MATCHED AND s.status != t.status
    THEN UPDATE SET t.status = s.status, t.updated_at = current_timestamp()
  WHEN NOT MATCHED THEN INSERT *
""")

# ── Gold: finance daily revenue aggregate ────────────────────────────────────
spark.sql("""
  INSERT OVERWRITE gold.finance_revenue_daily
  SELECT DATE(event_ts) AS day, SUM(amount) AS revenue,
         COUNT(DISTINCT order_id) AS order_count
  FROM silver.orders WHERE status = 'completed'
  GROUP BY DATE(event_ts)
""")

Key design points from production:

  • Bronze is append-only — bad transforms cannot corrupt it; replay always works from the same source.
  • Silver MERGE deduplicates by business key + latest event_ts — idempotent on every rerun.
  • Gold uses INSERT OVERWRITE — downstream consumers always see a complete, consistent day.
  • Target freshness: Bronze within 4 min → Silver within 8 min → Gold within 15 min.

Implementation checklist for this slice:

  1. Define gold consumers and acceptance tests first.
  2. Build silver quality checks before dashboarding.
  3. Run one replay from bronze to gold weekly (target: < 90 min for 30-day window).
  4. Review cost per gold product monthly.

Operator Field Note: What Fails First in Production

The Airbnb pattern — corrupted data platforms rarely announce themselves loudly. At Airbnb, the 2019 revenue corruption ran silently for 43 days because every individual pipeline step appeared healthy: ETL jobs ran on schedule, row counts looked normal, dashboards showed numbers. What was missing was a cross-layer correctness check — does gold actually match what silver validated? Without Medallion contracts, there was no layer that owned the answer to that question.

At LinkedIn, 23 metrics drifted between speed and batch paths for months. The signal was a finance partner comparing quarterly reports across two systems — not an automated alert. By the time it was investigated, drift had been running for 6+ months in some metrics. The pattern is consistent: the signal that breaks first is a business metric that "doesn't feel right," not a system alert.

  • Early warning signal: a gold dataset's row count or aggregate value diverges from expected range by more than 1% across two consecutive runs.
  • First containment move: pin consumers to the last-known-good gold snapshot and trigger a targeted silver→gold replay from bronze.
  • Escalate immediately when: divergence affects any metric with financial, compliance, or fraud-scoring implications.

15-Minute SRE Drill

  1. Replay one bounded failure case in staging.
  2. Capture one metric, one trace, and one log that prove the guardrail worked.
  3. Update the runbook with exact rollback command and owner on call.

Apache Spark is the dominant distributed compute engine for big data batch and micro-batch processing; Delta Lake adds ACID transactions, schema enforcement, and time-travel replay to Spark's storage layer — the combination that powers the Medallion architecture at Airbnb and Databricks. Apache Flink handles the sub-second streaming tier when Spark's micro-batch latency is too high.

How it solves the problem: The PySpark Medallion snippet in the Practical Example section above uses Delta Lake's MERGE INTO for idempotent silver writes and INSERT OVERWRITE for atomic gold refreshes. Delta Lake's transaction log makes bronze append-only and enables full-table replay from any committed snapshot — the property that recovered Airbnb's 43 days of revenue data in 90 minutes.

from pyspark.sql import SparkSession
from delta.tables import DeltaTable
from pyspark.sql.functions import col, current_timestamp, row_number
from pyspark.sql.window import Window

spark = (SparkSession.builder
    .appName("medallion-delta")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog",
            "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .getOrCreate())

# ── Bronze: append-only CDC landing with schema enforcement ──────────────────
(spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka:9092")
    .option("subscribe", "orders.cdc")
    .load()
    .selectExpr("CAST(value AS STRING) as payload", "timestamp as ingest_ts")
    .writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/delta/bronze/orders/_chk")
    .option("mergeSchema", "false")   # reject schema drift at bronze boundary
    .start("/delta/bronze/orders"))

# ── Silver: idempotent MERGE deduplication via Delta ACID ────────────────────
bronze_df = spark.read.format("delta").load("/delta/bronze/orders")

window = Window.partitionBy("order_id").orderBy(col("event_ts").desc())
deduped = (bronze_df
    .withColumn("rn", row_number().over(window))
    .filter(col("rn") == 1)
    .drop("rn"))

silver_table = DeltaTable.forPath(spark, "/delta/silver/orders")
silver_table.alias("t").merge(
    deduped.alias("s"),
    "t.order_id = s.order_id"
).whenMatchedUpdate(
    condition="s.status != t.status",
    set={"status": "s.status", "updated_at": current_timestamp()}
).whenNotMatchedInsertAll().execute()

# ── Gold: finance daily aggregate (INSERT OVERWRITE = idempotent) ─────────────
spark.sql("""
  INSERT OVERWRITE gold.finance_revenue_daily
  SELECT DATE(event_ts) AS day,
         SUM(amount)            AS revenue,
         COUNT(DISTINCT order_id) AS order_count
  FROM   silver.orders
  WHERE  status = 'completed'
  GROUP  BY DATE(event_ts)
""")

# ── Time-travel replay: recover gold from bronze after an ETL bug ─────────────
# Delta Lake retains full transaction log — replay from any snapshot version
bronze_at_bug = (spark.read
    .format("delta")
    .option("versionAsOf", 142)        # last known good version before the bug
    .load("/delta/bronze/orders"))
# Re-run silver and gold jobs against bronze_at_bug to recover 43 days of data

For sub-minute streaming (fraud features requiring < 2-minute freshness), Apache Flink replaces Spark's micro-batch tier:

# Apache Flink Python API: fraud feature stream with 1-minute windows
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment

env   = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)

# Read CDC events from Kafka with exactly-once semantics
t_env.execute_sql("""
  CREATE TABLE orders_cdc (
    order_id   STRING,
    customer_id STRING,
    amount     DOUBLE,
    event_ts   TIMESTAMP(3),
    WATERMARK FOR event_ts AS event_ts - INTERVAL '5' SECOND
  ) WITH (
    'connector' = 'kafka',
    'topic'     = 'orders.cdc',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format'    = 'json'
  )
""")

# 1-minute tumbling window: fraud feature for high-velocity spend detection
t_env.execute_sql("""
  INSERT INTO fraud_features
  SELECT customer_id,
         TUMBLE_END(event_ts, INTERVAL '1' MINUTE) AS window_end,
         SUM(amount)  AS spend_1m,
         COUNT(*)     AS order_count_1m
  FROM   orders_cdc
  GROUP  BY customer_id,
            TUMBLE(event_ts, INTERVAL '1' MINUTE)
""")

For a full deep-dive on Apache Spark, Delta Lake, and Apache Flink in production data pipelines, a dedicated follow-up post is planned.

📚 Lessons Learned

  • Pattern selection should be driven by freshness, replay, and ownership requirements.
  • CDC and Medallion are often the most practical first moves.
  • Lambda/Kappa decisions should be scoped per domain, not platform-wide ideology.
  • Data Mesh succeeds only when platform guardrails are strong.
  • Replay economics should be measured before incidents force urgency.

📌 TLDR: Summary & Key Takeaways

  • Big-data architecture is about controlled flow plus accountable ownership.
  • Use explicit contracts at ingestion, curation, and serving layers.
  • Design replay and late-data handling before scale.
  • Match streaming complexity to business freshness requirements.
  • Prefer incremental pattern adoption with measurable SLO outcomes.
Share

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms