Big Data Architecture Patterns: Lambda, Kappa, CDC, Medallion, and Data Mesh
Choose ingestion, serving, and ownership patterns deliberately when data platforms start to scale.
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: A serious data platform is defined less by where files are stored and more by how changes enter the system, how serving layers are materialized, and who owns quality over time. Lambda, Kappa, CDC, Medallion, and Data Mesh are patterns for making those choices explicit.
TLDR: Strong data architecture is a sequence of choices: how changes enter, how quality is staged, how replay works, and who owns each data product.
In 2019, Airbnb's analytics team discovered six weeks of revenue reports had been silently wrong — a broken ETL job had been overwriting validated silver-layer data with raw, unvalidated rows for 43 days. Every pipeline step showed green. No alert fired. The issue surfaced only during a finance audit. The fix wasn't better hardware: it was explicit ownership contracts between raw, validated, and gold-layer data. That incident is why Medallion, CDC, and Data Mesh exist — not as theoretical constructs, but as scars from real postmortems.
Here is what the Medallion fix looks like in practice — one order event flowing through layers:
| Layer | What it holds | Who can mutate it | What a bug here means |
| Bronze | Raw CDC event, source-faithful, append-only | Nobody — immutable | Nothing; replay from source always works |
| Silver | Validated, deduped order rows | Silver ETL job only | Downstream Gold breaks; Bronze replay recovers |
| Gold | Finance daily revenue aggregate | Gold job + owner | Dashboard shows wrong numbers; replay from Silver recovers |
| Consumer | BI tool / ML feature | Read-only | Downstream impact only; no data loss |
When the Airbnb ETL bug hit Silver, Bronze was untouched. Replay from Bronze to Gold took 90 minutes and recovered all 43 days of data.
📖 Why These Big-Data Patterns Matter in Real Teams
Most platform incidents are not caused by storage technology. They come from unclear ingestion boundaries, expensive reprocessing, and missing ownership for quality.
Practical architecture questions:
- Do we need minute-level freshness or hourly is fine?
- Can we replay yesterday safely if a bug slips in?
- Who signs off quality for each gold dataset?
- What is the cost ceiling for always-on streaming?
| Problem you see | Pattern that helps first |
| OLTP and analytics keep drifting | CDC for source-of-truth propagation |
| Raw data is hard to trust | Medallion-style curation contracts |
| Teams fight over central backlog | Data Mesh ownership model |
| Need both fast and recomputable views | Lambda or Kappa decision |
🔍 When to Use Lambda, Kappa, CDC, Medallion, and Data Mesh
| Pattern | Use when | Avoid when | Practical first step |
| CDC | You need reliable change propagation from transactional systems | Source systems do not expose stable change logs | Start with one critical table family |
| Lambda | You require low-latency serving plus deterministic batch recompute | Team cannot support dual pipeline logic | Reuse shared transformations across batch and speed paths |
| Kappa | Streaming is mature and replay from log is operationally proven | Data is not append-friendly or stream cost is unjustified | Prove one replay drill on real backlog |
| Medallion | You need clear raw-to-curated progression | Layers are created without consumer contracts | Define bronze/silver/gold acceptance rules |
| Data Mesh | Domain teams can own and operate data products | Platform standards and tooling are immature | Pilot with one domain and one product SLA |
When not to over-engineer
- If freshness target is daily and replay is rare, simple batch may be enough.
- If one team owns everything today, full Data Mesh can add overhead too early.
📊 Lambda Architecture: Dual-Path Batch and Speed
flowchart LR
A[Event Stream] --> B[Speed Layer Kafka / Flink]
A --> C[Batch Layer Spark / Hadoop]
B --> D[Real-time Views low latency]
C --> E[Batch Views high accuracy]
D --> F[Serving Layer merge both views]
E --> F
F --> G[Consumer Queries]
This diagram illustrates the Lambda Architecture's dual-path design, where the same event stream feeds both a real-time Speed Layer (Kafka/Flink) and a high-accuracy Batch Layer (Spark/Hadoop). Both paths converge at the Serving Layer, which merges low-latency views from the speed path with fully recomputed batch views to answer consumer queries. The key takeaway is that Lambda buys both freshness and accuracy at the cost of maintaining two parallel processing pipelines with potentially divergent semantics — the exact drift problem LinkedIn discovered across 23 metrics before moving to Kappa.
📊 Kappa Architecture: Single Stream Path
flowchart LR
A[Event Stream] --> B[Stream Layer Kafka + Flink]
B --> C[Serving Layer Materialized Views]
C --> D[Consumer Queries]
B --> E[Log Replay for corrections]
E --> C
Lambda keeps batch and speed paths separate — accurate but operationally duplicated. Kappa unifies them into one stream with replay; simpler to own, but replay cost can spike on large backlogs.
⚙️ How the Architecture Works End-to-End
- Capture source changes via CDC/events/batch.
- Land immutable raw records in bronze.
- Validate and normalize into silver with data quality checks.
- Publish consumer-specific gold products.
- Track ownership, lineage, and freshness SLOs per product.
| Stage | Non-negotiable control | Common anti-pattern |
| Ingestion | Replayable offsets/checkpoints | Polling snapshots with no lineage |
| Bronze | Immutable append and schema capture | Mutating raw data in place |
| Silver | Rule-based validation and dedupe | Silent cleanup logic buried in notebooks |
| Gold | Explicit contract per consumer | One gold table pretending to fit all use cases |
| Ownership | Named owner and support SLA | Central platform team owning semantics for every domain |
🛠️ How to Implement: Practical 12-Step Plan
- Pick one high-value domain (orders, payments, clickstream).
- Define target freshness and correctness SLO per consumer.
- Add CDC or event ingest with checkpointed offsets.
- Land bronze as immutable with schema version metadata.
- Build silver validations: null checks, type rules, dedupe keys.
- Create first gold product with explicit consumer contract.
- Add data tests for freshness, volume anomaly, and key integrity.
- Run replay drill from bronze to gold and record duration.
- Publish data product owner and escalation path.
- Track unit economics: compute spend per data product.
- Expand to second domain only after first meets SLO for 2-4 weeks.
- Introduce Mesh governance only where domain ownership is stable.
Implementation done criteria:
| Gate | Pass condition |
| Freshness | Product meets defined latency window |
| Replayability | Full recompute succeeds within business recovery target |
| Quality | Silver and gold checks catch contract drift |
| Ownership | Named team handles incidents and schema changes |
🧠 Deep Dive: Schema Drift, Replay Economics, and Cost Control
The Internals: Contracts, Late Data, and Replay Safety
CDC events should include operation type, source position, and schema version. Without these fields, replay and dedupe become guesswork.
Medallion is effective only when each layer has a contract:
- Bronze: no business interpretation, source-faithful storage.
- Silver: cleaned and conformed entities.
- Gold: explicitly modeled for a consumer decision.
Exactly-once claims should be practical, not marketing-driven. Most teams succeed with idempotent writes + checkpoint discipline.
| Drift type | First symptom | Mitigation |
| Source schema change | Broken downstream transforms | Contract tests + compatibility checks |
| Late-arriving events | Incorrect windows and aggregates | Watermark strategy + reconciliation jobs |
| Duplicate ingest | Inflated metrics | Business-key dedupe in silver |
Performance Analysis: Metrics That Predict Pain Early
| Metric | Why it matters |
| End-to-end freshness by product | Shows if consumers can trust timeliness |
| Replay duration | Defines recovery realism after incidents |
| Late-data percentage | Predicts reconciliation complexity |
| Stream compute cost per TB | Prevents unnoticed cost drift |
| Product-level incident count | Reveals weak ownership boundaries |
Use these metrics in weekly architecture reviews, not only post-incident retros.
📊 Big-Data Flow: CDC to Curated Data Products
flowchart TD
A[OLTP and event sources] --> B[CDC or ingest service]
B --> C[Bronze immutable storage]
C --> D[Silver validation and conformance]
D --> E[Gold data products]
E --> F[BI and dashboards]
E --> G[ML features and operational services]
H[Domain owner and SLA] --> E
This diagram maps the full Medallion data flow from raw operational sources to consumer-ready data products. OLTP databases and event streams feed a CDC or ingest service that lands immutable records in the Bronze layer; each subsequent layer — Silver for validation and conformance, Gold for consumer-specific aggregates — adds quality guarantees before data reaches BI tools, ML feature stores, or operational services. The Domain Owner and SLA node emphasizes that architectural correctness alone is insufficient: each Gold product requires a named owner accountable for freshness, quality, and incident response.
🌍 Real-World Applications: Realistic Scenario: Retail Analytics and Fraud Platform
Airbnb's Medallion Architecture (Databricks Delta Lake)
Airbnb rebuilt their analytics platform after the 2019 ETL incident. Their Bronze→Silver→Gold pipeline on Delta Lake now lands raw CDC events within 4 minutes of source commit. Replay from Bronze to Gold covers a full 30-day window in under 90 minutes — down from 18 hours in the old batch model. Each gold dataset has a named domain owner and a schema contract; a new consumer cannot query gold without accepting the published SLA.
Netflix: Data Mesh at 500B+ Events/Day
Netflix's shift to a Data Mesh model — where each of 200+ domain teams publishes a registered data product with a freshness SLA — reduced data incident MTTR from 4 hours to 35 minutes by eliminating the "who owns this table?" escalation loop. Operational metrics are published with < 10-minute freshness; historical datasets are hourly. The central platform team owns tooling, but never data semantics.
LinkedIn: Why They Left Lambda for Samza
LinkedIn's original Lambda setup ran Hadoop (batch) and Storm (speed) separately. A 2016 production audit found 23 metrics with persistent divergence between paths — some drifting for 6+ months. Finance reported a 4% discrepancy in session counts. LinkedIn migrated to Apache Samza for unified stream processing with replay support, eliminating dual-path semantic drift entirely.
| Real system | Pattern used | Key metric |
| Airbnb / Databricks | Medallion (CDC→Bronze→Silver→Gold) | 90-min replay, 4-min freshness |
| Netflix | Data Mesh (domain-owned products) | MTTR 4h → 35 min |
| Kappa / Samza unified stream | Eliminated 23 drifted metrics |
Failure scenario: Before Medallion contracts at Airbnb, one bad ETL job corrupted 43 days of revenue data silently. Every pipeline step was green; no cross-layer correctness check existed. By the time finance discovered the issue during a quarterly audit, 6 weeks of authoritative numbers were wrong. Bronze immutability means even when silver transforms break, the raw data can always be replayed from a known-good starting point.
⚖️ Trade-offs & Failure Modes: Pros, Cons, and Risks Across Pattern Choices
| Pattern decision | Pros | Cons | Main risk | Mitigation |
| CDC-first ingestion | Better lineage and freshness | Connector and schema complexity | Backfill pain | Standardized replay tooling |
| Lambda dual-path | Fast + accurate views | Duplicate logic | Drift between paths | Shared transformation library |
| Kappa single-stream | Conceptual simplicity | Replay cost can spike | Long recovery windows | Regular replay drills |
| Medallion layering | Clear data quality progression | Layer sprawl risk | Unused intermediate sets | Layer contracts tied to consumers |
| Data Mesh ownership | Faster domain decisions | Governance overhead | Inconsistent quality standards | Platform guardrails + product templates |
🧭 Decision Guide: What to Choose First
| Situation | Recommendation |
| Transactional changes are late or inconsistent in analytics | Implement CDC first |
| Teams cannot explain source-to-dashboard lineage | Add Medallion contracts and tests |
| You need <5 minute serving plus daily guaranteed accuracy | Consider Lambda in that domain only |
| Streaming cost is high and replay is rare | Prefer batch-centric model with targeted streaming |
| Domain teams are mature and accountable | Add Mesh ownership incrementally |
🧪 Practical Example: Order + Clickstream Lakehouse Slice
PySpark Medallion Pipeline (Delta Lake pattern from Airbnb/Databricks)
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, col
spark = SparkSession.builder.appName("medallion-orders").getOrCreate()
# ── Bronze: append-only CDC event landing ────────────────────────────────────
(spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "orders.cdc")
.load()
.selectExpr("CAST(value AS STRING) as payload", "timestamp as ingest_ts")
.writeStream.format("delta").outputMode("append")
.option("checkpointLocation", "/bronze/orders/_chk")
.start("/bronze/orders"))
# ── Silver: dedupe + validate via MERGE (idempotent on rerun) ─────────────────
spark.sql("""
MERGE INTO silver.orders AS t
USING (
SELECT order_id, customer_id, amount, status, event_ts,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY event_ts DESC) rn
FROM bronze.orders WHERE ingest_ts >= current_timestamp() - INTERVAL 10 MINUTES
) AS s ON t.order_id = s.order_id AND s.rn = 1
WHEN MATCHED AND s.status != t.status
THEN UPDATE SET t.status = s.status, t.updated_at = current_timestamp()
WHEN NOT MATCHED THEN INSERT *
""")
# ── Gold: finance daily revenue aggregate ────────────────────────────────────
spark.sql("""
INSERT OVERWRITE gold.finance_revenue_daily
SELECT DATE(event_ts) AS day, SUM(amount) AS revenue,
COUNT(DISTINCT order_id) AS order_count
FROM silver.orders WHERE status = 'completed'
GROUP BY DATE(event_ts)
""")
Key design points from production:
- Bronze is append-only — bad transforms cannot corrupt it; replay always works from the same source.
- Silver MERGE deduplicates by business key + latest
event_ts— idempotent on every rerun. - Gold uses
INSERT OVERWRITE— downstream consumers always see a complete, consistent day. - Target freshness: Bronze within 4 min → Silver within 8 min → Gold within 15 min.
Implementation checklist for this slice:
- Define gold consumers and acceptance tests first.
- Build silver quality checks before dashboarding.
- Run one replay from bronze to gold weekly (target: < 90 min for 30-day window).
- Review cost per gold product monthly.
Operator Field Note: What Fails First in Production
The Airbnb pattern — corrupted data platforms rarely announce themselves loudly. At Airbnb, the 2019 revenue corruption ran silently for 43 days because every individual pipeline step appeared healthy: ETL jobs ran on schedule, row counts looked normal, dashboards showed numbers. What was missing was a cross-layer correctness check — does gold actually match what silver validated? Without Medallion contracts, there was no layer that owned the answer to that question.
At LinkedIn, 23 metrics drifted between speed and batch paths for months. The signal was a finance partner comparing quarterly reports across two systems — not an automated alert. By the time it was investigated, drift had been running for 6+ months in some metrics. The pattern is consistent: the signal that breaks first is a business metric that "doesn't feel right," not a system alert.
- Early warning signal: a gold dataset's row count or aggregate value diverges from expected range by more than 1% across two consecutive runs.
- First containment move: pin consumers to the last-known-good gold snapshot and trigger a targeted silver→gold replay from bronze.
- Escalate immediately when: divergence affects any metric with financial, compliance, or fraud-scoring implications.
15-Minute SRE Drill
- Replay one bounded failure case in staging.
- Capture one metric, one trace, and one log that prove the guardrail worked.
- Update the runbook with exact rollback command and owner on call.
🛠️ Apache Spark and Delta Lake: PySpark Medallion Pipeline with Flink for Streaming
Apache Spark is the dominant distributed compute engine for big data batch and micro-batch processing; Delta Lake adds ACID transactions, schema enforcement, and time-travel replay to Spark's storage layer — the combination that powers the Medallion architecture at Airbnb and Databricks. Apache Flink handles the sub-second streaming tier when Spark's micro-batch latency is too high.
How it solves the problem: The PySpark Medallion snippet in the Practical Example section above uses Delta Lake's MERGE INTO for idempotent silver writes and INSERT OVERWRITE for atomic gold refreshes. Delta Lake's transaction log makes bronze append-only and enables full-table replay from any committed snapshot — the property that recovered Airbnb's 43 days of revenue data in 90 minutes.
from pyspark.sql import SparkSession
from delta.tables import DeltaTable
from pyspark.sql.functions import col, current_timestamp, row_number
from pyspark.sql.window import Window
spark = (SparkSession.builder
.appName("medallion-delta")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate())
# ── Bronze: append-only CDC landing with schema enforcement ──────────────────
(spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "orders.cdc")
.load()
.selectExpr("CAST(value AS STRING) as payload", "timestamp as ingest_ts")
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/delta/bronze/orders/_chk")
.option("mergeSchema", "false") # reject schema drift at bronze boundary
.start("/delta/bronze/orders"))
# ── Silver: idempotent MERGE deduplication via Delta ACID ────────────────────
bronze_df = spark.read.format("delta").load("/delta/bronze/orders")
window = Window.partitionBy("order_id").orderBy(col("event_ts").desc())
deduped = (bronze_df
.withColumn("rn", row_number().over(window))
.filter(col("rn") == 1)
.drop("rn"))
silver_table = DeltaTable.forPath(spark, "/delta/silver/orders")
silver_table.alias("t").merge(
deduped.alias("s"),
"t.order_id = s.order_id"
).whenMatchedUpdate(
condition="s.status != t.status",
set={"status": "s.status", "updated_at": current_timestamp()}
).whenNotMatchedInsertAll().execute()
# ── Gold: finance daily aggregate (INSERT OVERWRITE = idempotent) ─────────────
spark.sql("""
INSERT OVERWRITE gold.finance_revenue_daily
SELECT DATE(event_ts) AS day,
SUM(amount) AS revenue,
COUNT(DISTINCT order_id) AS order_count
FROM silver.orders
WHERE status = 'completed'
GROUP BY DATE(event_ts)
""")
# ── Time-travel replay: recover gold from bronze after an ETL bug ─────────────
# Delta Lake retains full transaction log — replay from any snapshot version
bronze_at_bug = (spark.read
.format("delta")
.option("versionAsOf", 142) # last known good version before the bug
.load("/delta/bronze/orders"))
# Re-run silver and gold jobs against bronze_at_bug to recover 43 days of data
For sub-minute streaming (fraud features requiring < 2-minute freshness), Apache Flink replaces Spark's micro-batch tier:
# Apache Flink Python API: fraud feature stream with 1-minute windows
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)
# Read CDC events from Kafka with exactly-once semantics
t_env.execute_sql("""
CREATE TABLE orders_cdc (
order_id STRING,
customer_id STRING,
amount DOUBLE,
event_ts TIMESTAMP(3),
WATERMARK FOR event_ts AS event_ts - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'orders.cdc',
'properties.bootstrap.servers' = 'kafka:9092',
'format' = 'json'
)
""")
# 1-minute tumbling window: fraud feature for high-velocity spend detection
t_env.execute_sql("""
INSERT INTO fraud_features
SELECT customer_id,
TUMBLE_END(event_ts, INTERVAL '1' MINUTE) AS window_end,
SUM(amount) AS spend_1m,
COUNT(*) AS order_count_1m
FROM orders_cdc
GROUP BY customer_id,
TUMBLE(event_ts, INTERVAL '1' MINUTE)
""")
For a full deep-dive on Apache Spark, Delta Lake, and Apache Flink in production data pipelines, a dedicated follow-up post is planned.
📚 Lessons Learned
- Pattern selection should be driven by freshness, replay, and ownership requirements.
- CDC and Medallion are often the most practical first moves.
- Lambda/Kappa decisions should be scoped per domain, not platform-wide ideology.
- Data Mesh succeeds only when platform guardrails are strong.
- Replay economics should be measured before incidents force urgency.
📌 TLDR: Summary & Key Takeaways
- Big-data architecture is about controlled flow plus accountable ownership.
- Use explicit contracts at ingestion, curation, and serving layers.
- Design replay and late-data handling before scale.
- Match streaming complexity to business freshness requirements.
- Prefer incremental pattern adoption with measurable SLO outcomes.
🔗 Related Posts
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
