All Posts

Medallion Architecture: Bronze, Silver, and Gold Layers in Practice

Structure your data lake with progressive refinement: raw Bronze ingestion, cleaned and conformed Silver, and business-ready Gold aggregates.

Abstract AlgorithmsAbstract Algorithms
Β·Β·26 min read
Cover Image for Medallion Architecture: Bronze, Silver, and Gold Layers in Practice
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Medallion Architecture solves the "data swamp" problem by organizing a data lake into three progressively refined zones β€” Bronze (raw, immutable), Silver (cleaned, conformed), Gold (aggregated, business-ready) β€” so teams always build on a trusted foundation, bad data never propagates silently, and any corruption can be replayed from scratch.

TLDR Three zones, one source of truth at every quality level: Bronze preserves what happened, Silver enforces what it means, Gold exposes what the business needs.

A fintech analytics team receives 40 million order events per day from Kafka. They store every event in S3. Six months later, they have 180 TB of raw JSON files β€” and every analyst spends three hours each morning writing custom Python scripts to clean nulls, correct currency mismatches, and deduplicate records before running a single query. The marketing team's "clean customer table" is different from the finance team's version. Revenue numbers don't reconcile between dashboards. Their data lake has become a data swamp: data goes in, but trustworthy answers barely come out.

Medallion Architecture was designed precisely for this. It does not require a new database engine or a new streaming framework. It requires a deliberate decision: raw data lands in one place and is never mutated, cleaned data lives in a second place, and aggregated business metrics live in a third. Three zones, explicit contracts between them, one source of truth at every level.

πŸ“– From Data Swamp to Data Refinery: What Medallion Architecture Actually Solves

Medallion Architecture (originally named by Databricks, though the pattern predates the term) is a data organization design pattern that divides a data lake into three logically distinct zones. It is not a technology you install. You cannot pip install medallion. It is a set of agreements your team makes about where data lives, who can mutate it, and what quality guarantees each zone provides.

The core insight is that data quality is not binary β€” it is a gradient. The same order_placed event can exist simultaneously as a raw byte stream, a validated row in a cleaned table, and a line item in a revenue aggregate. Each representation serves a different consumer with a different trust requirement. Trying to serve all three from one undifferentiated store forces every consumer to repeat the same data-cleaning work. Medallion Architecture eliminates that duplication by making each zone responsible for exactly one level of quality.

AudienceWhat they needWhich zone serves them
Data engineers debugging pipelinesExact byte copy of what the source sentBronze
Data scientists building ML featuresClean, joined, type-safe tablesSilver
Product managers and executivesPre-computed KPIs and aggregatesGold
Compliance and auditImmutable history with no gapsBronze

By separating concerns this way, each team stops re-implementing transformations that already exist downstream. A finance analyst querying Gold does not need to know that the source Kafka topic once sent amount as a string instead of a float β€” that fix lives in the Bronze-to-Silver job, and it runs exactly once.

πŸ” The Three Zones in Plain Terms: Bronze, Silver, and Gold Defined

Each layer has a single, non-negotiable responsibility.

Bronze β€” Raw Landing Zone. Bronze is an exact, append-only copy of the source data. The schema is whatever the producer sent. If the Kafka order_placed event has a typo in a field name, that typo is preserved in Bronze. If the source sent the same event twice, both copies land in Bronze. If a field that was never documented suddenly appears, it lands in Bronze. The contract is: Bronze is never mutated, never cleaned, never deleted. It is the raw ledger of what happened. Because of this, Bronze can always be used to replay all downstream layers from scratch β€” it is the recovery anchor for every incident.

Silver β€” Cleaned and Conformed Zone. Silver applies a defined set of transformations to Bronze data. Duplicates are removed. Nulls are handled according to documented rules. customer_id is validated as a UUID; rows that fail are quarantined to a dead-letter table, not silently dropped. Currency amounts are normalized to USD. The user_id from the orders Kafka topic is joined to the users dimension table so every Silver order row carries a fully resolved customer profile. Silver is the layer where cross-team trust is established β€” when three different teams query silver.orders, they all get the same answer.

Gold β€” Business-Ready Aggregation Zone. Gold materializes the answers to questions the business actually asks. It contains tables like daily_gmv_by_category, customer_lifetime_value, and product_funnel_conversion_rates. Gold tables are pre-aggregated and indexed for query performance. They power BI dashboards directly and serve as feature stores for ML models. Adding a new Gold table never requires touching Bronze or Silver β€” it is an additive operation that reads from Silver and publishes a new business view.

βš™οΈ How Data Moves from Raw to Business-Ready: The Medallion Flow

Follow a single order_placed event from an e-commerce platform through all three layers.

Step 1 β€” The event arrives. A customer clicks "Buy Now." The application emits a Kafka event with whatever fields the producer included at that moment:

{
  "event_id": "evt_9f2a",
  "user_id": "usr_4412",
  "items": [{"sku": "HDPHN-001", "qty": 2, "price": "49.99"}],
  "currency": "EUR",
  "timestamp": "2024-11-15T09:23:11Z",
  "session_id": null
}

Step 2 β€” Bronze ingestion. A streaming consumer writes this JSON verbatim to a Delta Lake table partitioned by ingestion date. The price field is a string, session_id is null, and the currency is EUR β€” none of this is corrected. The record is appended and the partition is closed.

Step 3 β€” Silver transformation. The Bronze-to-Silver Spark job runs on a schedule (or continuously in micro-batch mode). It validates user_id format, converts price to a float, translates EUR amounts to USD using that day's exchange rate, joins the order to the silver.users table to attach customer segment data, and removes duplicate event_id entries from re-delivered Kafka messages. The session_id null is filled with a sentinel value per the agreed schema contract. The resulting row is written to silver.orders.

Step 4 β€” Gold aggregation. At the end of the day, the Silver-to-Gold job groups silver.orders by category and sums amount_usd. The result is written as a partition in gold.daily_gmv_by_category. The CFO dashboard reads from Gold. The query runs in milliseconds because the heavy lifting happened in the pipeline, not at query time.

The value of this separation becomes obvious when something breaks. Suppose the exchange-rate lookup API returns wrong values for three hours. Every Silver order during that window has incorrect USD amounts, and the Gold aggregate for that period is wrong. The fix: correct the exchange-rate lookup, delete the affected Silver partition, and replay the Bronze-to-Silver job. Bronze was never touched. The replay takes minutes. There is no data loss.

🧠 Deep Dive: How Medallion's Three-Layer Contract Actually Works

The Internals: Immutability, Schema Contracts, and Partition Strategy

Bronze's immutability guarantee is the entire foundation of the replay story. In practice, this means Bronze tables are written with append mode only β€” no overwrite, no UPDATE, no DELETE operations are ever issued against Bronze data. This is enforced through Delta Lake table properties (delta.appendOnly = true) or IAM policies that restrict the Bronze IAM role to s3:PutObject but not s3:DeleteObject.

Silver enforces schema-on-write, the inverse of Bronze's schema-on-read. Before a Silver job writes a new row, the schema is validated against a registered contract β€” typically a Delta table schema or an Avro schema registered in a schema registry. Rows that fail schema validation are not silently discarded; they land in a quarantine table (silver.orders_rejected) with an error column explaining the violation. This quarantine pattern means Silver's SLA is never silently broken by bad upstream data.

Partitioning strategy differs by layer. Bronze is usually partitioned by ingestion timestamp (ingest_date, ingest_hour), which keeps write performance predictable regardless of event timestamp skew. Silver is partitioned by event timestamp (order_date), which optimizes analytical scan patterns. Gold is often not partitioned β€” the tables are already small by virtue of being pre-aggregated β€” or partitioned by a single dimension like region or category.

Performance Analysis: Reprocessing Windows and Storage Trade-offs

The most expensive operation in a Medallion pipeline is a full Bronze replay β€” reprocessing all Bronze data to regenerate Silver and Gold from scratch. For a 12-month Bronze table of 500 GB, a full replay on a 40-core Spark cluster typically takes 30–90 minutes depending on transformation complexity. This is acceptable for disaster recovery but unacceptable for daily incremental runs.

Incremental processing is handled through watermarking: each Silver job records the last processed ingest_timestamp in a checkpoint table and reads only Bronze partitions newer than that watermark on the next run. This reduces the typical Silver job runtime from 90 minutes to under 5 minutes for platforms with a few million daily events.

Storage overhead in a Medallion setup is typically 2–3Γ— the raw Bronze size: Silver adds ~50% due to joins and enrichment columns, Gold is usually less than 1% of Bronze size because of aggregation. Delta Lake's columnar Parquet storage and Z-ordering compress Silver data aggressively, often resulting in Silver being smaller than Bronze even with more columns.

MetricBronzeSilverGold
Schema enforcementNone (schema-on-read)Strict (schema-on-write)Strict, denormalized
Write modeAppend-onlyUpsert (merge)Overwrite or append
Partition keyIngestion timestampEvent timestampBusiness dimension
Typical size ratio1Γ— (baseline)1.3–1.5Γ—<0.01Γ—
Replay costN/A (source)Minutes to hoursMinutes

πŸ“Š Visualizing the E-Commerce Order Pipeline Across Medallion Layers

The diagram below traces a single order event from the Kafka source through all three layers to the CFO dashboard. Notice that each transformation job is a one-way gate β€” data flows forward, never backward.

graph TD
    A["🏭 Source Systems
Kafka · CDC · REST APIs"] --> B B["🟀 Bronze Layer
Raw JSON events
Append-only Β· Schema-on-read
Partitioned by ingest_date"] B --> C["βš™οΈ Bronze β†’ Silver Job
Validate Β· Deduplicate
Normalize currency Β· Join users"] C --> D["βšͺ Silver Layer
silver.orders
Typed Β· Deduped Β· USD amounts
Partitioned by order_date"] C --> E["🚫 silver.orders_rejected
Failed validation rows
with error_reason column"] D --> F["βš™οΈ Silver β†’ Gold Job
Group by category Β· Sum amounts
Compute LTV · Funnel metrics"] F --> G["🟑 Gold Layer
gold.daily_gmv_by_category
gold.customer_ltv_scores
gold.funnel_conversion_rates"] G --> H["πŸ“Š BI Dashboards
ML Feature Stores
Executive Reports"] style B fill:#cd7f32,color:#fff style D fill:#c0c0c0,color:#222 style G fill:#ffd700,color:#222 style E fill:#ff6b6b,color:#fff

The quarantine path (silver.orders_rejected) is as important as the happy path. When the Silver job encounters a row it cannot transform safely, that row lands in the rejected table rather than polluting Silver or being silently dropped. Data engineers monitor orders_rejected as a quality signal; a spike means something changed upstream.

🌍 Real-World Applications: Medallion at a High-Traffic E-Commerce Platform

The scenario. An e-commerce marketplace processes 5 million orders per day, sourced from a mobile app, a web frontend, and a third-party reseller API. Each source emits events in a slightly different schema. The reseller API occasionally delivers duplicate events. The mobile app sometimes omits the device_country field in poor network conditions.

Bronze ingestion (the source of truth). Three Kafka topics β€” app.order_placed, web.order_placed, and reseller.order_created β€” are each consumed by a Spark Structured Streaming job that writes to bronze.orders with a source column tagging the origin. The table now contains 5 million rows per day in its native, unmodified form. If the mobile app's schema changes next month (say, renaming user_token to user_id), Bronze captures both the old and new schema over time without breaking anything.

Silver transformation (the trust layer). A daily Silver job reads yesterday's Bronze partition, applies 14 documented transformation rules (defined in version-controlled YAML), and writes to silver.orders. The rules cover: UUID format validation for order_id and user_id, currency normalization from GBP/EUR/JPY to USD, duplicate removal keyed on (order_id, source), null-fill for optional fields per documented defaults, and a join to silver.users for customer tier enrichment. The Finance team, Data Science team, and Operations team all read from silver.orders. They all get the same 4.97 million valid rows (30,000 duplicates removed, 3,000 quarantined) without doing any cleaning themselves.

Gold aggregation (the self-service layer). Three Gold jobs run after Silver completes. One produces gold.daily_gmv_by_category (12 rows, one per product category, consumed by the CFO dashboard). Another produces gold.customer_ltv_scores (updated customer lifetime values, consumed by the ML recommendation engine). The third produces gold.funnel_conversion_rates (consumed by the Product team's growth dashboard). None of these jobs touch Bronze. All three complete in under 4 minutes combined.

The replay incident. Three weeks after launch, the exchange-rate lookup service returned stale rates for 6 hours. Gold GMV figures for that day were wrong. Recovery: delete the affected Silver partition, fix the exchange-rate lookup, re-run the Silver job for that day (8 minutes), re-run the Gold job (90 seconds). Total recovery time: under 12 minutes. Bronze was never touched.

πŸ› οΈ Delta Lake: How ACID Guarantees Underpin the Medallion Pattern

Delta Lake is the open-source table format that makes Medallion reliably implementable on object storage. Without Delta Lake (or a similar format like Apache Iceberg or Apache Hudi), running an UPDATE or DELETE on Parquet files in S3 is not atomic β€” a partial failure leaves the table in a corrupt state. Delta Lake solves this with a write-ahead transaction log (_delta_log/) that records every operation as a JSON commit entry. Every read sees a consistent snapshot regardless of concurrent writes.

The four Delta Lake features most critical to Medallion pipelines are:

  1. ACID transactions β€” Bronze appends, Silver merges (upserts), and Gold overwrites are all atomic. A failed Silver job either fully commits or leaves the previous state intact; there is no partial write.

  2. Time travel β€” Every Delta table version is queryable via VERSION AS OF or TIMESTAMP AS OF. If a Gold aggregate is wrong, you can query Silver as it existed before the bad run: SELECT * FROM silver.orders VERSION AS OF 42.

  3. Schema enforcement and evolution β€” Delta tables reject writes that violate the registered schema by default. When a new field arrives in Bronze, you explicitly evolve the Silver schema with mergeSchema, ensuring Silver consumers are aware of the change.

  4. Z-ordering β€” OPTIMIZE gold.orders ZORDER BY (customer_id, order_date) co-locates related data in the same Parquet files, reducing the data scanned by selective queries by up to 80%.

Here is a minimal PySpark + Delta Lake implementation of all three layers for the order events pipeline:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_timestamp, lit, when, current_timestamp
from delta.tables import DeltaTable

spark = SparkSession.builder \
    .appName("Medallion-OrderEvents") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

BRONZE_PATH = "s3://datalake/bronze/orders"
SILVER_PATH = "s3://datalake/silver/orders"
GOLD_PATH   = "s3://datalake/gold/daily_gmv_by_category"

# ─────────────────────────────────────────────────────────────
# BRONZE LAYER β€” Raw ingest from Kafka, append-only
# ─────────────────────────────────────────────────────────────
def ingest_bronze(kafka_df):
    """Write raw Kafka events to Bronze, exactly as received."""
    (
        kafka_df
        .withColumn("ingest_ts", current_timestamp())
        .withColumn("ingest_date", col("ingest_ts").cast("date"))
        .write
        .format("delta")
        .mode("append")                       # NEVER overwrite Bronze
        .partitionBy("ingest_date")
        .option("mergeSchema", "false")       # Reject schema drift silently
        .save(BRONZE_PATH)
    )

# ─────────────────────────────────────────────────────────────
# SILVER LAYER β€” Validate, deduplicate, normalize, enrich
# ─────────────────────────────────────────────────────────────
def transform_silver(bronze_partition_date: str):
    """Read a Bronze partition, apply quality rules, write to Silver."""
    bronze_df = spark.read.format("delta").load(BRONZE_PATH) \
        .filter(col("ingest_date") == bronze_partition_date)

    # 1. Cast types and fill nulls
    cleaned = bronze_df \
        .withColumn("amount", col("amount").cast("double")) \
        .withColumn("amount_usd",
            when(col("currency") == "EUR", col("amount") * 1.08)
           .when(col("currency") == "GBP", col("amount") * 1.27)
           .otherwise(col("amount"))
        ) \
        .withColumn("order_ts", to_timestamp("timestamp")) \
        .withColumn("session_id",
            when(col("session_id").isNull(), lit("UNKNOWN"))
           .otherwise(col("session_id"))
        )

    # 2. Validate: keep valid rows, quarantine bad rows
    valid   = cleaned.filter(col("order_id").rlike(r"^ord_[a-f0-9]{8}$"))
    invalid = cleaned.filter(~col("order_id").rlike(r"^ord_[a-f0-9]{8}$")) \
                     .withColumn("error_reason", lit("invalid_order_id_format"))

    # 3. Deduplicate on (order_id, source)
    deduped = valid.dropDuplicates(["order_id", "source"])

    # 4. Write valid rows to Silver (upsert to avoid duplicate runs)
    if DeltaTable.isDeltaTable(spark, SILVER_PATH):
        silver_table = DeltaTable.forPath(spark, SILVER_PATH)
        silver_table.alias("existing").merge(
            deduped.alias("incoming"),
            "existing.order_id = incoming.order_id AND existing.source = incoming.source"
        ).whenNotMatchedInsertAll().execute()
    else:
        deduped.write.format("delta").mode("overwrite") \
               .partitionBy("order_date").save(SILVER_PATH)

    # 5. Write rejected rows to quarantine
    if invalid.count() > 0:
        invalid.write.format("delta").mode("append") \
               .save("s3://datalake/silver/orders_rejected")

# ─────────────────────────────────────────────────────────────
# GOLD LAYER β€” Aggregate for business consumption
# ─────────────────────────────────────────────────────────────
def aggregate_gold(run_date: str):
    """Aggregate Silver orders into daily GMV by category."""
    from pyspark.sql.functions import sum as _sum, countDistinct

    silver_df = spark.read.format("delta").load(SILVER_PATH) \
        .filter(col("order_date") == run_date)

    gold_df = silver_df.groupBy("order_date", "category") \
        .agg(
            _sum("amount_usd").alias("gmv_usd"),
            countDistinct("order_id").alias("order_count"),
            countDistinct("user_id").alias("unique_customers"),
        )

    gold_df.write.format("delta") \
        .mode("overwrite") \
        .option("replaceWhere", f"order_date = '{run_date}'") \
        .save(GOLD_PATH)

Time travel in practice. If the Gold aggregate for 2024-11-15 is wrong, you can query the Silver table as it existed before the problematic run:

# Query Silver as it was at version 23 (before the bad exchange-rate run)
silver_before = spark.read.format("delta") \
    .option("versionAsOf", 23) \
    .load(SILVER_PATH)

Optimizing Gold reads with Z-ordering:

spark.sql(f"""
    OPTIMIZE delta.`{GOLD_PATH}`
    ZORDER BY (category, order_date)
""")

πŸ—οΈ Schema Evolution and Adding Layers Beyond Gold

Handling New Fields in Bronze Without Breaking Silver

When the mobile app team adds a new referral_code field to the order_placed event, it arrives in Bronze immediately with the next event. Silver does not break β€” because Silver reads Bronze with an explicit column list, not SELECT *. The new field is simply ignored until a Silver engineer decides to promote it.

To officially add referral_code to Silver, update the transformation job and use Delta Lake's mergeSchema option on the next write:

deduped.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \   # Adds new columns to the Silver schema
    .save(SILVER_PATH)

After this write, referral_code is part of the Silver schema. Existing rows have a null for this column, and downstream queries that don't reference it are unaffected. This is the schema evolution contract: Bronze absorbs new fields silently, Silver adopts them explicitly and deliberately.

When Four or More Layers Make Sense

Some teams extend the pattern beyond three layers. The most common additions are:

Bronze-Raw vs. Bronze-Validated β€” Some pipelines split Bronze into two sub-zones: Bronze-Raw (byte-faithful copy with zero parsing) and Bronze-Validated (parsed and type-cast, but not cleaned or deduped). This makes sense when the raw source is binary (Avro, Protobuf) and the parsing step is expensive or brittle. Most teams don't need this.

Platinum (Serving Layer) β€” A fourth layer that materializes views tuned for specific query engines or APIs. For example, a Platinum layer might pre-join Gold tables into a denormalized dataset for a real-time dashboard backed by Apache Druid or ClickHouse. This is appropriate when Gold tables need further index-tuning for a specific consumption technology.

When extra layers become over-engineering. If your team has fewer than five people and fewer than 10 TB of data, adding a Platinum layer increases operational overhead without measurable benefit. Start with three layers. Add a fourth only when you can name the specific query latency or schema contract problem it solves.

βš–οΈ Trade-offs & Failure Modes in Medallion Pipelines

Performance vs. Storage Cost

Medallion's quality guarantees come at the cost of storing the same data 2–3Γ— over. For a platform with 500 GB/day of raw events, Bronze grows at 500 GB/day, Silver at roughly 650 GB/day (additional enrichment columns), and Gold is negligible. Over a year, that's approximately 415 TB of storage. In 2024, S3 pricing is roughly $0.023/GB/month β€” the extra storage costs around $250/month per 100 GB of raw daily volume. For most platforms, that cost is trivially justified by the analyst time saved. For extremely high-volume platforms (petabytes per day), the Bronze retention policy (e.g., "keep Bronze for 90 days, keep Silver indefinitely") manages cost without giving up the replay safety window.

Failure Modes and How They Cascade

Broken Bronze-to-Silver job. If the Silver transformation job fails mid-run, the Silver table is left in a partial state. Delta Lake's atomic commits prevent corrupt partial writes β€” the failed run does not commit anything, and the previous Silver state is intact. The pipeline simply retries from the same Bronze partition.

Schema mismatch between Bronze and Silver. A source team renames a field without notice. The Bronze-to-Silver job crashes when it tries to read the old column name. Fix: update the transformation job to handle both old and new field names during the transition period, then remove the old path once Bronze retention expires for the old schema.

Gold serving stale data. The Gold aggregation job has not run since yesterday because a Silver job was delayed. The dashboard shows yesterday's numbers. Fix: add a data freshness SLA check at the dashboard layer (e.g., "if gold.daily_gmv_by_category has no partition for today, show a staleness warning instead of yesterday's total as today's").

Medallion vs. Lambda vs. Kappa: Orthogonal Concerns

Medallion Architecture is about data quality layers β€” it defines what level of cleanliness data has at each zone. Lambda and Kappa architectures are about batch vs. streaming execution β€” they define how data is processed in time. These are orthogonal: you can implement Medallion using Lambda Architecture (Bronze via streaming, Silver via batch recompute, Gold via batch), using Kappa Architecture (all layers via streaming with continuous updates), or using pure batch ETL. The choice of execution model does not determine whether you use Medallion; the choice of whether to enforce quality zones at each layer does.

🧭 Decision Guide: When to Use Medallion Architecture

SituationRecommendation
Use whenMultiple teams query the same data with different trust requirements; you need replay capability for bug fixes; your data lake is growing into a swamp of undifferentiated files
Use whenYou're building a data platform that will serve 3+ consumer teams (analytics, data science, product, finance) and need a single source of truth for each layer
Avoid whenYou have a single, small team and a simple ETL pipeline β€” the overhead of three explicit layers is not justified for a pipeline that runs twice a day and feeds one dashboard
AlternativeIf freshness is the primary constraint (sub-second latency) and you are willing to accept less cleaning guarantees, a Kappa-only streaming pipeline may serve better than a Medallion batch pipeline
Edge caseFor regulatory or audit workloads where immutable history is legally required, Bronze's append-only guarantee is not optional β€” even if the rest of Medallion feels like overhead

πŸ§ͺ Practical Example: Bringing Orders from Raw Kafka Events to the CFO Dashboard

This is the complete end-to-end trace for the e-commerce scenario. Running all three jobs produces a Gold table the CFO dashboard can query with no additional transformation:

from datetime import date

TODAY = str(date.today())

# Step 1: Ingest Bronze (runs continuously via Spark Structured Streaming)
kafka_stream = spark.readStream.format("kafka") \
    .option("kafka.bootstrap.servers", "broker:9092") \
    .option("subscribe", "app.order_placed,web.order_placed") \
    .load() \
    .selectExpr("CAST(value AS STRING) as raw_json", "timestamp as kafka_ts")

# Parse JSON into columns and write to Bronze
from pyspark.sql.functions import from_json, schema_of_json
raw_schema = "order_id STRING, user_id STRING, items ARRAY<STRUCT<sku:STRING,qty:INT,price:STRING>>, currency STRING, timestamp STRING, session_id STRING"

bronze_stream = kafka_stream.withColumn("data", from_json(col("raw_json"), raw_schema)) \
    .select("data.*", col("kafka_ts").alias("ingest_ts")) \
    .withColumn("ingest_date", col("ingest_ts").cast("date"))

query = bronze_stream.writeStream.format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://datalake/checkpoints/bronze_orders") \
    .option("path", BRONZE_PATH) \
    .partitionBy("ingest_date") \
    .start()

# Step 2: Run Silver job for today's Bronze partition (scheduled daily)
transform_silver(TODAY)

# Step 3: Run Gold aggregation for today (runs after Silver completes)
aggregate_gold(TODAY)

# Step 4: Query Gold β€” CFO dashboard query
spark.sql(f"""
    SELECT order_date, category, gmv_usd, order_count
    FROM delta.`{GOLD_PATH}`
    WHERE order_date = '{TODAY}'
    ORDER BY gmv_usd DESC
""").show()

At the end of this pipeline run, the CFO dashboard sees today's revenue broken down by product category, with duplicate orders removed, currency normalized, and all data validated against the Silver schema contract β€” from a single query against Gold, with zero cleaning code in the dashboard layer.

πŸ“š Lessons Learned from Medallion in Production

1. Enforce Bronze immutability with IAM policies, not just convention. "We never mutate Bronze" is a convention that survives until the first incident where someone manually patches Bronze to fix a one-off problem. Enforce it with object storage permissions: the Bronze write role gets PutObject only; no role except a break-glass account gets DeleteObject on Bronze prefixes. Make the contract automatic, not optional.

2. The quarantine table is not an afterthought. The silver.orders_rejected table is a first-class quality signal. Set up a daily alert: if rejected rows exceed 0.5% of Bronze input volume, a data quality engineer is paged. Teams that treat rejection as an edge case discover months later that they have been quietly losing valid orders to over-strict validation rules.

3. Don't skip the Silver-to-Gold dependency check. A Gold table that runs before Silver finishes produces partially correct aggregates. Use a pipeline orchestrator (Apache Airflow, Prefect, Dagster) to model the Bronze β†’ Silver β†’ Gold dependency explicitly and fail Gold loudly if Silver's SLA is missed.

4. Z-order Gold tables from day one. Z-ordering is not a performance optimization you add later β€” it is a write-time operation. Adding it retroactively requires an OPTIMIZE command that rewrites the entire table. If your Gold tables will be filtered by (region, category) in production queries, add the ZORDER BY (region, category) to the Gold write job from the first deployment.

5. Document the transformation rules as code, not Confluence pages. Every Bronze-to-Silver transformation rule should live in a version-controlled configuration file (YAML, JSON, or Python constants). "We normalize EUR using the ECB daily rate" in a wiki page becomes drift the moment someone updates the wiki but not the code. Rules in code are the only rules that are actually enforced.

πŸ“Œ Summary & Key Takeaways

  • Medallion Architecture is a design pattern, not a technology β€” it organizes a data lake into Bronze (raw), Silver (cleaned), and Gold (aggregated) zones with explicit contracts between them.
  • Bronze is sacred β€” it is the immutable source of truth that enables any downstream layer to be replayed from scratch when a bug is discovered. Never mutate it; enforce this with IAM policies.
  • Silver is where cross-team trust is built β€” one transformation job, one documented set of rules, one table that every team queries with confidence. The quarantine table is as important as the clean table.
  • Gold is the self-service layer β€” pre-aggregated, optimized for query tools, never requiring consumers to write their own cleaning logic. Each Gold table answers one business question cleanly.
  • Delta Lake makes the pattern reliable β€” ACID transactions, time travel, schema enforcement, and Z-ordering turn a conceptual pattern into a production-grade system on object storage.
  • Schema evolution is a deliberate, explicit act β€” Bronze absorbs new fields silently; Silver adopts them consciously via mergeSchema. This keeps upstream freedom and downstream stability in balance.
  • Medallion and Lambda/Kappa are orthogonal β€” choosing Medallion is about enforcing quality zones, not about choosing batch vs. streaming. Use whatever execution model fits your latency needs.

One-sentence takeaway: A data lake without explicit quality zones is a data swamp β€” Medallion gives every team a clear answer to "which table should I trust, and why?"

πŸ“ Practice Quiz

  1. A Silver-to-Gold aggregation job runs and produces wrong revenue numbers because the exchange-rate lookup returned stale values for 6 hours. What is the correct recovery sequence in a Medallion setup?

    • A) Delete the Bronze partition for that day, fix the lookup, replay Bronze and Silver.
    • B) Delete the affected Silver partition, fix the lookup, re-run Silver, then re-run Gold.
    • C) Update the Gold table with corrected values in-place using a SQL UPDATE statement.
    • D) Re-ingest the raw Kafka events to fix the problem at the source. Correct Answer: B β€” Silver is deleted and replayed from the intact Bronze partition. Bronze is never touched.
  2. Your Bronze Kafka consumer receives the same order_placed event twice because of a Kafka at-least-once delivery guarantee. What is the correct Medallion response?

    • A) Configure Kafka exactly-once delivery to prevent duplicates from entering Bronze.
    • B) Drop duplicates during Bronze ingestion so the raw table stays clean.
    • C) Accept both copies into Bronze (append-only), and deduplicate during the Bronze-to-Silver job.
    • D) Ignore duplicates β€” Gold aggregation will self-correct through averaging. Correct Answer: C β€” Bronze records what happened, including duplicates. Deduplication is Silver's responsibility.
  3. Which Delta Lake feature lets you query the Silver table as it existed before a problematic transformation run rewrote a partition?

    • A) Z-ordering
    • B) Schema enforcement
    • C) Time travel (VERSION AS OF / TIMESTAMP AS OF)
    • D) ACID merge operations Correct Answer: C β€” Delta Lake's time travel lets you query any historical version of a table by version number or timestamp.
  4. Open-ended challenge: Your team is onboarding a third source system (a third-party reseller API) that delivers order events in a different schema from your existing Kafka topics. The new schema uses customer_ref instead of user_id and prices denominated in a new currency. How would you evolve your Medallion pipeline to absorb this new source without breaking existing Silver consumers? What changes are needed in Bronze, the Bronze-to-Silver transformation, and the Silver schema? What would you put in the quarantine table, and how would you validate the integration before going live?

    (No single correct answer β€” consider schema evolution strategy, transformation rule versioning, quarantine monitoring, and backward compatibility for existing Gold tables.)


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms