Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi

Parquet alone breaks at scale: Delta Lake, Iceberg, and Hudi bring ACID, time travel, and schema evolution to object storage.

Big Data Engineering

Abstract Algorithms

·Mar 28, 2026·24 min read

Cover Image for Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 24 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

AI-generated summary

TLDR: Delta Lake, Apache Iceberg, and Apache Hudi are open table formats that wrap Parquet files with a transaction log (or snapshot tree) to deliver ACID guarantees, time travel, schema evolution, and efficient upserts on object storage.
Choose Delta Lake for Databricks/Spark centric shops, Iceberg for multi engine portability at petabyte scale, and Hudi for near real time CDC upserts and streaming pipelines.
📖 Production Nightmares That Open Table Formats Were Built to Solve It is Friday afternoon.
A daily ETL job has just re processed three days of customer events and landed the results in your S3 data lake.

Prerequisites required

Big Data foundationsDelta Lake foundationsData Lake foundations

Architecture callout

This section highlights system boundaries, failure modes, and scaling constraints that matter in real production environments.

Tradeoff comparison

Option A: Latency-first

Optimizes response time and user interaction loops, with higher implementation complexity.

Option B: Consistency-first

Prioritizes determinism and simpler operations, with increased latency under load.

TLDR: Delta Lake, Apache Iceberg, and Apache Hudi are open table formats that wrap Parquet files with a transaction log (or snapshot tree) to deliver ACID guarantees, time travel, schema evolution, and efficient upserts on object storage. Choose Delta Lake for Databricks/Spark-centric shops, Iceberg for multi-engine portability at petabyte scale, and Hudi for near-real-time CDC upserts and streaming pipelines.

📖 Production Nightmares That Open Table Formats Were Built to Solve

It is Friday afternoon. A daily ETL job has just re-processed three days of customer events and landed the results in your S3 data lake. You notice the numbers look wrong and roll back the Airflow DAG — but there is no rollback for Parquet files. The bad data is now merged with the good. Meanwhile a downstream ML job that started ten minutes ago read a half-written partition, trained on corrupted features, and silently produced a model with degraded accuracy. The on-call engineer opens the playbook and finds the recovery step: "manually identify and delete incorrect files, re-run from source."

This scenario repeats in production data lakes every week. Plain Parquet on object storage has five structural weaknesses that become compounding failures as pipelines grow:

Problem	Root Cause	Real Impact
No ACID transactions	Multiple writers have no coordination mechanism	Concurrent ETL jobs corrupt shared tables
No time travel	Files are mutable; overwrite destroys history	Bad data loads require full re-ingestion from source
No schema enforcement	Any writer can add/remove/rename columns	Downstream jobs break silently on unexpected schema drift
No efficient deletes	DELETE means rewriting entire partitions	GDPR "right-to-be-forgotten" takes hours per customer
Partial-write visibility	Readers see in-flight partitions	Streaming consumers produce wrong aggregates mid-write

Open table formats were each built to eliminate a specific cluster of these problems — Delta Lake by Databricks in 2019 to tame Spark pipelines, Iceberg by Netflix in 2018 to manage PB-scale Hive tables across query engines, and Hudi by Uber in 2016 (open-sourced 2019) to handle millions of streaming upserts per second from trip and payment systems.

⚙️ What Open Table Formats Actually Are: A Metadata Layer Over Parquet

A common misconception is that switching to Delta Lake or Iceberg means changing your file format. It does not. The underlying data files remain standard Parquet (or ORC). What changes is what sits alongside those files: a structured metadata layer that gives your storage system the same semantic guarantees a relational database provides.

S3 bucket: s3://data-lake/orders/
  ├── _delta_log/               ← Delta Lake: append-only transaction log
  │   ├── 000000000000.json
  │   ├── 000000000001.json
  │   └── 000000000010.checkpoint.parquet
  ├── part-00000-abc.snappy.parquet   ← actual data files (unchanged Parquet)
  └── part-00001-def.snappy.parquet

Every open table format stores two things alongside data files: a transaction log or snapshot tree that records every change to the table, and statistics metadata (min/max values per column, row counts) that enable query engines to skip files without scanning them. The format library interprets this metadata to provide the transactional guarantees — the storage layer (S3, ADLS, GCS) remains a dumb byte store.

graph TD
    W[Writer / ETL Job] --> API[Table Format API]
    API --> META[Metadata Layer _delta_log / snapshots / .hoodie]
    API --> DATA[Parquet Data Files on S3 / ADLS / GCS]
    META --> ACID[ACID Guarantees]
    META --> TT[Time Travel]
    META --> SE[Schema Evolution]
    META --> SKIP[File Skipping / Pruning]
    R[Reader / Query Engine] --> META
    R --> DATA

The format library acts as the intelligence layer: writers and readers go through the API, which uses metadata to coordinate concurrency, expose historical versions, and prune irrelevant files before any data is read.

🧠 Deep Dive: Transaction Logs, Snapshot Trees, and Timeline Internals

The Internals: Three Architectures for Metadata Management

Each format solves the metadata problem with a different data structure. Understanding the internals tells you exactly what the operational trade-offs are.

Delta Lake — Append-Only JSON Transaction Log

Delta Lake stores its transaction log in a _delta_log/ directory alongside the data files. Every commit appends a new numbered JSON file (000000000007.json) containing add/remove actions for data files plus the updated schema, partition information, and commit metadata. Every 10 commits (configurable), Delta consolidates the log into a Parquet checkpoint file for fast log replay. To read the current table state, a query engine replays log entries from the last checkpoint forward — an O(commits-since-checkpoint) scan, typically ≤ 10 files.

_delta_log/
  000000000000.json   → initial write: add 4 files
  000000000001.json   → MERGE INTO: remove 2 files, add 2 files
  ...
  000000000010.checkpoint.parquet   → full table state at v10
  000000000011.json   → subsequent write

Apache Iceberg — Immutable Snapshot Tree

Iceberg organizes metadata as an immutable tree: a metadata file points to a sequence of snapshots, each snapshot contains a manifest list, each manifest list references manifest files, and each manifest file catalogues a set of data files along with their column statistics. No existing files are ever mutated — a new snapshot is written atomically, and the catalog pointer is updated. This structure means a reader at snapshot N sees a perfectly consistent view regardless of concurrent writers at snapshot N+1.

metadata/
  v1.metadata.json    → points to snapshot-1
  v2.metadata.json    → points to snapshots 1 and 2 (current = 2)
  snap-1-<uuid>.avro  → manifest list for snapshot 1
  snap-2-<uuid>.avro  → manifest list for snapshot 2
  <uuid>-m0.avro      → manifest file (lists data files + stats)

Apache Hudi — Event-Sourced Timeline

Hudi maintains a .hoodie/ timeline — a log of instants (timestamped actions: commit, deltacommit, compaction, clean). Each instant has three states: requested, inflight, and completed. Readers and writers check the timeline to determine which data files are visible. Hudi has two storage types that directly affect read/write trade-offs: Copy-on-Write (CoW) rewrites data files on every upsert, producing clean Parquet readable by any engine; Merge-on-Read (MoR) appends delta log files and merges on read, yielding near-real-time write latency but requiring a read-time merge step.

Mathematical Model: Snapshot Isolation and Optimistic Concurrency Control

All three formats implement snapshot isolation — a formal isolation level from ANSI SQL where every transaction reads from a consistent snapshot taken at its start time. Snapshot isolation prevents the read-write conflicts that corrupt plain-Parquet lakes, but it allows a specific anomaly called a write-write conflict.

The formal guarantee: given two concurrent transactions T₁ and T₂ both reading snapshot S, and both writing data files, a conflict is detected if and only if their write sets overlap (i.e., they both modify at least one common file or partition). In that case, whichever transaction commits second is rejected with a ConcurrentModificationException.

Delta Lake implements this via optimistic concurrency control (OCC) on the _delta_log/. When writer W₁ attempts to commit at version N+1, it reads the log from version N (its read snapshot) to version N+current and checks whether any intervening commits touched the same files. If they did, the commit is rejected. If they did not, the new JSON is written atomically using S3 conditional PUTs (or Azure blob leases). This means Delta Lake OCC is partition-safe by default: two writers that touch different partitions can always commit concurrently.

Iceberg uses a similar OCC model on its metadata file pointer. The catalog (Hive Metastore, Glue, REST catalog) performs an atomic compare-and-swap on the current metadata file pointer. Two writers that add files to different partitions will both succeed; two writers that update the same manifest fail on the second CAS attempt.

Hudi's timeline mechanism is slightly different: it uses file-level locking on the .hoodie/ directory for MoR delta writes, and optimistic conflict detection for CoW compaction. Hudi's record-level upsert model (keyed on a user-specified record key) means write conflicts are resolved at the record level rather than the file level, which is what enables its streaming CDC guarantees.

Performance Analysis: Compaction, File Proliferation, and Metadata Amplification

The three formats have distinct performance profiles that determine operational overhead at scale:

Metric	Delta Lake	Apache Iceberg	Apache Hudi (MoR)
Write latency	Low — append JSON log entry	Low — write new snapshot metadata	Very low — append delta log
Read latency	Low after checkpoint	Low — manifest pruning	Medium — merge delta logs on read
Small file problem	Accumulates; OPTIMIZE + Z-ORDER needed	Accumulates; compaction job needed	Accumulates delta logs; auto-compaction
Metadata overhead	O(commits) JSON + checkpoint Parquet	O(snapshots) manifest files	O(instants) timeline entries
VACUUM / expiry	`VACUUM` removes old Parquet	`expire_snapshots()` removes old metadata	`clean` service removes old files

The most critical operational concern for all three formats is small file accumulation. High-frequency streaming writes produce many small Parquet files that degrade read performance exponentially (S3 LIST operations + Spark task overhead). Delta Lake's OPTIMIZE command coalesces small files and applies Z-ordering; Iceberg uses rewrite_data_files; Hudi handles this automatically with its compaction service in MoR mode. At Uber's scale (millions of Hudi commits per day), the compaction service runs as a continuous Spark job to maintain read performance within SLA.

📊 How Writes Flow Through Each Format's Metadata Layer

The diagrams below trace a single write operation (an upsert of 1,000 new orders) through each format's metadata update path.

Delta Lake Write Flow

graph TD
    A[Spark MERGE INTO orders] --> B{Read _delta_log find latest version N}
    B --> C[Write new Parquet data files to S3]
    C --> D{OCC check: did any commit Ncurrent touch same partition?}
    D -- No conflict --> E[Write 000N+1.json add/remove actions]
    D -- Conflict detected --> F[Retry with backoff or raise exception]
    E --> G[Readers at version  N+1 see new data]
    E --> H{Is version a multiple of 10?}
    H -- Yes --> I[Write checkpoint.parquet consolidate log]
    H -- No --> G

Apache Iceberg Snapshot Tree (after write)

graph TD
    A[Catalog pointer] --> B[v3.metadata.json current snapshot = 3]
    B --> S3[Snapshot 3 new data added]
    B --> S2[Snapshot 2 previous state]
    B --> S1[Snapshot 1 initial load]
    S3 --> ML3[Manifest List]
    ML3 --> MF1[Manifest File A 4 data files + stats]
    ML3 --> MF2[Manifest File B 2 new files]
    MF2 --> DF1[new-orders-001.parquet]
    MF2 --> DF2[new-orders-002.parquet]

Apache Hudi Timeline (MoR table)

graph LR
    TL[.hoodie/timeline] --> I1[20240101120000.commit CoW base files written]
    I1 --> I2[20240101130000.deltacommit delta log appended]
    I2 --> I3[20240101140000.deltacommit delta log appended]
    I3 --> I4[20240101150000.compaction delta merged into base files]
    I4 --> I5[20240101160000.clean old base files removed]

Delta Lake serializes concurrent writers through the transaction log. Iceberg achieves consistency through immutable snapshot trees and an atomic catalog pointer swap. Hudi's timeline records every instant of change, enabling both point-in-time queries and incremental reads.

🏗️ Delta Lake, Iceberg, and Hudi: A Format-by-Format Feature Breakdown

Delta Lake (Databricks, 2019)

Delta Lake's core strength is deep Spark integration. Its MERGE INTO syntax is the most production-tested upsert mechanism across any open table format:

from delta.tables import DeltaTable

delta_orders = DeltaTable.forPath(spark, "s3://data-lake/orders/")

# Upsert: update if exists, insert if new
delta_orders.alias("target").merge(
    source_df.alias("source"),
    "target.order_id = source.order_id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

# Time travel: query yesterday's state
orders_yesterday = spark.read \
    .format("delta") \
    .option("timestampAsOf", "2026-03-27") \
    .load("s3://data-lake/orders/")

# Schema evolution: add a new column without rewriting
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
new_df.write \
    .format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save("s3://data-lake/orders/")

# Z-ordering for multi-column data skipping
spark.sql("""
    OPTIMIZE delta.`s3://data-lake/orders/`
    ZORDER BY (customer_id, order_date)
""")

Z-ordering is Delta Lake's differentiating performance feature. By co-locating rows with similar values for a set of columns within the same data files, Z-ordering enables data skipping across multiple dimensions simultaneously — far more powerful than single-column Hive partitioning. A query filtering on both customer_id and order_date can skip 80–95% of files in a properly Z-ordered table.

Apache Iceberg (Netflix, 2018)

Iceberg's headline feature is hidden partitioning: partition values are tracked in metadata, not encoded in directory paths. Writers do not need to know the partition scheme; Iceberg derives partition values automatically. This enables partition evolution — changing the partition scheme (e.g., from monthly to daily) without rewriting any data.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.my_catalog.type", "glue") \
    .getOrCreate()

# Create table with hidden partitioning (no partition-path in column required)
spark.sql("""
    CREATE TABLE my_catalog.orders (
        order_id     BIGINT,
        customer_id  STRING,
        order_date   DATE,
        amount       DOUBLE
    )
    USING iceberg
    PARTITIONED BY (days(order_date))
""")

# Writers just append — no partition column needed in the data
orders_df.writeTo("my_catalog.orders").append()

# Evolve partition scheme from daily to monthly without rewriting data
spark.sql("""
    ALTER TABLE my_catalog.orders
    REPLACE PARTITION FIELD days(order_date)
    WITH months(order_date)
""")

# Time travel by snapshot ID
spark.read \
    .option("snapshot-id", "8170123456789012345") \
    .table("my_catalog.orders")

# Incremental read: changes between two snapshots
spark.read \
    .option("start-snapshot-id", "8170123456789012344") \
    .option("end-snapshot-id", "8170123456789012345") \
    .format("iceberg") \
    .load("my_catalog.orders")

Iceberg's multi-engine support is unmatched: the same table can be written by Spark and read by Trino, Flink, Hive, and Presto without any format conversion. Netflix runs Iceberg tables that are simultaneously written by Flink (real-time events), read by Trino (ad-hoc SQL), and batch-processed by Spark (ML feature engineering) — all against the same physical files.

Apache Hudi (Uber, 2016)

Hudi was engineered for one primary use case: high-throughput streaming upserts from CDC pipelines. Its record-level update model (every row has a recordKey and preCombineField) makes it the natural choice for systems where source events continuously update existing records.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.catalog.spark_catalog",
            "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
    .getOrCreate()

# Write with upsert semantics (MoR table type for low-latency writes)
orders_df.write \
    .format("hudi") \
    .option("hoodie.table.name", "orders") \
    .option("hoodie.datasource.write.recordkey.field", "order_id") \
    .option("hoodie.datasource.write.precombine.field", "updated_at") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
    .option("hoodie.compact.inline", "true") \
    .option("hoodie.compact.inline.max.delta.commits", "5") \
    .mode("append") \
    .save("s3://data-lake/orders/")

# Incremental query: only records changed since a timestamp
spark.read \
    .format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", "20260327000000") \
    .load("s3://data-lake/orders/")

The incremental query is Hudi's killer feature for downstream consumers: instead of scanning the entire table to find what changed, a consumer provides a timestamp and receives only the delta since that point. This makes Hudi CDC pipelines dramatically cheaper than equivalent Spark full-table scans.

Head-to-Head Feature Comparison

Feature	Delta Lake	Apache Iceberg	Apache Hudi
ACID transactions	✅ OCC on _delta_log	✅ CAS on catalog pointer	✅ Timeline + file locking
Time travel	✅ VERSION/TIMESTAMP AS OF	✅ Snapshot ID / timestamp	✅ Point-in-time instant
Upserts	MERGE INTO (Spark-native)	Row-level position/equality deletes	Native upserts, record-keyed (MoR/CoW)
Partition evolution	Manual repartition + rewrite	✅ Hidden + evolve without rewrite	Limited, schema-only evolution
Multi-engine	Spark-primary; Trino partial	Best-in-class (Spark/Flink/Trino/Hive)	Good (Spark/Flink; Trino improving)
Streaming	Spark Structured Streaming	Flink native; Spark streaming	Best-in-class — built for streaming
Compaction	OPTIMIZE + VACUUM (manual)	rewrite_data_files (manual/scheduled)	Auto-compaction (inline or async)
Schema evolution	mergeSchema option	Full DDL ALTER TABLE	Schema-on-write evolution
Best ecosystem	Databricks / Unity Catalog	Open multi-engine lakehouse	Kafka CDC / streaming upserts

🛠️ Apache Iceberg's Snapshot Architecture: Hidden Partitioning Under the Hood

Netflix open-sourced Iceberg after discovering that Hive's partition model — which encodes partition values in directory paths (/year=2024/month=03/) — caused three chronic production problems:

Partition misidentification — jobs that forgot to filter on partition columns performed full table scans silently.
Partition evolution pain — changing from monthly to daily partitioning required copying all historical data.
Small partition explosion — at petabyte scale, thousands of partition directories generated Hive Metastore bottlenecks on LIST operations.

Iceberg solves all three by separating the logical partition scheme from the physical file layout. The partition spec (e.g., days(event_time)) is stored in the table metadata, not in file paths. Data files live at flat paths like data/<uuid>.parquet, and each manifest file records the partition value alongside the file reference:

# Manifest file entry (Avro record)
{
  "status": "ADDED",
  "data_file": {
    "file_path": "s3://lake/orders/data/0001-abc.parquet",
    "partition": { "order_date_day": 19810 },   ← derived value, not in path
    "record_count": 1200,
    "column_sizes": { "order_id": 9800, "amount": 6400 },
    "lower_bounds": { "order_date": <2024-03-01>, "amount": 0.99 },
    "upper_bounds": { "order_date": <2024-03-31>, "amount": 4999.00 }
  }
}

When a query engine evaluates WHERE order_date = '2024-03-15', Iceberg evaluates this filter against the lower_bounds/upper_bounds in each manifest entry before opening any data file. Files that cannot possibly contain matching rows are pruned entirely. This manifest-level pruning is what makes Iceberg viable for PB-scale tables where traditional Hive would require minutes just to enumerate partitions.

Partition evolution in practice: when Netflix switched a 50-TB events table from monthly to daily partitioning, zero bytes of historical data were rewritten. Iceberg simply updated the partition spec in the metadata file. New writes used the daily scheme; old files retained their monthly partition values in manifests. Both old and new files co-exist transparently — queries filter correctly against both partition schemes simultaneously.

For a full exploration of multi-engine lakehouse architectures and the Medallion pattern that these table formats power, see Big Data Architecture Patterns: Lambda, Kappa, Medallion, and Data Mesh.

🌍 Real-World Applications: Netflix, Uber, and Databricks at Petabyte Scale

Netflix (Apache Iceberg): Netflix's data platform runs approximately 4,000 Iceberg tables, the largest exceeding 10 PB. The company chose Iceberg after a 2018 audit revealed that Hive Metastore bottlenecks were responsible for 30% of Spark job failures on their largest event tables. Today, Flink jobs write streaming events to Iceberg at sub-minute latency while Trino analysts run ad-hoc SQL against the same tables — a workload combination that no other format handles as cleanly.

Uber (Apache Hudi): Uber processes 100 billion+ trip events per day across their Hadoop data lake, with frequent updates as trip status changes (requested → accepted → completed → rated). Traditional batch pipelines that rewrote full partitions on every status change were both slow and expensive. Hudi's MoR upsert model reduced their data freshness lag from hours to minutes while cutting storage write amplification by 4x. Uber runs Hudi on a 500+ PB HDFS cluster and has contributed extensively to Hudi's incremental processing capabilities.

Databricks (Delta Lake): Delta Lake's most visible enterprise use case is the Medallion Architecture — Bronze (raw ingestion), Silver (cleansed/validated), Gold (aggregated for BI). Databricks reports that Medallion pipelines built on Delta Lake cut data quality incident rates by 60–70% in enterprise customers by making every layer independently recoverable through time travel. The MERGE INTO command is the Silver-layer workhorse: it idempotently upserts new records while preserving exactly-once semantics even when Spark jobs are retried.

⚖️ Trade-offs and Failure Modes: Small Files, Compaction Overhead, and Vendor Lock-in

Small File Accumulation

Every streaming write creates a new set of Parquet files. A Flink job that checkpoints every 30 seconds and writes to an Iceberg table will generate 2,880 small files per hour per partition. A table with 100 active partitions accumulates 288,000 files per hour. At this scale, S3 LIST operations alone can take seconds, and Spark's planning phase must open manifests for tens of thousands of files. All three formats require scheduled compaction to remain performant — this is a non-trivial operational overhead that teams frequently underestimate.

Metadata Amplification in Iceberg

Iceberg's manifest file design is elegant for reads but generates significant metadata write amplification. Each snapshot creates a new manifest list and potentially several new manifest files. For a table with 1,000 data files per partition and 500 active partitions, each write operation appends a new snapshot referencing up to 1 GB of manifest data. Running expire_snapshots() regularly is not optional — it is a prerequisite for keeping the metadata layer manageable.

Hudi CoW Write Amplification

CoW tables rewrite entire data files on every upsert, even if only one row changed. A Hudi CoW table with 10 GB files and a 0.1% update rate will rewrite 10 GB of data to change 10 MB of records. This is acceptable for read-heavy tables where query performance justifies the write cost, but at high upsert frequency it becomes prohibitive. The failure mode is subtle: write amplification silently increases S3 PUT costs until the bill becomes the monitoring signal.

Vendor Lock-in Risk

Delta Lake's tightest ecosystem coupling to Databricks (the Unity Catalog, Liquid Clustering, and Predictive I/O features are proprietary extensions on top of the open protocol) creates migration risk. OSS Delta Lake and Databricks Delta Lake are compatible but not identical — proprietary optimizations may embed assumptions that are difficult to replicate on other platforms.

🧭 Decision Guide: Choosing Between Delta Lake, Iceberg, and Hudi

Scenario	Recommendation	Rationale
All workloads run on Databricks / Spark only	Delta Lake	Tightest integration, MERGE INTO, Z-ordering, Unity Catalog
Mix of Spark + Trino + Flink across multiple teams	Apache Iceberg	Best multi-engine interoperability; hidden partitioning eliminates engine-specific bugs
CDC pipeline from MySQL/Postgres, near-real-time requirements	Apache Hudi	Record-level upserts, MoR for sub-minute freshness, incremental queries
Medallion architecture (Bronze → Silver → Gold)	Delta Lake or Iceberg	Both support ACID + time travel; choose by query engine mix
GDPR deletion requests at scale (millions of users)	Delta Lake or Iceberg	Delta's DELETE + REORG; Iceberg's equality delete files handle row-level deletes cleanly
Petabyte-scale with partition evolution requirements	Apache Iceberg	Partition evolution without data rewrite is unique to Iceberg
Team starting fresh, no existing Spark investment	Apache Iceberg	Neutral governance (Apache foundation), widest engine support, active ecosystem

Key non-obvious consideration: if your organization plans to migrate from Databricks to another query engine in the next 3–5 years, start with Iceberg. Delta Lake files are readable by other engines, but Delta Lake's metadata is the binding dependency. An Iceberg table can be queried by Spark today and Trino tomorrow without a single config change.

GDPR Article 17 requires that a data processor delete all personal data for a customer upon request, typically within 30 days. For a data lake storing hundreds of billions of events across years of Parquet files, this is one of the hardest operational problems — and the one that makes the "no efficient deletes in plain Parquet" problem most viscerally expensive.

The Plain-Parquet Problem

# Without a table format: find and rewrite every partition
for partition in list_partitions("s3://lake/events/"):
    df = spark.read.parquet(partition)
    if df.filter(f"customer_id = '{customer_id}'").count() > 0:
        df.filter(f"customer_id != '{customer_id}'") \
          .write.mode("overwrite").parquet(partition)
# Result: rewrites every partition the customer appears in.
# For a customer with events across 3 years = 1,095 partitions rewritten.

The Delta Lake Solution

from delta.tables import DeltaTable

# Silver layer: events table with MERGE INTO for ongoing upserts
def upsert_customer_events(batch_df, batch_id):
    silver = DeltaTable.forPath(spark, "s3://lake/silver/events/")
    silver.alias("target").merge(
        batch_df.alias("source"),
        "target.event_id = source.event_id"
    ).whenMatchedUpdateAll() \
     .whenNotMatchedInsertAll() \
     .execute()

# Streaming write via Structured Streaming
spark.readStream \
    .format("kafka") \
    .option("subscribe", "customer_events") \
    .load() \
    .writeStream \
    .foreachBatch(upsert_customer_events) \
    .option("checkpointLocation", "s3://lake/checkpoints/events/") \
    .start()

# GDPR deletion: surgical DELETE, no partition rewrite
spark.sql("""
    DELETE FROM delta.`s3://lake/silver/events/`
    WHERE customer_id = 'cust-4892'
""")

# Verify deletion
remaining = spark.sql("""
    SELECT COUNT(*) FROM delta.`s3://lake/silver/events/`
    WHERE customer_id = 'cust-4892'
""")
assert remaining.collect()[0][0] == 0

# Purge history so deleted data cannot be recovered via time travel
spark.sql("""
    VACUUM delta.`s3://lake/silver/events/`
    RETAIN 0 HOURS
""")
# Note: RETAIN 0 HOURS requires spark.databricks.delta.retentionDurationCheck.enabled = false
# In production, align VACUUM retention with your data retention policy (e.g., 7 days)

Delta Lake's DELETE FROM rewrites only the Parquet files that contain the matching rows — not every partition. For a customer present in 3 partitions out of 1,095, this is a 365x reduction in I/O. The VACUUM command removes old file versions from storage, ensuring time-travel queries cannot retrieve deleted data — a requirement for full GDPR compliance.

The Bronze layer (raw, immutable Parquet) is handled separately: GDPR deletion in Bronze typically requires a crypto-shredding approach — encrypting all events for a customer with a per-customer key, then destroying the key. The data remains in Bronze but becomes irreversibly unreadable, satisfying GDPR's right-to-erasure without rewriting 3 years of raw events.

📚 What Production Data Teams Learn the Hard Way

Lesson 1: Compaction is not optional — it is a first-class pipeline job. Every team that deploys a streaming write to Delta, Iceberg, or Hudi discovers small file accumulation within weeks. Plan for a compaction job from day one; treat it as a sidecar to every write pipeline, not as a cleanup task to address later.

Lesson 2: Schema evolution and schema enforcement are different problems. Delta Lake defaults to schema enforcement (strict mode) — any writer that introduces a new column or changes a type fails. This is the correct default in production. Turning on mergeSchema or autoMerge without governance gates is how Silver-layer schemas become unpredictable over time. Define a schema-change approval process before you enable evolution.

Lesson 3: Time travel is only as useful as your retention window. Delta Lake retains transaction log entries and old data files until VACUUM is run. If you VACUUM aggressively (e.g., daily with 0-hour retention), you lose the ability to roll back to yesterday's state — which is exactly when you need it. Set your VACUUM retention to at least 7 days, align it with your incident SLA, and document it as an operational constraint.

Lesson 4: Test your ACID guarantees with concurrent writers before production. The optimistic concurrency control in all three formats serializes conflicting writes — but "conflicting" is defined differently per format and table config. Delta Lake writers that touch different partitions will never conflict; writers that both write to the same partition will. Validate your concurrency assumptions with load tests before trusting your ETL pipeline on critical tables.

Lesson 5: For multi-engine shops, Iceberg's REST catalog is the connector you need. The Hive Metastore is the legacy coupling point for Iceberg tables, and it serializes catalog operations in a way that bottlenecks at hundreds of concurrent queries. Iceberg's REST catalog (backed by a simple HTTP service) is horizontally scalable and works identically across Spark, Trino, and Flink — upgrade to it before you hit the Metastore ceiling.

Lesson 6: Hidden partitioning in Iceberg eliminates a whole class of pipeline bugs. Hive-style partition-aware queries (WHERE year=2024 AND month=03) silently full-scan the table if the partition predicate is incorrectly typed or the partition column is aliased. Iceberg's hidden partitioning makes this impossible — the engine always pushes the filter down through metadata pruning automatically.

📌 TLDR: Summary & Key Takeaways for Open Table Format Selection

All three formats (Delta Lake, Iceberg, Hudi) store standard Parquet files with a metadata layer — your data format does not change, your operational semantics do.
Delta Lake wins for Spark-native Databricks shops: MERGE INTO, Z-ordering, and the Medallion architecture integration are unmatched in that ecosystem.
Apache Iceberg wins for multi-engine environments: hidden partitioning, partition evolution, and the REST catalog make it the most portable and scalable choice for organizations running Spark + Trino + Flink against the same tables.
Apache Hudi wins for streaming CDC upserts: record-keyed MoR writes, automatic compaction, and incremental queries are purpose-built for near-real-time pipelines from Debezium or Kafka Connect CDC sources.
Compaction is a required operational primitive, not an afterthought — plan for it from day one regardless of format choice.
GDPR deletion is solvable with all three formats via surgical DELETE + history expiry; it is fundamentally unsolvable with plain Parquet at scale.
The emerging convergence: Delta and Iceberg formats are becoming increasingly interoperable (Delta Universal Format supports Iceberg reads); in 2–3 years, format choice may matter less than catalog and governance choice.

Data Warehouse vs Data Lake vs Data Lakehouse — Understand the architectural evolution from warehouses to lakes to the Lakehouse pattern that open table formats enable.
Big Data Architecture Patterns: Lambda, Kappa, Medallion, and Data Mesh — How the Medallion Architecture layers (Bronze/Silver/Gold) map directly to open table format capabilities.
How Kafka Works — The CDC and streaming source that feeds Hudi pipelines and Delta Lake Structured Streaming jobs.

Expandable deep dives

📖 Production Nightmares That Open Table Formats Were Built to Solve⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

⚙️ What Open Table Formats Actually Are: A Metadata Layer Over Parquet⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

🧠 Deep Dive: Transaction Logs, Snapshot Trees, and Timeline Internals⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

The Internals: Three Architectures for Metadata Management⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi

Intermediate

📖 Production Nightmares That Open Table Formats Were Built to Solve

⚙️ What Open Table Formats Actually Are: A Metadata Layer Over Parquet

🧠 Deep Dive: Transaction Logs, Snapshot Trees, and Timeline Internals

The Internals: Three Architectures for Metadata Management

Mathematical Model: Snapshot Isolation and Optimistic Concurrency Control

Performance Analysis: Compaction, File Proliferation, and Metadata Amplification

📊 How Writes Flow Through Each Format's Metadata Layer

🏗️ Delta Lake, Iceberg, and Hudi: A Format-by-Format Feature Breakdown

Delta Lake (Databricks, 2019)

Apache Iceberg (Netflix, 2018)

Apache Hudi (Uber, 2016)

Head-to-Head Feature Comparison

🛠️ Apache Iceberg's Snapshot Architecture: Hidden Partitioning Under the Hood

🌍 Real-World Applications: Netflix, Uber, and Databricks at Petabyte Scale

⚖️ Trade-offs and Failure Modes: Small Files, Compaction Overhead, and Vendor Lock-in

🧭 Decision Guide: Choosing Between Delta Lake, Iceberg, and Hudi

🧪 GDPR-Compliant Right-to-Forget: A Worked Deletion Pipeline with Delta Lake

📚 What Production Data Teams Learn the Hard Way

📌 TLDR: Summary & Key Takeaways for Open Table Format Selection

🔗 Related Posts

Test Your Knowledge