Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi
Parquet alone breaks at scale: Delta Lake, Iceberg, and Hudi bring ACID, time travel, and schema evolution to object storage.
Abstract Algorithms
TLDR: Delta Lake, Apache Iceberg, and Apache Hudi are open table formats that wrap Parquet files with a transaction log (or snapshot tree) to deliver ACID guarantees, time travel, schema evolution, and efficient upserts on object storage. Choose Delta Lake for Databricks/Spark-centric shops, Iceberg for multi-engine portability at petabyte scale, and Hudi for near-real-time CDC upserts and streaming pipelines.
π Production Nightmares That Open Table Formats Were Built to Solve
It is Friday afternoon. A daily ETL job has just re-processed three days of customer events and landed the results in your S3 data lake. You notice the numbers look wrong and roll back the Airflow DAG β but there is no rollback for Parquet files. The bad data is now merged with the good. Meanwhile a downstream ML job that started ten minutes ago read a half-written partition, trained on corrupted features, and silently produced a model with degraded accuracy. The on-call engineer opens the playbook and finds the recovery step: "manually identify and delete incorrect files, re-run from source."
This scenario repeats in production data lakes every week. Plain Parquet on object storage has five structural weaknesses that become compounding failures as pipelines grow:
| Problem | Root Cause | Real Impact |
| No ACID transactions | Multiple writers have no coordination mechanism | Concurrent ETL jobs corrupt shared tables |
| No time travel | Files are mutable; overwrite destroys history | Bad data loads require full re-ingestion from source |
| No schema enforcement | Any writer can add/remove/rename columns | Downstream jobs break silently on unexpected schema drift |
| No efficient deletes | DELETE means rewriting entire partitions | GDPR "right-to-be-forgotten" takes hours per customer |
| Partial-write visibility | Readers see in-flight partitions | Streaming consumers produce wrong aggregates mid-write |
Open table formats were each built to eliminate a specific cluster of these problems β Delta Lake by Databricks in 2019 to tame Spark pipelines, Iceberg by Netflix in 2018 to manage PB-scale Hive tables across query engines, and Hudi by Uber in 2016 (open-sourced 2019) to handle millions of streaming upserts per second from trip and payment systems.
βοΈ What Open Table Formats Actually Are: A Metadata Layer Over Parquet
A common misconception is that switching to Delta Lake or Iceberg means changing your file format. It does not. The underlying data files remain standard Parquet (or ORC). What changes is what sits alongside those files: a structured metadata layer that gives your storage system the same semantic guarantees a relational database provides.
S3 bucket: s3://data-lake/orders/
βββ _delta_log/ β Delta Lake: append-only transaction log
β βββ 000000000000.json
β βββ 000000000001.json
β βββ 000000000010.checkpoint.parquet
βββ part-00000-abc.snappy.parquet β actual data files (unchanged Parquet)
βββ part-00001-def.snappy.parquet
Every open table format stores two things alongside data files: a transaction log or snapshot tree that records every change to the table, and statistics metadata (min/max values per column, row counts) that enable query engines to skip files without scanning them. The format library interprets this metadata to provide the transactional guarantees β the storage layer (S3, ADLS, GCS) remains a dumb byte store.
graph TD
W[Writer / ETL Job] --> API[Table Format API]
API --> META[Metadata Layer\n_delta_log / snapshots / .hoodie]
API --> DATA[Parquet Data Files\non S3 / ADLS / GCS]
META --> ACID[ACID Guarantees]
META --> TT[Time Travel]
META --> SE[Schema Evolution]
META --> SKIP[File Skipping / Pruning]
R[Reader / Query Engine] --> META
R --> DATA
The format library acts as the intelligence layer: writers and readers go through the API, which uses metadata to coordinate concurrency, expose historical versions, and prune irrelevant files before any data is read.
π§ Deep Dive: Transaction Logs, Snapshot Trees, and Timeline Internals
The Internals: Three Architectures for Metadata Management
Each format solves the metadata problem with a different data structure. Understanding the internals tells you exactly what the operational trade-offs are.
Delta Lake β Append-Only JSON Transaction Log
Delta Lake stores its transaction log in a _delta_log/ directory alongside the data files. Every commit appends a new numbered JSON file (000000000007.json) containing add/remove actions for data files plus the updated schema, partition information, and commit metadata. Every 10 commits (configurable), Delta consolidates the log into a Parquet checkpoint file for fast log replay. To read the current table state, a query engine replays log entries from the last checkpoint forward β an O(commits-since-checkpoint) scan, typically β€ 10 files.
_delta_log/
000000000000.json β initial write: add 4 files
000000000001.json β MERGE INTO: remove 2 files, add 2 files
...
000000000010.checkpoint.parquet β full table state at v10
000000000011.json β subsequent write
Apache Iceberg β Immutable Snapshot Tree
Iceberg organizes metadata as an immutable tree: a metadata file points to a sequence of snapshots, each snapshot contains a manifest list, each manifest list references manifest files, and each manifest file catalogues a set of data files along with their column statistics. No existing files are ever mutated β a new snapshot is written atomically, and the catalog pointer is updated. This structure means a reader at snapshot N sees a perfectly consistent view regardless of concurrent writers at snapshot N+1.
metadata/
v1.metadata.json β points to snapshot-1
v2.metadata.json β points to snapshots 1 and 2 (current = 2)
snap-1-<uuid>.avro β manifest list for snapshot 1
snap-2-<uuid>.avro β manifest list for snapshot 2
<uuid>-m0.avro β manifest file (lists data files + stats)
Apache Hudi β Event-Sourced Timeline
Hudi maintains a .hoodie/ timeline β a log of instants (timestamped actions: commit, deltacommit, compaction, clean). Each instant has three states: requested, inflight, and completed. Readers and writers check the timeline to determine which data files are visible. Hudi has two storage types that directly affect read/write trade-offs: Copy-on-Write (CoW) rewrites data files on every upsert, producing clean Parquet readable by any engine; Merge-on-Read (MoR) appends delta log files and merges on read, yielding near-real-time write latency but requiring a read-time merge step.
Mathematical Model: Snapshot Isolation and Optimistic Concurrency Control
All three formats implement snapshot isolation β a formal isolation level from ANSI SQL where every transaction reads from a consistent snapshot taken at its start time. Snapshot isolation prevents the read-write conflicts that corrupt plain-Parquet lakes, but it allows a specific anomaly called a write-write conflict.
The formal guarantee: given two concurrent transactions Tβ and Tβ both reading snapshot S, and both writing data files, a conflict is detected if and only if their write sets overlap (i.e., they both modify at least one common file or partition). In that case, whichever transaction commits second is rejected with a ConcurrentModificationException.
Delta Lake implements this via optimistic concurrency control (OCC) on the _delta_log/. When writer Wβ attempts to commit at version N+1, it reads the log from version N (its read snapshot) to version N+current and checks whether any intervening commits touched the same files. If they did, the commit is rejected. If they did not, the new JSON is written atomically using S3 conditional PUTs (or Azure blob leases). This means Delta Lake OCC is partition-safe by default: two writers that touch different partitions can always commit concurrently.
Iceberg uses a similar OCC model on its metadata file pointer. The catalog (Hive Metastore, Glue, REST catalog) performs an atomic compare-and-swap on the current metadata file pointer. Two writers that add files to different partitions will both succeed; two writers that update the same manifest fail on the second CAS attempt.
Hudi's timeline mechanism is slightly different: it uses file-level locking on the .hoodie/ directory for MoR delta writes, and optimistic conflict detection for CoW compaction. Hudi's record-level upsert model (keyed on a user-specified record key) means write conflicts are resolved at the record level rather than the file level, which is what enables its streaming CDC guarantees.
Performance Analysis: Compaction, File Proliferation, and Metadata Amplification
The three formats have distinct performance profiles that determine operational overhead at scale:
| Metric | Delta Lake | Apache Iceberg | Apache Hudi (MoR) |
| Write latency | Low β append JSON log entry | Low β write new snapshot metadata | Very low β append delta log |
| Read latency | Low after checkpoint | Low β manifest pruning | Medium β merge delta logs on read |
| Small file problem | Accumulates; OPTIMIZE + Z-ORDER needed | Accumulates; compaction job needed | Accumulates delta logs; auto-compaction |
| Metadata overhead | O(commits) JSON + checkpoint Parquet | O(snapshots) manifest files | O(instants) timeline entries |
| VACUUM / expiry | VACUUM removes old Parquet | expire_snapshots() removes old metadata | clean service removes old files |
The most critical operational concern for all three formats is small file accumulation. High-frequency streaming writes produce many small Parquet files that degrade read performance exponentially (S3 LIST operations + Spark task overhead). Delta Lake's OPTIMIZE command coalesces small files and applies Z-ordering; Iceberg uses rewrite_data_files; Hudi handles this automatically with its compaction service in MoR mode. At Uber's scale (millions of Hudi commits per day), the compaction service runs as a continuous Spark job to maintain read performance within SLA.
π How Writes Flow Through Each Format's Metadata Layer
The diagrams below trace a single write operation (an upsert of 1,000 new orders) through each format's metadata update path.
Delta Lake Write Flow
graph TD
A[Spark MERGE INTO orders] --> B{Read _delta_log\nfind latest version N}
B --> C[Write new Parquet data files\nto S3]
C --> D{OCC check: did any\ncommit Nβcurrent\ntouch same partition?}
D -- No conflict --> E[Write 000N+1.json\nadd/remove actions]
D -- Conflict detected --> F[Retry with backoff\nor raise exception]
E --> G[Readers at version β₯ N+1\nsee new data]
E --> H{Is version a multiple\nof 10?}
H -- Yes --> I[Write checkpoint.parquet\nconsolidate log]
H -- No --> G
Apache Iceberg Snapshot Tree (after write)
graph TD
A[Catalog pointer] --> B[v3.metadata.json\ncurrent snapshot = 3]
B --> S3[Snapshot 3\nnew data added]
B --> S2[Snapshot 2\nprevious state]
B --> S1[Snapshot 1\ninitial load]
S3 --> ML3[Manifest List]
ML3 --> MF1[Manifest File A\n4 data files + stats]
ML3 --> MF2[Manifest File B\n2 new files]
MF2 --> DF1[new-orders-001.parquet]
MF2 --> DF2[new-orders-002.parquet]
Apache Hudi Timeline (MoR table)
graph LR
TL[.hoodie/timeline] --> I1[20240101120000.commit\nCoW base files written]
I1 --> I2[20240101130000.deltacommit\ndelta log appended]
I2 --> I3[20240101140000.deltacommit\ndelta log appended]
I3 --> I4[20240101150000.compaction\ndelta merged into base files]
I4 --> I5[20240101160000.clean\nold base files removed]
Delta Lake serializes concurrent writers through the transaction log. Iceberg achieves consistency through immutable snapshot trees and an atomic catalog pointer swap. Hudi's timeline records every instant of change, enabling both point-in-time queries and incremental reads.
ποΈ Delta Lake, Iceberg, and Hudi: A Format-by-Format Feature Breakdown
Delta Lake (Databricks, 2019)
Delta Lake's core strength is deep Spark integration. Its MERGE INTO syntax is the most production-tested upsert mechanism across any open table format:
from delta.tables import DeltaTable
delta_orders = DeltaTable.forPath(spark, "s3://data-lake/orders/")
# Upsert: update if exists, insert if new
delta_orders.alias("target").merge(
source_df.alias("source"),
"target.order_id = source.order_id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
# Time travel: query yesterday's state
orders_yesterday = spark.read \
.format("delta") \
.option("timestampAsOf", "2026-03-27") \
.load("s3://data-lake/orders/")
# Schema evolution: add a new column without rewriting
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
new_df.write \
.format("delta") \
.option("mergeSchema", "true") \
.mode("append") \
.save("s3://data-lake/orders/")
# Z-ordering for multi-column data skipping
spark.sql("""
OPTIMIZE delta.`s3://data-lake/orders/`
ZORDER BY (customer_id, order_date)
""")
Z-ordering is Delta Lake's differentiating performance feature. By co-locating rows with similar values for a set of columns within the same data files, Z-ordering enables data skipping across multiple dimensions simultaneously β far more powerful than single-column Hive partitioning. A query filtering on both customer_id and order_date can skip 80β95% of files in a properly Z-ordered table.
Apache Iceberg (Netflix, 2018)
Iceberg's headline feature is hidden partitioning: partition values are tracked in metadata, not encoded in directory paths. Writers do not need to know the partition scheme; Iceberg derives partition values automatically. This enables partition evolution β changing the partition scheme (e.g., from monthly to daily) without rewriting any data.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.my_catalog.type", "glue") \
.getOrCreate()
# Create table with hidden partitioning (no partition-path in column required)
spark.sql("""
CREATE TABLE my_catalog.orders (
order_id BIGINT,
customer_id STRING,
order_date DATE,
amount DOUBLE
)
USING iceberg
PARTITIONED BY (days(order_date))
""")
# Writers just append β no partition column needed in the data
orders_df.writeTo("my_catalog.orders").append()
# Evolve partition scheme from daily to monthly without rewriting data
spark.sql("""
ALTER TABLE my_catalog.orders
REPLACE PARTITION FIELD days(order_date)
WITH months(order_date)
""")
# Time travel by snapshot ID
spark.read \
.option("snapshot-id", "8170123456789012345") \
.table("my_catalog.orders")
# Incremental read: changes between two snapshots
spark.read \
.option("start-snapshot-id", "8170123456789012344") \
.option("end-snapshot-id", "8170123456789012345") \
.format("iceberg") \
.load("my_catalog.orders")
Iceberg's multi-engine support is unmatched: the same table can be written by Spark and read by Trino, Flink, Hive, and Presto without any format conversion. Netflix runs Iceberg tables that are simultaneously written by Flink (real-time events), read by Trino (ad-hoc SQL), and batch-processed by Spark (ML feature engineering) β all against the same physical files.
Apache Hudi (Uber, 2016)
Hudi was engineered for one primary use case: high-throughput streaming upserts from CDC pipelines. Its record-level update model (every row has a recordKey and preCombineField) makes it the natural choice for systems where source events continuously update existing records.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
.getOrCreate()
# Write with upsert semantics (MoR table type for low-latency writes)
orders_df.write \
.format("hudi") \
.option("hoodie.table.name", "orders") \
.option("hoodie.datasource.write.recordkey.field", "order_id") \
.option("hoodie.datasource.write.precombine.field", "updated_at") \
.option("hoodie.datasource.write.operation", "upsert") \
.option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
.option("hoodie.compact.inline", "true") \
.option("hoodie.compact.inline.max.delta.commits", "5") \
.mode("append") \
.save("s3://data-lake/orders/")
# Incremental query: only records changed since a timestamp
spark.read \
.format("hudi") \
.option("hoodie.datasource.query.type", "incremental") \
.option("hoodie.datasource.read.begin.instanttime", "20260327000000") \
.load("s3://data-lake/orders/")
The incremental query is Hudi's killer feature for downstream consumers: instead of scanning the entire table to find what changed, a consumer provides a timestamp and receives only the delta since that point. This makes Hudi CDC pipelines dramatically cheaper than equivalent Spark full-table scans.
Head-to-Head Feature Comparison
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
| ACID transactions | β OCC on _delta_log | β CAS on catalog pointer | β Timeline + file locking |
| Time travel | β VERSION/TIMESTAMP AS OF | β Snapshot ID / timestamp | β Point-in-time instant |
| Upserts | MERGE INTO (Spark-native) | Row-level position/equality deletes | Native upserts, record-keyed (MoR/CoW) |
| Partition evolution | Manual repartition + rewrite | β Hidden + evolve without rewrite | Limited, schema-only evolution |
| Multi-engine | Spark-primary; Trino partial | Best-in-class (Spark/Flink/Trino/Hive) | Good (Spark/Flink; Trino improving) |
| Streaming | Spark Structured Streaming | Flink native; Spark streaming | Best-in-class β built for streaming |
| Compaction | OPTIMIZE + VACUUM (manual) | rewrite_data_files (manual/scheduled) | Auto-compaction (inline or async) |
| Schema evolution | mergeSchema option | Full DDL ALTER TABLE | Schema-on-write evolution |
| Best ecosystem | Databricks / Unity Catalog | Open multi-engine lakehouse | Kafka CDC / streaming upserts |
π οΈ Apache Iceberg's Snapshot Architecture: Hidden Partitioning Under the Hood
Netflix open-sourced Iceberg after discovering that Hive's partition model β which encodes partition values in directory paths (/year=2024/month=03/) β caused three chronic production problems:
- Partition misidentification β jobs that forgot to filter on partition columns performed full table scans silently.
- Partition evolution pain β changing from monthly to daily partitioning required copying all historical data.
- Small partition explosion β at petabyte scale, thousands of partition directories generated Hive Metastore bottlenecks on
LISToperations.
Iceberg solves all three by separating the logical partition scheme from the physical file layout. The partition spec (e.g., days(event_time)) is stored in the table metadata, not in file paths. Data files live at flat paths like data/<uuid>.parquet, and each manifest file records the partition value alongside the file reference:
# Manifest file entry (Avro record)
{
"status": "ADDED",
"data_file": {
"file_path": "s3://lake/orders/data/0001-abc.parquet",
"partition": { "order_date_day": 19810 }, β derived value, not in path
"record_count": 1200,
"column_sizes": { "order_id": 9800, "amount": 6400 },
"lower_bounds": { "order_date": <2024-03-01>, "amount": 0.99 },
"upper_bounds": { "order_date": <2024-03-31>, "amount": 4999.00 }
}
}
When a query engine evaluates WHERE order_date = '2024-03-15', Iceberg evaluates this filter against the lower_bounds/upper_bounds in each manifest entry before opening any data file. Files that cannot possibly contain matching rows are pruned entirely. This manifest-level pruning is what makes Iceberg viable for PB-scale tables where traditional Hive would require minutes just to enumerate partitions.
Partition evolution in practice: when Netflix switched a 50-TB events table from monthly to daily partitioning, zero bytes of historical data were rewritten. Iceberg simply updated the partition spec in the metadata file. New writes used the daily scheme; old files retained their monthly partition values in manifests. Both old and new files co-exist transparently β queries filter correctly against both partition schemes simultaneously.
For a full exploration of multi-engine lakehouse architectures and the Medallion pattern that these table formats power, see Big Data Architecture Patterns: Lambda, Kappa, Medallion, and Data Mesh.
π Real-World Applications: Netflix, Uber, and Databricks at Petabyte Scale
Netflix (Apache Iceberg): Netflix's data platform runs approximately 4,000 Iceberg tables, the largest exceeding 10 PB. The company chose Iceberg after a 2018 audit revealed that Hive Metastore bottlenecks were responsible for 30% of Spark job failures on their largest event tables. Today, Flink jobs write streaming events to Iceberg at sub-minute latency while Trino analysts run ad-hoc SQL against the same tables β a workload combination that no other format handles as cleanly.
Uber (Apache Hudi): Uber processes 100 billion+ trip events per day across their Hadoop data lake, with frequent updates as trip status changes (requested β accepted β completed β rated). Traditional batch pipelines that rewrote full partitions on every status change were both slow and expensive. Hudi's MoR upsert model reduced their data freshness lag from hours to minutes while cutting storage write amplification by 4x. Uber runs Hudi on a 500+ PB HDFS cluster and has contributed extensively to Hudi's incremental processing capabilities.
Databricks (Delta Lake): Delta Lake's most visible enterprise use case is the Medallion Architecture β Bronze (raw ingestion), Silver (cleansed/validated), Gold (aggregated for BI). Databricks reports that Medallion pipelines built on Delta Lake cut data quality incident rates by 60β70% in enterprise customers by making every layer independently recoverable through time travel. The MERGE INTO command is the Silver-layer workhorse: it idempotently upserts new records while preserving exactly-once semantics even when Spark jobs are retried.
βοΈ Trade-offs and Failure Modes: Small Files, Compaction Overhead, and Vendor Lock-in
Small File Accumulation
Every streaming write creates a new set of Parquet files. A Flink job that checkpoints every 30 seconds and writes to an Iceberg table will generate 2,880 small files per hour per partition. A table with 100 active partitions accumulates 288,000 files per hour. At this scale, S3 LIST operations alone can take seconds, and Spark's planning phase must open manifests for tens of thousands of files. All three formats require scheduled compaction to remain performant β this is a non-trivial operational overhead that teams frequently underestimate.
Metadata Amplification in Iceberg
Iceberg's manifest file design is elegant for reads but generates significant metadata write amplification. Each snapshot creates a new manifest list and potentially several new manifest files. For a table with 1,000 data files per partition and 500 active partitions, each write operation appends a new snapshot referencing up to 1 GB of manifest data. Running expire_snapshots() regularly is not optional β it is a prerequisite for keeping the metadata layer manageable.
Hudi CoW Write Amplification
CoW tables rewrite entire data files on every upsert, even if only one row changed. A Hudi CoW table with 10 GB files and a 0.1% update rate will rewrite 10 GB of data to change 10 MB of records. This is acceptable for read-heavy tables where query performance justifies the write cost, but at high upsert frequency it becomes prohibitive. The failure mode is subtle: write amplification silently increases S3 PUT costs until the bill becomes the monitoring signal.
Vendor Lock-in Risk
Delta Lake's tightest ecosystem coupling to Databricks (the Unity Catalog, Liquid Clustering, and Predictive I/O features are proprietary extensions on top of the open protocol) creates migration risk. OSS Delta Lake and Databricks Delta Lake are compatible but not identical β proprietary optimizations may embed assumptions that are difficult to replicate on other platforms.
π§ Decision Guide: Choosing Between Delta Lake, Iceberg, and Hudi
| Scenario | Recommendation | Rationale |
| All workloads run on Databricks / Spark only | Delta Lake | Tightest integration, MERGE INTO, Z-ordering, Unity Catalog |
| Mix of Spark + Trino + Flink across multiple teams | Apache Iceberg | Best multi-engine interoperability; hidden partitioning eliminates engine-specific bugs |
| CDC pipeline from MySQL/Postgres, near-real-time requirements | Apache Hudi | Record-level upserts, MoR for sub-minute freshness, incremental queries |
| Medallion architecture (Bronze β Silver β Gold) | Delta Lake or Iceberg | Both support ACID + time travel; choose by query engine mix |
| GDPR deletion requests at scale (millions of users) | Delta Lake or Iceberg | Delta's DELETE + REORG; Iceberg's equality delete files handle row-level deletes cleanly |
| Petabyte-scale with partition evolution requirements | Apache Iceberg | Partition evolution without data rewrite is unique to Iceberg |
| Team starting fresh, no existing Spark investment | Apache Iceberg | Neutral governance (Apache foundation), widest engine support, active ecosystem |
Key non-obvious consideration: if your organization plans to migrate from Databricks to another query engine in the next 3β5 years, start with Iceberg. Delta Lake files are readable by other engines, but Delta Lake's metadata is the binding dependency. An Iceberg table can be queried by Spark today and Trino tomorrow without a single config change.
π§ͺ GDPR-Compliant Right-to-Forget: A Worked Deletion Pipeline with Delta Lake
GDPR Article 17 requires that a data processor delete all personal data for a customer upon request, typically within 30 days. For a data lake storing hundreds of billions of events across years of Parquet files, this is one of the hardest operational problems β and the one that makes the "no efficient deletes in plain Parquet" problem most viscerally expensive.
The Plain-Parquet Problem
# Without a table format: find and rewrite every partition
for partition in list_partitions("s3://lake/events/"):
df = spark.read.parquet(partition)
if df.filter(f"customer_id = '{customer_id}'").count() > 0:
df.filter(f"customer_id != '{customer_id}'") \
.write.mode("overwrite").parquet(partition)
# Result: rewrites every partition the customer appears in.
# For a customer with events across 3 years = 1,095 partitions rewritten.
The Delta Lake Solution
from delta.tables import DeltaTable
# Silver layer: events table with MERGE INTO for ongoing upserts
def upsert_customer_events(batch_df, batch_id):
silver = DeltaTable.forPath(spark, "s3://lake/silver/events/")
silver.alias("target").merge(
batch_df.alias("source"),
"target.event_id = source.event_id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
# Streaming write via Structured Streaming
spark.readStream \
.format("kafka") \
.option("subscribe", "customer_events") \
.load() \
.writeStream \
.foreachBatch(upsert_customer_events) \
.option("checkpointLocation", "s3://lake/checkpoints/events/") \
.start()
# GDPR deletion: surgical DELETE, no partition rewrite
spark.sql("""
DELETE FROM delta.`s3://lake/silver/events/`
WHERE customer_id = 'cust-4892'
""")
# Verify deletion
remaining = spark.sql("""
SELECT COUNT(*) FROM delta.`s3://lake/silver/events/`
WHERE customer_id = 'cust-4892'
""")
assert remaining.collect()[0][0] == 0
# Purge history so deleted data cannot be recovered via time travel
spark.sql("""
VACUUM delta.`s3://lake/silver/events/`
RETAIN 0 HOURS
""")
# Note: RETAIN 0 HOURS requires spark.databricks.delta.retentionDurationCheck.enabled = false
# In production, align VACUUM retention with your data retention policy (e.g., 7 days)
Delta Lake's DELETE FROM rewrites only the Parquet files that contain the matching rows β not every partition. For a customer present in 3 partitions out of 1,095, this is a 365x reduction in I/O. The VACUUM command removes old file versions from storage, ensuring time-travel queries cannot retrieve deleted data β a requirement for full GDPR compliance.
The Bronze layer (raw, immutable Parquet) is handled separately: GDPR deletion in Bronze typically requires a crypto-shredding approach β encrypting all events for a customer with a per-customer key, then destroying the key. The data remains in Bronze but becomes irreversibly unreadable, satisfying GDPR's right-to-erasure without rewriting 3 years of raw events.
π What Production Data Teams Learn the Hard Way
Lesson 1: Compaction is not optional β it is a first-class pipeline job. Every team that deploys a streaming write to Delta, Iceberg, or Hudi discovers small file accumulation within weeks. Plan for a compaction job from day one; treat it as a sidecar to every write pipeline, not as a cleanup task to address later.
Lesson 2: Schema evolution and schema enforcement are different problems. Delta Lake defaults to schema enforcement (strict mode) β any writer that introduces a new column or changes a type fails. This is the correct default in production. Turning on mergeSchema or autoMerge without governance gates is how Silver-layer schemas become unpredictable over time. Define a schema-change approval process before you enable evolution.
Lesson 3: Time travel is only as useful as your retention window. Delta Lake retains transaction log entries and old data files until VACUUM is run. If you VACUUM aggressively (e.g., daily with 0-hour retention), you lose the ability to roll back to yesterday's state β which is exactly when you need it. Set your VACUUM retention to at least 7 days, align it with your incident SLA, and document it as an operational constraint.
Lesson 4: Test your ACID guarantees with concurrent writers before production. The optimistic concurrency control in all three formats serializes conflicting writes β but "conflicting" is defined differently per format and table config. Delta Lake writers that touch different partitions will never conflict; writers that both write to the same partition will. Validate your concurrency assumptions with load tests before trusting your ETL pipeline on critical tables.
Lesson 5: For multi-engine shops, Iceberg's REST catalog is the connector you need. The Hive Metastore is the legacy coupling point for Iceberg tables, and it serializes catalog operations in a way that bottlenecks at hundreds of concurrent queries. Iceberg's REST catalog (backed by a simple HTTP service) is horizontally scalable and works identically across Spark, Trino, and Flink β upgrade to it before you hit the Metastore ceiling.
Lesson 6: Hidden partitioning in Iceberg eliminates a whole class of pipeline bugs. Hive-style partition-aware queries (WHERE year=2024 AND month=03) silently full-scan the table if the partition predicate is incorrectly typed or the partition column is aliased. Iceberg's hidden partitioning makes this impossible β the engine always pushes the filter down through metadata pruning automatically.
π TLDR: Summary & Key Takeaways for Open Table Format Selection
- All three formats (Delta Lake, Iceberg, Hudi) store standard Parquet files with a metadata layer β your data format does not change, your operational semantics do.
- Delta Lake wins for Spark-native Databricks shops:
MERGE INTO, Z-ordering, and the Medallion architecture integration are unmatched in that ecosystem. - Apache Iceberg wins for multi-engine environments: hidden partitioning, partition evolution, and the REST catalog make it the most portable and scalable choice for organizations running Spark + Trino + Flink against the same tables.
- Apache Hudi wins for streaming CDC upserts: record-keyed MoR writes, automatic compaction, and incremental queries are purpose-built for near-real-time pipelines from Debezium or Kafka Connect CDC sources.
- Compaction is a required operational primitive, not an afterthought β plan for it from day one regardless of format choice.
- GDPR deletion is solvable with all three formats via surgical
DELETE+ history expiry; it is fundamentally unsolvable with plain Parquet at scale. - The emerging convergence: Delta and Iceberg formats are becoming increasingly interoperable (Delta Universal Format supports Iceberg reads); in 2β3 years, format choice may matter less than catalog and governance choice.
π Practice Quiz
A Delta Lake table has received 23 commits since its last checkpoint. How many JSON log files does a query engine need to read to reconstruct the current table state?
- A) 23 files β all commits since the table was created
- B) 13 files β 10-commit checkpoint plus 13 subsequent JSON entries
- C) 3 files β Delta compresses log files automatically every 10 commits Correct Answer: B
Your team runs Flink for real-time event ingestion, Trino for analyst SQL queries, and Spark for weekly ML batch training β all against the same customer events table. No single team "owns" the table format choice. Which format is the strongest fit and why?
- A) Delta Lake β Spark integration is best-in-class and Trino connectors exist
- B) Apache Iceberg β REST catalog and format-agnostic metadata enable all three engines natively
- C) Apache Hudi β MoR tables support both streaming ingestion and batch reads Correct Answer: B
In Apache Hudi's Merge-on-Read (MoR) table type, a read query on a partition with 3 base files and 15 delta log files requires which operation before returning results?
- A) A full compaction run that rewrites all 3 base files and discards the delta logs
- B) A read-time merge that combines the 3 base files with applicable delta log entries to resolve the latest record state
- C) A scan of only the 15 delta log files; base files are ignored after a deltacommit Correct Answer: B
Open-ended challenge: Your organization has a 20 TB customer activity table written by a Kafka-backed Flink pipeline (1 million upserts per minute) and read by both Trino analysts (ad-hoc SQL) and a Spark ML pipeline (weekly full-table scan). GDPR deletion requests must complete within 24 hours. Sketch which table format you would choose and the compaction + deletion strategy you would implement. What would change if the Flink write frequency doubled to 2 million upserts per minute?
π Related Posts
- Data Warehouse vs Data Lake vs Data Lakehouse β Understand the architectural evolution from warehouses to lakes to the Lakehouse pattern that open table formats enable.
- Big Data Architecture Patterns: Lambda, Kappa, Medallion, and Data Mesh β How the Medallion Architecture layers (Bronze/Silver/Gold) map directly to open table format capabilities.
- How Kafka Works β The CDC and streaming source that feeds Hudi pipelines and Delta Lake Structured Streaming jobs.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Software Engineering Principles: Your Complete Learning Roadmap
TLDR: This roadmap organizes the Software Engineering Principles series into a problem-first learning path β starting with the code smell before the principle. New to SOLID? Start with Single Responsibility. Facing messy legacy code? Jump to the smel...
Machine Learning Fundamentals: Your Complete Learning Roadmap
TLDR: πΊοΈ Most ML courses dive into math formulas before explaining what problems they solve. This roadmap guides you through 9 essential posts across 3 phases: understanding ML fundamentals β mastering core algorithms β deploying production models. ...
Low-Level Design Guide: Your Complete Learning Roadmap
TLDR TLDR: LLD interviews ask you to design classes and interfaces β not databases and caches.This roadmap sequences 8 problems across two phases: Phase 1 (6 beginner posts) builds your core OOP vocabulary through increasingly complex domains; Phase...

LLM Engineering: Your Complete Learning Roadmap
TLDR: The LLM space moves so fast that engineers end up reading random blog posts and never build a mental model of how everything connects. This roadmap organizes 35+ LLM Engineering posts into 7 tra
