Data Warehouse vs Data Lake vs Data Lakehouse: Which One to Choose?

Snowflake, Databricks, or S3? We explain the evolution of data storage. Learn when to use a struc...

Abstract Algorithms

·Mar 9, 2026·14 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Warehouse = structured, clean data for BI and SQL dashboards (Snowflake, BigQuery). Lake = raw, messy data for ML and data science (S3, HDFS). Lakehouse = open table formats (Delta Lake, Iceberg) that bring SQL performance to raw storage — the best of both.

📖 Three Ways to Store Data at Scale: What Problem Each Solves

As organizations grow, they end up with data coming from many sources: transactional databases, clickstreams, IoT sensors, third-party APIs. Storing and querying all of it efficiently requires choosing the right abstraction.

Question	Warehouse	Lake	Lakehouse
What data format?	Structured tables, SQL	Raw files (JSON, Parquet, CSV)	Open formats (Parquet + metadata layer)
Who uses it?	Analysts, BI tools	Data scientists, ML engineers	Both
Primary operation	SQL aggregations	Batch ML training, exploration	SQL + ML on same data
When to update data?	ETL: clean before storing	ELT: store raw, transform later	ACID on open files
Cost model	Compute + managed storage	Cheap object storage	Object storage + optional compute

🔍 OLAP, OLTP, and the Anatomy of a Data Decision

Before picking a storage architecture, it helps to understand the two fundamentally different ways organizations query data.

OLTP (Online Transaction Processing) is what your application's production database does — short, fast reads and writes: "insert this order", "update this balance". PostgreSQL and MySQL are OLTP systems, optimized for low-latency single-row operations.

OLAP (Online Analytical Processing) is what happens when Finance asks "what was our revenue by product category for every quarter in the last three years?" — a query scanning billions of rows. Warehouses, lakes, and lakehouses are all OLAP-oriented.

Dimension	OLTP (Production DB)	OLAP (Warehouse / Lake)
Query type	Single-row reads/writes	Multi-row aggregations
Latency target	< 10 ms	Seconds to minutes
Data format	Normalized rows	Columnar, denormalized
Scale	GB to TB	TB to PB
Example	PostgreSQL, MySQL	Snowflake, S3, Delta Lake

A second key distinction is data shape:

Structured — rows and columns with a fixed schema (orders, events, user profiles).
Semi-structured — self-describing but variable layout (JSON, Parquet, Avro logs).
Unstructured — no schema at all (images, PDFs, audio, raw text).

Warehouses only accept structured data. Lakes and lakehouses handle all three. If your organization generates diverse data — ML training sets, sensor readings, chat logs — a warehouse alone is a dead end.

🔢 Deep Dive: The Data Warehouse — SQL on Clean, Structured Data

A warehouse ingests pre-processed, schema-validated data. The ETL pipeline (Extract, Transform, Load) cleans and normalizes data before it enters.

flowchart LR
    Source[Operational DBs APIs / Events] --> ETL[ETL Pipeline Transform + Validate]
    ETL --> DW[Data Warehouse Snowflake / BigQuery Redshift]
    DW --> BI[BI Tools Tableau / Looker]
    DW --> Analyst[SQL Analysts]

Strengths:

Sub-second query performance on aggregations over billions of rows.
Strong governance: schema enforcement, row-level security, audit trails.
Familiar SQL interface for business users.

Limitations:

Only stores structured, pre-defined schemas.
Poor fit for unstructured data (images, logs, text, audio).
Re-processing raw data means going back to source — the warehouse discards it.

⚙️ The Data Lake: Raw Files for Every Use Case

A lake stores everything in its raw form — no transformation required at write time. Data scientists can explore it, ML engineers can train on it, and analysts can query subsets of it.

flowchart LR
    Source2[Any source] --> Lake[Data Lake S3 / GCS / HDFS Parquet, JSON, CSV, Images]
    Lake --> Spark[Spark Jobs ML Training]
    Lake --> Notebook[Notebooks Exploration]
    Lake --> DW2[Copy subsets to Warehouse]

Strengths:

Stores anything: structured, semi-structured, unstructured.
Cheap object storage — pennies per GB.
Supports large-scale ML training with frameworks like Spark, Ray, PyTorch.

Limitations:

"Data swamp" risk: without governance, data becomes untrustworthy and unmaintainable.
No ACID transactions — multiple writers can corrupt files.
Poor for interactive SQL queries (full file scans, no indexes).

🧠 Deep Dive: The Data Lakehouse: Open Table Formats Bridge the Gap

The lakehouse uses open table formats (Delta Lake, Apache Iceberg, Apache Hudi) on top of cheap object storage to add the features a warehouse provides:

ACID transactions on Parquet files using a transaction log.
Schema enforcement and evolution.
Time travel — query data as it existed at any past timestamp.
Query optimization — data skipping, Z-ordering, compaction.

flowchart TB
    Raw[S3 / GCS Parquet files] --> TableFmt[Delta Lake / Iceberg Transaction Log + Stats]
    TableFmt --> SparkSQL[Spark SQL ML Pipelines]
    TableFmt --> Presto[Presto / Trino Interactive SQL]
    TableFmt --> BITool[BI Tools native connector]

Used by Databricks (Delta Lake), Netflix (Iceberg), Uber (Hudi). You keep cheap storage but gain warehouse-quality query performance and data consistency.

📊 ETL vs ELT vs Lakehouse Pipelines: Seeing the Difference

The processing philosophy is as important as the storage layer. Here is how data flows in each paradigm — from the same raw sources all the way to the consumer:

flowchart TD
    SRC[Data Sources APIs  Databases  Event Logs]

    SRC --> ETL_T[Transform & Clean ETL Pipeline]
    ETL_T --> WH[Warehouse Snowflake  BigQuery]
    WH --> BI_T[Dashboards & Reports BI Analysts]

    SRC --> ELT_L[Land Raw Files S3  GCS  HDFS]
    ELT_L --> ELT_T[Transform in Place Spark  dbt]
    ELT_T --> ML[ML Training Data Science Teams]

    SRC --> LH_L[Land as Parquet S3 + Delta Lake / Iceberg]
    LH_L --> LH_T[ACID Table Layer Transaction Log + Stats]
    LH_T --> SQL_Q[SQL Queries BI Analysts]
    LH_T --> ML2[ML Pipelines Data Science Teams]

The key contrast in practice:

ETL cleans before writing — fast to query, but the raw data is discarded.
ELT writes raw first — flexible, but requires transformation discipline downstream.
Lakehouse writes raw Parquet with a governance layer — you get fast queries and raw access.

Think of ETL as a hotel that only admits guests in formal attire: clean and easy to serve, but many guests can't get in. A data lake is a campground that takes everyone with no rules. A lakehouse is a hotel that accepts everyone and maintains order internally.

📊 Architecture Data Flow Comparison

flowchart LR
    subgraph WH[Warehouse: ETL Path]
        S1[Raw Source] --> T1[Transform & Clean]
        T1 --> DW1[(Snowflake BigQuery)]
        DW1 --> BI1[BI Dashboards]
    end

    subgraph LK[Lake: ELT Path]
        S2[Raw Source] --> L2[(S3 / HDFS Raw Parquet)]
        L2 --> T2[Transform in Spark]
        T2 --> ML2[ML Training]
    end

    subgraph LH[Lakehouse: Unified Path]
        S3[Raw Source] --> P3[(S3 + Delta Lake Iceberg Table)]
        P3 --> Q3[SQL via Trino]
        P3 --> ML3[ML via Spark]
    end

This diagram compares the three architecture paths side by side: the ETL warehouse path cleans data before loading it into Snowflake or BigQuery, the ELT lake path lands raw files on S3 and transforms them later in Spark, and the lakehouse path uses Delta Lake or Iceberg on S3 to serve both SQL and ML from the same Parquet files. The three parallel subgraphs make the key trade-off visible — the warehouse discards raw data, the lake retains it but lacks governance, and the lakehouse retains it while adding the ACID layer that makes it query-ready. The takeaway is that the lakehouse eliminates the forced choice between a clean warehouse and a flexible lake.

📊 ETL Pipeline vs ELT Pipeline

sequenceDiagram
    participant Src as Data Source
    participant ETL as ETL Engine
    participant WH as Warehouse
    participant Lake as Data Lake
    participant ELT as Spark/dbt

    Note over Src,WH: ETL  Transform before load
    Src->>ETL: Raw extract (orders, events)
    ETL->>ETL: Validate + clean + schema
    ETL->>WH: Load clean structured data
    WH-->>ETL: Indexed, ready for SQL

    Note over Src,ELT: ELT  Load raw, transform later
    Src->>Lake: Land raw files (JSON/Parquet)
    Lake-->>ELT: Raw data available
    ELT->>Lake: Transform in place (dbt/Spark)
    Lake-->>ELT: Query-ready tables

This sequence diagram contrasts the ETL and ELT processing orders across the same two actors. In the ETL flow, the engine validates, cleans, and transforms before the data ever reaches the warehouse — queries are fast, but the raw source data is discarded. In the ELT flow, raw files land in the lake immediately and Spark or dbt transforms them in place on demand — the original data is always preserved and schema changes never break ingestion. This deferred transformation is the architectural shift that makes lakehouse patterns possible.

🌍 Real-World Application: Netflix, Airbnb, and Spotify

Airbnb and the SQL-First Warehouse

Airbnb runs Snowflake to power thousands of internal BI reports. Analysts write SQL to query pre-aggregated booking, pricing, and host data. Because the data is clean and typed, a complex multi-table aggregation across millions of listings completes in seconds. Business users never wait for Spark jobs.

Best fit: high-frequency SQL dashboards, regulatory reporting, executive metrics.

Spotify and the Petabyte Data Lake

Spotify stores petabytes of raw listening events in S3 as Parquet files. Machine learning teams pull that raw data directly into training pipelines with Spark or PyTorch. The raw signal matters — a cleaned-up summary would lose the nuance that separates "skip after 5 seconds" from "replayed three times immediately."

Best fit: ML training at scale, raw event archival, research workloads requiring full signal fidelity.

Netflix and the Lakehouse Migration

Netflix migrated from a plain data lake to Apache Iceberg on S3 to solve a costly problem: hundreds of tables written to simultaneously by multiple teams were occasionally corrupting each other. Iceberg's ACID transaction log eliminated the corruption and unlocked time travel — Netflix can now audit any table as it existed at any past timestamp.

Uber adopted Apache Hudi for similar reasons: incremental upserts (updating individual ride records) are impossible on a plain lake but trivial with Hudi's merge-on-read format.

Company	Architecture	Primary Driver
Airbnb	Warehouse (Snowflake)	BI speed, analyst self-service
Spotify	Data Lake (S3 + Parquet)	ML at petabyte scale
Netflix	Lakehouse (Iceberg on S3)	ACID correctness + time travel
Uber	Lakehouse (Hudi on S3)	Incremental upserts on raw data

⚖️ Trade-offs & Failure Modes: Trade-offs, Failure Modes & Decision Guide: Which One to Pick

If your team primarily needs…	Choose
Fast SQL dashboards for business users	Warehouse (Snowflake, BigQuery)
ML model training on raw, diverse data	Lake (S3 + Parquet)
Both SQL and ML on the same data	Lakehouse (Delta Lake, Iceberg)
Quick start with minimal ops overhead	Warehouse (fully managed)
Maximum flexibility and open standards	Lakehouse
Smallest storage cost, OK with management complexity	Lake

In practice, many large organizations run all three: a warehouse for reporting, a lake for raw archival and ML, and a lakehouse for the intersection.

🧪 Choosing Your Architecture: A Three-Question Decision Walk

You do not need to memorize the theory — answering three questions about your actual use case gets you to the right answer every time.

Question 1 — Who consumes the data?

Business analysts running SQL dashboards → lean toward Warehouse.
Data scientists training ML models → lean toward Lake.
Both groups sharing the same dataset → lean toward Lakehouse.

Question 2 — What shape is your data?

All structured (fixed schema, rows and columns) → Warehouse works perfectly.
Mix of structured, semi-structured, and unstructured → Lake or Lakehouse required.

Question 3 — Do you need ACID correctness?

Multiple writers, strict consistency, or incremental updates required → Warehouse or Lakehouse.
Append-only batch jobs, OK with eventual consistency → Lake is sufficient.

Worked example — an e-commerce startup:

The startup has three data consumers:

A finance dashboard querying daily revenue (SQL, structured data).
A recommendation model training on raw clickstreams (ML, semi-structured data).
A fraud detection system applying incremental risk-score updates (ACID required).

Verdict: A Lakehouse (Delta Lake or Iceberg on S3) covers all three. The same Parquet files power the BI dashboard via Trino SQL, the ML training pipeline via Spark, and the fraud model's incremental ACID writes — no data duplication, single source of truth, one storage bill.

🛠️ Apache Spark + Delta Lake: SQL and ML on the Same Data with PySpark

Apache Spark is the dominant distributed compute engine for big data analytics and ML pipelines; Delta Lake is an open-source storage layer (built by Databricks) that adds ACID transactions, schema enforcement, time travel, and MERGE/upsert support to Parquet files stored on S3, GCS, or HDFS. Together they form the canonical Python lakehouse stack.

The core advantage over a plain data lake: multiple PySpark jobs can write to the same Delta table concurrently without corrupting each other, BI analysts can query it with SQL, and ML engineers can read the same Parquet files — one storage layer serves both use cases.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp
from delta import configure_spark_with_delta_pip

# 1. Create a Spark session with Delta Lake extensions
builder = (
    SparkSession.builder
    .appName("LakehouseDemo")
    .config("spark.sql.extensions",
            "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog",
            "org.apache.spark.sql.delta.catalog.DeltaCatalog")
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()

# 2. ELT pattern — store raw clickstream events as a Delta table (no pre-transformation)
events_df = spark.read.json("s3://data-lake/raw/clickstream/2024/")

(
    events_df
    .withColumn("ingested_at", current_timestamp())
    .write
    .format("delta")
    .mode("append")                        # ACID-safe: concurrent writers never corrupt files
    .partitionBy("event_date")             # partition pruning speeds up date-range queries
    .save("s3://data-lake/delta/clickstream")
)

# 3. SQL analytics on the same Delta table — BI dashboard query
spark.read.format("delta").load("s3://data-lake/delta/clickstream") \
    .createOrReplaceTempView("clickstream")

spark.sql("""
    SELECT   product_id,
             COUNT(*)              AS page_views,
             COUNT(DISTINCT user_id) AS unique_visitors
    FROM     clickstream
    WHERE    event_date >= '2024-01-01'
    GROUP BY product_id
    ORDER BY page_views DESC
    LIMIT    10
""").show()

# 4. Time travel — query the table as it existed 7 versions ago (ML debugging / audits)
historical_df = (
    spark.read.format("delta")
    .option("versionAsOf", 5)              # or .option("timestampAsOf", "2024-01-15")
    .load("s3://data-lake/delta/clickstream")
)
print(f"Historical version row count: {historical_df.count()}")

# 5. MERGE (upsert) — update existing records, insert new ones
#    This operation is impossible on plain S3 Parquet without Delta Lake
spark.sql("""
    MERGE INTO clickstream AS target
    USING     updates      AS source
    ON        target.event_id = source.event_id
    WHEN MATCHED     THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
""")

dbt (data build tool) is the transformation companion in production lakehouse pipelines: SQL-based dbt models run on top of the Delta table, applying business-logic transformations with full lineage tracking and incremental materialization — without any Spark job code.

Layer	Tool	Role
Compute	Apache Spark (PySpark)	Distributed read/write, ML training
Storage format	Delta Lake / Apache Iceberg	ACID + time travel on Parquet
Transformation	dbt	SQL models with lineage
BI / SQL	Trino, Presto, Databricks SQL	Interactive analyst queries

For a full deep-dive on Delta Lake MERGE patterns, Apache Iceberg vs. Delta Lake trade-offs, and dbt incremental model design, a dedicated follow-up post is planned.

📚 Five Lessons Learned From Real Data Architecture Failures

1. A data lake without a catalog becomes a data swamp. Without a metadata catalog (Apache Atlas, AWS Glue, or DataHub), teams cannot discover what data exists, how fresh it is, or who owns it. Within months, the lake fills with orphaned datasets nobody trusts. Governance is not optional — build it before the lake grows.

2. Over-engineering to a lakehouse before you need it adds cost. If your team is five analysts writing SQL with zero ML use cases, a managed warehouse (BigQuery, Snowflake) ships faster and costs less to operate. Adopt a lakehouse when your teams genuinely diverge in their data needs, not before.

3. ETL pipelines become bottlenecks as schemas evolve. Warehouse ETL pipelines hardcode transformations. Every time a source system adds a column, someone updates a pipeline. An ELT or lake-first approach stores raw data untouched — schema changes do not break ingestion, and transformations happen on demand downstream.

4. Coupling compute and storage is expensive at scale. Legacy on-premises warehouses couple compute and storage: scaling one means scaling both. Modern cloud designs (Snowflake virtual warehouses, Databricks clusters over S3) decouple them — pay only for active compute, store data cheaply at rest.

5. Time travel is more useful than it sounds. Delta Lake and Iceberg let you query data as it existed 30 days ago. This is not just an audit feature — it is how engineers debug ML model regressions ("why did accuracy drop last Tuesday?") and recover from accidental bulk deletes without triggering a full restore.

📌 TLDR: Summary & Key Takeaways

Warehouse: clean, structured, SQL, fast dashboards. Bad for unstructured or raw data.
Lake: raw, cheap, flexible. Good for ML. Bad for interactive SQL and governance.
Lakehouse: open table formats add ACID + SQL performance on top of cheap object storage.
ETL (warehouse) transforms before storing; ELT (lake/lakehouse) stores raw and transforms later.
Most serious data organizations run a hybrid: warehouse for BI, lake/lakehouse for ML.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read