Data Warehouse vs Data Lake vs Data Lakehouse: Which One to Choose?
Snowflake, Databricks, or S3? We explain the evolution of data storage. Learn when to use a struc...
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Warehouse = structured, clean data for BI and SQL dashboards (Snowflake, BigQuery). Lake = raw, messy data for ML and data science (S3, HDFS). Lakehouse = open table formats (Delta Lake, Iceberg) that bring SQL performance to raw storage β the best of both.
π Three Ways to Store Data at Scale: What Problem Each Solves
As organizations grow, they end up with data coming from many sources: transactional databases, clickstreams, IoT sensors, third-party APIs. Storing and querying all of it efficiently requires choosing the right abstraction.
| Question | Warehouse | Lake | Lakehouse |
| What data format? | Structured tables, SQL | Raw files (JSON, Parquet, CSV) | Open formats (Parquet + metadata layer) |
| Who uses it? | Analysts, BI tools | Data scientists, ML engineers | Both |
| Primary operation | SQL aggregations | Batch ML training, exploration | SQL + ML on same data |
| When to update data? | ETL: clean before storing | ELT: store raw, transform later | ACID on open files |
| Cost model | Compute + managed storage | Cheap object storage | Object storage + optional compute |
π OLAP, OLTP, and the Anatomy of a Data Decision
Before picking a storage architecture, it helps to understand the two fundamentally different ways organizations query data.
OLTP (Online Transaction Processing) is what your application's production database does β short, fast reads and writes: "insert this order", "update this balance". PostgreSQL and MySQL are OLTP systems, optimized for low-latency single-row operations.
OLAP (Online Analytical Processing) is what happens when Finance asks "what was our revenue by product category for every quarter in the last three years?" β a query scanning billions of rows. Warehouses, lakes, and lakehouses are all OLAP-oriented.
| Dimension | OLTP (Production DB) | OLAP (Warehouse / Lake) |
| Query type | Single-row reads/writes | Multi-row aggregations |
| Latency target | < 10 ms | Seconds to minutes |
| Data format | Normalized rows | Columnar, denormalized |
| Scale | GB to TB | TB to PB |
| Example | PostgreSQL, MySQL | Snowflake, S3, Delta Lake |
A second key distinction is data shape:
- Structured β rows and columns with a fixed schema (orders, events, user profiles).
- Semi-structured β self-describing but variable layout (JSON, Parquet, Avro logs).
- Unstructured β no schema at all (images, PDFs, audio, raw text).
Warehouses only accept structured data. Lakes and lakehouses handle all three. If your organization generates diverse data β ML training sets, sensor readings, chat logs β a warehouse alone is a dead end.
π’ Deep Dive: The Data Warehouse β SQL on Clean, Structured Data
A warehouse ingests pre-processed, schema-validated data. The ETL pipeline (Extract, Transform, Load) cleans and normalizes data before it enters.
flowchart LR
Source[Operational DBs APIs / Events] --> ETL[ETL Pipeline Transform + Validate]
ETL --> DW[Data Warehouse Snowflake / BigQuery Redshift]
DW --> BI[BI Tools Tableau / Looker]
DW --> Analyst[SQL Analysts]
Strengths:
- Sub-second query performance on aggregations over billions of rows.
- Strong governance: schema enforcement, row-level security, audit trails.
- Familiar SQL interface for business users.
Limitations:
- Only stores structured, pre-defined schemas.
- Poor fit for unstructured data (images, logs, text, audio).
- Re-processing raw data means going back to source β the warehouse discards it.
βοΈ The Data Lake: Raw Files for Every Use Case
A lake stores everything in its raw form β no transformation required at write time. Data scientists can explore it, ML engineers can train on it, and analysts can query subsets of it.
flowchart LR
Source2[Any source] --> Lake[Data Lake S3 / GCS / HDFS Parquet, JSON, CSV, Images]
Lake --> Spark[Spark Jobs ML Training]
Lake --> Notebook[Notebooks Exploration]
Lake --> DW2[Copy subsets to Warehouse]
Strengths:
- Stores anything: structured, semi-structured, unstructured.
- Cheap object storage β pennies per GB.
- Supports large-scale ML training with frameworks like Spark, Ray, PyTorch.
Limitations:
- "Data swamp" risk: without governance, data becomes untrustworthy and unmaintainable.
- No ACID transactions β multiple writers can corrupt files.
- Poor for interactive SQL queries (full file scans, no indexes).
π§ Deep Dive: The Data Lakehouse: Open Table Formats Bridge the Gap
The lakehouse uses open table formats (Delta Lake, Apache Iceberg, Apache Hudi) on top of cheap object storage to add the features a warehouse provides:
- ACID transactions on Parquet files using a transaction log.
- Schema enforcement and evolution.
- Time travel β query data as it existed at any past timestamp.
- Query optimization β data skipping, Z-ordering, compaction.
flowchart TB
Raw[S3 / GCS Parquet files] --> TableFmt[Delta Lake / Iceberg Transaction Log + Stats]
TableFmt --> SparkSQL[Spark SQL ML Pipelines]
TableFmt --> Presto[Presto / Trino Interactive SQL]
TableFmt --> BITool[BI Tools native connector]
Used by Databricks (Delta Lake), Netflix (Iceberg), Uber (Hudi). You keep cheap storage but gain warehouse-quality query performance and data consistency.
π ETL vs ELT vs Lakehouse Pipelines: Seeing the Difference
The processing philosophy is as important as the storage layer. Here is how data flows in each paradigm β from the same raw sources all the way to the consumer:
flowchart TD
SRC[Data Sources APIs Databases Event Logs]
SRC --> ETL_T[Transform & Clean ETL Pipeline]
ETL_T --> WH[Warehouse Snowflake BigQuery]
WH --> BI_T[Dashboards & Reports BI Analysts]
SRC --> ELT_L[Land Raw Files S3 GCS HDFS]
ELT_L --> ELT_T[Transform in Place Spark dbt]
ELT_T --> ML[ML Training Data Science Teams]
SRC --> LH_L[Land as Parquet S3 + Delta Lake / Iceberg]
LH_L --> LH_T[ACID Table Layer Transaction Log + Stats]
LH_T --> SQL_Q[SQL Queries BI Analysts]
LH_T --> ML2[ML Pipelines Data Science Teams]
The key contrast in practice:
- ETL cleans before writing β fast to query, but the raw data is discarded.
- ELT writes raw first β flexible, but requires transformation discipline downstream.
- Lakehouse writes raw Parquet with a governance layer β you get fast queries and raw access.
Think of ETL as a hotel that only admits guests in formal attire: clean and easy to serve, but many guests can't get in. A data lake is a campground that takes everyone with no rules. A lakehouse is a hotel that accepts everyone and maintains order internally.
π Architecture Data Flow Comparison
flowchart LR
subgraph WH[Warehouse: ETL Path]
S1[Raw Source] --> T1[Transform & Clean]
T1 --> DW1[(Snowflake BigQuery)]
DW1 --> BI1[BI Dashboards]
end
subgraph LK[Lake: ELT Path]
S2[Raw Source] --> L2[(S3 / HDFS Raw Parquet)]
L2 --> T2[Transform in Spark]
T2 --> ML2[ML Training]
end
subgraph LH[Lakehouse: Unified Path]
S3[Raw Source] --> P3[(S3 + Delta Lake Iceberg Table)]
P3 --> Q3[SQL via Trino]
P3 --> ML3[ML via Spark]
end
This diagram compares the three architecture paths side by side: the ETL warehouse path cleans data before loading it into Snowflake or BigQuery, the ELT lake path lands raw files on S3 and transforms them later in Spark, and the lakehouse path uses Delta Lake or Iceberg on S3 to serve both SQL and ML from the same Parquet files. The three parallel subgraphs make the key trade-off visible β the warehouse discards raw data, the lake retains it but lacks governance, and the lakehouse retains it while adding the ACID layer that makes it query-ready. The takeaway is that the lakehouse eliminates the forced choice between a clean warehouse and a flexible lake.
π ETL Pipeline vs ELT Pipeline
sequenceDiagram
participant Src as Data Source
participant ETL as ETL Engine
participant WH as Warehouse
participant Lake as Data Lake
participant ELT as Spark/dbt
Note over Src,WH: ETL Transform before load
Src->>ETL: Raw extract (orders, events)
ETL->>ETL: Validate + clean + schema
ETL->>WH: Load clean structured data
WH-->>ETL: Indexed, ready for SQL
Note over Src,ELT: ELT Load raw, transform later
Src->>Lake: Land raw files (JSON/Parquet)
Lake-->>ELT: Raw data available
ELT->>Lake: Transform in place (dbt/Spark)
Lake-->>ELT: Query-ready tables
This sequence diagram contrasts the ETL and ELT processing orders across the same two actors. In the ETL flow, the engine validates, cleans, and transforms before the data ever reaches the warehouse β queries are fast, but the raw source data is discarded. In the ELT flow, raw files land in the lake immediately and Spark or dbt transforms them in place on demand β the original data is always preserved and schema changes never break ingestion. This deferred transformation is the architectural shift that makes lakehouse patterns possible.
π Real-World Application: Netflix, Airbnb, and Spotify
Airbnb and the SQL-First Warehouse
Airbnb runs Snowflake to power thousands of internal BI reports. Analysts write SQL to query pre-aggregated booking, pricing, and host data. Because the data is clean and typed, a complex multi-table aggregation across millions of listings completes in seconds. Business users never wait for Spark jobs.
Best fit: high-frequency SQL dashboards, regulatory reporting, executive metrics.
Spotify and the Petabyte Data Lake
Spotify stores petabytes of raw listening events in S3 as Parquet files. Machine learning teams pull that raw data directly into training pipelines with Spark or PyTorch. The raw signal matters β a cleaned-up summary would lose the nuance that separates "skip after 5 seconds" from "replayed three times immediately."
Best fit: ML training at scale, raw event archival, research workloads requiring full signal fidelity.
Netflix and the Lakehouse Migration
Netflix migrated from a plain data lake to Apache Iceberg on S3 to solve a costly problem: hundreds of tables written to simultaneously by multiple teams were occasionally corrupting each other. Iceberg's ACID transaction log eliminated the corruption and unlocked time travel β Netflix can now audit any table as it existed at any past timestamp.
Uber adopted Apache Hudi for similar reasons: incremental upserts (updating individual ride records) are impossible on a plain lake but trivial with Hudi's merge-on-read format.
| Company | Architecture | Primary Driver |
| Airbnb | Warehouse (Snowflake) | BI speed, analyst self-service |
| Spotify | Data Lake (S3 + Parquet) | ML at petabyte scale |
| Netflix | Lakehouse (Iceberg on S3) | ACID correctness + time travel |
| Uber | Lakehouse (Hudi on S3) | Incremental upserts on raw data |
βοΈ Trade-offs & Failure Modes: Trade-offs, Failure Modes & Decision Guide: Which One to Pick
| If your team primarily needs⦠| Choose |
| Fast SQL dashboards for business users | Warehouse (Snowflake, BigQuery) |
| ML model training on raw, diverse data | Lake (S3 + Parquet) |
| Both SQL and ML on the same data | Lakehouse (Delta Lake, Iceberg) |
| Quick start with minimal ops overhead | Warehouse (fully managed) |
| Maximum flexibility and open standards | Lakehouse |
| Smallest storage cost, OK with management complexity | Lake |
In practice, many large organizations run all three: a warehouse for reporting, a lake for raw archival and ML, and a lakehouse for the intersection.
π§ͺ Choosing Your Architecture: A Three-Question Decision Walk
You do not need to memorize the theory β answering three questions about your actual use case gets you to the right answer every time.
Question 1 β Who consumes the data?
- Business analysts running SQL dashboards β lean toward Warehouse.
- Data scientists training ML models β lean toward Lake.
- Both groups sharing the same dataset β lean toward Lakehouse.
Question 2 β What shape is your data?
- All structured (fixed schema, rows and columns) β Warehouse works perfectly.
- Mix of structured, semi-structured, and unstructured β Lake or Lakehouse required.
Question 3 β Do you need ACID correctness?
- Multiple writers, strict consistency, or incremental updates required β Warehouse or Lakehouse.
- Append-only batch jobs, OK with eventual consistency β Lake is sufficient.
Worked example β an e-commerce startup:
The startup has three data consumers:
- A finance dashboard querying daily revenue (SQL, structured data).
- A recommendation model training on raw clickstreams (ML, semi-structured data).
- A fraud detection system applying incremental risk-score updates (ACID required).
Verdict: A Lakehouse (Delta Lake or Iceberg on S3) covers all three. The same Parquet files power the BI dashboard via Trino SQL, the ML training pipeline via Spark, and the fraud model's incremental ACID writes β no data duplication, single source of truth, one storage bill.
π οΈ Apache Spark + Delta Lake: SQL and ML on the Same Data with PySpark
Apache Spark is the dominant distributed compute engine for big data analytics and ML pipelines; Delta Lake is an open-source storage layer (built by Databricks) that adds ACID transactions, schema enforcement, time travel, and MERGE/upsert support to Parquet files stored on S3, GCS, or HDFS. Together they form the canonical Python lakehouse stack.
The core advantage over a plain data lake: multiple PySpark jobs can write to the same Delta table concurrently without corrupting each other, BI analysts can query it with SQL, and ML engineers can read the same Parquet files β one storage layer serves both use cases.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp
from delta import configure_spark_with_delta_pip
# 1. Create a Spark session with Delta Lake extensions
builder = (
SparkSession.builder
.appName("LakehouseDemo")
.config("spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog")
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()
# 2. ELT pattern β store raw clickstream events as a Delta table (no pre-transformation)
events_df = spark.read.json("s3://data-lake/raw/clickstream/2024/")
(
events_df
.withColumn("ingested_at", current_timestamp())
.write
.format("delta")
.mode("append") # ACID-safe: concurrent writers never corrupt files
.partitionBy("event_date") # partition pruning speeds up date-range queries
.save("s3://data-lake/delta/clickstream")
)
# 3. SQL analytics on the same Delta table β BI dashboard query
spark.read.format("delta").load("s3://data-lake/delta/clickstream") \
.createOrReplaceTempView("clickstream")
spark.sql("""
SELECT product_id,
COUNT(*) AS page_views,
COUNT(DISTINCT user_id) AS unique_visitors
FROM clickstream
WHERE event_date >= '2024-01-01'
GROUP BY product_id
ORDER BY page_views DESC
LIMIT 10
""").show()
# 4. Time travel β query the table as it existed 7 versions ago (ML debugging / audits)
historical_df = (
spark.read.format("delta")
.option("versionAsOf", 5) # or .option("timestampAsOf", "2024-01-15")
.load("s3://data-lake/delta/clickstream")
)
print(f"Historical version row count: {historical_df.count()}")
# 5. MERGE (upsert) β update existing records, insert new ones
# This operation is impossible on plain S3 Parquet without Delta Lake
spark.sql("""
MERGE INTO clickstream AS target
USING updates AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
dbt (data build tool) is the transformation companion in production lakehouse pipelines: SQL-based dbt models run on top of the Delta table, applying business-logic transformations with full lineage tracking and incremental materialization β without any Spark job code.
| Layer | Tool | Role |
| Compute | Apache Spark (PySpark) | Distributed read/write, ML training |
| Storage format | Delta Lake / Apache Iceberg | ACID + time travel on Parquet |
| Transformation | dbt | SQL models with lineage |
| BI / SQL | Trino, Presto, Databricks SQL | Interactive analyst queries |
For a full deep-dive on Delta Lake MERGE patterns, Apache Iceberg vs. Delta Lake trade-offs, and dbt incremental model design, a dedicated follow-up post is planned.
π Five Lessons Learned From Real Data Architecture Failures
1. A data lake without a catalog becomes a data swamp. Without a metadata catalog (Apache Atlas, AWS Glue, or DataHub), teams cannot discover what data exists, how fresh it is, or who owns it. Within months, the lake fills with orphaned datasets nobody trusts. Governance is not optional β build it before the lake grows.
2. Over-engineering to a lakehouse before you need it adds cost. If your team is five analysts writing SQL with zero ML use cases, a managed warehouse (BigQuery, Snowflake) ships faster and costs less to operate. Adopt a lakehouse when your teams genuinely diverge in their data needs, not before.
3. ETL pipelines become bottlenecks as schemas evolve. Warehouse ETL pipelines hardcode transformations. Every time a source system adds a column, someone updates a pipeline. An ELT or lake-first approach stores raw data untouched β schema changes do not break ingestion, and transformations happen on demand downstream.
4. Coupling compute and storage is expensive at scale. Legacy on-premises warehouses couple compute and storage: scaling one means scaling both. Modern cloud designs (Snowflake virtual warehouses, Databricks clusters over S3) decouple them β pay only for active compute, store data cheaply at rest.
5. Time travel is more useful than it sounds. Delta Lake and Iceberg let you query data as it existed 30 days ago. This is not just an audit feature β it is how engineers debug ML model regressions ("why did accuracy drop last Tuesday?") and recover from accidental bulk deletes without triggering a full restore.
π TLDR: Summary & Key Takeaways
- Warehouse: clean, structured, SQL, fast dashboards. Bad for unstructured or raw data.
- Lake: raw, cheap, flexible. Good for ML. Bad for interactive SQL and governance.
- Lakehouse: open table formats add ACID + SQL performance on top of cheap object storage.
- ETL (warehouse) transforms before storing; ELT (lake/lakehouse) stores raw and transforms later.
- Most serious data organizations run a hybrid: warehouse for BI, lake/lakehouse for ML.
π Related Posts
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer β 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2Γ A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
