Category

big data

12 articles across 2 sub-topics

Spark on Kubernetes: Operator, Dynamic Allocation, and Production Monitoring

TLDR: Running Spark on Kubernetes replaces YARN's static queue model with a container-native, elastically-scaled execution environment. The kubeflow Spark Operator manages SparkApplication CRDs through a reconciliation loop that creates driver and ex...

31 min read

Spark Executor Sizing: Memory Model, Core Tuning, and GC Strategy

TLDR: Spark executor OOMs are almost never caused by insufficient total cluster RAM — they are caused by misallocating memory across five distinct JVM regions while ignoring GC behavior and memoryOverhead. Master the UnifiedMemoryManager model, apply...

33 min read

Spark Architecture: Driver, Executors, DAG Scheduler, and Task Scheduler Explained

TLDR: Spark's architecture is a precise chain of responsibility. The Driver converts user code into a DAG, the DAGScheduler breaks it into stages at shuffle boundaries, the TaskScheduler dispatches tasks to Executors respecting data locality, and the...

26 min read

Spark Adaptive Query Execution: Dynamic Coalescing, Pruning, and Skew Handling

TLDR: Before AQE, Spark compiled your entire query into a static physical plan using size estimates that were frequently wrong — and a wrong estimate at planning time meant a skewed join, 800 small tasks, or a missed broadcast opportunity that no amo...

34 min read
Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi

Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi

TLDR: Delta Lake, Apache Iceberg, and Apache Hudi are open table formats that wrap Parquet files with a transaction log (or snapshot tree) to deliver ACID guarantees, time travel, schema evolution, and efficient upserts on object storage. Choose Delt...

23 min read

Apache Spark for Data Engineers: RDDs, DataFrames, and Structured Streaming

TLDR: Apache Spark distributes Python DataFrame jobs across a cluster of executors, using lazy evaluation and the Catalyst query optimizer to process terabytes with the same code that works on gigabytes. Master partitioning, shuffle-awareness, and St...

18 min read
Medallion Architecture: Bronze, Silver, and Gold Layers in Practice

Medallion Architecture: Bronze, Silver, and Gold Layers in Practice

TLDR: Medallion Architecture solves the "data swamp" problem by organizing a data lake into three progressively refined zones — Bronze (raw, immutable), Silver (cleaned, conformed), Gold (aggregated, business-ready) — so teams always build on a trust...

22 min read
Kappa Architecture: Streaming-First Data Pipelines

Kappa Architecture: Streaming-First Data Pipelines

TLDR: Kappa architecture replaces Lambda's batch + speed dual codebases with a single streaming pipeline backed by a replayable Kafka log. Reprocessing becomes replaying from offset 0. One codebase, no drift. TLDR: Kappa is the right call when your t...

20 min read
Big Data 101: The 5 Vs, Ecosystem, and Why Scale Breaks Everything

Big Data 101: The 5 Vs, Ecosystem, and Why Scale Breaks Everything

TLDR: Traditional databases fail at big data scale for three concrete reasons — storage saturation, compute bottleneck, and write-lock contention. The 5 Vs (Volume, Velocity, Variety, Veracity, Value) frame what makes data "big." A layered ecosystem ...

20 min read

Lambda Architecture Pattern: Balancing Batch Accuracy with Streaming Freshness

TLDR: Lambda architecture is justified when replay correctness and sub-minute freshness are both non-negotiable despite dual-path complexity. TLDR: Lambda architecture is a fit only when you need both low-latency views and deterministic recompute fro...

13 min read

Big Data Architecture Patterns: Lambda, Kappa, CDC, Medallion, and Data Mesh

TLDR: A serious data platform is defined less by where files are stored and more by how changes enter the system, how serving layers are materialized, and who owns quality over time. Lambda, Kappa, CDC, Medallion, and Data Mesh are patterns for makin...

15 min read

Data Warehouse vs Data Lake vs Data Lakehouse: Which One to Choose?

TLDR: Warehouse = structured, clean data for BI and SQL dashboards (Snowflake, BigQuery). Lake = raw, messy data for ML and data science (S3, HDFS). Lakehouse = open table formats (Delta Lake, Iceberg) that bring SQL performance to raw storage — the ...

14 min read