Category

performance

12 articles across 7 sub-topics

Shuffles in Spark: Why groupBy Kills Performance

TLDR: A Spark shuffle is the most expensive operation in any distributed job — it moves every matching key across the network, writes temporary sorted files to disk, and forces a hard synchronization barrier between every upstream and downstream stag...

29 min read

Partitioning in Spark: HashPartitioner, RangePartitioner, and Custom Strategies

TLDR: Spark's partition count and partitioning strategy are the two levers that determine whether a job scales linearly or crumbles under data growth. HashPartitioner distributes keys by hash modulo — fast and uniform for well-distributed keys, catas...

23 min read

Caching and Persistence in Spark: Storage Levels and When to Use Them

TLDR: Calling cache() or persist() does not immediately store anything — Spark caches lazily at the first action, partition by partition, managed by a per-executor BlockManager. When memory fills up, LRU eviction silently drops or spills partitions. ...

22 min read

Broadcast Joins vs Sort-Merge Joins in Spark

📖 The 45-Minute Join Stage That Became 90 Seconds A data engineering team at a retail company was running a nightly Spark job that joined their 500 GB transaction fact table against a 50 MB product dimension table. The job had been in production for...

23 min read

Spark Architecture: Driver, Executors, DAG Scheduler, and Task Scheduler Explained

TLDR: Spark's architecture is a precise chain of responsibility. The Driver converts user code into a DAG, the DAGScheduler breaks it into stages at shuffle boundaries, the TaskScheduler dispatches tasks to Executors respecting data locality, and the...

26 min read

Spark Adaptive Query Execution: Dynamic Coalescing, Pruning, and Skew Handling

TLDR: Before AQE, Spark compiled your entire query into a static physical plan using size estimates that were frequently wrong — and a wrong estimate at planning time meant a skewed join, 800 small tasks, or a missed broadcast opportunity that no amo...

34 min read