Category
performance
12 articles across 7 sub-topics
Shuffles in Spark: Why groupBy Kills Performance
TLDR: A Spark shuffle is the most expensive operation in any distributed job — it moves every matching key across the network, writes temporary sorted files to disk, and forces a hard synchronization barrier between every upstream and downstream stag...
Partitioning in Spark: HashPartitioner, RangePartitioner, and Custom Strategies
TLDR: Spark's partition count and partitioning strategy are the two levers that determine whether a job scales linearly or crumbles under data growth. HashPartitioner distributes keys by hash modulo — fast and uniform for well-distributed keys, catas...
Caching and Persistence in Spark: Storage Levels and When to Use Them
TLDR: Calling cache() or persist() does not immediately store anything — Spark caches lazily at the first action, partition by partition, managed by a per-executor BlockManager. When memory fills up, LRU eviction silently drops or spills partitions. ...
Broadcast Joins vs Sort-Merge Joins in Spark
📖 The 45-Minute Join Stage That Became 90 Seconds A data engineering team at a retail company was running a nightly Spark job that joined their 500 GB transaction fact table against a 50 MB product dimension table. The job had been in production for...
Spark Architecture: Driver, Executors, DAG Scheduler, and Task Scheduler Explained
TLDR: Spark's architecture is a precise chain of responsibility. The Driver converts user code into a DAG, the DAGScheduler breaks it into stages at shuffle boundaries, the TaskScheduler dispatches tasks to Executors respecting data locality, and the...
Spark Adaptive Query Execution: Dynamic Coalescing, Pruning, and Skew Handling
TLDR: Before AQE, Spark compiled your entire query into a static physical plan using size estimates that were frequently wrong — and a wrong estimate at planning time meant a skewed join, 800 small tasks, or a missed broadcast opportunity that no amo...

Partitioning Approaches in SQL and NoSQL: Horizontal, Vertical, Range, Hash, and List Partitioning
TLDR: Partitioning splits one logical table into smaller physical pieces called partitions. The database planner skips irrelevant partitions entirely — turning a 30-second full-table scan into a 200ms single-partition read. Range partitioning is best...

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions — with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...
System Design: Complete Guide to Caching — Patterns, Eviction, and Distributed Strategies
TLDR: Caching is the single highest-leverage performance tool in distributed systems. This guide covers every read/write pattern (Cache-Aside through Refresh-Ahead), every eviction policy (LRU through ARC), cache invalidation pitfalls, thundering her...
Little's Law: The Secret Formula for System Performance
TLDR: Little's Law ($L = \lambda W$) connects three metrics every system designer measures: $L$ = concurrent requests in flight, $\lambda$ = throughput (RPS), $W$ = average response time. If latency spikes, your concurrency requirement explodes with ...

