Abstract AlgorithmsAn AI Powered Learning Platform

Topic

apache spark

16 articles across 12 sub-topics

Sub-topic

3 articles

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global mi

Apr 19, 2026•28 min read

Spark Structured Streaming: Micro-Batch vs Continuous Processing

📖 The 15-Minute Gap: How a Fraud Team Discovered They Needed Real-Time Streaming A fintech team runs payment fraud detection with a well-tuned Spark batch job. Every 15 minutes it reads a day's worth

Apr 19, 2026•27 min read

Stateful Aggregations in Spark Structured Streaming: mapGroupsWithState

TLDR: mapGroupsWithState gives each streaming key its own mutable state object, persisted in a fault-tolerant state store that checkpoints to object storage on every micro-batch. Where window aggregat

Apr 19, 2026•28 min read

Sub-topic

Big Data

3 articles

Spark Architecture: Driver, Executors, DAG Scheduler, and Task Scheduler Explained

TLDR: Spark's architecture is a precise chain of responsibility. The Driver converts user code into a DAG, the DAGScheduler breaks it into stages at shuffle boundaries, the TaskScheduler dispatches ta

Apr 19, 2026•29 min read

Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi

TLDR: Delta Lake, Apache Iceberg, and Apache Hudi are open table formats that wrap Parquet files with a transaction log (or snapshot tree) to deliver ACID guarantees, time travel, schema evolution, an

Mar 28, 2026•24 min read

Apache Spark for Data Engineers: RDDs, DataFrames, and Structured Streaming

TLDR: Apache Spark distributes Python DataFrame jobs across a cluster of executors, using lazy evaluation and the Catalyst query optimizer to process terabytes with the same code that works on gigabyt

Mar 28, 2026•19 min read

Sub-topic

Performance

1 article

Shuffles in Spark: Why groupBy Kills Performance

TLDR: A Spark shuffle is the most expensive operation in any distributed job — it moves every matching key across the network, writes temporary sorted files to disk, and forces a hard synchronization

Apr 19, 2026•32 min read

Sub-topic

Parquet

1 article

Reading and Writing Data in Spark: Parquet, Delta, JSON, and JDBC

TLDR: Parquet's columnar layout with row-group statistics enables predicate pushdown that can reduce a 500 GB scan to 8 GB. Delta Lake wraps Parquet with a JSON transaction log to add ACID semantics a

Apr 19, 2026•35 min read

Sub-topic

Partitioning

1 article

Partitioning in Spark: HashPartitioner, RangePartitioner, and Custom Strategies

TLDR: Spark's partition count and partitioning strategy are the two levers that determine whether a job scales linearly or crumbles under data growth. HashPartitioner distributes keys by hash modulo —

Apr 19, 2026•26 min read

Sub-topic

Kubernetes

1 article

Spark on Kubernetes: Operator, Dynamic Allocation, and Production Monitoring

TLDR: Running Spark on Kubernetes replaces YARN's static queue model with a container-native, elastically-scaled execution environment. The kubeflow Spark Operator manages SparkApplication CRDs throug

Apr 19, 2026•37 min read

Sub-topic

Kafka

1 article

Kafka and Spark Structured Streaming: Building a Production Pipeline

📖 The 500K-Event Problem: When a Naive Kafka Consumer Falls Apart An analytics platform at a mid-sized fintech company needs to process 500,000 payment events per second from a Kafka cluster. The tea

Apr 19, 2026•24 min read

Sub-topic

Performance Tuning

1 article

Spark Executor Sizing: Memory Model, Core Tuning, and GC Strategy

TLDR: Spark executor OOMs are almost never caused by insufficient total cluster RAM — they are caused by misallocating memory across five distinct JVM regions while ignoring GC behavior and memoryOver

Apr 19, 2026•37 min read

Sub-topic

Dataframes

1 article

Spark DataFrames and Spark SQL: Schema, DDL, and the Catalyst Optimizer

TLDR: Catalyst is Spark's query compiler. It takes any DataFrame operation or SQL string, parses it into an abstract syntax tree, resolves column references against the catalog, applies a library of a

Apr 19, 2026•24 min read

Sub-topic

Caching

1 article

Caching and Persistence in Spark: Storage Levels and When to Use Them

TLDR: Calling cache() or persist() does not immediately store anything — Spark caches lazily at the first action, partition by partition, managed by a per-executor BlockManager. When memory fills up,

Apr 19, 2026•24 min read

Sub-topic

Joins

1 article

Broadcast Joins vs Sort-Merge Joins in Spark

📖 The 45-Minute Join Stage That Became 90 Seconds A data engineering team at a retail company was running a nightly Spark job that joined their 500 GB transaction fact table against a 50 MB product d

Apr 19, 2026•26 min read

Sub-topic

Aqe

1 article

Spark Adaptive Query Execution: Dynamic Coalescing, Pruning, and Skew Handling

TLDR: Before AQE, Spark compiled your entire query into a static physical plan using size estimates that were frequently wrong — and a wrong estimate at planning time meant a skewed join, 800 small ta

Apr 19, 2026•39 min read