Abstract Algorithms
Explore

Start here

#apache Spark

Learn #apache Spark as a connected topic across chapters, concepts, simulations, and interview reasoning.

#apache SparkMental ModelTradeoffsFailure ModesInterview ReasoningApache Spark for Data Engineers

Begin with

Apache Spark for Data Engineers gives you the cleanest entry point before branching into constraints, failures, and related systems.

16

Chapters

10

Concepts

Start With Apache Spark for Data Engineers

Grounding

Build the mental model.

Start Reading

Shape

See how the pieces depend on each other.

See Context

Consequence

Compare what improves and what breaks.

Compare Tradeoffs

Stress

Change constraints and watch behavior.

Practice Reasoning

Next

Move to the next useful edge.

Continue Reading

Related systems

Follow the nearby ideas

Use the map as a quiet orientation layer, then move back into the articles for depth.

Guidance

#apache Spark

Continues from what you have already explored.

System behavior

HyperLogLog Cardinality Estimation

Hash values route into registers, leading-zero runs update maxima, and the harmonic mean estimates unique cardinality with bounded error.

Open
Step 1 / 3Normal flow
itemprefixbucketmax rhoestimateuser idUInput StreamActorXHash FunctionComputeGPrefix RouterBoundaryDm RegistersDurabilityCHarmonic MeanCoordinatorSCardinality EstimateService

Read in sequence

1Apache Spark for Data Engineers: RDDs, DataFrames, and Structured StreamingTLDR: Apache Spark distributes Python DataFrame jobs across a cluster of executors, using lazy evaluation and the Catalyst query optimizer to process terabytes with the same code that works on gigabyt19 min2Watermarking and Late Data Handling in Spark Structured StreamingTLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global mi27 min3Spark Structured Streaming: Micro-Batch vs Continuous Processing📖 The 15-Minute Gap: How a Fraud Team Discovered They Needed Real-Time Streaming A fintech team runs payment fraud detection with a well-tuned Spark batch job. Every 15 minutes it reads a day's worth27 min4Stateful Aggregations in Spark Structured Streaming: mapGroupsWithStateTLDR: mapGroupsWithState gives each streaming key its own mutable state object, persisted in a fault-tolerant state store that checkpoints to object storage on every micro-batch. Where window aggregat28 min5Shuffles in Spark: Why groupBy Kills PerformanceTLDR: A Spark shuffle is the most expensive operation in any distributed job — it moves every matching key across the network, writes temporary sorted files to disk, and forces a hard synchronization 31 min6Reading and Writing Data in Spark: Parquet, Delta, JSON, and JDBCTLDR: Parquet's columnar layout with row-group statistics enables predicate pushdown that can reduce a 500 GB scan to 8 GB. Delta Lake wraps Parquet with a JSON transaction log to add ACID semantics a34 min7Partitioning in Spark: HashPartitioner, RangePartitioner, and Custom StrategiesTLDR: Spark's partition count and partitioning strategy are the two levers that determine whether a job scales linearly or crumbles under data growth. HashPartitioner distributes keys by hash modulo —26 min8Spark on Kubernetes: Operator, Dynamic Allocation, and Production MonitoringTLDR: Running Spark on Kubernetes replaces YARN's static queue model with a container-native, elastically-scaled execution environment. The kubeflow Spark Operator manages SparkApplication CRDs throug36 min9Kafka and Spark Structured Streaming: Building a Production Pipeline📖 The 500K-Event Problem: When a Naive Kafka Consumer Falls Apart An analytics platform at a mid-sized fintech company needs to process 500,000 payment events per second from a Kafka cluster. The tea23 min10Spark Executor Sizing: Memory Model, Core Tuning, and GC StrategyTLDR: Spark executor OOMs are almost never caused by insufficient total cluster RAM — they are caused by misallocating memory across five distinct JVM regions while ignoring GC behavior and memoryOver37 min11Spark DataFrames and Spark SQL: Schema, DDL, and the Catalyst OptimizerTLDR: Catalyst is Spark's query compiler. It takes any DataFrame operation or SQL string, parses it into an abstract syntax tree, resolves column references against the catalog, applies a library of a24 min12Caching and Persistence in Spark: Storage Levels and When to Use ThemTLDR: Calling cache() or persist() does not immediately store anything — Spark caches lazily at the first action, partition by partition, managed by a per-executor BlockManager. When memory fills up, 24 min

Showing the top 12 of 16 matching chapters.

Related threads

Abstract Algorithms · © 2026 · Engineering learning lab