Abstract Algorithms

Start here

#apache Spark

Learn #apache Spark as a connected topic across chapters, concepts, simulations, and interview reasoning.

#apache SparkMental ModelTradeoffsFailure ModesInterview ReasoningApache Spark for Data Engineers

Begin with

Apache Spark for Data Engineers gives you the cleanest entry point before branching into constraints, failures, and related systems.

Chapters

Concepts

Start With Apache Spark for Data Engineers

Grounding

Build the mental model.

Start Reading

Shape

See how the pieces depend on each other.

See Context

Consequence

Compare what improves and what breaks.

Compare Tradeoffs

Stress

Change constraints and watch behavior.

Practice Reasoning

Move to the next useful edge.

Related systems

Follow the nearby ideas

Use the map as a quiet orientation layer, then move back into the articles for depth.

#apache Spark Mental Model Tradeoffs Failure Modes Interview Reasoning Apache Spark for Data Engineers Core Mechanism Replication

Guidance

#apache Spark

Continues from what you have already explored.

I can continue your learning session from the exact context you left off.

Resume Context

Continue Learning Practice Tradeoffs Next Drill

System behavior

HyperLogLog Cardinality Estimation

Hash values route into registers, leading-zero runs update maxima, and the harmonic mean estimates unique cardinality with bounded error.

Open

Speed

Step 1 / 3Normal flow

Read in sequence

1Apache Spark for Data Engineers: RDDs, DataFrames, and Structured StreamingTLDR: Apache Spark distributes Python DataFrame jobs across a cluster of executors, using lazy evaluation and the Catalyst query optimizer to process terabytes with the same code that works on gigabyt19 min 2Watermarking and Late Data Handling in Spark Structured StreamingTLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global mi27 min 3Spark Structured Streaming: Micro-Batch vs Continuous Processing📖 The 15-Minute Gap: How a Fraud Team Discovered They Needed Real-Time Streaming A fintech team runs payment fraud detection with a well-tuned Spark batch job. Every 15 minutes it reads a day's worth27 min 4Stateful Aggregations in Spark Structured Streaming: mapGroupsWithStateTLDR: mapGroupsWithState gives each streaming key its own mutable state object, persisted in a fault-tolerant state store that checkpoints to object storage on every micro-batch. Where window aggregat28 min 5Shuffles in Spark: Why groupBy Kills PerformanceTLDR: A Spark shuffle is the most expensive operation in any distributed job — it moves every matching key across the network, writes temporary sorted files to disk, and forces a hard synchronization 31 min 6Reading and Writing Data in Spark: Parquet, Delta, JSON, and JDBCTLDR: Parquet's columnar layout with row-group statistics enables predicate pushdown that can reduce a 500 GB scan to 8 GB. Delta Lake wraps Parquet with a JSON transaction log to add ACID semantics a34 min 7Partitioning in Spark: HashPartitioner, RangePartitioner, and Custom StrategiesTLDR: Spark's partition count and partitioning strategy are the two levers that determine whether a job scales linearly or crumbles under data growth. HashPartitioner distributes keys by hash modulo —26 min 8Spark on Kubernetes: Operator, Dynamic Allocation, and Production MonitoringTLDR: Running Spark on Kubernetes replaces YARN's static queue model with a container-native, elastically-scaled execution environment. The kubeflow Spark Operator manages SparkApplication CRDs throug36 min 9Kafka and Spark Structured Streaming: Building a Production Pipeline📖 The 500K-Event Problem: When a Naive Kafka Consumer Falls Apart An analytics platform at a mid-sized fintech company needs to process 500,000 payment events per second from a Kafka cluster. The tea23 min 10Spark Executor Sizing: Memory Model, Core Tuning, and GC StrategyTLDR: Spark executor OOMs are almost never caused by insufficient total cluster RAM — they are caused by misallocating memory across five distinct JVM regions while ignoring GC behavior and memoryOver37 min 11Spark DataFrames and Spark SQL: Schema, DDL, and the Catalyst OptimizerTLDR: Catalyst is Spark's query compiler. It takes any DataFrame operation or SQL string, parses it into an abstract syntax tree, resolves column references against the catalog, applies a library of a24 min 12Caching and Persistence in Spark: Storage Levels and When to Use ThemTLDR: Calling cache() or persist() does not immediately store anything — Spark caches lazily at the first action, partition by partition, managed by a per-executor BlockManager. When memory fills up, 24 min

Showing the top 12 of 16 matching chapters.

Related threads

Find the idea you are trying to connect