Home/Learn/Intermediate
Topic

Intermediate

Learn Intermediate as a connected topic across chapters, concepts, simulations, and interview reasoning.

10 Concepts16 Articles7h 2m

Overview

Learn Intermediate as a connected topic across chapters, concepts, simulations, and interview reasoning.

How this topic helps

#apache Spark
Python
Performance
Structured Streaming

Learning Path in this Topic

Series that contain articles from Intermediate. Select a path to filter the article list.

Articles

16 matched articles

Article 1Watermarking and Late Data Handling in Spark Structured StreamingTLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global mi27 minArticle 2Spark Structured Streaming: Micro-Batch vs Continuous Processing📖 The 15-Minute Gap: How a Fraud Team Discovered They Needed Real-Time Streaming A fintech team runs payment fraud detection with a well-tuned Spark batch job. Every 15 minutes it reads a day's worth27 minArticle 3Shuffles in Spark: Why groupBy Kills PerformanceTLDR: A Spark shuffle is the most expensive operation in any distributed job — it moves every matching key across the network, writes temporary sorted files to disk, and forces a hard synchronization 31 minArticle 4Reading and Writing Data in Spark: Parquet, Delta, JSON, and JDBCTLDR: Parquet's columnar layout with row-group statistics enables predicate pushdown that can reduce a 500 GB scan to 8 GB. Delta Lake wraps Parquet with a JSON transaction log to add ACID semantics a34 minArticle 5Partitioning in Spark: HashPartitioner, RangePartitioner, and Custom StrategiesTLDR: Spark's partition count and partitioning strategy are the two levers that determine whether a job scales linearly or crumbles under data growth. HashPartitioner distributes keys by hash modulo —26 minArticle 6Kafka and Spark Structured Streaming: Building a Production Pipeline📖 The 500K-Event Problem: When a Naive Kafka Consumer Falls Apart An analytics platform at a mid-sized fintech company needs to process 500,000 payment events per second from a Kafka cluster. The tea23 min

Page 1 of 3