Category
intermediate
15 articles across 7 sub-topics
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
Spark Structured Streaming: Micro-Batch vs Continuous Processing
📖 The 15-Minute Gap: How a Fraud Team Discovered They Needed Real-Time Streaming A fintech team runs payment fraud detection with a well-tuned Spark batch job. Every 15 minutes it reads a day's worth of transaction events from S3, scores them agains...
Shuffles in Spark: Why groupBy Kills Performance
TLDR: A Spark shuffle is the most expensive operation in any distributed job — it moves every matching key across the network, writes temporary sorted files to disk, and forces a hard synchronization barrier between every upstream and downstream stag...
Reading and Writing Data in Spark: Parquet, Delta, JSON, and JDBC
TLDR: Parquet's columnar layout with row-group statistics enables predicate pushdown that can reduce a 500 GB scan to 8 GB. Delta Lake wraps Parquet with a JSON transaction log to add ACID semantics and time travel. JSON and CSV read every byte becau...
Partitioning in Spark: HashPartitioner, RangePartitioner, and Custom Strategies
TLDR: Spark's partition count and partitioning strategy are the two levers that determine whether a job scales linearly or crumbles under data growth. HashPartitioner distributes keys by hash modulo — fast and uniform for well-distributed keys, catas...
Kafka and Spark Structured Streaming: Building a Production Pipeline
📖 The 500K-Event Problem: When a Naive Kafka Consumer Falls Apart An analytics platform at a mid-sized fintech company needs to process 500,000 payment events per second from a Kafka cluster. The team starts with a straightforward approach: a hand-r...
Spark DataFrames and Spark SQL: Schema, DDL, and the Catalyst Optimizer
TLDR: Catalyst is Spark's query compiler. It takes any DataFrame operation or SQL string, parses it into an abstract syntax tree, resolves column references against the catalog, applies a library of algebraic rewrite rules to produce an optimized log...
Caching and Persistence in Spark: Storage Levels and When to Use Them
TLDR: Calling cache() or persist() does not immediately store anything — Spark caches lazily at the first action, partition by partition, managed by a per-executor BlockManager. When memory fills up, LRU eviction silently drops or spills partitions. ...
Broadcast Joins vs Sort-Merge Joins in Spark
📖 The 45-Minute Join Stage That Became 90 Seconds A data engineering team at a retail company was running a nightly Spark job that joined their 500 GB transaction fact table against a 50 MB product dimension table. The job had been in production for...
Python OOP: Classes, Dataclasses, and Dunder Methods
📖 Why Every Java Developer Writes Un-Pythonic Classes on Day One Imagine a developer — let's call him Daniel — who has written Java for six years. He sits down to write his first Python class and produces this: class BankAccount: def __init__(se...
Decorators Explained: From Functions to Frameworks
📖 The Copy-Paste Crisis: When Timing Code Invades Twenty Functions Sofia is three months into her first Python backend role. The team runs a performance review and discovers the data-processing API is slow. The tech lead asks her to add timing instr...
Async Python: asyncio, Coroutines, and Event Loops Without the Confusion
📖 The 500-Second Problem: What Cooperative Multitasking Actually Fixes Suppose your monitoring pipeline checks the health endpoint of 1,000 internal microservices. Each HTTP call takes about 500 milliseconds — network round-trip, DNS, TLS handshake,...
