Category
apache spark
3 articles
Big Data Engineering: Your Complete Learning Roadmap
TLDR: πΊοΈ You want to learn Big Data Engineering, but the ecosystem feels overwhelming. This roadmap breaks down 11 posts across 4 phases: Foundations β Architecture β Pipelines β Advanced. Start with the 5 Vs and Data Lakes, then tackle Lambda Archi...
β’18 min read

Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi
TLDR: Delta Lake, Apache Iceberg, and Apache Hudi are open table formats that wrap Parquet files with a transaction log (or snapshot tree) to deliver ACID guarantees, time travel, schema evolution, an
β’27 min read
Apache Spark for Data Engineers: RDDs, DataFrames, and Structured Streaming
TLDR: Apache Spark distributes Python DataFrame jobs across a cluster of executors, using lazy evaluation and the Catalyst query optimizer to process terabytes with the same code that works on gigabytes. Master partitioning, shuffle-awareness, and St...
β’20 min read
