Start here
Probabilistic Data Structures
Learn sketches, approximate membership, cardinality estimation, error bounds, and when approximation beats exactness.
HashingCardinality EstimationApproximation ErrorRegistersStreaming DataHyperLogLog
Begin with
HyperLogLog gives you the cleanest entry point before branching into constraints, failures, and related systems.
12
Articles
10
Concepts
Relationships
Follow the shape of the system
Move through prerequisites, dependencies, tradeoffs, and adjacent concepts without losing the thread.
Guidance
Hashing
Continues from what you have already explored.
System behavior
HyperLogLog Cardinality Estimation
Hash values route into registers, leading-zero runs update maxima, and the harmonic mean estimates unique cardinality with bounded error.
Step 1 / 3Normal flow
Read in sequence
1Probabilistic Data Structures: A Practical Guide to Bloom Filters, HyperLogLog, and Count-Min SketchTLDR: Probabilistic data structures trade a small, bounded probability of being wrong for orders-of-magnitude better memory efficiency and O(1) speed. Bloom Filters answer "definitely not in this set"14 min2HyperLogLog Explained: Counting Billions of Unique Items with 12 KBTLDR: HyperLogLog estimates the number of distinct elements in a dataset using ~12 KB of memory regardless of cardinality — with ±0.81% error. The insight: if you hash every element to a random bit st18 min3Count-Min Sketch Explained: Frequency Estimation at Streaming ScaleTLDR: Count-Min Sketch (CMS) is a fixed-size d × w counter matrix that estimates how often any element has appeared in a stream. Insert: hash the element with each of the d hash functions to get one c22 min4Bloom Filters Explained: Membership Testing with Zero False NegativesTLDR: A Bloom filter is a bit array of m bits + k independent hash functions that sets k bits on insert and checks those same k bits on lookup. If any checked bit is 0, the element is definitely not i19 min5How Bloom Filters Work: The Probabilistic SetTLDR
TLDR: A Bloom Filter is a bit array + multiple hash functions that answers "Is X in the set?" in \(O(1)\) constant space. It can return false positives (say "yes" when the answer is "no") but ne13 min6Variational Autoencoders (VAE): The Art of Compression and CreationTLDR: A VAE learns to compress data into a smooth probabilistic latent space, then generate new samples by decoding random points from that space. The reparameterization trick is what makes it trainab13 min7What are Hash Tables? Basics ExplainedTLDR: A hash table gives you near-O(1) lookups, inserts, and deletes by using a hash function to map keys to array indices. The tradeoff: collisions (when two keys hash to the same slot) must be handl12 min8NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split DataTLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node — virtual nod24 min9SQL Partitioning: Range, Hash, List, and Composite Strategies ExplainedTLDR: SQL partitioning divides one logical table into smaller physical child tables, all accessed through the parent table name. The query optimizer skips irrelevant child tables entirely — a process 25 min10Partitioning in Spark: HashPartitioner, RangePartitioner, and Custom StrategiesTLDR: Spark's partition count and partitioning strategy are the two levers that determine whether a job scales linearly or crumbles under data growth. HashPartitioner distributes keys by hash modulo —26 min11Python Data Structures: Lists, Dicts, Sets, and TuplesTLDR: Python's four built-in collections are not interchangeable — their internals are fundamentally different. list is a dynamic array: fast at the end, slow for membership. dict is a hash table: O(126 min12Partitioning Approaches in SQL and NoSQL: Horizontal, Vertical, Range, Hash, and List PartitioningTLDR: Partitioning splits one logical table into smaller physical pieces. The database skips irrelevant pieces entirely — turning a 30-second full-table scan into a sub-second single-partition read. S12 min
Related threads

