Count-Min Sketch Explained: Frequency Estimation at Streaming Scale
How a d × w counter matrix answers 'how often has X appeared?' with a bounded upper bound error and O(1) operations at any stream volume.
Abstract AlgorithmsIntermediate
For developers with some experience. Builds on fundamentals.
Estimated read time: 21 min
AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Count-Min Sketch (CMS) is a fixed-size
d × wcounter matrix that estimates how often any element has appeared in a stream. Insert: hash the element with each of thedhash functions to get one column per row, increment thosedcounters. Query: read the samedcounters, return the minimum — an upper bound that is never an underestimate. Withd = 7rows andw = 2000columns, a ~112 KB sketch delivers <1% error for hashtag frequency estimation across millions of events — versus gigabytes for exact counting. Twitter trending topics, Akamai heavy-hitter detection, and Apache Flink streaming analytics all use Count-Min Sketch.
📖 The Frequency Estimation Problem: Trending at Twitter Scale
Twitter's trending topics feature shows the hashtags gaining the most momentum in the last five minutes. At peak load, tens of millions of tweets arrive per minute. Every tweet may contain one or more hashtags. For each incoming tweet, the system must:
- Increment the frequency count for each hashtag in the tweet.
- Maintain a running top-K list of the most frequent hashtags in a sliding window.
An exact approach — a HashMap<String, Long> mapping each hashtag to its count — sounds straightforward. But the number of distinct hashtags active at any moment is enormous: millions of unique tags appear and vanish continuously. Maintaining exact counts for all of them requires unbounded memory that grows with every new hashtag ever observed. At Twitter's scale, that means gigabytes to terabytes of counter state, with write contention at the hottest hashtags creating lock bottlenecks.
Count-Min Sketch solves this with a fixed memory footprint that never grows. You choose the dimensions once at startup. The sketch answers "how often has #hashtag appeared?" with a bounded over-estimate — and you trade that small overestimation for constant memory, regardless of how many distinct elements have been seen.
This is the frequency estimation problem: given a high-velocity stream of elements, estimate how often any given element has appeared, using sub-linear memory.
🔍 The d × w Counter Matrix: One Row Per Hash Function
A Count-Min Sketch has two parameters:
d(depth) — the number of rows. Each row has its own independent hash function.w(width) — the number of columns per row. This determines the counter resolution.
The data structure is simply a 2D array of integer counters, all initialized to zero:
counters[d][w] — all entries start at 0
Nothing else is stored. No keys, no element identifiers, no linked lists. The entire frequency history of a stream is encoded in d × w integers.
| Parameter | Role | Example |
d (rows / depth) | Controls error probability — more rows → lower collision probability | 7 |
w (columns / width) | Controls error magnitude — more columns → smaller per-column collision rate | 2000 |
| Memory | d × w × 4 bytes | 7 × 2000 × 4 = 56 KB |
| Hash functions | One independent function per row | h_1 … h_7 |
A sketch with d = 7 and w = 2000 uses 56 KB of counter storage for a 32-bit counter array. Compare this to a HashMap tracking exact counts for 1 million distinct hashtags: at ~64 bytes per entry, that is 64 MB — more than 1000× larger.
⚙️ Insert and Query: The Minimum Across Rows Is Always an Upper Bound
Insert: d Increments, One Per Row
To record that element x appeared (or appeared with weight weight):
for i in 0..d:
col = h_i(x) mod w
counters[i][col] += weight
Every insert touches exactly d counter cells — one per row. The column in each row is determined by that row's independent hash function. Weight is typically 1 for event counting, but weighted inserts (e.g., for byte-counting in network traffic) are supported by using the byte count as the weight.
Query: Read d Counters, Return the Minimum
To estimate how often element x has appeared:
estimate = +∞
for i in 0..d:
col = h_i(x) mod w
estimate = min(estimate, counters[i][col])
return estimate
Why the minimum? Counters can only be over-estimated, never under-estimated. When element y hashes to the same column as x in row i, it inflates counters[i][col_x]. This counter now reports a value larger than x's true frequency. But it is unlikely that every row's counter for x is inflated simultaneously — each row uses an independent hash function, so collisions are independent events. Taking the minimum across all d rows exploits the row with the fewest collisions as the best available estimate.
The minimum is always ≥ the true frequency (never an underestimate), because counters are only ever incremented, never decremented. The sketch answers "how often has X appeared?" with the guarantee: estimate ≥ true_count.
A Worked Example: Inserting Three Hashtags
Consider a 3-row × 6-column sketch tracking three hashtags: #kubernetes, #docker, and #k8s.
Suppose the hash functions produce:
#kubernetes: row 0 → col 2, row 1 → col 5, row 2 → col 1#docker: row 0 → col 2, row 1 → col 3, row 2 → col 4#k8s: row 0 → col 4, row 1 → col 5, row 2 → col 1
After inserting #kubernetes ×3, #docker ×5, and #k8s ×2:
| Col 0 | Col 1 | Col 2 | Col 3 | Col 4 | Col 5 | |
| Row 0 | 0 | 0 | 8 | 0 | 2 | 0 |
| Row 1 | 0 | 0 | 0 | 5 | 0 | 5 |
| Row 2 | 0 | 5 | 0 | 0 | 5 | 0 |
Note that in Row 0, #kubernetes (3) and #docker (5) collide at col 2 — both increment the same counter, giving 8. Querying #kubernetes:
- Row 0: counter[0][2] = 8 ← inflated by
#dockercollision - Row 1: counter[1][5] = 5 ← inflated by
#k8scollision - Row 2: counter[2][1] = 5 ← inflated by
#k8scollision
min(8, 5, 5) = 5. The true count is 3. The estimate of 5 is an overestimate, but it is the best available estimate given the collision pattern. A real sketch with d = 7 and w = 2000 reduces such collisions dramatically.
🧠 Deep Dive: Error Bounds, Heavy Hitters, and Performance
The Internals: Hash Independence and Counter Width
The error guarantee of Count-Min Sketch depends on the hash functions being pairwise independent — knowing that element x hashes to column c in row i tells you nothing about where it hashes in row j. If hash functions are correlated, the collision events across rows become correlated too, and the minimum-across-rows estimate loses its tightness.
In practice, row-wise hash functions are often derived by the formula:
h_i(x) = ((a_i * x + b_i) mod prime) mod w
where a_i and b_i are distinct random coefficients for each row chosen at construction time. This universal hashing family provides the required pairwise independence at low computational cost.
Counter width must accommodate the maximum possible frequency. For event counting with a 32-bit counter, the maximum stream size before overflow is 2^32 ≈ 4 billion events — sufficient for most windowed streaming contexts. For larger streams or weighted inserts, 64-bit counters are used at twice the memory cost.
Mathematical Model: Error Bounds and Sketch Sizing
The CMS error guarantee: with probability at least 1 − δ, the estimate for any element is within ε × n of the true frequency, where n is the total number of insertions:
P(estimate ≤ true_count + ε × n) ≥ 1 − δ
Solving for sketch dimensions:
w = ⌈e / ε⌉ — width from target error fraction ε ("epsilon")
d = ⌈ln(1 / δ)⌉ — depth from target failure probability δ ("delta")
Where e ≈ 2.718 is Euler's number. Worked example: target error ε = 0.005 (0.5% of total stream volume) with failure probability δ = 0.001 (0.1%):
w = ⌈e / 0.005⌉ = ⌈543.7⌉ = 544 columns
d = ⌈ln(1 / 0.001)⌉ = ⌈6.9⌉ = 7 rows
Memory = 544 × 7 × 4 bytes = 15,232 bytes ≈ 15 KB
15 KB for 0.5% error on any frequency query, with 99.9% confidence. An exact counter for 1 million distinct elements at 8 bytes per long counter: 8 MB — more than 500× larger.
The error guarantee is additive, not multiplicative: the error is at most ε × n regardless of the element's frequency. This means rare elements may have larger relative error (their count might be inflated from 1 to 100 by collisions with frequent elements), while frequent elements (counts in the millions) see only a small relative error.
Performance Analysis: Time and Space Complexity
| Operation | Time complexity | Notes |
| Insert | O(d) | d hash computations + d counter increments |
| Query | O(d) | d hash computations + d counter reads + minimum |
| Space | O(d × w) | Fixed at construction; never grows |
With d = 7 and w = 2000, every insert touches 7 counter cells. All 7 cells fit in a small number of cache lines. Insert throughput on modern hardware exceeds 10 million operations per second for string keys. The fixed-size counter array is the key advantage: a HashMap must resize and rehash as the number of distinct elements grows, creating unpredictable latency spikes at high throughput. The Count-Min Sketch never resizes.
Heavy Hitters Detection with Count-Min Sketch
Heavy hitters (also called "top-K" or "elephants") are elements that appear more than n/k times in a stream of n events — the elements responsible for a disproportionate fraction of the traffic. Identifying heavy hitters is a common streaming problem: which hashtags are trending? Which source IPs are generating unusual traffic? Which products are experiencing demand spikes?
Count-Min Sketch enables efficient heavy hitter detection via a heap + sketch pattern:
- On each insert, query the sketch for the updated estimate of that element.
- If the estimate exceeds the heavy-hitter threshold (
n/kfor top-K), add the element to a min-heap of size K. - Periodically evict elements from the heap whose sketch estimate has dropped below the threshold.
This pattern finds heavy hitters in O(d + log K) per event — fast enough for millions of events per second — with a fixed memory footprint of O(d × w + K).
📊 Visualizing Count-Min Sketch Insert and Query
The following diagram traces the complete insert-and-query lifecycle for a hashtag event, showing how each row's hash function maps the element to an independent column, why collisions can inflate individual row counters, and why the minimum across all rows gives the best available estimate.
flowchart TD
A(["Incoming event: tweet contains #kubernetes"]) --> B["Hash #kubernetes with each of d=7 hash functions"]
B --> C["Row 0: h0('#kubernetes') → col 2\nRow 1: h1('#kubernetes') → col 5\n...\nRow 6: h6('#kubernetes') → col 1"]
C --> D["Increment counters[0][2], counters[1][5], ..., counters[6][1]"]
D --> E(["Insert complete — d=7 counter cells updated"])
A2(["Query: how often has #kubernetes appeared?"]) --> F["Hash #kubernetes with same d=7 hash functions"]
F --> G["Read counters[0][2], counters[1][5], ..., counters[6][1]"]
G --> H{"Some rows may have collisions\nfrom other elements hashing to same column"}
H -->|"collision in row i"| I["counter[i][col] is inflated\nreports count higher than true frequency"]
H -->|"no collision in row j"| J["counter[j][col] equals or approximates true frequency"]
I --> K["Return minimum across all 7 row values"]
J --> K
K --> L(["Estimated frequency — always ≥ true count\nnever an underestimate"])
style E fill:#3b82f6,color:#fff
style L fill:#22c55e,color:#fff
style I fill:#f59e0b,color:#fff
The minimum-across-rows selection is the probabilistic heart of Count-Min Sketch: each row's independent hash function randomizes collision patterns, so a counter that is inflated in row i is unlikely to also be inflated in all other rows simultaneously. The minimum captures the row least affected by collisions.
🌍 Production Deployments: Where Count-Min Sketch Runs in Industry
Twitter/X trending topics. CMS tracks hashtag occurrence frequency in sliding five-minute windows. The top-K hashtags in the trending sidebar come from Count-Min Sketch estimates, not exact counts. Exact counting at Twitter's scale — tens of millions of tweets per minute, with millions of distinct active hashtags — would require hundreds of gigabytes of counter state per analytics tier. The CMS sketch for the same workload fits in a few hundred kilobytes.
Akamai CDN heavy-hitter identification. Each edge node uses Count-Min Sketch to identify "heavy hitter" URLs — pages generating disproportionately high traffic — in megabyte-scale memory per node. Identified heavy hitters are then pre-cached aggressively. The alternative — maintaining exact per-URL counters for every distinct URL served — is impractical at CDN edge scale where a single node may serve millions of distinct URLs per hour.
Apache Flink streaming analytics. Flink's DataStream API offers a TopKFrequency operator backed by Count-Min Sketch for approximate top-K queries over infinite streams. The operator maintains a fixed-size sketch per window, emitting updated top-K estimates on each window close without the memory unboundedness of exact counting.
Kafka Streams. Applications built on Kafka Streams use CMS for real-time frequency analytics — detecting spike events, identifying anomalous consumers, or tracking message distribution across topics — all within the Kafka broker's JVM process without external state stores.
Network intrusion detection. High-throughput packet processors use Count-Min Sketch to identify source IP addresses sending anomalous packet volumes — a classic heavy-hitter problem. At line rate (millions of packets per second), maintaining a counter per source IP would consume gigabytes for the internet's IP address space. A CMS sketch with d = 7, w = 65536 uses 1.8 MB and delivers sub-1% error on the heaviest talkers.
⚖️ Trade-offs and Failure Modes
Always overestimates, never underestimates. The minimum-across-rows guarantee means estimates are always ≥ true count. This is safe for trending topics (slightly inflated counts are fine for ranking) but may be inappropriate for use cases where underestimation is preferable — for example, billing, where you would rather overcount than undercount, making CMS appropriate; versus rate limiting, where overcounting means blocking legitimate requests, making CMS inappropriate.
Tail-item inflation from heavy hitter collisions. If one element appears extremely frequently (a global trending hashtag), it inflates many column counters. Elements that collide with a heavy hitter in any row may have their estimates inflated far above their true counts. Mitigation: use a wider sketch (w proportional to 1/ε) to reduce collision probability; pair with a separate exact counter for items above a confirmed heavy-hitter threshold.
No deletion. Counters are increment-only. If you need to remove an element from the frequency estimate (e.g., sliding-window semantics where old events expire), a plain Count-Min Sketch cannot decrement. Mitigation: use sliding-window CMS (maintain multiple sketches for time buckets; subtract the oldest bucket from the combined estimate) or the Count sketch variant (which supports decrement but has weaker error guarantees).
Sketch does not know which elements are in it. To find heavy hitters, you must either externally track candidate elements or scan all observed elements through the sketch. The sketch itself cannot enumerate elements — it can only answer "what is the frequency of element X?" for a specifically queried X.
| Failure Mode | Symptom | Mitigation |
| Heavy-hitter collision inflation | Tail elements over-reported | Wider sketch; exact counters above threshold |
| Sliding window needed | Cannot expire old events | Per-bucket sketches; subtract expired bucket |
| Rate limiting use case | Overcounting blocks legitimate traffic | Use exact counter for rate limiting |
| Unknown heavy hitters | Cannot enumerate top-K without candidate set | Maintain a heap of seen elements alongside sketch |
🧭 Decision Guide: Count-Min Sketch vs Exact Counter vs HyperLogLog
| Situation | Recommendation | Reason |
| Estimate element frequency in high-volume stream | Count-Min Sketch | Fixed O(d×w) memory; never underestimates |
| Need exact frequency counts — billing, SLA reporting | Exact HashMap counter | CMS over-estimates; accept memory cost for correctness |
| Count distinct elements — unique visitors, unique IPs | HyperLogLog | CMS cannot count cardinality |
| Membership check — "is X in the set?" | Bloom Filter | CMS answers "how often", not "is it present?" |
| Identify top-K frequent elements in stream | CMS + min-heap | CMS frequency estimates feed the heap comparator |
| Sliding-window frequency estimation | Multiple CMS buckets | One sketch per time bucket; subtract expired bucket |
The deciding question: Do you need to estimate how often a specific element has appeared in a stream, using fixed memory? If yes, Count-Min Sketch is the right structure. If you need exact counts or cardinality instead of frequency, choose exact counters or HyperLogLog respectively.
🧪 Implementing Count-Min Sketch in Java
The following Java implementation demonstrates the complete insert-query lifecycle with pairwise-independent hash functions per row.
public class CountMinSketch {
private final long[][] counters; // [depth][width]
private final int width; // w — columns per row
private final int depth; // d — number of rows / hash functions
private final long[] a; // hash function coefficient a_i
private final long[] b; // hash function coefficient b_i
private static final long PRIME = 2_038_074_743L; // a large prime
/**
* @param width w — number of columns (e = e/ε ≈ 2.718 / target_error_fraction)
* @param depth d — number of rows (d = ⌈ln(1/δ)⌉)
*/
public CountMinSketch(int width, int depth) {
this.width = width;
this.depth = depth;
this.counters = new long[depth][width];
// Initialize random hash function coefficients
java.util.Random rand = new java.util.Random(42);
this.a = new long[depth];
this.b = new long[depth];
for (int i = 0; i < depth; i++) {
a[i] = (long) (rand.nextInt(Integer.MAX_VALUE)) + 1;
b[i] = (long) (rand.nextInt(Integer.MAX_VALUE));
}
}
/** Insert element with weight (use weight=1 for simple event counting). */
public void insert(String element, long weight) {
int hashCode = element.hashCode();
for (int i = 0; i < depth; i++) {
int col = (int) ((Math.abs(a[i] * hashCode + b[i]) % PRIME) % width);
counters[i][col] += weight;
}
}
/**
* Estimate frequency of element.
* Returned value is always >= true frequency (upper bound guarantee).
*/
public long estimate(String element) {
int hashCode = element.hashCode();
long minCount = Long.MAX_VALUE;
for (int i = 0; i < depth; i++) {
int col = (int) ((Math.abs(a[i] * hashCode + b[i]) % PRIME) % width);
minCount = Math.min(minCount, counters[i][col]);
}
return minCount;
}
}
Usage example — Twitter-style trending hashtag detection:
// Size for 0.5% error (ε=0.005) with 99.9% confidence (δ=0.001)
int w = (int) Math.ceil(Math.E / 0.005); // ~544 columns
int d = (int) Math.ceil(Math.log(1 / 0.001)); // 7 rows
CountMinSketch cms = new CountMinSketch(w, d);
// Simulate tweet stream
long totalEvents = 0;
cms.insert("#kubernetes", 1); totalEvents++;
cms.insert("#docker", 1); totalEvents++;
cms.insert("#kubernetes", 1); totalEvents++;
cms.insert("#k8s", 1); totalEvents++;
// ... millions more events ...
// Query frequency
long k8sEstimate = cms.estimate("#kubernetes");
System.out.println("#kubernetes estimate: " + k8sEstimate);
// Result: >= 2 (true count), bounded by ε × totalEvents error
// Heavy hitter check: is #kubernetes responsible for >0.5% of all events?
if (k8sEstimate > 0.005 * totalEvents) {
System.out.println("#kubernetes is a heavy hitter");
}
Three additional streaming problems solved by the same structure:
Network packet rate monitoring. Insert each packet's source IP with weight = packet size in bytes. Query any source IP to get its estimated byte contribution. Alert when an IP's estimate exceeds a threshold — heavy-hitter detection for DDoS mitigation.
Top-K product views in e-commerce. Insert each product ID on every page view event. Maintain a min-heap of size K alongside the sketch. After each insert, compare the updated estimate against the heap's minimum. Update the heap if the estimate is larger.
Cache eviction priority. A cache eviction policy can prefer to keep high-frequency items. On each cache miss, query the sketch for the requested item's frequency. If the frequency estimate is high, upgrade the item's eviction priority in the cache's priority queue.
🛠️ Redis CMS: Count-Min Sketch as a Native Data Type
Redis ships Count-Min Sketch as part of the RedisBloom module (included in Redis Stack) with native commands for initialization, increment, and query.
# Initialize a Count-Min Sketch with explicit dimensions
CMS.INITBYDIM trending 2000 7
# Creates a width=2000, depth=7 sketch — ~56 KB of counter storage
# Or initialize by error parameters (Redis computes dimensions automatically)
CMS.INITBYPROB trending 0.005 0.001
# target error fraction = 0.005, failure probability = 0.001
# Increment one element
CMS.INCRBY trending "#kubernetes" 1
→ (integer) 1 # the updated estimate for #kubernetes
# Increment multiple elements in one call
CMS.INCRBY trending "#docker" 5 "#k8s" 2
# Query frequency estimates
CMS.QUERY trending "#kubernetes"
→ (integer) 1
CMS.QUERY trending "#docker" "#k8s"
→ 1) (integer) 5
→ 2) (integer) 2
# Get sketch info
CMS.INFO trending
→ width: 2000
→ depth: 7
→ count: 8 (total number of increments across all inserts)
CMS.INITBYDIM creates a sketch with explicit dimensions. CMS.INITBYPROB is more convenient for production use — you specify your error budget and Redis computes the optimal width and depth. CMS.INCRBY returns the post-increment estimate for each element, enabling heavy-hitter checks inline with the insert. CMS.QUERY reads estimates without modifying the sketch.
Production pattern — Twitter-style trending in Redis:
# On each tweet event that contains #kubernetes:
CMS.INCRBY trending:window-{current_minute} "#kubernetes" 1
# Every 5 minutes, query the top candidates for trending:
CMS.QUERY trending:window-{current_minute} "#kubernetes" "#docker" "#k8s"
# Reset by deleting old window keys (CMS.RESET not available — use key expiry)
EXPIRE trending:window-{old_minute} 300 # expire after 5 minutes
For a full deep-dive on Redis probabilistic data types including Bloom Filter and HyperLogLog commands, see the companion posts in this series.
📚 Lessons Learned
Count-Min Sketch answers frequency, not membership or cardinality. It tells you how often element X has appeared — not whether X is in the set (Bloom Filter's job) and not how many distinct elements there are (HyperLogLog's job). Match the structure to the question.
Over-estimation is the price; fixed memory is the reward. Every estimate from a Count-Min Sketch is an upper bound on the true frequency. The over-estimation magnitude is bounded by
ε × nwith high probability. For trending topics, ranking, and anomaly detection, this bounded over-estimation is a manageable trade-off for constant-memory operation.Width and depth are your error budget dials. Width (
w) controls the error magnitude — doublingwhalves the maximum per-query error. Depth (d) controls the error probability — each additional row reduces the probability of the worst case exponentially. Use the formulaw = e/ε, d = ln(1/δ)and size conservatively for your expected peak stream volume.Heavy hitter collisions are the dominant failure mode in practice. When one element dominates the stream (say, 30% of all events), it inflates many columns across all rows. Elements that collide with it in any row inherit inflated estimates. This is the "heavy hitter inflation" problem — address it by using a wider sketch or maintaining exact counters for known heavy hitters.
No sliding window without extra structure. A plain Count-Min Sketch accumulates counts monotonically — once incremented, a counter never decreases. For sliding-window frequency estimation, maintain one sketch per time bucket (e.g., one per minute for a 5-minute window). Query the bucket sketches and sum them, then discard the oldest bucket as the window slides.
Use library implementations for production. The error guarantees depend on the hash functions being pairwise independent. Rolling your own pairwise-independent hash family is error-prone. Use Redis RedisBloom's
CMS.INITBYPROB, Apache DataSketches, or Guava's sketching utilities — all tested against the theoretical error bounds.
📌 Summary & Key Takeaways
Count-Min Sketch is the right structure when you need to estimate "how often has X appeared?" in a high-volume stream using fixed, pre-allocated memory.
- Core guarantee: estimates are always ≥ true frequency — the minimum across all
drows is an upper bound, never an underestimate. - Sizing formula:
w = ⌈e/ε⌉columns for error withinε × nof true count;d = ⌈ln(1/δ)⌉rows for failure probability ≤ δ. - Memory is fixed at O(d × w) — never grows with stream volume or the number of distinct elements seen.
- All operations are O(d) ≈ O(1) — with
dtypically 5–10, inserts and queries touch fewer than 10 counter cells regardless of stream size. - Heavy hitter detection pairs naturally with a min-heap of size K: insert into sketch, query the estimate, update the heap if the estimate exceeds the K-th largest seen.
- Sliding windows require per-bucket sketches — one per time bucket — with periodic discard of expired buckets.
- Redis CMS commands (
CMS.INITBYPROB,CMS.INCRBY,CMS.QUERY) provide production-grade Count-Min Sketch without custom implementation. - The decisive test: if you need to estimate frequency of specific elements in a stream — trending topics, heavy hitters, anomaly detection — and exact counting would require unbounded memory, Count-Min Sketch delivers bounded-error estimates in fixed kilobytes.
🔗 Related Posts
- Bloom Filters Explained: Membership Testing with Zero False Negatives — The companion probabilistic structure for set membership — answers "is X definitely not in the set?" rather than "how often has X appeared?"
- HyperLogLog Explained: Counting Billions of Unique Items with 12 KB — Fixed-memory cardinality estimation — answers "how many distinct elements have I seen?" in the same probabilistic family
- Probabilistic Data Structures: A Practical Guide — Overview and decision guide across Bloom, Cuckoo, HyperLogLog, and Count-Min Sketch
- Sliding Window Technique — Another O(n) streaming pattern for subarray and window problems in the Data Structures and Algorithms series
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
HyperLogLog Explained: Counting Billions of Unique Items with 12 KB
TLDR: HyperLogLog estimates the number of distinct elements in a dataset using ~12 KB of memory regardless of cardinality — with ±0.81% error. The insight: if you hash every element to a random bit string, the maximum length of leading zeros you obse...
Bloom Filters Explained: Membership Testing with Zero False Negatives
TLDR: A Bloom filter is a bit array of m bits + k independent hash functions that sets k bits on insert and checks those same k bits on lookup. If any checked bit is 0, the element is definitely not in the set — false negatives are mathematically imp...
Spark Structured Streaming: Micro-Batch vs Continuous Processing
📖 The 15-Minute Gap: How a Fraud Team Discovered They Needed Real-Time Streaming A fintech team runs payment fraud detection with a well-tuned Spark batch job. Every 15 minutes it reads a day's worth of transaction events from S3, scores them agains...
Split Brain Explained: When Two Nodes Both Think They Are Leader
TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader — each accepting writes the other never sees. Prevent it with quorum consensus (at least ⌊N/2⌋+1 nodes must agree before leadership is g...
