All Posts

How Bloom Filters Work: The Probabilistic Set

Can you check if a username is taken without querying the database? Yes. Bloom Filters are a space-efficient probabilistic data structure.

Abstract AlgorithmsAbstract Algorithms
ยทยท5 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: A Bloom Filter is a bit array + multiple hash functions that answers "Is X in the set?" in $O(1)$ constant space. It can return false positives (say "yes" when the answer is "no") but never false negatives (never says "no" when the answer is "yes"). Tuning array size and hash count controls the false positive rate.


๐Ÿ“– The Nightclub Guestlist with No Memory

Imagine a nightclub with a strict guestlist. The bouncer has a system with ten switches (0 or 1). When Alice registers, her name is hashed by three different functions: switch 2, 5, and 8 are turned ON. When Bob registers, switch 1, 5, and 9 are turned ON.

At the door: "Is Charlie here?"

  • Hash Charlie โ†’ switches 2, 7, 4.
  • Switch 7 is OFF. Definitely not registered. (No false negative.)

"Is Alice here?"

  • Hash Alice โ†’ switches 2, 5, 8.
  • All three are ON. Probably registered. (Might be a false positive โ€” someone else turned on those switches.)

This is the Bloom filter. It uses space proportional to the bit array, not to the number of items.


๐Ÿ”ข How Three Hash Functions Set Bits in a Bit Array

Inserting "Alice" into a 10-bit Bloom filter with 3 hash functions:

Position:    0 1 2 3 4 5 6 7 8 9
Initial:     0 0 0 0 0 0 0 0 0 0

hash1("Alice") = 2 โ†’ set bit 2
hash2("Alice") = 5 โ†’ set bit 5
hash3("Alice") = 8 โ†’ set bit 8

After Alice:  0 0 1 0 0 1 0 0 1 0

Querying "Charlie":

hash1("Charlie") = 2 โ†’ bit 2 is 1 โœ“
hash2("Charlie") = 7 โ†’ bit 7 is 0 โœ—  โ†’ DEFINITELY NOT IN SET

Querying "Dave" (false positive case):

hash1("Dave") = 2 โ†’ bit 2 is 1 โœ“  (set by Alice)
hash2("Dave") = 5 โ†’ bit 5 is 1 โœ“  (set by Alice)
hash3("Dave") = 8 โ†’ bit 8 is 1 โœ“  (set by Alice)
โ†’ All bits set โ†’ "Probably in set" โ†’ FALSE POSITIVE

โš™๏ธ False Positives Are Guaranteed; False Negatives Are Not

Why can you never have a false negative?

When you insert an element, you set specific bits. When you query the same element, you check the exact same bits. Those bits can only be 1 (we set them). They cannot have been unset โ€” there's no delete operation.

Why can you have a false positive?

Multiple elements share the same bit positions. If Alice set bits 2, 5, 8 and Dave's query also maps to 2, 5, 8 โ€” Dave gets a false positive even though he was never inserted.

The false positive probability formula:

$$P_{\text{fp}} \approx \left(1 - e^{-kn/m}\right)^k$$

  • $k$: number of hash functions
  • $n$: number of elements inserted
  • $m$: size of the bit array
ActionEffect on $P_{\text{fp}}$
Larger array $m$Lower (fewer collisions)
More elements $n$Higher (more bits set)
More hash functions $k$Non-monotonic โ€” there's an optimal $k = (m/n) \ln 2$

๐Ÿง  Sizing a Bloom Filter: Choosing $m$ and $k$

Given $n$ expected elements and desired false positive rate $p$:

$$m = -\frac{n \ln p}{(\ln 2)^2} \qquad k = \frac{m}{n} \ln 2$$

Example: 1 million elements, 1% false positives

  • $m \approx 9.585 \times 10^6$ bits โ‰ˆ 1.2 MB
  • $k \approx 7$ hash functions

Compare: storing 1 million 100-byte strings = 100 MB. The Bloom filter is 80ร— smaller.


๐ŸŒ Where Bloom Filters Appear in Real Systems

SystemUse case
CassandraBefore a disk read, check the Bloom filter to skip SSTable files that definitely don't contain the key
Google ChromeSafe Browsing โ€” check URLs against a Bloom filter of malicious sites before full network lookup
BitcoinSPV wallets use Bloom filters to ask nodes for transactions matching their addresses without revealing their full address list
HBase / LevelDB / RocksDBPer-SSTable Bloom filters to reduce disk I/O for key lookups
CDN / web cache"Have we cached this URL before?" check without expensive database lookup

โš–๏ธ Limitations: No Delete, No Count, No Exact Answer

LimitationExplanation
No deletionUnsetting a bit might unset it for another element. (Counting Bloom filters extend this.)
No exact membershipOnly "probably yes" or "definitely no"
No retrievalCan't get the stored value โ€” it's not stored, only bits
Size is fixedStandard Bloom filters can't resize. Scalable Bloom filters (SBF) exist as a workaround.

๐Ÿ“Œ Key Takeaways

  • A Bloom filter uses a bit array + $k$ hash functions to test set membership in $O(1)$ constant space.
  • It has no false negatives and controllable false positives.
  • Optimal hash count: $k = (m/n) \ln 2$. Optimal array size: $m = -n \ln p / (\ln 2)^2$.
  • 1% FP rate for 1M elements costs only ~1.2 MB โ€” 80ร— smaller than storing the elements.
  • Used in Cassandra, Chrome Safe Browsing, Bitcoin SPV, and RocksDB.

๐Ÿงฉ Test Your Understanding

  1. A Bloom filter returns "no" for element X. Is X in the set?
  2. You double the bit array size $m$ while keeping $n$ constant. Does $P_{\text{fp}}$ increase or decrease?
  3. Why can't you delete an element from a standard Bloom filter?
  4. Cassandra uses a Bloom filter before reading from disk. What skip cost does this avoid?

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms