All Posts

How Bloom Filters Work: The Probabilistic Set

Can you check if a username is taken without querying the database? Yes. Bloom Filters are a space-efficient probabilistic data structure.

Abstract AlgorithmsAbstract Algorithms
Β·Β·13 min read

AI-assisted content.

TLDR

TLDR: A Bloom Filter is a bit array + multiple hash functions that answers "Is X in the set?" in $O(1)$ constant space. It can return false positives (say "yes" when the answer is "no") but never false negatives (never says "no" when the answer is "yes"). Tuning array size and hash count controls the false positive rate.


πŸ“– The Nightclub Guestlist with No Memory

Imagine a nightclub with a strict guestlist. The bouncer has a system with ten switches (0 or 1). When Alice registers, her name is hashed by three different functions: switch 2, 5, and 8 are turned ON. When Bob registers, switch 1, 5, and 9 are turned ON.

At the door: "Is Charlie here?"

  • Hash Charlie β†’ switches 2, 7, 4.
  • Switch 7 is OFF. Definitely not registered. (No false negative.)

"Is Alice here?"

  • Hash Alice β†’ switches 2, 5, 8.
  • All three are ON. Probably registered. (Might be a false positive β€” someone else turned on those switches.)

This is the Bloom filter. It uses space proportional to the bit array, not to the number of items.


πŸ” Bit Arrays, Hash Functions, and the Two Operations

A Bloom filter has just two parts and two operations β€” that's the whole structure.

The two parts:

  • A bit array of size m, initialized to all zeros.
  • k independent hash functions, each of which maps any element to one position in the array.

The two operations:

Insert(x): Run x through all k hash functions. Each function gives a position. Set those positions to 1.

Query(x): Run x through the same k hash functions. If every resulting position is 1, report "probably in set." If any position is 0, report "definitely not in set."

That asymmetry is the key insight: you can only set bits, never unset them. So a missing bit is conclusive proof of absence. But all bits being set could mean the element is present or that other elements happened to set the same positions.

Here is how a Bloom filter compares to an exact hash set:

PropertyHash SetBloom Filter
MemoryO(n Γ— element size)O(m bits) β€” fixed
InsertO(1)O(k) β€” k hash functions
Query resultExact: yes or noProbabilistic: "probably yes" or "definitely no"
DeleteYesNo (standard)
False negativesNeverNever
False positivesNeverYes, tunable

The memory difference is dramatic. A 10-million-element hash set of 100-byte strings uses roughly 1 GB. A well-tuned Bloom filter for the same set with a 1% false positive rate uses under 12 MB β€” about 85Γ— smaller.


πŸ”’ How Three Hash Functions Set Bits in a Bit Array

Inserting "Alice" into a 10-bit Bloom filter with 3 hash functions:

Position:    0 1 2 3 4 5 6 7 8 9
Initial:     0 0 0 0 0 0 0 0 0 0

hash1("Alice") = 2 β†’ set bit 2
hash2("Alice") = 5 β†’ set bit 5
hash3("Alice") = 8 β†’ set bit 8

After Alice:  0 0 1 0 0 1 0 0 1 0

Querying "Charlie":

hash1("Charlie") = 2 β†’ bit 2 is 1 βœ“
hash2("Charlie") = 7 β†’ bit 7 is 0 βœ—  β†’ DEFINITELY NOT IN SET

Querying "Dave" (false positive case):

hash1("Dave") = 2 β†’ bit 2 is 1 βœ“  (set by Alice)
hash2("Dave") = 5 β†’ bit 5 is 1 βœ“  (set by Alice)
hash3("Dave") = 8 β†’ bit 8 is 1 βœ“  (set by Alice)
β†’ All bits set β†’ "Probably in set" β†’ FALSE POSITIVE

πŸ“Š Bloom Filter Insert: Hash Functions Set Bits

sequenceDiagram
    participant E as Element "Alice"
    participant H1 as hash1
    participant H2 as hash2
    participant H3 as hash3
    participant B as Bit Array[0..9]

    E->>H1: compute hash
    H1-->>B: set bit 2 = 1
    E->>H2: compute hash
    H2-->>B: set bit 5 = 1
    E->>H3: compute hash
    H3-->>B: set bit 8 = 1
    Note over B: Bits {2,5,8} now = 1

This diagram shows how a single insert fans out across three independent hash functions, each writing to a different position in the bit array. All three hash operations happen for every insert β€” there are no shortcuts. The critical takeaway is that these bit writes are the only state change: once set, a bit is never cleared, which is precisely why false negatives are structurally impossible.

πŸ“Š Bloom Filter Lookup: Check All Bits

sequenceDiagram
    participant E as Query Element
    participant H1 as hash1
    participant H2 as hash2
    participant H3 as hash3
    participant B as Bit Array

    E->>H1: compute hash  pos P1
    H1-->>B: check bit P1
    E->>H2: compute hash  pos P2
    H2-->>B: check bit P2
    E->>H3: compute hash  pos P3
    H3-->>B: check bit P3
    alt any bit = 0
        B-->>E: DEFINITELY NOT in set
    else all bits = 1
        B-->>E: PROBABLY in set (may be FP)
    end

This diagram shows the asymmetric nature of a Bloom filter query: the result is either a definitive "no" (any zero bit proves absence) or a probabilistic "yes" (all bits set could mean present or could be a false positive from other elements). The alt branch at the bottom is where the entire probabilistic guarantee lives β€” every design decision about array size and hash count is aimed at keeping the false-positive branch rare.


βš™οΈ False Positives Are Guaranteed; False Negatives Are Not

Why can you never have a false negative?

When you insert an element, you set specific bits. When you query the same element, you check the exact same bits. Those bits can only be 1 (we set them). They cannot have been unset β€” there's no delete operation.

Why can you have a false positive?

Multiple elements share the same bit positions. If Alice set bits 2, 5, 8 and Dave's query also maps to 2, 5, 8 β€” Dave gets a false positive even though he was never inserted.

The false positive probability formula:

$$P_{\text{fp}} \approx \left(1 - e^{-kn/m}\right)^k$$

  • $k$: number of hash functions
  • $n$: number of elements inserted
  • $m$: size of the bit array
ActionEffect on $P_{\text{fp}}$
Larger array $m$Lower (fewer collisions)
More elements $n$Higher (more bits set)
More hash functions $k$Non-monotonic β€” there's an optimal $k = (m/n) \ln 2$

πŸ“Š Insert and Query Flow

The following diagram shows how a single element travels through a Bloom filter for both insert and query operations. The key branch at query time β€” "are all bits 1?" β€” is where the probabilistic guarantee lives.

flowchart TD
    Insert[Insert: hash(element)  k bit positions]
    Insert --> SetBits[Set bits at positions h1, h2, h3 = 1]

    Query[Query: hash(element)  k bit positions]
    Query --> CheckBits{All k bits = 1?}
    CheckBits -- "No (any bit = 0)" --> Absent[DEFINITELY NOT in set (no false negatives)]
    CheckBits -- "Yes (all bits = 1)" --> Present[PROBABLY in set (may be false positive)]

The insert path is unconditional β€” it always writes. The query path is the only place a decision is made. Because bits are never cleared, the "No" branch is a mathematical certainty: if any bit is zero, the element was never hashed to that position, so it was never inserted.


🧠 Deep Dive: Sizing a Bloom Filterβ€”Choosing m and k

Given $n$ expected elements and desired false positive rate $p$:

$$m = -\frac{n \ln p}{(\ln 2)^2} \qquad k = \frac{m}{n} \ln 2$$

Example: 1 million elements, 1% false positives

  • $m \approx 9.585 \times 10^6$ bits β‰ˆ 1.2 MB
  • $k \approx 7$ hash functions

Compare: storing 1 million 100-byte strings = 100 MB. The Bloom filter is 80Γ— smaller.


🌍 Real-World Applications: Bloom Filters in Production

SystemUse case
CassandraBefore a disk read, check the Bloom filter to skip SSTable files that definitely don't contain the key
Google ChromeSafe Browsing β€” check URLs against a Bloom filter of malicious sites before full network lookup
BitcoinSPV wallets use Bloom filters to ask nodes for transactions matching their addresses without revealing their full address list
HBase / LevelDB / RocksDBPer-SSTable Bloom filters to reduce disk I/O for key lookups
CDN / web cache"Have we cached this URL before?" check without expensive database lookup

πŸ§ͺ Sizing Chrome's Safe Browsing Filter from Scratch

Let's work through a real design problem: Google Chrome needs to check whether URLs are malicious before sending a network request. The Safe Browsing list has roughly 10 million known-bad URLs, and you want a false positive rate under 0.1% (so 1 in 1,000 safe URLs triggers a false alarm).

Step 1: Calculate the required bit array size (m)

$$m = -\frac{n \ln p}{(\ln 2)^2}$$

Plugging in $n = 10{,}000{,}000$ and $p = 0.001$:

$$m = -\frac{10{,}000{,}000 \times \ln(0.001)}{(\ln 2)^2} = \frac{10{,}000{,}000 \times 6.908}{0.480} \approx 143{,}900{,}000 \text{ bits}$$

That is roughly 17.2 MB β€” small enough to ship inside the browser and keep in memory.

Step 2: Calculate the optimal number of hash functions (k)

$$k = \frac{m}{n} \ln 2 = \frac{143{,}900{,}000}{10{,}000{,}000} \times 0.693 \approx 10 \text{ hash functions}$$

Step 3: Sanity check

ParameterValue
URLs in list (n)10,000,000
Target FP rate (p)0.1%
Bit array size (m)~143.9 million bits β‰ˆ 17.2 MB
Hash functions (k)10
Storage vs raw strings~85Γ— smaller than storing the URLs

Why this matters: Chrome hashes a URL 10 times and checks 10 bits in under a microsecond. Only if all bits are set does Chrome query Google's servers. The filter absorbs 99.9% of lookups.


βš–οΈ Trade-offs & Failure Modes: Bloom Filter Limitations

LimitationExplanation
No deletionUnsetting a bit might unset it for another element. (Counting Bloom filters extend this.)
No exact membershipOnly "probably yes" or "definitely no"
No retrievalCan't get the stored value β€” it's not stored, only bits
Size is fixedStandard Bloom filters can't resize. Scalable Bloom filters (SBF) exist as a workaround.

🧭 Decision Guide: Bloom Filter vs Hash Set

ScenarioUseWhy
Cache pre-check (avoid disk/network)Bloom filterMemory-efficient; false positive triggers a cheap lookup
Username availability checkBloom filterSub-millisecond, no DB hit for "definitely taken"
Financial ledger membershipHash setNo false positives allowed
Deletions neededCounting Bloom filterStandard Bloom doesn't support deletion
Small set (< 10k items)Hash setMemory advantage disappears at small scale

πŸ› οΈ Guava BloomFilter: Production-Ready Bloom Filters in Java

Google Guava provides a type-safe, production-ready BloomFilter<T> implementation that handles the sizing math (m and k from this post's formulas) automatically β€” you only supply the expected insertion count and desired false-positive probability. Under the hood it uses MurmurHash3-based hash functions and an atomic long[] bit array.

// build.gradle: com.google.guava:guava:33.2.1-jre

import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
import java.nio.charset.StandardCharsets;

public class UsernameAvailabilityChecker {

    // Create a Bloom filter for 1 million usernames with 1% FP rate.
    // Guava auto-calculates optimal m and k β€” no manual formula needed.
    private static final BloomFilter<String> TAKEN_USERNAMES =
        BloomFilter.create(
            Funnels.stringFunnel(StandardCharsets.UTF_8),
            1_000_000,   // expected insertions (n)
            0.01         // desired false-positive probability (p = 1%)
        );

    /** Bulk-load existing usernames from the database at startup. */
    public static void seed(Iterable<String> existingUsernames) {
        existingUsernames.forEach(TAKEN_USERNAMES::put);
    }

    /**
     * Returns false β†’ username is DEFINITELY available. Skip the DB query.
     * Returns true  β†’ username is PROBABLY taken. Verify with a DB query.
     */
    public static boolean mightBeTaken(String username) {
        return TAKEN_USERNAMES.mightContain(username);
    }

    public static void main(String[] args) {
        seed(java.util.List.of("alice", "bob", "charlie"));

        System.out.println(mightBeTaken("alice"));    // β†’ true  (in filter)
        System.out.println(mightBeTaken("dave"));     // β†’ false (definitely not taken)
        System.out.println(mightBeTaken("zara"));     // β†’ false (definitely not taken)
    }
}

What Guava handles automatically:

Manual formulaGuava equivalent
$m = -n \ln p / (\ln 2)^2$Computed from expectedInsertions + fpp parameters
$k = (m/n) \ln 2$Computed internally; uses MurmurHash3 for $k$ independent positions
Bit arrayLockFreeBitArray backed by atomic long[] β€” safe for concurrent reads

Thread safety note: BloomFilter.mightContain() is thread-safe for reads. BloomFilter.put() is not thread-safe. For concurrent inserts in a server context, synchronize on the filter or use a ReentrantReadWriteLock.

For a full deep-dive on Guava's BloomFilter implementation and Counting Bloom Filter variants for deletion support, a dedicated follow-up post is planned.


πŸ“š Three Rules That Make Bloom Filters Worth Using

Working through the math and the real-world examples surfaces a few lessons that are easy to miss the first time.

1. False negatives are structurally impossible β€” not just unlikely. Once a bit is set to 1, it stays 1 forever (in a standard Bloom filter). This means if you inserted element X, its hash positions are permanently marked. Querying X again will always find all its bits set. There is no sequence of events that can produce a false negative. This guarantee is absolute, which is why Bloom filters are trusted as a pre-filter layer in Cassandra, RocksDB, and similar systems: if the filter says "not here," the disk is not touched.

2. Use a Bloom filter when you need to eliminate "definitely no" at low cost. Choose a Bloom filter when false positives are tolerable and memory is constrained. Use an exact structure (hash set, database index) when false positives are unacceptable.

3. The Counting Bloom filter enables deletion β€” at a memory cost. Standard Bloom filters replace single bits with small integer counters (typically 3–4 bits per counter). Insert increments counters; delete decrements them. A position is "set" if its counter is > 0. This makes deletion safe β€” decrementing a counter only affects the element being removed, not other elements that share other bits. The trade-off: 3–4Γ— more memory. Counting Bloom filters are used in network packet filters and real-time stream deduplication where elements need to age out.

4. Tune early, not after the fact. The FP rate is fixed at construction time by your choice of m and k. Estimate maximum n before deploying and compute m from the formula β€” you cannot reduce the FP rate without rebuilding the filter.


πŸ“Œ TLDR: Summary & Key Takeaways

  • A Bloom filter uses a bit array + $k$ hash functions to test set membership in $O(1)$ constant space.
  • It has no false negatives and controllable false positives.
  • Optimal hash count: $k = (m/n) \ln 2$. Optimal array size: $m = -n \ln p / (\ln 2)^2$.
  • 1% FP rate for 1M elements costs only ~1.2 MB β€” 80Γ— smaller than storing the elements.
  • Used in Cassandra, Chrome Safe Browsing, Bitcoin SPV, and RocksDB.

Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms