Bloom filters & probabilistic data structures

🎲Key Takeaways

1
Bloom filters test set membership with zero false negatives, tunable false positives, and constant O(K) time
2
They use ~10 bits per element for a 1% false positive rate — vastly more space-efficient than hash sets
3
HyperLogLog counts unique elements with <1% error using only 12 KB of memory regardless of cardinality
4
Used everywhere: Chrome Safe Browsing, Cassandra, Bigtable, Redis, CDNs, spam filters

The Power of Probabilistic Data Structures

Sometimes you need to answer questions like "Have I seen this element before?" or "How many unique visitors today?" — but the dataset is too large for exact tracking. Probabilistic data structures trade **guaranteed exactness for dramatic space savings**.

A Bloom filter can tell you if an item is **definitely NOT in the set** or **probably in the set** — using 1/100th the memory of a hash set. A HyperLogLog can count 1 billion unique items using just 12 KB. These structures are foundational in databases, caches, networks, and streaming systems.

Bloom Filter — The Core Data Structure

A Bloom filter is a bit array of m bits, initially all zeros, with k independent hash functions. **Insert(x):** Hash x with all k hash functions. Set bits at positions h₁(x), h₂(x), ..., hₖ(x) to 1. **Query(x):** Hash x with all k functions. If ALL k bits are 1 → "probably in set". If ANY bit is 0 → "definitely NOT in set". Why false positives? Multiple items can set the same bits. Over time, more bits become 1, increasing the chance that all k bits for a new item are coincidentally set. **Key property:** False positives are possible, false negatives are IMPOSSIBLE. If a bit is 0, no inserted element could have set it.

For n expected elements and desired false positive rate p: **Optimal bit array size:** m = -(n × ln(p)) / (ln(2))² **Optimal hash functions:** k = (m/n) × ln(2) **Examples:** - n=1M, p=1%: m = 9.6M bits (~1.2 MB), k = 7 hashes - n=1M, p=0.1%: m = 14.4M bits (~1.8 MB), k = 10 hashes - n=1B, p=1%: m = 9.6B bits (~1.2 GB), k = 7 hashes Compare: a HashSet of 1M strings (avg 50 bytes) = ~50 MB. Bloom filter = 1.2 MB. That's 40× savings!

**Standard Bloom filter limitations:** - Cannot delete elements (setting a bit to 0 might affect other elements) - Cannot count occurrences - False positive rate increases as filter fills up **Cuckoo Filter:** Supports deletion. Uses cuckoo hashing with fingerprints. Better space efficiency than Bloom filters at low false positive rates (<3%). Used in newer systems. **Counting Bloom Filter:** Replace each bit with a counter. Supports deletion (decrement instead of clear). 3-4× more memory than standard Bloom filter. **Scalable Bloom Filter:** Grows by adding new filters when the current one is too full. Maintains the desired false positive rate as the dataset grows.

Bloom Filter Implementation

import hashlib, math

class BloomFilter:
    def __init__(self, expected_items, fp_rate=0.01):
        self.fp_rate = fp_rate
        self.size = self._optimal_size(expected_items, fp_rate)
        self.hash_count = self._optimal_hashes(self.size, expected_items)
        self.bit_array = bytearray(math.ceil(self.size / 8))
        self.count = 0
    
    @staticmethod
    def _optimal_size(n, p):
        return int(-n * math.log(p) / (math.log(2) ** 2))
    
    @staticmethod
    def _optimal_hashes(m, n):
        return max(1, int((m / n) * math.log(2)))
    
    def _hashes(self, item):
        h1 = int(hashlib.md5(str(item).encode()).hexdigest(), 16)
        h2 = int(hashlib.sha1(str(item).encode()).hexdigest(), 16)
        for i in range(self.hash_count):
            yield (h1 + i * h2) % self.size  # double hashing
    
    def add(self, item):
        for pos in self._hashes(item):
            self.bit_array[pos // 8] |= (1 << (pos % 8))
        self.count += 1
    
    def __contains__(self, item):
        return all(
            self.bit_array[pos // 8] & (1 << (pos % 8))
            for pos in self._hashes(item)
        )

# Usage
bf = BloomFilter(expected_items=1_000_000, fp_rate=0.01)
bf.add("user:12345")
bf.add("user:67890")
print("user:12345" in bf)  # True (definitely added)
print("user:99999" in bf)  # False (definitely NOT added)
# Rare: False → True (false positive, ~1% chance)

Probabilistic Data Structures Overview

Structure	Question It Answers	Space	Error	Supports Delete
Bloom Filter	Is X in the set?	~10 bits/element	False positives only	No
Cuckoo Filter	Is X in the set?	~12 bits/element	FP only, better at low rates	Yes
Count-Min Sketch	How many times has X appeared?	O(w × d)	Over-counts only	No
HyperLogLog	How many unique elements?	12 KB fixed	<1% error	No
Top-K (Count-Min + Heap)	What are the top K elements?	O(K × w × d)	Approximate	No
t-digest	What is the p99 latency?	O(compression)	~0.1-1% error	No

Other Key Structures

Counts the number of unique (distinct) elements in a stream using only ~12 KB of memory. **How:** Hashes each element, uses the position of the leftmost 1-bit as an estimator. With 2^14 = 16,384 registers (6 bits each), it achieves <1% standard error. **Example:** "How many unique visitors did we have today?" With 100M visitors, a HashSet requires ~GBs. HyperLogLog: 12 KB and ~0.8% error. **Redis:** Built-in via PFADD and PFCOUNT. `PFADD visitors "user:123"`, `PFCOUNT visitors` → approximate unique count. **Used by:** Google BigQuery (COUNT DISTINCT), Redis, Presto, Apache Flink, Datadog.

Real-World Applications

**Cassandra:** Uses Bloom filters on each SSTable to skip disk reads. Before reading an SSTable, the Bloom filter is checked. If it says "not present," the SSTable is skipped entirely — saving expensive disk I/O. Each SSTable has its own Bloom filter. **Bigtable / HBase:** Similar to Cassandra. Bloom filters reduce the number of SSTables that need to be checked during a read. Configurable per column family. **PostgreSQL:** Uses Bloom indexes (extension) for multi-column approximate lookups. Useful when you have many columns and need to check "is there any row matching these criteria?"

✅Interview Tip

When Bloom filters come up in interviews, mention these key applications: (1) Avoiding unnecessary disk reads in LSM-tree databases, (2) Cache penetration prevention, (3) Spell checking / dictionary lookup, (4) Network duplicate detection. For HyperLogLog, mention unique visitor counting and the Redis PFCOUNT command. Showing you know where these fit in real systems is more impressive than just knowing the algorithm.

Advantages

•Extreme space efficiency: 10 bits/element vs 50+ bytes for hash sets
•O(K) constant time for both insert and query (K = number of hash functions)
•Zero false negatives — if the filter says 'no', the element definitely isn't in the set
•HyperLogLog counts billions of unique elements in just 12 KB with <1% error

Disadvantages

•Standard Bloom filters cannot delete elements
•False positive rate increases as the filter fills — must size correctly upfront
•Not suitable when exact answers are required (e.g., financial calculations)
•Multiple hash function computations add latency for each operation

🧪 Test Your Understanding

Knowledge Check1/3

Can a Bloom filter produce false negatives?