🎲Key Takeaways
- 1Bloom filters test set membership with zero false negatives, tunable false positives, and constant O(K) time
- 2They use ~10 bits per element for a 1% false positive rate — vastly more space-efficient than hash sets
- 3HyperLogLog counts unique elements with <1% error using only 12 KB of memory regardless of cardinality
- 4Used everywhere: Chrome Safe Browsing, Cassandra, Bigtable, Redis, CDNs, spam filters
The Power of Probabilistic Data Structures
Sometimes you need to answer questions like "Have I seen this element before?" or "How many unique visitors today?" — but the dataset is too large for exact tracking. Probabilistic data structures trade **guaranteed exactness for dramatic space savings**.
A Bloom filter can tell you if an item is **definitely NOT in the set** or **probably in the set** — using 1/100th the memory of a hash set. A HyperLogLog can count 1 billion unique items using just 12 KB. These structures are foundational in databases, caches, networks, and streaming systems.
Bloom Filter — The Core Data Structure
A Bloom filter is a bit array of m bits, initially all zeros, with k independent hash functions. **Insert(x):** Hash x with all k hash functions. Set bits at positions h₁(x), h₂(x), ..., hₖ(x) to 1. **Query(x):** Hash x with all k functions. If ALL k bits are 1 → "probably in set". If ANY bit is 0 → "definitely NOT in set". Why false positives? Multiple items can set the same bits. Over time, more bits become 1, increasing the chance that all k bits for a new item are coincidentally set. **Key property:** False positives are possible, false negatives are IMPOSSIBLE. If a bit is 0, no inserted element could have set it.
For n expected elements and desired false positive rate p: **Optimal bit array size:** m = -(n × ln(p)) / (ln(2))² **Optimal hash functions:** k = (m/n) × ln(2) **Examples:** - n=1M, p=1%: m = 9.6M bits (~1.2 MB), k = 7 hashes - n=1M, p=0.1%: m = 14.4M bits (~1.8 MB), k = 10 hashes - n=1B, p=1%: m = 9.6B bits (~1.2 GB), k = 7 hashes Compare: a HashSet of 1M strings (avg 50 bytes) = ~50 MB. Bloom filter = 1.2 MB. That's 40× savings!
**Standard Bloom filter limitations:** - Cannot delete elements (setting a bit to 0 might affect other elements) - Cannot count occurrences - False positive rate increases as filter fills up **Cuckoo Filter:** Supports deletion. Uses cuckoo hashing with fingerprints. Better space efficiency than Bloom filters at low false positive rates (<3%). Used in newer systems. **Counting Bloom Filter:** Replace each bit with a counter. Supports deletion (decrement instead of clear). 3-4× more memory than standard Bloom filter. **Scalable Bloom Filter:** Grows by adding new filters when the current one is too full. Maintains the desired false positive rate as the dataset grows.
import hashlib, math
class BloomFilter:
def __init__(self, expected_items, fp_rate=0.01):
self.fp_rate = fp_rate
self.size = self._optimal_size(expected_items, fp_rate)
self.hash_count = self._optimal_hashes(self.size, expected_items)
self.bit_array = bytearray(math.ceil(self.size / 8))
self.count = 0
@staticmethod
def _optimal_size(n, p):
return int(-n * math.log(p) / (math.log(2) ** 2))
@staticmethod
def _optimal_hashes(m, n):
return max(1, int((m / n) * math.log(2)))
def _hashes(self, item):
h1 = int(hashlib.md5(str(item).encode()).hexdigest(), 16)
h2 = int(hashlib.sha1(str(item).encode()).hexdigest(), 16)
for i in range(self.hash_count):
yield (h1 + i * h2) % self.size # double hashing
def add(self, item):
for pos in self._hashes(item):
self.bit_array[pos // 8] |= (1 << (pos % 8))
self.count += 1
def __contains__(self, item):
return all(
self.bit_array[pos // 8] & (1 << (pos % 8))
for pos in self._hashes(item)
)
# Usage
bf = BloomFilter(expected_items=1_000_000, fp_rate=0.01)
bf.add("user:12345")
bf.add("user:67890")
print("user:12345" in bf) # True (definitely added)
print("user:99999" in bf) # False (definitely NOT added)
# Rare: False → True (false positive, ~1% chance)Probabilistic Data Structures Overview
| Structure | Question It Answers | Space | Error | Supports Delete |
|---|---|---|---|---|
| Bloom Filter | Is X in the set? | ~10 bits/element | False positives only | No |
| Cuckoo Filter | Is X in the set? | ~12 bits/element | FP only, better at low rates | Yes |
| Count-Min Sketch | How many times has X appeared? | O(w × d) | Over-counts only | No |
| HyperLogLog | How many unique elements? | 12 KB fixed | <1% error | No |
| Top-K (Count-Min + Heap) | What are the top K elements? | O(K × w × d) | Approximate | No |
| t-digest | What is the p99 latency? | O(compression) | ~0.1-1% error | No |
Other Key Structures
Counts the number of unique (distinct) elements in a stream using only ~12 KB of memory. **How:** Hashes each element, uses the position of the leftmost 1-bit as an estimator. With 2^14 = 16,384 registers (6 bits each), it achieves <1% standard error. **Example:** "How many unique visitors did we have today?" With 100M visitors, a HashSet requires ~GBs. HyperLogLog: 12 KB and ~0.8% error. **Redis:** Built-in via PFADD and PFCOUNT. `PFADD visitors "user:123"`, `PFCOUNT visitors` → approximate unique count. **Used by:** Google BigQuery (COUNT DISTINCT), Redis, Presto, Apache Flink, Datadog.
Real-World Applications
**Cassandra:** Uses Bloom filters on each SSTable to skip disk reads. Before reading an SSTable, the Bloom filter is checked. If it says "not present," the SSTable is skipped entirely — saving expensive disk I/O. Each SSTable has its own Bloom filter. **Bigtable / HBase:** Similar to Cassandra. Bloom filters reduce the number of SSTables that need to be checked during a read. Configurable per column family. **PostgreSQL:** Uses Bloom indexes (extension) for multi-column approximate lookups. Useful when you have many columns and need to check "is there any row matching these criteria?"
Advantages
- •Extreme space efficiency: 10 bits/element vs 50+ bytes for hash sets
- •O(K) constant time for both insert and query (K = number of hash functions)
- •Zero false negatives — if the filter says 'no', the element definitely isn't in the set
- •HyperLogLog counts billions of unique elements in just 12 KB with <1% error
Disadvantages
- •Standard Bloom filters cannot delete elements
- •False positive rate increases as the filter fills — must size correctly upfront
- •Not suitable when exact answers are required (e.g., financial calculations)
- •Multiple hash function computations add latency for each operation
🧪 Test Your Understanding
Can a Bloom filter produce false negatives?