๐ฅKey Takeaways
- 1Cache stampede: many threads recompute the same expired key simultaneously โ use locking or early recomputation
- 2Cache avalanche: mass expiration causes all traffic to hit the database โ stagger TTLs with jitter
- 3Cache penetration: requests for non-existent keys always miss cache โ use bloom filters or cache negative results
- 4Hot key: single key gets extreme traffic โ replicate across multiple cache nodes
When Caching Goes Wrong
Caching seems simple โ store results, serve from memory. But at scale, subtle problems emerge that can take down your entire system. Understanding these failure modes is essential for designing robust caching architectures.
Cache Failure Patterns
Problem: A popular key expires. 1000 concurrent requests all see a cache miss and all query the database simultaneously, overwhelming it.
Solutions: (1) Distributed lock โ only one thread recomputes, others wait. (2) Early recomputation โ refresh the cache before TTL expires using a background job. (3) Stale-while-revalidate โ serve stale data while recomputing in the background.
Problem: Many keys expire at the same time (e.g., all set with same TTL at startup). Suddenly all traffic hits the database.
Solutions: (1) Add random jitter to TTLs: TTL = base_ttl + random(0, 60 seconds). (2) Use sliding TTL โ reset TTL on each access. (3) Warm the cache before peak traffic.
Problem: Requests for data that doesn't exist in the database (e.g., user_id = -1) always miss the cache, always query the DB.
Solutions: (1) Cache negative results โ store null with short TTL. (2) Bloom filter in front of cache โ quickly check if key could exist. (3) Input validation โ reject invalid keys before cache lookup.
Problem: One key (e.g., celebrity's profile, trending post) gets millions of requests. Even Redis has per-key throughput limits.
Solutions: (1) Replicate hot keys across multiple cache nodes. (2) Local in-memory cache (L1 cache) in application servers. (3) Split the key: hot_key_1, hot_key_2 ... hot_key_N and randomly distribute reads.
# Without jitter (BAD - all expire at same time)
redis.setex("user:1", 3600, data) # All keys expire at T+3600
redis.setex("user:2", 3600, data)
redis.setex("user:3", 3600, data)
# With jitter (GOOD - spread out expiration)
import random
base_ttl = 3600
jitter = random.randint(0, 300) # 0-5 minutes of randomness
redis.setex("user:1", base_ttl + jitter, data) # Expires T+3600..3900
# Stale-while-revalidate pattern:
# Store logical_ttl inside the value
# On read: if now > logical_ttl:
# serve stale data immediately
# trigger async refresh in backgroundAdvantages
- โขEach problem has well-known solutions
- โขUnderstanding failure modes prevents outages
- โขProper handling enables caching at massive scale
Disadvantages
- โขSolutions add implementation complexity
- โขTesting cache failure modes is difficult
- โขSome solutions (distributed locks) add latency
๐งช Test Your Understanding
What's the best defense against cache avalanche?