Cache problems and solutions

🔥Key Takeaways

1
Cache stampede: many threads recompute the same expired key simultaneously — use locking or early recomputation
2
Cache avalanche: mass expiration causes all traffic to hit the database — stagger TTLs with jitter
3
Cache penetration: requests for non-existent keys always miss cache — use bloom filters or cache negative results
4
Hot key: single key gets extreme traffic — replicate across multiple cache nodes

When Caching Goes Wrong

Caching seems simple — store results, serve from memory. But at scale, subtle problems emerge that can take down your entire system. Understanding these failure modes is essential for designing robust caching architectures.

Cache Failure Patterns

Problem: A popular key expires. 1000 concurrent requests all see a cache miss and all query the database simultaneously, overwhelming it.

Solutions: (1) Distributed lock — only one thread recomputes, others wait. (2) Early recomputation — refresh the cache before TTL expires using a background job. (3) Stale-while-revalidate — serve stale data while recomputing in the background.

Problem: Many keys expire at the same time (e.g., all set with same TTL at startup). Suddenly all traffic hits the database.

Solutions: (1) Add random jitter to TTLs: TTL = base_ttl + random(0, 60 seconds). (2) Use sliding TTL — reset TTL on each access. (3) Warm the cache before peak traffic.

Problem: Requests for data that doesn't exist in the database (e.g., user_id = -1) always miss the cache, always query the DB.

Solutions: (1) Cache negative results — store null with short TTL. (2) Bloom filter in front of cache — quickly check if key could exist. (3) Input validation — reject invalid keys before cache lookup.

Problem: One key (e.g., celebrity's profile, trending post) gets millions of requests. Even Redis has per-key throughput limits.

Solutions: (1) Replicate hot keys across multiple cache nodes. (2) Local in-memory cache (L1 cache) in application servers. (3) Split the key: hot_key_1, hot_key_2 ... hot_key_N and randomly distribute reads.

TTL Jitter Example

# Without jitter (BAD - all expire at same time)
redis.setex("user:1", 3600, data)  # All keys expire at T+3600
redis.setex("user:2", 3600, data)
redis.setex("user:3", 3600, data)

# With jitter (GOOD - spread out expiration)
import random
base_ttl = 3600
jitter = random.randint(0, 300)  # 0-5 minutes of randomness
redis.setex("user:1", base_ttl + jitter, data)  # Expires T+3600..3900

# Stale-while-revalidate pattern:
# Store logical_ttl inside the value
# On read: if now > logical_ttl: 
#   serve stale data immediately
#   trigger async refresh in background

Advantages

•Each problem has well-known solutions
•Understanding failure modes prevents outages
•Proper handling enables caching at massive scale

Disadvantages

•Solutions add implementation complexity
•Testing cache failure modes is difficult
•Some solutions (distributed locks) add latency

🧪 Test Your Understanding

Knowledge Check1/1

What's the best defense against cache avalanche?