๐ฆKey Takeaways
- 1Rate limiting protects services from abuse, ensures fair usage, and prevents cascading failures
- 2Token Bucket is the most common algorithm โ handles bursts while enforcing average rate
- 3Distributed rate limiting requires shared state (Redis) or approximate techniques (local + sync)
- 4Always rate limit at multiple layers: API gateway, per-user, per-IP, and per-resource
Why Rate Limiting Matters
Every production API needs rate limiting. Without it, a single misbehaving client, bot, or attacker can overwhelm your service and affect all users. Rate limiting serves four purposes:
1. **Prevent abuse** โ Stop DDoS attacks and scrapers 2. **Fair usage** โ Ensure no single tenant monopolizes shared resources 3. **Cost control** โ Prevent runaway usage from inflating cloud bills 4. **System stability** โ Protect downstream services from traffic spikes
Rate limiting appears in nearly every system design interview, either as a stand-alone design ("Design a Rate Limiter") or as a component in larger systems.
Rate Limiting Algorithms Compared
| Algorithm | Burst Handling | Memory | Precision | Complexity | Best For |
|---|---|---|---|---|---|
| Token Bucket | Allows bursts up to bucket size | O(1) per key | Approximate | Simple | API rate limiting (most common) |
| Leaky Bucket | Smooths out bursts | O(1) per key | Smooth output | Simple | Traffic shaping, network QoS |
| Fixed Window | Allows 2ร burst at window edges | O(1) per key | Coarse | Simplest | Simple counters, analytics |
| Sliding Window Log | Precise | O(N) per key | Exact | Higher memory | When precision matters |
| Sliding Window Counter | Allows slight burst | O(1) per key | Approximate | Moderate | Best balance of precision and memory |
Algorithm Deep Dives
The most widely used algorithm (AWS, Stripe, Google Cloud). **How it works:** - A bucket has capacity C (max tokens) and refill rate R (tokens/second) - Each request consumes 1 token. If bucket is empty โ reject (429). - Tokens are refilled at rate R, up to capacity C - Allows bursts of up to C requests, then sustained rate of R requests/sec **Implementation:** Store two values per key: `tokens` (current count) and `last_refill_time`. On each request, calculate elapsed time, add refill tokens, then check if a token is available. **Used by:** AWS API Gateway, Stripe, Shopify, GitHub API.
Similar to token bucket but **processes requests at a fixed rate** regardless of input rate. Think of a bucket with a hole โ water (requests) drips out at a constant rate. **How it works:** - Requests enter a FIFO queue (the bucket) - Requests are processed at a fixed rate R - If the queue is full โ reject the request - No bursts โ output is always smooth at rate R **Difference from Token Bucket:** Token bucket allows bursts; leaky bucket smooths output. Token bucket is more flexible for API rate limiting. **Used by:** Network traffic shaping, NGINX rate limiting module.
Divide time into fixed windows (e.g., 1 minute). Count requests per window. If count exceeds limit โ reject. **Problem: Boundary burst** โ If the limit is 100/min, a user can send 100 requests at 11:59:59 and 100 more at 12:00:01. That's 200 requests in 2 seconds, but each window sees only 100. This 2ร burst can be problematic. **Pros:** Dead simple. O(1) memory. Easy to implement with Redis INCR + EXPIRE. **Cons:** Boundary burst issue. Not suitable when burst protection is critical.
Keep a log of all request timestamps. For each new request, remove entries older than the window, count remaining, and check against the limit. **Pros:** Perfectly accurate. No boundary burst issue. **Cons:** O(N) memory per key (stores every timestamp). For high-traffic APIs, this can be expensive. **Optimization:** Use Redis sorted sets (ZRANGEBYSCORE to trim, ZCARD to count). Still better than fixed window for precision, but more expensive.
A hybrid of fixed window and sliding window log. The best trade-off. **How it works:** - Track count for current window and previous window - Weighted count = previous_count ร overlap% + current_count - If weighted_count > limit โ reject **Example:** Window = 1 min, limit = 100. At 12:00:42 (70% into current window): weighted = prev_count ร 0.3 + curr_count. If > 100 โ reject. **Pros:** O(1) memory. No boundary burst. ~99.9% accurate vs exact sliding window. **Used by:** Cloudflare, most production rate limiters as the recommended approach.
-- Redis Lua script for atomic token bucket rate limiting
-- Keys: KEYS[1] = rate limit key
-- Args: ARGV[1] = max_tokens, ARGV[2] = refill_rate, ARGV[3] = now (unix ms)
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens per second
local now = tonumber(ARGV[3])
-- Get current state
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or max_tokens
local last_refill = tonumber(data[2]) or now
-- Calculate refill
local elapsed = (now - last_refill) / 1000 -- seconds
local new_tokens = math.min(max_tokens, tokens + elapsed * refill_rate)
-- Try to consume a token
if new_tokens >= 1 then
new_tokens = new_tokens - 1
redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('PEXPIRE', key, math.ceil(max_tokens / refill_rate * 1000) + 1000)
return {1, math.floor(new_tokens)} -- allowed, remaining
else
redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
return {0, 0} -- rejected, 0 remaining
endLayer 1: Edge / CDN
Common Rate Limits in Production
Advantages
- โขProtects services from abusive traffic and DDoS attacks
- โขEnsures fair resource allocation across tenants in multi-tenant systems
- โขToken Bucket allows controlled bursts while maintaining average rate
- โขSliding Window Counter offers the best precision-to-memory trade-off
Disadvantages
- โขDistributed rate limiting adds latency (Redis round-trip) and complexity
- โขClock skew across servers can cause inconsistent enforcement
- โขRace conditions possible without atomic operations (Lua scripts or CAS)
- โขOver-aggressive limits hurt legitimate users; too lenient limits don't protect
๐งช Test Your Understanding
What's the main problem with Fixed Window rate limiting?