Rate limiting patterns

🚦Key Takeaways

1
Rate limiting protects services from abuse, ensures fair usage, and prevents cascading failures
2
Token Bucket is the most common algorithm — handles bursts while enforcing average rate
3
Distributed rate limiting requires shared state (Redis) or approximate techniques (local + sync)
4
Always rate limit at multiple layers: API gateway, per-user, per-IP, and per-resource

Why Rate Limiting Matters

Every production API needs rate limiting. Without it, a single misbehaving client, bot, or attacker can overwhelm your service and affect all users. Rate limiting serves four purposes:

1. **Prevent abuse** — Stop DDoS attacks and scrapers 2. **Fair usage** — Ensure no single tenant monopolizes shared resources 3. **Cost control** — Prevent runaway usage from inflating cloud bills 4. **System stability** — Protect downstream services from traffic spikes

Rate limiting appears in nearly every system design interview, either as a stand-alone design ("Design a Rate Limiter") or as a component in larger systems.

Rate Limiting Algorithms Compared

Algorithm	Burst Handling	Memory	Precision	Complexity	Best For
Token Bucket	Allows bursts up to bucket size	O(1) per key	Approximate	Simple	API rate limiting (most common)
Leaky Bucket	Smooths out bursts	O(1) per key	Smooth output	Simple	Traffic shaping, network QoS
Fixed Window	Allows 2× burst at window edges	O(1) per key	Coarse	Simplest	Simple counters, analytics
Sliding Window Log	Precise	O(N) per key	Exact	Higher memory	When precision matters
Sliding Window Counter	Allows slight burst	O(1) per key	Approximate	Moderate	Best balance of precision and memory

Algorithm Deep Dives

The most widely used algorithm (AWS, Stripe, Google Cloud). **How it works:** - A bucket has capacity C (max tokens) and refill rate R (tokens/second) - Each request consumes 1 token. If bucket is empty → reject (429). - Tokens are refilled at rate R, up to capacity C - Allows bursts of up to C requests, then sustained rate of R requests/sec **Implementation:** Store two values per key: `tokens` (current count) and `last_refill_time`. On each request, calculate elapsed time, add refill tokens, then check if a token is available. **Used by:** AWS API Gateway, Stripe, Shopify, GitHub API.

Similar to token bucket but **processes requests at a fixed rate** regardless of input rate. Think of a bucket with a hole — water (requests) drips out at a constant rate. **How it works:** - Requests enter a FIFO queue (the bucket) - Requests are processed at a fixed rate R - If the queue is full → reject the request - No bursts — output is always smooth at rate R **Difference from Token Bucket:** Token bucket allows bursts; leaky bucket smooths output. Token bucket is more flexible for API rate limiting. **Used by:** Network traffic shaping, NGINX rate limiting module.

Divide time into fixed windows (e.g., 1 minute). Count requests per window. If count exceeds limit → reject. **Problem: Boundary burst** — If the limit is 100/min, a user can send 100 requests at 11:59:59 and 100 more at 12:00:01. That's 200 requests in 2 seconds, but each window sees only 100. This 2× burst can be problematic. **Pros:** Dead simple. O(1) memory. Easy to implement with Redis INCR + EXPIRE. **Cons:** Boundary burst issue. Not suitable when burst protection is critical.

Keep a log of all request timestamps. For each new request, remove entries older than the window, count remaining, and check against the limit. **Pros:** Perfectly accurate. No boundary burst issue. **Cons:** O(N) memory per key (stores every timestamp). For high-traffic APIs, this can be expensive. **Optimization:** Use Redis sorted sets (ZRANGEBYSCORE to trim, ZCARD to count). Still better than fixed window for precision, but more expensive.

A hybrid of fixed window and sliding window log. The best trade-off. **How it works:** - Track count for current window and previous window - Weighted count = previous_count × overlap% + current_count - If weighted_count > limit → reject **Example:** Window = 1 min, limit = 100. At 12:00:42 (70% into current window): weighted = prev_count × 0.3 + curr_count. If > 100 → reject. **Pros:** O(1) memory. No boundary burst. ~99.9% accurate vs exact sliding window. **Used by:** Cloudflare, most production rate limiters as the recommended approach.

Token Bucket in Redis (Production Pattern)

-- Redis Lua script for atomic token bucket rate limiting
-- Keys: KEYS[1] = rate limit key
-- Args: ARGV[1] = max_tokens, ARGV[2] = refill_rate, ARGV[3] = now (unix ms)

local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])  -- tokens per second
local now = tonumber(ARGV[3])

-- Get current state
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or max_tokens
local last_refill = tonumber(data[2]) or now

-- Calculate refill
local elapsed = (now - last_refill) / 1000  -- seconds
local new_tokens = math.min(max_tokens, tokens + elapsed * refill_rate)

-- Try to consume a token
if new_tokens >= 1 then
    new_tokens = new_tokens - 1
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('PEXPIRE', key, math.ceil(max_tokens / refill_rate * 1000) + 1000)
    return {1, math.floor(new_tokens)}  -- allowed, remaining
else
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    return {0, 0}  -- rejected, 0 remaining
end

Layer 1: Edge / CDN

💡Distributed Rate Limiting Strategies

In a multi-server environment, rate limiting must be coordinated: 1. **Centralized (Redis):** Single source of truth. ~1ms overhead per request. Used by 90% of production systems. 2. **Local + Sync:** Each server maintains local counters, periodically syncs to central store. Lower latency but allows brief over-limit bursts. 3. **Token Bucket per partition:** Route same user to same server (sticky sessions). No coordination needed but fails on server changes. 4. **Approximate:** Accept N/servers as local limit per server. Simple, no coordination, tolerates up to N total (acceptable for most use cases).

Common Rate Limits in Production

5,000/hr

GitHub API

300/15min

Twitter API

100/sec

Stripe API

60/min

OpenAI API

Advantages

•Protects services from abusive traffic and DDoS attacks
•Ensures fair resource allocation across tenants in multi-tenant systems
•Token Bucket allows controlled bursts while maintaining average rate
•Sliding Window Counter offers the best precision-to-memory trade-off

Disadvantages

•Distributed rate limiting adds latency (Redis round-trip) and complexity
•Clock skew across servers can cause inconsistent enforcement
•Race conditions possible without atomic operations (Lua scripts or CAS)
•Over-aggressive limits hurt legitimate users; too lenient limits don't protect

🧪 Test Your Understanding

Knowledge Check1/4

What's the main problem with Fixed Window rate limiting?