Intermediate18 min readยท Topic 4.5

Rate limiting patterns

Token bucket, leaky bucket, fixed window, sliding window log, sliding window counter, distributed rate limiting

๐ŸšฆKey Takeaways

  • 1
    Rate limiting protects services from abuse, ensures fair usage, and prevents cascading failures
  • 2
    Token Bucket is the most common algorithm โ€” handles bursts while enforcing average rate
  • 3
    Distributed rate limiting requires shared state (Redis) or approximate techniques (local + sync)
  • 4
    Always rate limit at multiple layers: API gateway, per-user, per-IP, and per-resource

Why Rate Limiting Matters

Every production API needs rate limiting. Without it, a single misbehaving client, bot, or attacker can overwhelm your service and affect all users. Rate limiting serves four purposes:

1. **Prevent abuse** โ€” Stop DDoS attacks and scrapers 2. **Fair usage** โ€” Ensure no single tenant monopolizes shared resources 3. **Cost control** โ€” Prevent runaway usage from inflating cloud bills 4. **System stability** โ€” Protect downstream services from traffic spikes

Rate limiting appears in nearly every system design interview, either as a stand-alone design ("Design a Rate Limiter") or as a component in larger systems.

Rate Limiting Algorithms Compared

AlgorithmBurst HandlingMemoryPrecisionComplexityBest For
Token BucketAllows bursts up to bucket sizeO(1) per keyApproximateSimpleAPI rate limiting (most common)
Leaky BucketSmooths out burstsO(1) per keySmooth outputSimpleTraffic shaping, network QoS
Fixed WindowAllows 2ร— burst at window edgesO(1) per keyCoarseSimplestSimple counters, analytics
Sliding Window LogPreciseO(N) per keyExactHigher memoryWhen precision matters
Sliding Window CounterAllows slight burstO(1) per keyApproximateModerateBest balance of precision and memory

Algorithm Deep Dives

The most widely used algorithm (AWS, Stripe, Google Cloud). **How it works:** - A bucket has capacity C (max tokens) and refill rate R (tokens/second) - Each request consumes 1 token. If bucket is empty โ†’ reject (429). - Tokens are refilled at rate R, up to capacity C - Allows bursts of up to C requests, then sustained rate of R requests/sec **Implementation:** Store two values per key: `tokens` (current count) and `last_refill_time`. On each request, calculate elapsed time, add refill tokens, then check if a token is available. **Used by:** AWS API Gateway, Stripe, Shopify, GitHub API.

Similar to token bucket but **processes requests at a fixed rate** regardless of input rate. Think of a bucket with a hole โ€” water (requests) drips out at a constant rate. **How it works:** - Requests enter a FIFO queue (the bucket) - Requests are processed at a fixed rate R - If the queue is full โ†’ reject the request - No bursts โ€” output is always smooth at rate R **Difference from Token Bucket:** Token bucket allows bursts; leaky bucket smooths output. Token bucket is more flexible for API rate limiting. **Used by:** Network traffic shaping, NGINX rate limiting module.

Divide time into fixed windows (e.g., 1 minute). Count requests per window. If count exceeds limit โ†’ reject. **Problem: Boundary burst** โ€” If the limit is 100/min, a user can send 100 requests at 11:59:59 and 100 more at 12:00:01. That's 200 requests in 2 seconds, but each window sees only 100. This 2ร— burst can be problematic. **Pros:** Dead simple. O(1) memory. Easy to implement with Redis INCR + EXPIRE. **Cons:** Boundary burst issue. Not suitable when burst protection is critical.

Keep a log of all request timestamps. For each new request, remove entries older than the window, count remaining, and check against the limit. **Pros:** Perfectly accurate. No boundary burst issue. **Cons:** O(N) memory per key (stores every timestamp). For high-traffic APIs, this can be expensive. **Optimization:** Use Redis sorted sets (ZRANGEBYSCORE to trim, ZCARD to count). Still better than fixed window for precision, but more expensive.

A hybrid of fixed window and sliding window log. The best trade-off. **How it works:** - Track count for current window and previous window - Weighted count = previous_count ร— overlap% + current_count - If weighted_count > limit โ†’ reject **Example:** Window = 1 min, limit = 100. At 12:00:42 (70% into current window): weighted = prev_count ร— 0.3 + curr_count. If > 100 โ†’ reject. **Pros:** O(1) memory. No boundary burst. ~99.9% accurate vs exact sliding window. **Used by:** Cloudflare, most production rate limiters as the recommended approach.

Token Bucket in Redis (Production Pattern)
-- Redis Lua script for atomic token bucket rate limiting
-- Keys: KEYS[1] = rate limit key
-- Args: ARGV[1] = max_tokens, ARGV[2] = refill_rate, ARGV[3] = now (unix ms)

local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])  -- tokens per second
local now = tonumber(ARGV[3])

-- Get current state
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or max_tokens
local last_refill = tonumber(data[2]) or now

-- Calculate refill
local elapsed = (now - last_refill) / 1000  -- seconds
local new_tokens = math.min(max_tokens, tokens + elapsed * refill_rate)

-- Try to consume a token
if new_tokens >= 1 then
    new_tokens = new_tokens - 1
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('PEXPIRE', key, math.ceil(max_tokens / refill_rate * 1000) + 1000)
    return {1, math.floor(new_tokens)}  -- allowed, remaining
else
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    return {0, 0}  -- rejected, 0 remaining
end

Layer 1: Edge / CDN

๐Ÿ’กDistributed Rate Limiting Strategies
In a multi-server environment, rate limiting must be coordinated: 1. **Centralized (Redis):** Single source of truth. ~1ms overhead per request. Used by 90% of production systems. 2. **Local + Sync:** Each server maintains local counters, periodically syncs to central store. Lower latency but allows brief over-limit bursts. 3. **Token Bucket per partition:** Route same user to same server (sticky sessions). No coordination needed but fails on server changes. 4. **Approximate:** Accept N/servers as local limit per server. Simple, no coordination, tolerates up to N total (acceptable for most use cases).

Common Rate Limits in Production

5,000/hr
GitHub API
300/15min
Twitter API
100/sec
Stripe API
60/min
OpenAI API

Advantages

  • โ€ขProtects services from abusive traffic and DDoS attacks
  • โ€ขEnsures fair resource allocation across tenants in multi-tenant systems
  • โ€ขToken Bucket allows controlled bursts while maintaining average rate
  • โ€ขSliding Window Counter offers the best precision-to-memory trade-off

Disadvantages

  • โ€ขDistributed rate limiting adds latency (Redis round-trip) and complexity
  • โ€ขClock skew across servers can cause inconsistent enforcement
  • โ€ขRace conditions possible without atomic operations (Lua scripts or CAS)
  • โ€ขOver-aggressive limits hurt legitimate users; too lenient limits don't protect

๐Ÿงช Test Your Understanding

Knowledge Check1/4

What's the main problem with Fixed Window rate limiting?