Intermediate → Advanced22 min read· Topic 8.4

Resilience patterns

Circuit breaker, retry with backoff, timeout, bulkhead, fallback, health checks

🛡️Key Takeaways

  • 1
    Circuit Breaker: stops calling a failing service — closed (normal) → open (fail-fast) → half-open (test)
  • 2
    Retry with exponential backoff + jitter: prevents thundering herd on recovery
  • 3
    Bulkhead: isolate resources per dependency (separate thread pools/connection pools)
  • 4
    Timeout: always set timeouts — a missing timeout is guaranteed to cause a cascading failure eventually

Building Systems That Fail Gracefully

In distributed systems, failures are guaranteed. Resilience patterns prevent individual service failures from cascading across the entire system. Every production microservice should implement at least: timeouts, retries with backoff, and circuit breakers.

Closed (Normal)

Requests flow normally. The circuit breaker counts failures. If failures exceed the threshold (e.g., 5 failures in 10 seconds), it transitions to OPEN.

All Resilience Patterns

ALWAYS set timeouts on external calls. Without a timeout, a slow downstream service causes your service to hang, consuming threads/connections.

Rule of thumb: set timeout to 2-3x the p99 latency. If p99 is 200ms, timeout at 500ms.

Most catastrophic outages involve missing timeouts somewhere in the call chain.

Retry failed requests with increasing delays: 1s, 2s, 4s, 8s... (exponential backoff).

Add random jitter to prevent thundering herd: all clients retrying at the same time.

Max retries should be small (3-5). Budget circuit breaker failure count accordingly.

Isolate resources per dependency. If Service A calls Service B and C, give each a separate thread pool.

If B hangs and exhausts its threads, C is unaffected — the failure is isolated.

Named after ship bulkheads that prevent a hull breach from flooding the entire vessel.

When a service call fails (even after retries and circuit breaker), return a degraded response instead of an error.

Examples: cached results, default values, reduced functionality. Netflix shows generic recommendations when the personalization service is down.

Advantages

  • Circuit breakers prevent cascading failures
  • Graceful degradation maintains partial service
  • Patterns compose well together

Disadvantages

  • Adding resilience patterns increases code complexity
  • Incorrect thresholds can mask real issues
  • Testing failure scenarios is difficult

🧪 Test Your Understanding

Knowledge Check1/1

What happens when a circuit breaker is OPEN?