Beginner20 min readยท Topic 1.2

Reliability and availability

SLAs, fault tolerance, redundancy, graceful degradation, chaos engineering

๐Ÿ›ก๏ธKey Takeaways

  • 1
    Availability = uptime percentage (99.9% = 8.7 hours downtime/year); Reliability = probability of working correctly
  • 2
    SLA/SLO/SLI form the measurement framework: SLI measures โ†’ SLO targets โ†’ SLA contracts
  • 3
    Redundancy is the primary tool: replicate everything that can fail
  • 4
    Graceful degradation > total failure: serve degraded results rather than errors

Availability vs Reliability

Availability and reliability are related but distinct concepts. Availability measures what percentage of time the system is operational and serving traffic. Reliability measures whether the system produces correct results when it is operational.

A system can be highly available but unreliable (it's always up but gives wrong answers). It can be reliable but unavailable (when it works, it's perfect, but it crashes often). Great systems are both.

Availability Nines

The cost of each '9'
LevelAvailability %Downtime/yearDowntime/monthTypical for
Two 9s99%3.65 days7.3 hoursInternal tools
Three 9s99.9%8.76 hours43.8 minutesSaaS products
Four 9s99.99%52.6 minutes4.38 minutesE-commerce, banking
Five 9s99.999%5.26 minutes26.3 secondsTelecom, critical infra

Strategies for High Availability

Everything that can fail should have a backup: multiple app servers behind a load balancer, database replicas, multi-AZ deployment, redundant network paths.

The formula for combined availability: if each server is 99.9% available, two servers in active-passive = 1 - (0.001)ยฒ = 99.9999%.

When a component fails, serve a degraded experience rather than an error. Show cached results when the database is slow. Disable recommendations but still show content. Skip analytics but still process orders.

Netflix's approach: if the recommendation engine fails, show trending content instead of an error page.

Proactively inject failures to discover weaknesses before real outages. Netflix's Chaos Monkey randomly kills production instances. Amazon's Game Day simulates region failures.

The principle: if you're afraid to test it, you're not confident it works.

Availability Math

99.9%
8.76 hrs downtime/yr
99.99%
52.6 min downtime/yr
99.999%
5.26 min downtime/yr
10x
cost per extra 9

Advantages

  • โ€ขHigh availability builds user trust and retention
  • โ€ขRedundancy prevents catastrophic single-point failures
  • โ€ขSLO/SLI framework provides clear, measurable targets

Disadvantages

  • โ€ขEach additional 'nine' costs exponentially more
  • โ€ขRedundancy increases infrastructure cost and complexity
  • โ€ขAchieving five 9s requires extreme engineering discipline

๐Ÿงช Test Your Understanding

Knowledge Check1/2

How much downtime per year does 99.99% availability allow?