Reliability and availability

🛡️Key Takeaways

1
Availability = uptime percentage (99.9% = 8.7 hours downtime/year); Reliability = probability of working correctly
2
SLA/SLO/SLI form the measurement framework: SLI measures → SLO targets → SLA contracts
3
Redundancy is the primary tool: replicate everything that can fail
4
Graceful degradation > total failure: serve degraded results rather than errors

Availability vs Reliability

Availability and reliability are related but distinct concepts. Availability measures what percentage of time the system is operational and serving traffic. Reliability measures whether the system produces correct results when it is operational.

A system can be highly available but unreliable (it's always up but gives wrong answers). It can be reliable but unavailable (when it works, it's perfect, but it crashes often). Great systems are both.

Availability Nines

The cost of each '9'

Level	Availability %	Downtime/year	Downtime/month	Typical for
Two 9s	99%	3.65 days	7.3 hours	Internal tools
Three 9s	99.9%	8.76 hours	43.8 minutes	SaaS products
Four 9s	99.99%	52.6 minutes	4.38 minutes	E-commerce, banking
Five 9s	99.999%	5.26 minutes	26.3 seconds	Telecom, critical infra

Strategies for High Availability

Everything that can fail should have a backup: multiple app servers behind a load balancer, database replicas, multi-AZ deployment, redundant network paths.

The formula for combined availability: if each server is 99.9% available, two servers in active-passive = 1 - (0.001)² = 99.9999%.

When a component fails, serve a degraded experience rather than an error. Show cached results when the database is slow. Disable recommendations but still show content. Skip analytics but still process orders.

Netflix's approach: if the recommendation engine fails, show trending content instead of an error page.

Proactively inject failures to discover weaknesses before real outages. Netflix's Chaos Monkey randomly kills production instances. Amazon's Game Day simulates region failures.

The principle: if you're afraid to test it, you're not confident it works.

Availability Math

99.9%

8.76 hrs downtime/yr

99.99%

52.6 min downtime/yr

99.999%

5.26 min downtime/yr

10x

cost per extra 9

Advantages

•High availability builds user trust and retention
•Redundancy prevents catastrophic single-point failures
•SLO/SLI framework provides clear, measurable targets

Disadvantages

•Each additional 'nine' costs exponentially more
•Redundancy increases infrastructure cost and complexity
•Achieving five 9s requires extreme engineering discipline

🧪 Test Your Understanding

Knowledge Check1/2

How much downtime per year does 99.99% availability allow?