๐ก๏ธKey Takeaways
- 1Availability = uptime percentage (99.9% = 8.7 hours downtime/year); Reliability = probability of working correctly
- 2SLA/SLO/SLI form the measurement framework: SLI measures โ SLO targets โ SLA contracts
- 3Redundancy is the primary tool: replicate everything that can fail
- 4Graceful degradation > total failure: serve degraded results rather than errors
Availability vs Reliability
Availability and reliability are related but distinct concepts. Availability measures what percentage of time the system is operational and serving traffic. Reliability measures whether the system produces correct results when it is operational.
A system can be highly available but unreliable (it's always up but gives wrong answers). It can be reliable but unavailable (when it works, it's perfect, but it crashes often). Great systems are both.
Availability Nines
| Level | Availability % | Downtime/year | Downtime/month | Typical for |
|---|---|---|---|---|
| Two 9s | 99% | 3.65 days | 7.3 hours | Internal tools |
| Three 9s | 99.9% | 8.76 hours | 43.8 minutes | SaaS products |
| Four 9s | 99.99% | 52.6 minutes | 4.38 minutes | E-commerce, banking |
| Five 9s | 99.999% | 5.26 minutes | 26.3 seconds | Telecom, critical infra |
Strategies for High Availability
Everything that can fail should have a backup: multiple app servers behind a load balancer, database replicas, multi-AZ deployment, redundant network paths.
The formula for combined availability: if each server is 99.9% available, two servers in active-passive = 1 - (0.001)ยฒ = 99.9999%.
When a component fails, serve a degraded experience rather than an error. Show cached results when the database is slow. Disable recommendations but still show content. Skip analytics but still process orders.
Netflix's approach: if the recommendation engine fails, show trending content instead of an error page.
Proactively inject failures to discover weaknesses before real outages. Netflix's Chaos Monkey randomly kills production instances. Amazon's Game Day simulates region failures.
The principle: if you're afraid to test it, you're not confident it works.
Availability Math
Advantages
- โขHigh availability builds user trust and retention
- โขRedundancy prevents catastrophic single-point failures
- โขSLO/SLI framework provides clear, measurable targets
Disadvantages
- โขEach additional 'nine' costs exponentially more
- โขRedundancy increases infrastructure cost and complexity
- โขAchieving five 9s requires extreme engineering discipline
๐งช Test Your Understanding
How much downtime per year does 99.99% availability allow?