๐Key Takeaways
- 1Vertical scaling (scale up) = bigger machine; Horizontal scaling (scale out) = more machines
- 2Horizontal scaling is almost always preferred for production systems
- 3Scalability is measured by throughput (QPS) and latency (p50, p95, p99)
- 4Diagonal scaling = scale up first, then scale out when you hit limits
What Is Scalability?
Scalability is a system's ability to handle increased load by adding resources. A scalable system can grow from serving 100 users to 100 million users without a fundamental redesign. It's the most discussed topic in system design because every architecture decision either helps or hinders scalability.
There are two fundamental approaches to scaling: vertical (scaling up) and horizontal (scaling out). Understanding when to use each โ and their trade-offs โ is foundational to every system design discussion.
Scaling Strategy Comparison
| Factor | Vertical Scaling | Horizontal Scaling |
|---|---|---|
| Complexity | Low โ just upgrade hardware | High โ distributed systems challenges |
| Cost | Expensive โ high-end servers cost a lot | Cheaper โ commodity hardware |
| Limit | Hardware ceiling (biggest machine available) | Virtually unlimited |
| Downtime | Often requires downtime to upgrade | Zero-downtime with rolling updates |
| Data consistency | Simple โ single node | Complex โ distributed consensus needed |
| Failure impact | Single point of failure | Partial failure only affects some nodes |
| Use case | Database primary, specialized compute | Stateless web/app servers, CDNs |
Deep Dive: Measuring Scalability
Throughput measures how many requests your system handles per second. A well-scaled web tier might handle 100K QPS across 20 servers, while a database might handle 30K QPS on a single instance.
Key insight: throughput should scale linearly (or near-linearly) with the number of nodes. If adding a second server only gives you 1.3x throughput instead of 2x, you have a scaling bottleneck.
Average latency is misleading โ use percentiles. p50 = 50% of requests are faster. p95 = 95% are faster. p99 = 99% are faster.
Example: p50 = 50ms, p95 = 200ms, p99 = 1.2s. The p99 is often 10-20x worse than p50 due to GC pauses, contention, or unlucky cache misses.
SLAs are typically defined at p99: 'p99 latency must be under 500ms.'
In practice, the best approach is diagonal: start by scaling up (it's simpler and faster) until you hit the limits of a single machine or the cost becomes unreasonable. Then scale out.
Example: Start with a single PostgreSQL on a 64-core machine. When you hit 30K QPS, add read replicas (horizontal). When writes become the bottleneck, shard the database.
Advantages
- โขHorizontal scaling offers virtually unlimited capacity
- โขCommodity hardware is cheaper than high-end machines
- โขNo single point of failure with proper horizontal design
- โขOn-demand scaling reduces wasted resources
Disadvantages
- โขHorizontal scaling adds distributed systems complexity
- โขData consistency becomes much harder across nodes
- โขNetwork partitions create new failure modes
- โขState management requires careful design
๐งช Test Your Understanding
What is diagonal scaling?