Scalability fundamentals

📈Key Takeaways

1
Vertical scaling (scale up) = bigger machine; Horizontal scaling (scale out) = more machines
2
Horizontal scaling is almost always preferred for production systems
3
Scalability is measured by throughput (QPS) and latency (p50, p95, p99)
4
Diagonal scaling = scale up first, then scale out when you hit limits

What Is Scalability?

Scalability is a system's ability to handle increased load by adding resources. A scalable system can grow from serving 100 users to 100 million users without a fundamental redesign. It's the most discussed topic in system design because every architecture decision either helps or hinders scalability.

There are two fundamental approaches to scaling: vertical (scaling up) and horizontal (scaling out). Understanding when to use each — and their trade-offs — is foundational to every system design discussion.

Vertical vs Horizontal Scaling

Scaling Strategy Comparison

Factor	Vertical Scaling	Horizontal Scaling
Complexity	Low — just upgrade hardware	High — distributed systems challenges
Cost	Expensive — high-end servers cost a lot	Cheaper — commodity hardware
Limit	Hardware ceiling (biggest machine available)	Virtually unlimited
Downtime	Often requires downtime to upgrade	Zero-downtime with rolling updates
Data consistency	Simple — single node	Complex — distributed consensus needed
Failure impact	Single point of failure	Partial failure only affects some nodes
Use case	Database primary, specialized compute	Stateless web/app servers, CDNs

Deep Dive: Measuring Scalability

Throughput measures how many requests your system handles per second. A well-scaled web tier might handle 100K QPS across 20 servers, while a database might handle 30K QPS on a single instance.

Key insight: throughput should scale linearly (or near-linearly) with the number of nodes. If adding a second server only gives you 1.3x throughput instead of 2x, you have a scaling bottleneck.

Average latency is misleading — use percentiles. p50 = 50% of requests are faster. p95 = 95% are faster. p99 = 99% are faster.

Example: p50 = 50ms, p95 = 200ms, p99 = 1.2s. The p99 is often 10-20x worse than p50 due to GC pauses, contention, or unlucky cache misses.

SLAs are typically defined at p99: 'p99 latency must be under 500ms.'

In practice, the best approach is diagonal: start by scaling up (it's simpler and faster) until you hit the limits of a single machine or the cost becomes unreasonable. Then scale out.

Example: Start with a single PostgreSQL on a 64-core machine. When you hit 30K QPS, add read replicas (horizontal). When writes become the bottleneck, shard the database.

✅Interview Tip

When discussing scaling in an interview, always mention the specific bottleneck you're solving. Don't just say 'we scale horizontally.' Say 'Read QPS exceeds what a single DB can handle, so we add read replicas. Write QPS is manageable, so the primary remains a single node.'

Advantages

•Horizontal scaling offers virtually unlimited capacity
•Commodity hardware is cheaper than high-end machines
•No single point of failure with proper horizontal design
•On-demand scaling reduces wasted resources

Disadvantages

•Horizontal scaling adds distributed systems complexity
•Data consistency becomes much harder across nodes
•Network partitions create new failure modes
•State management requires careful design

🧪 Test Your Understanding

Knowledge Check1/2

What is diagonal scaling?