Database sharding and partitioning

🧩Key Takeaways

1
Sharding = splitting data across multiple database nodes to distribute load
2
Partition key choice determines data distribution and query performance — most critical decision
3
Consistent hashing enables adding/removing nodes with minimal data movement
4
Cross-shard queries and transactions are expensive — design to minimize them

When You Need Sharding

A single database node has finite capacity: ~50K QPS for reads, ~10-20K QPS for writes, and terabytes of storage. When your data or load exceeds these limits, you must distribute data across multiple nodes — this is sharding (horizontal partitioning).

Sharding is one of the most impactful and complex decisions in system design. Getting the shard key wrong can cause hot spots, expensive cross-shard queries, and painful re-sharding.

Sharding Strategies

Data is partitioned based on value ranges. Example: users A-M on Shard 1, N-Z on Shard 2. Or orders from January on Shard 1, February on Shard 2.

Pros: Range queries (find all orders in date range) are efficient — only hit relevant shards.

Cons: Hot spots if data is unevenly distributed. Time-based sharding makes the 'current month' shard a bottleneck.

Consistent Hashing Ring

⚠️Shard Key Anti-Patterns

Don't shard on: auto-incrementing IDs (all new writes go to the last shard), timestamps (current time is always hot), country codes (USA shard will be 10x larger). Good shard keys: user_id (uniform distribution, most queries are per-user), tenant_id (multi-tenant SaaS), composite keys (user_id + date).

Advantages

•Enables horizontal scaling beyond single-node limits
•Distributes both storage and query load
•Consistent hashing minimizes data movement

Disadvantages

•Cross-shard joins and transactions are expensive
•Wrong shard key creates hot spots
•Re-sharding is operationally painful
•Adds significant application complexity

🧪 Test Your Understanding

Knowledge Check1/1

Why is consistent hashing preferred over simple hash-mod sharding?