Unique ID generation

🔑Key Takeaways

1
Auto-increment IDs fail at distributed scale — you need globally unique, decentralized ID generators
2
Snowflake IDs (64-bit: timestamp + datacenter + machine + sequence) generate 4096 IDs/ms/machine and are time-sortable
3
UUID v4 is simple but not sortable and wastes storage (128 bits). ULID combines UUID's universality with sortability
4
Choose your ID scheme based on: sortability, size, collision risk, information leakage, and database index friendliness

Why Is This Hard?

In a single-server system, auto-increment works perfectly. But in distributed systems with multiple database writers, coordinating sequential IDs requires synchronization — which kills throughput. The challenge: generate IDs that are **globally unique**, **roughly sortable by time**, **compact**, and **require no coordination** between nodes.

This comes up in almost every system design interview — "How would you generate unique IDs for URLs, messages, posts, or orders at scale?"

ID Generation Schemes Compared

Scheme	Size	Sortable	Decentralized	Format	Best For
Auto-increment	32-64 bits	Yes	No (single writer)	1, 2, 3...	Small, single-DB systems
UUID v4	128 bits	No	Yes	550e8400-e29b-41d4...	Simple distributed systems
UUID v7 (2024)	128 bits	Yes	Yes	timestamp + random	Modern distributed IDs
Snowflake ID	64 bits	Yes	Yes	timestamp+dc+machine+seq	High-throughput at scale
ULID	128 bits	Yes	Yes	01ARZ3NDEKTSV4RRFF...	DB-friendly sortable IDs
NanoID	Variable	No	Yes	V1StGXR8_Z5jdHi6B-myT	URL-safe short IDs
Sonyflake	63 bits	Yes	Yes	10ms precision variant	Lower throughput, longer lifespan

Twitter Snowflake ID Structure (64 bits)

Deep Dive Into Each Scheme

UUIDs are 128-bit identifiers typically represented as 32 hex digits: 550e8400-e29b-41d4-a716-446655440000. Version 4 uses random/pseudo-random generation. **Pros:** Zero coordination needed. Built into every language. Universally understood. **Cons:** 128 bits is large (wastes index space). Not sortable — causes B-tree page splits in databases (terrible for write perf). Encodes no information. ~2^122 possible values means collision is theoretically possible but astronomically unlikely (p = 10^-37 for 1 billion IDs). **When to use:** Prototypes, low-scale systems, when sortability doesn't matter, or when you need guaranteed uniqueness across disconnected systems.

Created by Twitter in 2010. A 64-bit ID structured as: 1 sign bit + 41 timestamp bits + 5 DC bits + 5 machine bits + 12 sequence bits. **Key properties:** - Time-sortable: IDs generated later have higher values - Compact: fits in int64 (half the size of UUID) - High throughput: 4,096 IDs per millisecond per machine - Decentralized: no coordinator needed - 69 years of usable range from custom epoch **Clock skew problem:** If a machine's clock goes backwards, you'd generate duplicate IDs. Solution: reject requests until clock catches up, or use logical clock + sequence instead. **Adopted by:** Twitter, Discord (modified), Instagram (modified), Spotify.

ULID = 48-bit timestamp (ms) + 80-bit random. Encoded as Crockford Base32 (26 characters). Example: 01ARZ3NDEKTSV4RRFFQ69G5FAV. **Why ULID over UUID?** - Lexicographically sortable (string comparison = time order) - 1.21e+24 unique ULIDs per millisecond - Monotonically increasing within millisecond (random part incremented) - DB-friendly: sequential writes, no B-tree fragmentation - URL-safe: no special characters **When to use:** When you want UUID-level uniqueness with sortability and database friendliness.

**PostgreSQL:** Use UUID v7 (pg 17+) or bigserial + logical sharding. **MySQL:** AUTO_INCREMENT with different offsets per shard (shard1: 1,3,5... shard2: 2,4,6...). **MongoDB:** ObjectId is already a 12-byte, time-sortable ID (4 timestamp + 5 random + 3 counter). **DynamoDB:** Client-generated UUIDs or KSUIDs. **The Ticket Server approach (Flickr):** Dedicated MySQL instances that only generate IDs via AUTO_INCREMENT. Simple but creates a single point of failure (solved with 2 servers using odd/even).

Snowflake ID Generator (Simplified)

import time, threading

class SnowflakeGenerator:
    EPOCH = 1288834974657  # Twitter epoch (Nov 4, 2010)
    
    def __init__(self, datacenter_id, machine_id):
        self.datacenter_id = datacenter_id & 0x1F  # 5 bits
        self.machine_id = machine_id & 0x1F         # 5 bits
        self.sequence = 0
        self.last_timestamp = -1
        self.lock = threading.Lock()
    
    def generate(self):
        with self.lock:
            timestamp = int(time.time() * 1000) - self.EPOCH
            
            if timestamp == self.last_timestamp:
                self.sequence = (self.sequence + 1) & 0xFFF  # 12 bits
                if self.sequence == 0:  # Overflow — wait for next ms
                    while timestamp <= self.last_timestamp:
                        timestamp = int(time.time() * 1000) - self.EPOCH
            else:
                self.sequence = 0
            
            self.last_timestamp = timestamp
            
            return (
                (timestamp << 22) |
                (self.datacenter_id << 17) |
                (self.machine_id << 12) |
                self.sequence
            )

# Usage: gen = SnowflakeGenerator(datacenter_id=1, machine_id=1)
# unique_id = gen.generate()  # e.g., 1234567890123456789

⚠️Clock Skew — The Silent Killer

NTP synchronization can cause clocks to jump backward. If a Snowflake generator encounters this, it would produce duplicate IDs. Solutions: 1. **Wait:** Block until clock catches up (Twitter's approach) 2. **Logical clocks:** Use Lamport/hybrid logical clocks instead of wall time 3. **Sequence overflow:** Use the remaining sequence space to "extend" into the past 4. **Monitoring:** Alert on clock skew > threshold (usually 5ms) Discord solved this by tracking the last generated timestamp and refusing to generate if the clock goes backward.

Single Database?

Advantages

•Snowflake IDs are compact (64 bits), sortable, and support 4M IDs/sec/machine
•No coordination needed — each machine generates independently
•Time-sortable IDs enable efficient range queries and index-friendly inserts
•UUID v7 / ULID offer standardized sortable alternatives at 128 bits

Disadvantages

•Snowflake relies on wall clocks — vulnerable to clock skew
•Fixed bit allocation limits: max 32 DCs × 32 machines in standard Snowflake
•Information leakage: timestamps in IDs reveal creation time
•No scheme is perfect — every choice involves trade-offs on size, sortability, and throughput

🧪 Test Your Understanding

Knowledge Check1/4

Why are auto-increment IDs problematic in distributed systems?