Beginner18 min readยท Topic 1.6

Unique ID generation

UUID, Snowflake IDs, ULID, auto-increment at scale, time-sortable IDs, ID collision probability

๐Ÿ”‘Key Takeaways

  • 1
    Auto-increment IDs fail at distributed scale โ€” you need globally unique, decentralized ID generators
  • 2
    Snowflake IDs (64-bit: timestamp + datacenter + machine + sequence) generate 4096 IDs/ms/machine and are time-sortable
  • 3
    UUID v4 is simple but not sortable and wastes storage (128 bits). ULID combines UUID's universality with sortability
  • 4
    Choose your ID scheme based on: sortability, size, collision risk, information leakage, and database index friendliness

Why Is This Hard?

In a single-server system, auto-increment works perfectly. But in distributed systems with multiple database writers, coordinating sequential IDs requires synchronization โ€” which kills throughput. The challenge: generate IDs that are **globally unique**, **roughly sortable by time**, **compact**, and **require no coordination** between nodes.

This comes up in almost every system design interview โ€” "How would you generate unique IDs for URLs, messages, posts, or orders at scale?"

ID Generation Schemes Compared

SchemeSizeSortableDecentralizedFormatBest For
Auto-increment32-64 bitsYesNo (single writer)1, 2, 3...Small, single-DB systems
UUID v4128 bitsNoYes550e8400-e29b-41d4...Simple distributed systems
UUID v7 (2024)128 bitsYesYestimestamp + randomModern distributed IDs
Snowflake ID64 bitsYesYestimestamp+dc+machine+seqHigh-throughput at scale
ULID128 bitsYesYes01ARZ3NDEKTSV4RRFF...DB-friendly sortable IDs
NanoIDVariableNoYesV1StGXR8_Z5jdHi6B-myTURL-safe short IDs
Sonyflake63 bitsYesYes10ms precision variantLower throughput, longer lifespan
Twitter Snowflake ID Structure (64 bits)
Sign1 bitTimestamp41 bitsDC ID5 bitsMachine ID5 bitsSequence12 bitsTotal: 64 bits= ~4M IDs/sec/machine

Deep Dive Into Each Scheme

UUIDs are 128-bit identifiers typically represented as 32 hex digits: 550e8400-e29b-41d4-a716-446655440000. Version 4 uses random/pseudo-random generation. **Pros:** Zero coordination needed. Built into every language. Universally understood. **Cons:** 128 bits is large (wastes index space). Not sortable โ€” causes B-tree page splits in databases (terrible for write perf). Encodes no information. ~2^122 possible values means collision is theoretically possible but astronomically unlikely (p = 10^-37 for 1 billion IDs). **When to use:** Prototypes, low-scale systems, when sortability doesn't matter, or when you need guaranteed uniqueness across disconnected systems.

Created by Twitter in 2010. A 64-bit ID structured as: 1 sign bit + 41 timestamp bits + 5 DC bits + 5 machine bits + 12 sequence bits. **Key properties:** - Time-sortable: IDs generated later have higher values - Compact: fits in int64 (half the size of UUID) - High throughput: 4,096 IDs per millisecond per machine - Decentralized: no coordinator needed - 69 years of usable range from custom epoch **Clock skew problem:** If a machine's clock goes backwards, you'd generate duplicate IDs. Solution: reject requests until clock catches up, or use logical clock + sequence instead. **Adopted by:** Twitter, Discord (modified), Instagram (modified), Spotify.

ULID = 48-bit timestamp (ms) + 80-bit random. Encoded as Crockford Base32 (26 characters). Example: 01ARZ3NDEKTSV4RRFFQ69G5FAV. **Why ULID over UUID?** - Lexicographically sortable (string comparison = time order) - 1.21e+24 unique ULIDs per millisecond - Monotonically increasing within millisecond (random part incremented) - DB-friendly: sequential writes, no B-tree fragmentation - URL-safe: no special characters **When to use:** When you want UUID-level uniqueness with sortability and database friendliness.

**PostgreSQL:** Use UUID v7 (pg 17+) or bigserial + logical sharding. **MySQL:** AUTO_INCREMENT with different offsets per shard (shard1: 1,3,5... shard2: 2,4,6...). **MongoDB:** ObjectId is already a 12-byte, time-sortable ID (4 timestamp + 5 random + 3 counter). **DynamoDB:** Client-generated UUIDs or KSUIDs. **The Ticket Server approach (Flickr):** Dedicated MySQL instances that only generate IDs via AUTO_INCREMENT. Simple but creates a single point of failure (solved with 2 servers using odd/even).

Snowflake ID Generator (Simplified)
import time, threading

class SnowflakeGenerator:
    EPOCH = 1288834974657  # Twitter epoch (Nov 4, 2010)
    
    def __init__(self, datacenter_id, machine_id):
        self.datacenter_id = datacenter_id & 0x1F  # 5 bits
        self.machine_id = machine_id & 0x1F         # 5 bits
        self.sequence = 0
        self.last_timestamp = -1
        self.lock = threading.Lock()
    
    def generate(self):
        with self.lock:
            timestamp = int(time.time() * 1000) - self.EPOCH
            
            if timestamp == self.last_timestamp:
                self.sequence = (self.sequence + 1) & 0xFFF  # 12 bits
                if self.sequence == 0:  # Overflow โ€” wait for next ms
                    while timestamp <= self.last_timestamp:
                        timestamp = int(time.time() * 1000) - self.EPOCH
            else:
                self.sequence = 0
            
            self.last_timestamp = timestamp
            
            return (
                (timestamp << 22) |
                (self.datacenter_id << 17) |
                (self.machine_id << 12) |
                self.sequence
            )

# Usage: gen = SnowflakeGenerator(datacenter_id=1, machine_id=1)
# unique_id = gen.generate()  # e.g., 1234567890123456789
โš ๏ธClock Skew โ€” The Silent Killer
NTP synchronization can cause clocks to jump backward. If a Snowflake generator encounters this, it would produce duplicate IDs. Solutions: 1. **Wait:** Block until clock catches up (Twitter's approach) 2. **Logical clocks:** Use Lamport/hybrid logical clocks instead of wall time 3. **Sequence overflow:** Use the remaining sequence space to "extend" into the past 4. **Monitoring:** Alert on clock skew > threshold (usually 5ms) Discord solved this by tracking the last generated timestamp and refusing to generate if the clock goes backward.

Single Database?

Advantages

  • โ€ขSnowflake IDs are compact (64 bits), sortable, and support 4M IDs/sec/machine
  • โ€ขNo coordination needed โ€” each machine generates independently
  • โ€ขTime-sortable IDs enable efficient range queries and index-friendly inserts
  • โ€ขUUID v7 / ULID offer standardized sortable alternatives at 128 bits

Disadvantages

  • โ€ขSnowflake relies on wall clocks โ€” vulnerable to clock skew
  • โ€ขFixed bit allocation limits: max 32 DCs ร— 32 machines in standard Snowflake
  • โ€ขInformation leakage: timestamps in IDs reveal creation time
  • โ€ขNo scheme is perfect โ€” every choice involves trade-offs on size, sortability, and throughput

๐Ÿงช Test Your Understanding

Knowledge Check1/4

Why are auto-increment IDs problematic in distributed systems?