📖

Reference

System Design Glossary

64 essential terms and concepts — from ACID to ZooKeeper. Your go-to reference for system design interviews and everyday engineering.

ArchitectureCachingData StructuresDatabaseDistributed SystemsFundamentalsInfrastructureMessagingNetworkingStorage

A

ACID

Database

Atomicity, Consistency, Isolation, Durability — properties of reliable database transactions. Atomicity ensures all-or-nothing, Consistency maintains invariants, Isolation prevents interference between concurrent transactions, Durability persists committed data.

API Gateway

Architecture

A server that acts as the single entry point for API requests. Handles routing, authentication, rate limiting, request/response transformation, and load balancing. Examples: Kong, AWS API Gateway, Envoy.

Availability

Fundamentals

The proportion of time a system is operational and accessible. Measured in 'nines' — 99.9% (three nines) = 8.77 hours downtime per year. Availability = Uptime / (Uptime + Downtime).

Async Messaging

Architecture

Communication pattern where the sender doesn't wait for the receiver to process the message. Enables temporal decoupling, load levelling, and fault tolerance. Implemented via message queues (Kafka, RabbitMQ, SQS).

B

Back Pressure

Distributed Systems

A mechanism for a receiver to signal to the sender to slow down when overwhelmed. Prevents cascading failures in stream processing and messaging systems. Essential in Kafka, Flink, and reactive systems.

BASE

Database

Basically Available, Soft state, Eventually consistent — an alternative to ACID for distributed databases that prioritizes availability over immediate consistency.

Bloom Filter

Data Structures

A probabilistic data structure that tests set membership. May return false positives but never false negatives. Space-efficient — uses a bit array and multiple hash functions. Used in databases, caches, and network systems to avoid expensive lookups.

Bulkhead

Architecture

A resilience pattern that isolates components so that a failure in one doesn't cascade to others, similar to watertight compartments in a ship. Implemented via thread pools, process isolation, or separate service instances.

C

CAP Theorem

Distributed Systems

Brewer's theorem: a distributed system can provide at most 2 of 3 guarantees — Consistency (every read gets the latest write), Availability (every request gets a response), Partition tolerance (system operates despite network splits). In practice, P is mandatory, so the choice is between CP and AP.

Cache

Caching

A high-speed data storage layer that stores a subset of data for faster retrieval. Can exist at every layer — browser, CDN, application, database. Cache hit ratio is the key metric.

Cache-Aside

Caching

Also called 'lazy loading'. Application checks cache first; on miss, reads from DB and populates cache. Most common caching pattern. Drawback: first request always misses.

Circuit Breaker

Architecture

A resilience pattern that prevents a service from calling a failing dependency repeatedly. States: Closed (normal) → Open (failing, fast-fail) → Half-Open (testing recovery). Prevents cascade failures.

Consistent Hashing

Distributed Systems

A hashing technique that minimizes key redistribution when nodes are added/removed. Keys and nodes are placed on a virtual ring. Used in DynamoDB, Cassandra, CDNs, and distributed caches.

Consensus

Distributed Systems

The process by which distributed nodes agree on a value. Essential for leader election and replication. Algorithms: Paxos, Raft, Zab (ZooKeeper). FLP impossibility theorem proves deterministic consensus is impossible in async networks with failures.

CQRS

Architecture

Command Query Responsibility Segregation — separating read and write models. Commands modify state, queries read state. Enables independent scaling and optimization of reads vs writes. Often paired with event sourcing.

CRDT

Distributed Systems

Conflict-free Replicated Data Type — data structures that can be replicated across nodes and merged deterministically without coordination. Used for eventual consistency in collaborative editing, counters, and sets.

CDN

Infrastructure

Content Delivery Network — a geographically distributed network of proxy servers and data centers. Caches content at edge locations close to users, reducing latency and backend load. Examples: CloudFront, Cloudflare, Akamai.

D

Dead Letter Queue

Messaging

A queue that stores messages that couldn't be processed after multiple retries. Enables manual inspection and reprocessing of failed messages without blocking the main queue.

DNS

Networking

Domain Name System — translates domain names to IP addresses. Hierarchical: root → TLD → authoritative nameserver. Supports record types: A, AAAA, CNAME, MX, NS, TXT. TTL controls caching.

Database Sharding

Database

Horizontal partitioning of data across multiple database instances. Each shard holds a subset of data. Sharding keys determine data placement. Challenges: cross-shard queries, rebalancing, hot spots.

E

Event Sourcing

Architecture

Storing all changes to application state as a sequence of immutable events rather than current state. The current state is derived by replaying events. Enables audit trails, time travel, and event-driven architectures.

Eventual Consistency

Distributed Systems

A consistency model where replicas will converge to the same value given enough time without new updates. Provides higher availability than strong consistency. Used in DNS, CDNs, NoSQL databases.

F

Failover

Infrastructure

The process of switching to a redundant/standby system when the primary fails. Active-passive: standby takes over. Active-active: both handle traffic, surviving node absorbs full load.

Fan-out

Architecture

Distributing a message/event to multiple recipients. Write fan-out (push): write to all followers' timelines at write time. Read fan-out (pull): assemble feed at read time. Trade-off between write amplification and read latency.

G

Geohash

Data Structures

A hierarchical spatial data structure that encodes geographic coordinates into a short string of letters/digits. Adjacent locations share common prefixes. Used for proximity searches in location-based services.

gRPC

Networking

Google's Remote Procedure Call framework. Uses HTTP/2 for transport, Protocol Buffers for serialization. Supports streaming, multiplexing, and strong typing. 2-10× faster than REST/JSON for internal services.

H

Heartbeat

Distributed Systems

Periodic signal sent between nodes to detect failures. If heartbeats are missed beyond a threshold, the node is considered failed. Used in distributed systems, load balancers, and cluster management.

Horizontal Scaling

Fundamentals

Adding more machines to handle increased load (scale out). Preferred for stateless services. Requires load balancing and/or sharding. Contrast with vertical scaling (bigger machine).

Hot Spot

Distributed Systems

A node or partition receiving disproportionately more traffic than others. Caused by skewed data distribution or popular keys. Solutions: salting keys, consistent hashing with virtual nodes, splitting hot partitions.

I

Idempotency

Distributed Systems

Property where performing the same operation multiple times produces the same result. Critical for retry logic in distributed systems. Implemented via idempotency keys — unique request identifiers that prevent duplicate processing.

Inverted Index

Data Structures

A data structure mapping terms to the documents/records containing them. The foundation of full-text search engines like Elasticsearch and Lucene. Enables fast keyword lookup across millions of documents.

K

Kafka

Messaging

A distributed event streaming platform. Append-only log with topics, partitions, and consumer groups. Provides at-least-once delivery, high throughput (millions of messages/second), and data retention.

L

Latency

Fundamentals

Time between a request and its response. Measured in percentiles: p50 (median), p95, p99, p99.9. Tail latency (p99+) matters most for user experience. Amdahl's law limits parallel improvements.

Leader Election

Distributed Systems

The process of choosing a single node to coordinate operations in a distributed system. Prevents split-brain. Implemented via Raft, ZooKeeper (ZAB), or etcd. Key concept: fencing tokens to prevent stale leaders.

Load Balancer

Networking

Distributes incoming traffic across multiple servers. Layer 4 (TCP/UDP) or Layer 7 (HTTP). Algorithms: round-robin, least connections, consistent hashing. Examples: HAProxy, NGINX, AWS ALB.

LSM Tree

Storage

Log-Structured Merge Tree — a write-optimized storage engine. Writes go to an in-memory table (memtable), then flushed to sorted files (SSTables) on disk. Background compaction merges files. Used in Cassandra, RocksDB, LevelDB.

M

Message Queue

Messaging

Middleware that enables async communication between services. Provides buffering, ordering, delivery guarantees. Point-to-point or pub/sub. Examples: RabbitMQ, Amazon SQS, Kafka, Redis Streams.

Microservices

Architecture

Architecture style where an application is composed of small, independently deployable services. Each owns its data and communicates via APIs or events. Benefits: team autonomy, independent scaling, tech diversity.

MVCC

Database

Multi-Version Concurrency Control — database technique allowing concurrent reads and writes by maintaining multiple versions of data. Readers see a consistent snapshot without locking. Used in PostgreSQL, MySQL InnoDB.

N

NoSQL

Database

Non-relational databases optimized for specific access patterns. Types: Document (MongoDB), Key-Value (Redis), Wide-Column (Cassandra), Graph (Neo4j), Time-Series (InfluxDB), Search (Elasticsearch).

O

Outbox Pattern

Architecture

Ensures reliable event publishing by writing events to a local 'outbox' table in the same DB transaction as the business operation. A separate process reads the outbox and publishes events. Solves the dual-write problem.

P

Partition Tolerance

Distributed Systems

The system continues to function despite network partitions (message loss between nodes). In CAP theorem, this is effectively mandatory — networks are unreliable. So the real choice is CP vs AP.

Pub/Sub

Messaging

Publish-Subscribe messaging pattern. Publishers send messages to topics, subscribers receive messages from topics they've subscribed to. Enables loose coupling and fan-out. Examples: Kafka, Google Pub/Sub, Redis Pub/Sub.

Q

Quorum

Distributed Systems

The minimum number of nodes that must agree for an operation to succeed. For N replicas: write quorum W, read quorum R. If W + R > N, reads are guaranteed to see latest writes. DynamoDB and Cassandra use quorum-based consistency.

R

Rate Limiting

Networking

Controlling the rate of requests a client can make. Protects against abuse and overload. Algorithms: Token Bucket, Leaky Bucket, Fixed Window, Sliding Window Log, Sliding Window Counter.

Read Replica

Database

A copy of the primary database that handles read queries. Reduces load on the primary. Introduces replication lag — the delay between a write on primary and its availability on replicas.

Replication

Database

Copying data across multiple nodes for fault tolerance and read scalability. Modes: single-leader, multi-leader, leaderless. Trade-off between consistency and latency.

REST

Networking

Representational State Transfer — an architectural style for APIs. Uses HTTP methods (GET, POST, PUT, DELETE), stateless communication, resource-oriented URLs. Most common API style.

S

Saga Pattern

Architecture

Manages distributed transactions as a sequence of local transactions. Each step has a compensating action for rollback. Types: choreography (event-driven) and orchestration (central coordinator).

Service Discovery

Networking

Mechanism for services to find and communicate with each other. Client-side: client queries registry (Consul, etcd). Server-side: load balancer queries registry. DNS-based: use DNS for discovery.

Service Mesh

Infrastructure

Infrastructure layer handling service-to-service communication. Provides traffic management, security (mTLS), and observability without changing application code. Examples: Istio, Linkerd, Envoy.

Sharding

Database

See Database Sharding — horizontal partitioning of data across multiple database instances.

SLA / SLO / SLI

Infrastructure

SLA: Service Level Agreement — the contract. SLO: Service Level Objective — the target (e.g. 99.9% uptime). SLI: Service Level Indicator — the measured metric (e.g. actual uptime %).

Snowflake ID

Data Structures

A 64-bit unique ID generation scheme created by Twitter. Components: timestamp (41 bits) + datacenter ID (5 bits) + machine ID (5 bits) + sequence number (12 bits). Generates 4096 IDs per millisecond per machine. Time-sortable.

T

Throughput

Fundamentals

The number of operations/requests a system can handle per unit time. Measured in QPS (queries/second), TPS (transactions/second), or RPS (requests/second). Throughput × Latency = Concurrency (Little's Law).

Token Bucket

Networking

A rate limiting algorithm. Tokens are added at a fixed rate. Each request consumes a token. If the bucket is empty, the request is rejected or queued. Allows bursts up to the bucket capacity.

Two-Phase Commit (2PC)

Distributed Systems

A distributed transaction protocol. Phase 1: Coordinator asks all participants to prepare (vote yes/no). Phase 2: If all voted yes, coordinator sends commit; otherwise abort. Blocking protocol — coordinator failure stalls everyone.

V

Vector Clock

Distributed Systems

A mechanism to track causal ordering of events in distributed systems. Each node maintains a vector of logical timestamps. Enables detecting concurrent events (neither happened before the other). Used in DynamoDB.

Vertical Scaling

Fundamentals

Increasing the resources (CPU, RAM, disk) of a single machine (scale up). Simpler but has hardware limits. Good for databases that are hard to shard. Eventually hits a ceiling.

Virtual Node (vNode)

Distributed Systems

In consistent hashing, a physical node is represented by multiple virtual nodes on the hash ring. Improves load distribution and enables heterogeneous node capacities.

W

WAL (Write-Ahead Log)

Storage

A technique where changes are written to a sequential log before being applied to the main data structure. Ensures durability — if the system crashes, changes can be recovered from the log. Used in databases and Kafka.

WebSocket

Networking

A protocol providing full-duplex communication over a single TCP connection. Enables real-time bidirectional communication (chat, gaming, live updates). Persistent connection — overhead of initial handshake only.

Write-Ahead Log

Storage

See WAL.

Z

ZooKeeper

Infrastructure

A distributed coordination service for maintaining configuration, naming, synchronization, and group services. Provides: leader election, distributed locks, service registry. Being replaced by etcd in many modern systems.

Need deeper explanations?

Every glossary term is covered in depth in our structured topic modules.