Intermediate → Advanced28 min read· Topic 7.2

Storage engine internals

B-trees, LSM trees, SSTable, WAL, MVCC, column-oriented storage

⚙️Key Takeaways

  • 1
    B-trees: read-optimized, used by most SQL databases — O(log n) reads, in-place updates
  • 2
    LSM trees: write-optimized, used by Cassandra/RocksDB — sequential writes + periodic compaction
  • 3
    WAL (Write-Ahead Log): crash recovery mechanism — write to log first, then apply to data files
  • 4
    Column-oriented storage is best for analytics (OLAP); row-oriented is best for transactions (OLTP)

How Databases Store Data on Disk

Understanding storage engine internals helps you make informed database choices and troubleshoot performance issues. The two dominant structures are B-trees (used by PostgreSQL, MySQL) and LSM trees (used by Cassandra, RocksDB, LevelDB).

B-Tree vs LSM Tree

FeatureB-TreeLSM Tree
Read perfExcellent (O(log n) direct lookup)Good (may check multiple levels)
Write perfGood (random I/O for in-place update)Excellent (sequential writes)
Space usageModerate (fragmentation)Higher (duplicates before compaction)
Write amplificationLow-MediumHigh (compaction rewrites)
Used byPostgreSQL, MySQL (InnoDB), SQLiteCassandra, RocksDB, LevelDB, HBase
Best forRead-heavy, OLTPWrite-heavy, time-series, logs
💡Write-Ahead Log (WAL)
Before modifying any data page, the database writes the change to a sequential append-only log (WAL). If the system crashes mid-write, it replays the WAL on startup to recover to a consistent state. This is how PostgreSQL, MySQL, and most databases achieve durability (the D in ACID).

Advantages

  • B-trees provide predictable read performance
  • LSM trees excel at write throughput
  • WAL ensures crash recovery

Disadvantages

  • LSM compaction causes periodic latency spikes
  • B-tree random I/O is slower for writes
  • Storage engine choice is usually made by the database, not the user

🧪 Test Your Understanding

Knowledge Check1/1

Which storage engine is better for write-heavy workloads?