Advanced30 min readยท Topic 10.4

Data-intensive system design

Data pipelines, data lakes, warehouses, Apache Spark, MapReduce, columnar databases

๐Ÿ“ŠKey Takeaways

  • 1
    Data lake: raw, schema-on-read storage (cheap, flexible); Data warehouse: structured, schema-on-write (fast queries)
  • 2
    MapReduce: pioneer of distributed batch processing โ€” largely replaced by Apache Spark (10-100x faster)
  • 3
    ETL โ†’ ELT shift: load raw data first, transform in the warehouse (dbt, BigQuery, Snowflake model)
  • 4
    Columnar databases (BigQuery, Redshift, ClickHouse) dominate OLAP โ€” orders of magnitude faster for analytics

Designing for Data at Scale

Data-intensive systems differ from computation-intensive ones: the bottleneck is data volume, throughput, and complexity โ€” not CPU. Think petabyte-scale analytics, ML feature pipelines, and data warehousing.

Data Lake vs Data Warehouse

FeatureData LakeData Warehouse
SchemaSchema-on-read (raw data)Schema-on-write (structured)
StorageObject storage (S3) โ€” cheapColumnar (Redshift, BigQuery) โ€” expensive
Data typesAny (JSON, Parquet, images, logs)Structured (tables, columns)
UsersData engineers, data scientistsBusiness analysts, BI tools
Query speedSlow (full scans)Fast (optimized, indexed)
CostVery low for storagePay per query/compute
โœ…Modern Architecture: Lakehouse
The Lakehouse pattern (Databricks Delta Lake, Apache Iceberg) combines data lake storage costs with data warehouse query performance. Store raw data in object storage (Parquet/Delta format) with ACID transactions, schema enforcement, and time travel. Query directly with SQL engines.

Advantages

  • โ€ขData lakes provide cheap, flexible storage
  • โ€ขLakehouse combines the best of both worlds
  • โ€ขColumnar databases enable sub-second analytics on petabytes

Disadvantages

  • โ€ขData lakes can become 'data swamps' without governance
  • โ€ขETL/ELT pipelines are complex to build and maintain
  • โ€ขData warehouse costs scale with query volume

๐Ÿงช Test Your Understanding

Knowledge Check1/1

Why are columnar databases faster for analytics than row-oriented databases?