Data-intensive system design

📊Key Takeaways

1
Data lake: raw, schema-on-read storage (cheap, flexible); Data warehouse: structured, schema-on-write (fast queries)
2
MapReduce: pioneer of distributed batch processing — largely replaced by Apache Spark (10-100x faster)
3
ETL → ELT shift: load raw data first, transform in the warehouse (dbt, BigQuery, Snowflake model)
4
Columnar databases (BigQuery, Redshift, ClickHouse) dominate OLAP — orders of magnitude faster for analytics

Designing for Data at Scale

Data-intensive systems differ from computation-intensive ones: the bottleneck is data volume, throughput, and complexity — not CPU. Think petabyte-scale analytics, ML feature pipelines, and data warehousing.

Data Lake vs Data Warehouse

Feature	Data Lake	Data Warehouse
Schema	Schema-on-read (raw data)	Schema-on-write (structured)
Storage	Object storage (S3) — cheap	Columnar (Redshift, BigQuery) — expensive
Data types	Any (JSON, Parquet, images, logs)	Structured (tables, columns)
Users	Data engineers, data scientists	Business analysts, BI tools
Query speed	Slow (full scans)	Fast (optimized, indexed)
Cost	Very low for storage	Pay per query/compute

✅Modern Architecture: Lakehouse

The Lakehouse pattern (Databricks Delta Lake, Apache Iceberg) combines data lake storage costs with data warehouse query performance. Store raw data in object storage (Parquet/Delta format) with ACID transactions, schema enforcement, and time travel. Query directly with SQL engines.

Advantages

•Data lakes provide cheap, flexible storage
•Lakehouse combines the best of both worlds
•Columnar databases enable sub-second analytics on petabytes

Disadvantages

•Data lakes can become 'data swamps' without governance
•ETL/ELT pipelines are complex to build and maintain
•Data warehouse costs scale with query volume

🧪 Test Your Understanding

Knowledge Check1/1

Why are columnar databases faster for analytics than row-oriented databases?