๐Key Takeaways
- 1Data lake: raw, schema-on-read storage (cheap, flexible); Data warehouse: structured, schema-on-write (fast queries)
- 2MapReduce: pioneer of distributed batch processing โ largely replaced by Apache Spark (10-100x faster)
- 3ETL โ ELT shift: load raw data first, transform in the warehouse (dbt, BigQuery, Snowflake model)
- 4Columnar databases (BigQuery, Redshift, ClickHouse) dominate OLAP โ orders of magnitude faster for analytics
Designing for Data at Scale
Data-intensive systems differ from computation-intensive ones: the bottleneck is data volume, throughput, and complexity โ not CPU. Think petabyte-scale analytics, ML feature pipelines, and data warehousing.
Data Lake vs Data Warehouse
| Feature | Data Lake | Data Warehouse |
|---|---|---|
| Schema | Schema-on-read (raw data) | Schema-on-write (structured) |
| Storage | Object storage (S3) โ cheap | Columnar (Redshift, BigQuery) โ expensive |
| Data types | Any (JSON, Parquet, images, logs) | Structured (tables, columns) |
| Users | Data engineers, data scientists | Business analysts, BI tools |
| Query speed | Slow (full scans) | Fast (optimized, indexed) |
| Cost | Very low for storage | Pay per query/compute |
โ
Modern Architecture: Lakehouse
The Lakehouse pattern (Databricks Delta Lake, Apache Iceberg) combines data lake storage costs with data warehouse query performance. Store raw data in object storage (Parquet/Delta format) with ACID transactions, schema enforcement, and time travel. Query directly with SQL engines.
Advantages
- โขData lakes provide cheap, flexible storage
- โขLakehouse combines the best of both worlds
- โขColumnar databases enable sub-second analytics on petabytes
Disadvantages
- โขData lakes can become 'data swamps' without governance
- โขETL/ELT pipelines are complex to build and maintain
- โขData warehouse costs scale with query volume
๐งช Test Your Understanding
Knowledge Check1/1
Why are columnar databases faster for analytics than row-oriented databases?