📁Key Takeaways
- 1HDFS: designed for large files, batch processing — stores data in 128MB blocks, 3x replication
- 2S3: object storage with virtually unlimited scale, 99.999999999% (11 9s) durability
- 3GFS (Google File System): the predecessor to HDFS — append-optimized, single master
- 4Modern alternative: use object storage (S3) for most use cases; HDFS only for Hadoop ecosystems
Storing Data Across Machines
Distributed file systems spread files across many machines to provide capacity beyond a single disk, fault tolerance through replication, and parallel read/write throughput. Understanding HDFS architecture is essential for data-intensive system design.
Distributed Storage Systems
| System | Architecture | Durability | Best For |
|---|---|---|---|
| HDFS | NameNode + DataNodes, 128MB blocks | 3x replication | Batch analytics (MapReduce, Spark) |
| Amazon S3 | Managed object store, HTTP API | 11 nines (99.999999999%) | General file storage, data lake |
| GFS | Master + ChunkServers, 64MB chunks | 3x replication | Google internal (predecessor to HDFS) |
| Ceph | CRUSH algorithm, no single master | Configurable | OpenStack, on-premise, block/file/object |
Advantages
- •Massive parallel throughput for big data
- •Replication provides fault tolerance
- •S3 offers virtually unlimited scale at low cost
Disadvantages
- •HDFS has a NameNode single point of failure (need HA)
- •Not suitable for small files (overhead per block)
- •S3 has higher latency than local file systems
🧪 Test Your Understanding
Knowledge Check1/1
What's the default block size in HDFS?