Distributed file systems

📁Key Takeaways

1
HDFS: designed for large files, batch processing — stores data in 128MB blocks, 3x replication
2
S3: object storage with virtually unlimited scale, 99.999999999% (11 9s) durability
3
GFS (Google File System): the predecessor to HDFS — append-optimized, single master
4
Modern alternative: use object storage (S3) for most use cases; HDFS only for Hadoop ecosystems

Storing Data Across Machines

Distributed file systems spread files across many machines to provide capacity beyond a single disk, fault tolerance through replication, and parallel read/write throughput. Understanding HDFS architecture is essential for data-intensive system design.

Distributed Storage Systems

System	Architecture	Durability	Best For
HDFS	NameNode + DataNodes, 128MB blocks	3x replication	Batch analytics (MapReduce, Spark)
Amazon S3	Managed object store, HTTP API	11 nines (99.999999999%)	General file storage, data lake
GFS	Master + ChunkServers, 64MB chunks	3x replication	Google internal (predecessor to HDFS)
Ceph	CRUSH algorithm, no single master	Configurable	OpenStack, on-premise, block/file/object

Advantages

•Massive parallel throughput for big data
•Replication provides fault tolerance
•S3 offers virtually unlimited scale at low cost

Disadvantages

•HDFS has a NameNode single point of failure (need HA)
•Not suitable for small files (overhead per block)
•S3 has higher latency than local file systems

🧪 Test Your Understanding

Knowledge Check1/1

What's the default block size in HDFS?