Intermediate → Advanced22 min read· Topic 7.3

Distributed file systems

HDFS, Google File System, Amazon S3 architecture, Ceph

📁Key Takeaways

  • 1
    HDFS: designed for large files, batch processing — stores data in 128MB blocks, 3x replication
  • 2
    S3: object storage with virtually unlimited scale, 99.999999999% (11 9s) durability
  • 3
    GFS (Google File System): the predecessor to HDFS — append-optimized, single master
  • 4
    Modern alternative: use object storage (S3) for most use cases; HDFS only for Hadoop ecosystems

Storing Data Across Machines

Distributed file systems spread files across many machines to provide capacity beyond a single disk, fault tolerance through replication, and parallel read/write throughput. Understanding HDFS architecture is essential for data-intensive system design.

Distributed Storage Systems

SystemArchitectureDurabilityBest For
HDFSNameNode + DataNodes, 128MB blocks3x replicationBatch analytics (MapReduce, Spark)
Amazon S3Managed object store, HTTP API11 nines (99.999999999%)General file storage, data lake
GFSMaster + ChunkServers, 64MB chunks3x replicationGoogle internal (predecessor to HDFS)
CephCRUSH algorithm, no single masterConfigurableOpenStack, on-premise, block/file/object

Advantages

  • Massive parallel throughput for big data
  • Replication provides fault tolerance
  • S3 offers virtually unlimited scale at low cost

Disadvantages

  • HDFS has a NameNode single point of failure (need HA)
  • Not suitable for small files (overhead per block)
  • S3 has higher latency than local file systems

🧪 Test Your Understanding

Knowledge Check1/1

What's the default block size in HDFS?