Intermediate → Advanced14 min read· Topic 7.4

Data serialization formats

JSON, Protocol Buffers, Avro, Parquet/ORC, MessagePack

📋Key Takeaways

  • 1
    JSON: human-readable, universal, but verbose and slow to parse — use for APIs and config
  • 2
    Protocol Buffers (protobuf): binary, fast, strongly typed, compact — use for gRPC and internal communication
  • 3
    Avro: binary with schema evolution, used in Kafka and data pipelines — schema stored separately
  • 4
    Parquet/ORC: columnar formats for analytics — excellent compression and query performance for OLAP

How Data Crosses Service Boundaries

Every piece of data sent between services, stored in queues, or written to disk must be serialized into bytes. The format you choose affects performance, storage cost, schema evolution capability, and developer experience.

Serialization Formats Compared

FormatTypeHuman-ReadableSpeedSchema EvolutionBest For
JSONTextYesSlowWeakREST APIs, config files
ProtobufBinaryNoVery fastGood (field numbering)gRPC, internal services
AvroBinaryNoFastExcellent (full+backward)Kafka, data pipelines
MessagePackBinaryNoFastNoneLight binary JSON replacement
ParquetColumnarNoFast for analyticsGoodData lakes, Spark, analytics

Advantages

  • JSON is universal and debuggable
  • Protobuf is 3-10x faster than JSON
  • Parquet dramatically reduces analytics storage costs

Disadvantages

  • Binary formats are not human-debuggable
  • Schema evolution requires careful planning
  • Multiple formats in one system add complexity

🧪 Test Your Understanding

Knowledge Check1/1

Which format is best for Kafka event streaming with evolving schemas?