Data serialization formats

📋Key Takeaways

1
JSON: human-readable, universal, but verbose and slow to parse — use for APIs and config
2
Protocol Buffers (protobuf): binary, fast, strongly typed, compact — use for gRPC and internal communication
3
Avro: binary with schema evolution, used in Kafka and data pipelines — schema stored separately
4
Parquet/ORC: columnar formats for analytics — excellent compression and query performance for OLAP

How Data Crosses Service Boundaries

Every piece of data sent between services, stored in queues, or written to disk must be serialized into bytes. The format you choose affects performance, storage cost, schema evolution capability, and developer experience.

Serialization Formats Compared

Format	Type	Human-Readable	Speed	Schema Evolution	Best For
JSON	Text	Yes	Slow	Weak	REST APIs, config files
Protobuf	Binary	No	Very fast	Good (field numbering)	gRPC, internal services
Avro	Binary	No	Fast	Excellent (full+backward)	Kafka, data pipelines
MessagePack	Binary	No	Fast	None	Light binary JSON replacement
Parquet	Columnar	No	Fast for analytics	Good	Data lakes, Spark, analytics

Advantages

•JSON is universal and debuggable
•Protobuf is 3-10x faster than JSON
•Parquet dramatically reduces analytics storage costs

Disadvantages

•Binary formats are not human-debuggable
•Schema evolution requires careful planning
•Multiple formats in one system add complexity

🧪 Test Your Understanding

Knowledge Check1/1

Which format is best for Kafka event streaming with evolving schemas?