Stream processing

🌊Key Takeaways

1
Stream processing handles unbounded, continuous data — vs batch processing that handles bounded datasets
2
Kafka Streams: library-based, no separate cluster needed; Flink: full framework, exactly-once, complex event processing
3
Windowing: tumbling (fixed, non-overlapping), sliding (overlapping), session (activity-based gaps)
4
Backpressure: when consumers can't keep up, slow down producers or buffer — never drop silently

Processing Data in Motion

Batch processing (MapReduce, Spark) processes finite datasets. Stream processing handles data as it arrives — continuously. Use cases: real-time analytics, fraud detection, monitoring, social media feeds.

The shift from batch to streaming reflects the increasing demand for real-time insights. Most modern data architectures combine both (Lambda/Kappa architecture).

Stream Processing Frameworks

Framework	Model	Exactly-Once	Latency	Best For
Kafka Streams	Library (no cluster)	Yes (Kafka)	Low	Lightweight, Kafka-centric
Apache Flink	Full framework	Yes (checkpointing)	Very low	Complex event processing, large-scale
Spark Streaming	Micro-batch	Yes	Medium (100ms-sec)	Unifying batch + stream
AWS Kinesis	Managed	At-least-once	Low	AWS-native, low-ops

Windowing Strategies

Fixed-size, non-overlapping windows. e.g., count events per 1-minute window.

Simple but events at window boundaries can cause artifacts.

Advantages

•Real-time insights instead of hourly/daily batch
•Flink provides exactly-once with low latency
•Kafka Streams requires no additional infrastructure

Disadvantages

•Stream processing is more complex than batch
•Handling late/out-of-order data requires windowing logic
•Debugging streaming pipelines is harder

🧪 Test Your Understanding

Knowledge Check1/1

What differentiates stream processing from batch processing?