๐Key Takeaways
- 1Stream processing handles unbounded, continuous data โ vs batch processing that handles bounded datasets
- 2Kafka Streams: library-based, no separate cluster needed; Flink: full framework, exactly-once, complex event processing
- 3Windowing: tumbling (fixed, non-overlapping), sliding (overlapping), session (activity-based gaps)
- 4Backpressure: when consumers can't keep up, slow down producers or buffer โ never drop silently
Processing Data in Motion
Batch processing (MapReduce, Spark) processes finite datasets. Stream processing handles data as it arrives โ continuously. Use cases: real-time analytics, fraud detection, monitoring, social media feeds.
The shift from batch to streaming reflects the increasing demand for real-time insights. Most modern data architectures combine both (Lambda/Kappa architecture).
Stream Processing Frameworks
| Framework | Model | Exactly-Once | Latency | Best For |
|---|---|---|---|---|
| Kafka Streams | Library (no cluster) | Yes (Kafka) | Low | Lightweight, Kafka-centric |
| Apache Flink | Full framework | Yes (checkpointing) | Very low | Complex event processing, large-scale |
| Spark Streaming | Micro-batch | Yes | Medium (100ms-sec) | Unifying batch + stream |
| AWS Kinesis | Managed | At-least-once | Low | AWS-native, low-ops |
Windowing Strategies
Fixed-size, non-overlapping windows. e.g., count events per 1-minute window.
Simple but events at window boundaries can cause artifacts.
Advantages
- โขReal-time insights instead of hourly/daily batch
- โขFlink provides exactly-once with low latency
- โขKafka Streams requires no additional infrastructure
Disadvantages
- โขStream processing is more complex than batch
- โขHandling late/out-of-order data requires windowing logic
- โขDebugging streaming pipelines is harder
๐งช Test Your Understanding
Knowledge Check1/1
What differentiates stream processing from batch processing?