Advanced28 min readยท Topic 10.3

Stream processing

Kafka Streams, Apache Flink, Spark Streaming, windowing, backpressure handling

๐ŸŒŠKey Takeaways

  • 1
    Stream processing handles unbounded, continuous data โ€” vs batch processing that handles bounded datasets
  • 2
    Kafka Streams: library-based, no separate cluster needed; Flink: full framework, exactly-once, complex event processing
  • 3
    Windowing: tumbling (fixed, non-overlapping), sliding (overlapping), session (activity-based gaps)
  • 4
    Backpressure: when consumers can't keep up, slow down producers or buffer โ€” never drop silently

Processing Data in Motion

Batch processing (MapReduce, Spark) processes finite datasets. Stream processing handles data as it arrives โ€” continuously. Use cases: real-time analytics, fraud detection, monitoring, social media feeds.

The shift from batch to streaming reflects the increasing demand for real-time insights. Most modern data architectures combine both (Lambda/Kappa architecture).

Stream Processing Frameworks

FrameworkModelExactly-OnceLatencyBest For
Kafka StreamsLibrary (no cluster)Yes (Kafka)LowLightweight, Kafka-centric
Apache FlinkFull frameworkYes (checkpointing)Very lowComplex event processing, large-scale
Spark StreamingMicro-batchYesMedium (100ms-sec)Unifying batch + stream
AWS KinesisManagedAt-least-onceLowAWS-native, low-ops

Windowing Strategies

Fixed-size, non-overlapping windows. e.g., count events per 1-minute window.

Simple but events at window boundaries can cause artifacts.

Advantages

  • โ€ขReal-time insights instead of hourly/daily batch
  • โ€ขFlink provides exactly-once with low latency
  • โ€ขKafka Streams requires no additional infrastructure

Disadvantages

  • โ€ขStream processing is more complex than batch
  • โ€ขHandling late/out-of-order data requires windowing logic
  • โ€ขDebugging streaming pipelines is harder

๐Ÿงช Test Your Understanding

Knowledge Check1/1

What differentiates stream processing from batch processing?