All Posts

Stream Processing Pipeline Pattern: Stateful Real-Time Data Products

Build low-latency pipelines with windowing, state stores, and exactly-once outcomes.

Abstract AlgorithmsAbstract Algorithms
Β·Β·14 min read
Cover Image for Stream Processing Pipeline Pattern: Stateful Real-Time Data Products

AI-assisted content.

TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together β€” and Kafka Streams lets you build all three directly inside your Spring Boot service.

Stripe's real-time fraud detection processes each transaction in under 50 ms. Fraud detection that runs on yesterday's batch data misses a card-cloning attack tonight. Stream processing handles events as they arrive.

πŸ“– From Batch Jobs to Live Data: Why Stream Processing Exists

Imagine your e-commerce platform's revenue dashboard refreshes every six hours. A surge of fraudulent orders hits a specific region at 2:14 AM. Your batch job picks it up at 8:00 AM β€” six hours too late to block the chargebacks.

Stream processing exists to close that window. Instead of accumulating data in a store and processing it later, a stream pipeline consumes and transforms events as they arrive, producing low-latency results in seconds or minutes.

The canonical contrast:

ModeLatencyState modelWhen to use
BatchHoursAll data read at onceNightly reports, ML training
Micro-batchSeconds–minutesPer-micro-batch windowSpark Structured Streaming pipelines
True streamingMilliseconds–secondsPer-event or windowed stateFraud detection, live dashboards, alerting

The pattern this post focuses on: stateful windowed aggregation β€” computing live revenue per region per 5-minute tumbling window from a continuous stream of order events.


πŸ” Framework Choices for Java Stream Processing

Four mature options exist for JVM teams. The right choice depends on your operational model and existing stack.

FrameworkDeployment modelBest forKey limitation
Kafka StreamsEmbedded library inside your Spring Boot appKafka-native teams, per-service aggregationsKafka-only source/sink
Apache FlinkDedicated cluster (or Kubernetes operator)High-throughput analytics, CEP, ML feature pipelinesCluster overhead; steeper ops curve
Spark Structured StreamingSpark clusterSQL-style pipelines, data lake integrationMicro-batch latency floor; requires Spark expertise
Spring Cloud Stream + Kafka BinderEmbedded in Spring BootSimple, declarative fan-out/enrichment pipelinesLimited windowing and state support

This post uses Kafka Streams. It runs as a library inside your application β€” no separate cluster to provision, operate, or scale. It offers exactly-once semantics via Kafka transactions and exposes an interactive state store API that makes building a live REST query endpoint trivial.


βš™οΈ Event Time, Windows, and State: The Three Pillars of Stream Processing

Event time vs. processing time

Event time is when the event actually occurred β€” recorded in the event payload itself. Processing time is when the event arrives at your pipeline. Mobile clients, network partitions, and consumer lag mean these can diverge by seconds to minutes.

A pipeline relying only on processing time assigns late-arriving events to the wrong window. An order placed at 10:04:50 that arrives at 10:05:10 gets counted in the 10:05–10:10 window instead of 10:00–10:05. Over a busy traffic period, this shift produces revenue totals that are off-by-minutes and never converge.

Kafka Streams uses event time when you configure a TimestampExtractor that reads the event's own timestamp field.

Window types

TypeShapeExample use case
TumblingFixed-size, non-overlapping slotsRevenue per 5-min window
SlidingFixed-size, overlapping by a step intervalRolling 60-sec fraud score
SessionVariable-length, gap-basedUser session activity grouping

For revenue aggregation, a tumbling window is ideal β€” each 5-minute slot is independent and maps cleanly to one dashboard data point.

πŸ“Š Window Types Compared

flowchart LR
    subgraph Tumbling
        T1[0-60s] --> T2[60-120s]
        T2 --> T3[120-180s]
    end
    subgraph Sliding
        S1[0-60s]
        S2[30-90s]
        S3[60-120s]
    end
    subgraph Session
        SE1[events grouped by idle gap]
    end

The diagram contrasts three windowing strategies side by side: tumbling windows produce non-overlapping fixed-size time buckets where every event belongs to exactly one window; sliding windows advance in smaller steps and overlap, so the same event contributes to multiple windows; session windows group events by idle gap rather than clock time, making their size variable and data-driven. For the revenue aggregation scenario in this post, tumbling windows are the right choiceβ€”each 5-minute slot maps cleanly to one independent dashboard data point with no double-counting between adjacent windows.

Stateful vs. stateless operations

A stateless operation (map, filter) processes each event independently. A stateful operator β€” like a running tally that accumulates results across events β€” maintains a state store: a local RocksDB key-value store backed by a Kafka changelog topic for durability and replay. This is what makes windowed aggregation possible: the running sum for each (region, window) key accumulates in the store until the window closes and the final result is emitted downstream.


πŸ“Š Live Revenue Topology: How the Data Flows

The complete Kafka Streams topology for this post's scenario β€” every order event in, aggregated revenue per region per 5-minute window out:

flowchart LR
    A([order-events topic]) --> B[selectKey (by region)]
    B --> C[groupByKey]
    C --> D[windowedBy (5-min tumbling)]
    D --> E[aggregate (sum revenueCents)]
    E --> F[(revenue-store WindowStore)]
    F --> G[toStream]
    G --> H([revenue-by-region topic])
    F -. "interactive query" .-> I[REST /revenue/{region}/current]

Re-keying by region is the critical step: it ensures all events for the same region land in the same Kafka partition, which is a prerequisite for correct partition-local aggregation. Without this step, the same region's events would be spread across partitions and each partition would produce a partial β€” and incorrect β€” sum.


πŸ§ͺ Java Implementation: Live Revenue Aggregation with Kafka Streams

This example demonstrates a complete, production-shaped Kafka Streams topology implementing the live revenue aggregation scenario described earlier in this post. It was chosen because Kafka Streams runs as a library inside a standard JVM serviceβ€”no separate cluster to operateβ€”making it the most accessible entry point for teams already running Kafka. Read the code top-to-bottom as a pipeline: Maven dependencies first, domain events next, then the topology definition that wires source topic β†’ key selection β†’ windowed aggregation β†’ sink topic.

Maven dependencies

<dependency>
  <groupId>org.apache.kafka</groupId>
  <artifactId>kafka-streams</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.kafka</groupId>
  <artifactId>spring-kafka</artifactId>
</dependency>

Domain events

public record OrderEvent(String orderId, String region, long revenueCents, Instant eventTime) {}
public record RegionRevenueSummary(String region, long totalRevenueCents, Instant windowStart, Instant windowEnd) {}

Kafka Streams topology

@Configuration
public class RevenueStreamConfig {

    @Bean
    public KStream<String, OrderEvent> revenueStream(StreamsBuilder builder) {
        // Input: key=orderId, value=OrderEvent (JSON deserialized)
        KStream<String, OrderEvent> orders = builder.stream("order-events",
            Consumed.with(Serdes.String(), orderEventSerde()));

        // Re-key by region for correct partition-level aggregation
        KStream<String, OrderEvent> byRegion = orders.selectKey(
            (orderId, event) -> event.region());

        // Tumbling 5-minute window, keyed by region
        KTable<Windowed<String>, Long> windowedRevenue = byRegion
            .groupByKey(Grouped.with(Serdes.String(), orderEventSerde()))
            .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
            .aggregate(
                () -> 0L,
                (region, event, total) -> total + event.revenueCents(),
                Materialized.<String, Long, WindowStore<Bytes, byte[]>>as("revenue-store")
                    .withValueSerde(Serdes.Long())
            );

        // Map to output records and push to downstream topic
        windowedRevenue
            .toStream()
            .map((windowedRegion, total) -> KeyValue.pair(
                windowedRegion.key(),
                new RegionRevenueSummary(
                    windowedRegion.key(), total,
                    windowedRegion.window().startTime(),
                    windowedRegion.window().endTime()
                )))
            .to("revenue-by-region", Produced.with(Serdes.String(), revenueSummarySerde()));

        return orders;
    }
}

Spring Boot application.yml configuration

spring:
  kafka:
    streams:
      application-id: revenue-aggregator
      bootstrap-servers: kafka:9092
      properties:
        processing.guarantee: exactly_once_v2
        default.key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
        default.value.serde: org.springframework.kafka.support.serializer.JsonSerde

Querying the state store via REST

Because the revenue-store is materialized, you can query it directly from a REST controller β€” no extra database round-trip required:

@RestController
public class RevenueQueryController {
    private final KafkaStreams streams;

    @GetMapping("/revenue/{region}/current")
    public Long getCurrentRevenue(@PathVariable String region) {
        ReadOnlyWindowStore<String, Long> store = streams.store(
            StoreQueryParameters.fromNameAndType("revenue-store",
                QueryableStoreTypes.windowStore()));

        Instant now = Instant.now();
        WindowStoreIterator<Long> it = store.fetch(region, now.minus(5, MINUTES), now);
        return it.hasNext() ? it.next().value : 0L;
    }
}

This is one of Kafka Streams' most practical production features: your Spring Boot service simultaneously processes the stream and serves the latest aggregated result. For many real-time read-path use cases this eliminates a Redis or PostgreSQL dependency entirely.


🧠 Deep Dive: Late Arrivals, Watermarks, and Exactly-Once Guarantees

Internals: Grace Periods, Late Arrivals, and Watermarks

TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)) is the strictest window policy: any event arriving after its window's close is silently dropped. For a live dashboard this is often acceptable β€” a 30-second late order is minor noise.

For financial correctness, add a grace period:

.windowedBy(TimeWindows.ofSizeAndGrace(Duration.ofMinutes(5), Duration.ofSeconds(30)))

This holds the window open an extra 30 seconds, absorbing minor network delays before emitting the final aggregate. A region revenue window that drops orders arriving 30 seconds late can undercount by 2–5% during high-traffic bursts β€” a material discrepancy in a live revenue report reviewed by finance.

Watermarks serve an analogous role in Flink: they are periodic timestamps β€” like a time-boundary marker telling the engine "no earlier event is coming" β€” that allow the engine to safely close windows without waiting forever. Kafka Streams achieves the same effect via StreamsConfig.MAX_TASK_IDLE_MS_CONFIG, which controls how long a partition waits for lagging records before advancing its local stream time.

Performance Analysis: Exactly-Once Semantics and Throughput Trade-offs

Setting processing.guarantee: exactly_once_v2 activates Kafka's transactional producer and the fenced consumer-group protocol. This guarantees that every input record is processed exactly once β€” no duplicates on crash, no dropped records on restart β€” even when a Streams task fails mid-window.

Important operational constraint: exactly_once_v2 requires at least 3 Kafka brokers. The transactional coordinator writes its own internal log (__transaction_state) with a replication factor of 3 by default. Running this setting against a single-broker development cluster will cause the Streams app to fail at startup with an InvalidReplicationFactorException. Always verify broker count before enabling in production.

ConfigurationThroughput impactNotes
exactly_once_v2~5–10% reductionRequires β‰₯3 brokers; correct default for financial pipelines
Grace period setState store held longerMemory proportional to grace window size
num.standby.replicas=1Slightly higher Kafka I/OEliminates cold-start rebuild on failover

βš–οΈ Trade-offs & Failure Modes: Failure Modes in Stream Pipelines

Failure modeSymptomRoot causeMitigation
Consumer lag growsDashboard values fall behind real-timeUnderpowered consumers or slow aggregate logicScale consumer threads; profile the aggregate UDF
State store corrupt after restartWrong aggregates despite healthy KafkaChangelog topic deleted or under-replicatedSet replication.factor β‰₯ 3 on all changelog topics
Window undercountingRevenue totals lower than source-of-truthLate events dropped; grace period absentAdd grace period; run periodic batch reconciliation
Rebalance storm on startupStreams app crashes and restarts repeatedlyToo many partitions or sticky assignment disabledEnable sticky assignor; tune session.timeout.ms
Duplicate output recordsDownstream sees repeated revenue summariesat_least_once guarantee with no idempotent sinkUpgrade to exactly_once_v2; add idempotent sink logic

🧭 Decision Guide: Choosing the Right Framework for Your Pipeline

SituationRecommendation
Team is Kafka-native and needs per-service aggregationsKafka Streams β€” embedded library, no extra cluster, interactive state store queries
Complex event processing (CEP), ML feature pipelines, or multi-source joinsApache Flink β€” richer windowing API, true stream time, purpose-built cluster
Team is already on Spark and data lands in a data lakeSpark Structured Streaming β€” micro-batch is fine; SQL API is mature
Simple enrichment or fan-out with minimal windowingSpring Cloud Stream β€” declarative, low boilerplate, fits Spring Boot teams
Pipeline must restart without manual data recoveryAny framework β€” ensure changelog topics are retained and num.standby.replicas β‰₯ 1

🌍 Real-World Applications: Where Stream Pipelines Are Applied

IndustryUse caseWhy stream processing fits
E-commerceLive revenue dashboards, fraud detectionSub-second latency to catch anomalies before chargebacks
FinanceReal-time transaction monitoring, risk scoringWindowed aggregation over event-time prevents stale risk models
LogisticsFleet tracking, delivery ETA updatesStateful joins correlate position events to live route state
AdTechImpression and click counting per campaignHigh-throughput tumbling windows per advertiser ID

πŸ”§ Operating a Stream Pipeline in Production

Ops runbook checkpoints:

SignalFirst checkEscalation action
Consumer lag spikingGC pause logs β†’ thread count β†’ aggregate UDF profilingScale consumer threads; add partitions
Wrong aggregates after restartChangelog topic retention policyRestore from Kafka backup; increase retention
Rebalance loop on startupnum.stream.threads vs. partition countSet sticky assignor; reduce thread count

Apache Flink is a distributed stream processing engine designed for high-throughput, low-latency analytics β€” it runs as a dedicated cluster (or via Kubernetes Operator) and supports true stream-time semantics with watermarks, complex event patterns, and managed state stores backed by RocksDB.

Where Kafka Streams is the right choice for per-service aggregations embedded in a Spring Boot app, Flink shines for cross-service analytics, ML feature pipelines, and multi-source joins that would require multiple Kafka Streams instances to coordinate.

public class RevenueAggregationJob {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env =
            StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(60_000); // checkpoint every 60s for exactly-once

        DataStream<OrderEvent> orders = env
            .addSource(new FlinkKafkaConsumer<>(
                "order-events",
                new JsonDeserializationSchema<>(OrderEvent.class),
                kafkaProperties()));

        // KeyedStream: partition by region, then apply 5-minute tumbling window
        orders
            .assignTimestampsAndWatermarks(
                WatermarkStrategy.<OrderEvent>forBoundedOutOfOrderness(Duration.ofSeconds(30))
                    .withTimestampAssigner((event, ts) -> event.eventTimeMs()))
            .keyBy(OrderEvent::region)
            .window(TumblingEventTimeWindows.of(Time.minutes(5)))
            .aggregate(new RevenueAggregateFunction())
            .addSink(new FlinkKafkaProducer<>(
                "revenue-by-region",
                new JsonSerializationSchema<>(),
                kafkaProperties()));

        env.execute("Live Revenue Aggregation");
    }

    static class RevenueAggregateFunction
            implements AggregateFunction<OrderEvent, Long, RegionRevenueSummary> {
        @Override public Long createAccumulator() { return 0L; }
        @Override public Long add(OrderEvent e, Long acc) { return acc + e.revenueCents(); }
        @Override public RegionRevenueSummary getResult(Long acc) {
            return new RegionRevenueSummary(acc);
        }
        @Override public Long merge(Long a, Long b) { return a + b; }
    }
}

WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(30)) tells Flink to hold windows open 30 seconds past their nominal close time to absorb late-arriving events β€” the Flink equivalent of Kafka Streams' .ofSizeAndGrace() policy. env.enableCheckpointing(60_000) activates Flink's distributed snapshot mechanism for exactly-once fault tolerance without a separate transaction coordinator.

CapabilityKafka StreamsApache Flink
DeploymentEmbedded library in Spring BootDedicated cluster or K8s Operator
State storeRocksDB (local, REST queryable)RocksDB (distributed, HA replicas)
Exactly-onceKafka transactionsDistributed checkpoints (Chandy-Lamport)
Best forPer-service aggregationsCross-service analytics, CEP, ML features

For a full deep-dive on Apache Flink cluster setup, stateful CEP operators, and ML feature pipeline patterns, a dedicated follow-up post is planned.

πŸ“Š Stateful Stream: Kafka to Sink

sequenceDiagram
    participant K as Kafka
    participant F as Flink
    participant SB as StateBackend
    participant Sink as OutputSink
    K->>F: Stream of events
    F->>SB: Read keyed state
    SB-->>F: Current aggregate
    F->>F: Update aggregate
    F->>SB: Write updated state
    F->>Sink: Emit aggregated result
    Note over F,SB: Exactly-once via checkpoints

The sequence diagram shows how Flink achieves exactly-once semantics across a stateful aggregation: every event consumed from Kafka triggers a read-modify-write cycle against the state backend, and the updated aggregate is written to both the state backend and the output sink within the same checkpoint boundary. The Note over F,SB: Exactly-once via checkpoints annotation captures the mechanismβ€”Chandy-Lamport distributed snapshots ensure that if the job crashes mid-stream, it replays from the last consistent checkpoint rather than reprocessing from the beginning and double-counting already-emitted aggregates.

πŸ“š Lessons Learned

  • Event-time is almost always the right choice.Processing-time windows produce correct-looking but subtly wrong results during any consumer lag event or network hiccup.
  • Re-keying before groupByKey is not optional. Grouping on the original orderId key spreads events for the same region across partitions, silently breaking partition-local aggregation.
  • Grace periods belong in business requirements. Agree on late-arrival tolerance before writing the topology β€” reversing it later requires a state store migration.
  • State store rebuild time is a production SLA risk. Pre-populate standby replicas (num.standby.replicas=1) so failover is a hot swap, not a cold replay lasting minutes.
  • Interactive queries eliminate a database dependency. The REST endpoint querying revenue-store directly is a first-class Kafka Streams feature, not a workaround.

πŸ“Œ TLDR: Summary & Key Takeaways

  • Stream processing transforms continuous event streams in real-time, eliminating the latency gap of batch pipelines β€” live revenue windows instead of 6-hour reports.
  • Kafka Streams runs embedded in your Spring Boot service: no extra cluster, full exactly-once via Kafka transactions, and interactive state store queries on the same JVM.
  • Windowed aggregation requires re-keying by the group dimension, choosing the right window type, and a grace period that matches late-arrival tolerance.
  • exactly_once_v2 is right for financial pipelines but requires at least 3 Kafka brokers β€” verify before enabling in new environments.
  • Consumer lag on the input topic is the most actionable operational signal β€” it shows the pipeline is falling behind before the dashboard shows stale numbers.

Share

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms