Stream Processing Pipeline Pattern: Stateful Real-Time Data Products

Build low-latency pipelines with windowing, state stores, and exactly-once outcomes.

Abstract Algorithms

·Mar 13, 2026·14 min read

Cover Image for Stream Processing Pipeline Pattern: Stateful Real-Time Data Products

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together — and Kafka Streams lets you build all three directly inside your Spring Boot service.

Stripe's real-time fraud detection processes each transaction in under 50 ms. Fraud detection that runs on yesterday's batch data misses a card-cloning attack tonight. Stream processing handles events as they arrive.

📖 From Batch Jobs to Live Data: Why Stream Processing Exists

Imagine your e-commerce platform's revenue dashboard refreshes every six hours. A surge of fraudulent orders hits a specific region at 2:14 AM. Your batch job picks it up at 8:00 AM — six hours too late to block the chargebacks.

Stream processing exists to close that window. Instead of accumulating data in a store and processing it later, a stream pipeline consumes and transforms events as they arrive, producing low-latency results in seconds or minutes.

The canonical contrast:

Mode	Latency	State model	When to use
Batch	Hours	All data read at once	Nightly reports, ML training
Micro-batch	Seconds–minutes	Per-micro-batch window	Spark Structured Streaming pipelines
True streaming	Milliseconds–seconds	Per-event or windowed state	Fraud detection, live dashboards, alerting

The pattern this post focuses on: stateful windowed aggregation — computing live revenue per region per 5-minute tumbling window from a continuous stream of order events.

🔍 Framework Choices for Java Stream Processing

Four mature options exist for JVM teams. The right choice depends on your operational model and existing stack.

Framework	Deployment model	Best for	Key limitation
Kafka Streams	Embedded library inside your Spring Boot app	Kafka-native teams, per-service aggregations	Kafka-only source/sink
Apache Flink	Dedicated cluster (or Kubernetes operator)	High-throughput analytics, CEP, ML feature pipelines	Cluster overhead; steeper ops curve
Spark Structured Streaming	Spark cluster	SQL-style pipelines, data lake integration	Micro-batch latency floor; requires Spark expertise
Spring Cloud Stream + Kafka Binder	Embedded in Spring Boot	Simple, declarative fan-out/enrichment pipelines	Limited windowing and state support

This post uses Kafka Streams. It runs as a library inside your application — no separate cluster to provision, operate, or scale. It offers exactly-once semantics via Kafka transactions and exposes an interactive state store API that makes building a live REST query endpoint trivial.

⚙️ Event Time, Windows, and State: The Three Pillars of Stream Processing

Event time vs. processing time

Event time is when the event actually occurred — recorded in the event payload itself. Processing time is when the event arrives at your pipeline. Mobile clients, network partitions, and consumer lag mean these can diverge by seconds to minutes.

A pipeline relying only on processing time assigns late-arriving events to the wrong window. An order placed at 10:04:50 that arrives at 10:05:10 gets counted in the 10:05–10:10 window instead of 10:00–10:05. Over a busy traffic period, this shift produces revenue totals that are off-by-minutes and never converge.

Kafka Streams uses event time when you configure a TimestampExtractor that reads the event's own timestamp field.

Window types

Type	Shape	Example use case
Tumbling	Fixed-size, non-overlapping slots	Revenue per 5-min window
Sliding	Fixed-size, overlapping by a step interval	Rolling 60-sec fraud score
Session	Variable-length, gap-based	User session activity grouping

For revenue aggregation, a tumbling window is ideal — each 5-minute slot is independent and maps cleanly to one dashboard data point.

📊 Window Types Compared

flowchart LR
    subgraph Tumbling
        T1[0-60s] --> T2[60-120s]
        T2 --> T3[120-180s]
    end
    subgraph Sliding
        S1[0-60s]
        S2[30-90s]
        S3[60-120s]
    end
    subgraph Session
        SE1[events grouped by idle gap]
    end

The diagram contrasts three windowing strategies side by side: tumbling windows produce non-overlapping fixed-size time buckets where every event belongs to exactly one window; sliding windows advance in smaller steps and overlap, so the same event contributes to multiple windows; session windows group events by idle gap rather than clock time, making their size variable and data-driven. For the revenue aggregation scenario in this post, tumbling windows are the right choice—each 5-minute slot maps cleanly to one independent dashboard data point with no double-counting between adjacent windows.

Stateful vs. stateless operations

A stateless operation (map, filter) processes each event independently. A stateful operator — like a running tally that accumulates results across events — maintains a state store: a local RocksDB key-value store backed by a Kafka changelog topic for durability and replay. This is what makes windowed aggregation possible: the running sum for each (region, window) key accumulates in the store until the window closes and the final result is emitted downstream.

📊 Live Revenue Topology: How the Data Flows

The complete Kafka Streams topology for this post's scenario — every order event in, aggregated revenue per region per 5-minute window out:

flowchart LR
    A([order-events topic]) --> B[selectKey (by region)]
    B --> C[groupByKey]
    C --> D[windowedBy (5-min tumbling)]
    D --> E[aggregate (sum revenueCents)]
    E --> F[(revenue-store WindowStore)]
    F --> G[toStream]
    G --> H([revenue-by-region topic])
    F -. "interactive query" .-> I[REST /revenue/{region}/current]

Re-keying by region is the critical step: it ensures all events for the same region land in the same Kafka partition, which is a prerequisite for correct partition-local aggregation. Without this step, the same region's events would be spread across partitions and each partition would produce a partial — and incorrect — sum.

🧪 Java Implementation: Live Revenue Aggregation with Kafka Streams

This example demonstrates a complete, production-shaped Kafka Streams topology implementing the live revenue aggregation scenario described earlier in this post. It was chosen because Kafka Streams runs as a library inside a standard JVM service—no separate cluster to operate—making it the most accessible entry point for teams already running Kafka. Read the code top-to-bottom as a pipeline: Maven dependencies first, domain events next, then the topology definition that wires source topic → key selection → windowed aggregation → sink topic.

Maven dependencies

<dependency>
  <groupId>org.apache.kafka</groupId>
  <artifactId>kafka-streams</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.kafka</groupId>
  <artifactId>spring-kafka</artifactId>
</dependency>

Domain events

public record OrderEvent(String orderId, String region, long revenueCents, Instant eventTime) {}
public record RegionRevenueSummary(String region, long totalRevenueCents, Instant windowStart, Instant windowEnd) {}

Kafka Streams topology

@Configuration
public class RevenueStreamConfig {

    @Bean
    public KStream<String, OrderEvent> revenueStream(StreamsBuilder builder) {
        // Input: key=orderId, value=OrderEvent (JSON deserialized)
        KStream<String, OrderEvent> orders = builder.stream("order-events",
            Consumed.with(Serdes.String(), orderEventSerde()));

        // Re-key by region for correct partition-level aggregation
        KStream<String, OrderEvent> byRegion = orders.selectKey(
            (orderId, event) -> event.region());

        // Tumbling 5-minute window, keyed by region
        KTable<Windowed<String>, Long> windowedRevenue = byRegion
            .groupByKey(Grouped.with(Serdes.String(), orderEventSerde()))
            .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
            .aggregate(
                () -> 0L,
                (region, event, total) -> total + event.revenueCents(),
                Materialized.<String, Long, WindowStore<Bytes, byte[]>>as("revenue-store")
                    .withValueSerde(Serdes.Long())
            );

        // Map to output records and push to downstream topic
        windowedRevenue
            .toStream()
            .map((windowedRegion, total) -> KeyValue.pair(
                windowedRegion.key(),
                new RegionRevenueSummary(
                    windowedRegion.key(), total,
                    windowedRegion.window().startTime(),
                    windowedRegion.window().endTime()
                )))
            .to("revenue-by-region", Produced.with(Serdes.String(), revenueSummarySerde()));

        return orders;
    }
}

Spring Boot `application.yml` configuration

spring:
  kafka:
    streams:
      application-id: revenue-aggregator
      bootstrap-servers: kafka:9092
      properties:
        processing.guarantee: exactly_once_v2
        default.key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
        default.value.serde: org.springframework.kafka.support.serializer.JsonSerde

Querying the state store via REST

Because the revenue-store is materialized, you can query it directly from a REST controller — no extra database round-trip required:

@RestController
public class RevenueQueryController {
    private final KafkaStreams streams;

    @GetMapping("/revenue/{region}/current")
    public Long getCurrentRevenue(@PathVariable String region) {
        ReadOnlyWindowStore<String, Long> store = streams.store(
            StoreQueryParameters.fromNameAndType("revenue-store",
                QueryableStoreTypes.windowStore()));

        Instant now = Instant.now();
        WindowStoreIterator<Long> it = store.fetch(region, now.minus(5, MINUTES), now);
        return it.hasNext() ? it.next().value : 0L;
    }
}

This is one of Kafka Streams' most practical production features: your Spring Boot service simultaneously processes the stream and serves the latest aggregated result. For many real-time read-path use cases this eliminates a Redis or PostgreSQL dependency entirely.

🧠 Deep Dive: Late Arrivals, Watermarks, and Exactly-Once Guarantees

Internals: Grace Periods, Late Arrivals, and Watermarks

TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)) is the strictest window policy: any event arriving after its window's close is silently dropped. For a live dashboard this is often acceptable — a 30-second late order is minor noise.

For financial correctness, add a grace period:

.windowedBy(TimeWindows.ofSizeAndGrace(Duration.ofMinutes(5), Duration.ofSeconds(30)))

This holds the window open an extra 30 seconds, absorbing minor network delays before emitting the final aggregate. A region revenue window that drops orders arriving 30 seconds late can undercount by 2–5% during high-traffic bursts — a material discrepancy in a live revenue report reviewed by finance.

Watermarks serve an analogous role in Flink: they are periodic timestamps — like a time-boundary marker telling the engine "no earlier event is coming" — that allow the engine to safely close windows without waiting forever. Kafka Streams achieves the same effect via StreamsConfig.MAX_TASK_IDLE_MS_CONFIG, which controls how long a partition waits for lagging records before advancing its local stream time.

Performance Analysis: Exactly-Once Semantics and Throughput Trade-offs

Setting processing.guarantee: exactly_once_v2 activates Kafka's transactional producer and the fenced consumer-group protocol. This guarantees that every input record is processed exactly once — no duplicates on crash, no dropped records on restart — even when a Streams task fails mid-window.

Important operational constraint: exactly_once_v2 requires at least 3 Kafka brokers. The transactional coordinator writes its own internal log (__transaction_state) with a replication factor of 3 by default. Running this setting against a single-broker development cluster will cause the Streams app to fail at startup with an InvalidReplicationFactorException. Always verify broker count before enabling in production.

Configuration	Throughput impact	Notes
`exactly_once_v2`	~5–10% reduction	Requires ≥3 brokers; correct default for financial pipelines
Grace period set	State store held longer	Memory proportional to grace window size
`num.standby.replicas=1`	Slightly higher Kafka I/O	Eliminates cold-start rebuild on failover

⚖️ Trade-offs & Failure Modes: Failure Modes in Stream Pipelines

Failure mode	Symptom	Root cause	Mitigation
Consumer lag grows	Dashboard values fall behind real-time	Underpowered consumers or slow aggregate logic	Scale consumer threads; profile the aggregate UDF
State store corrupt after restart	Wrong aggregates despite healthy Kafka	Changelog topic deleted or under-replicated	Set `replication.factor ≥ 3` on all changelog topics
Window undercounting	Revenue totals lower than source-of-truth	Late events dropped; grace period absent	Add grace period; run periodic batch reconciliation
Rebalance storm on startup	Streams app crashes and restarts repeatedly	Too many partitions or sticky assignment disabled	Enable sticky assignor; tune `session.timeout.ms`
Duplicate output records	Downstream sees repeated revenue summaries	`at_least_once` guarantee with no idempotent sink	Upgrade to `exactly_once_v2`; add idempotent sink logic

🧭 Decision Guide: Choosing the Right Framework for Your Pipeline

Situation	Recommendation
Team is Kafka-native and needs per-service aggregations	Kafka Streams — embedded library, no extra cluster, interactive state store queries
Complex event processing (CEP), ML feature pipelines, or multi-source joins	Apache Flink — richer windowing API, true stream time, purpose-built cluster
Team is already on Spark and data lands in a data lake	Spark Structured Streaming — micro-batch is fine; SQL API is mature
Simple enrichment or fan-out with minimal windowing	Spring Cloud Stream — declarative, low boilerplate, fits Spring Boot teams
Pipeline must restart without manual data recovery	Any framework — ensure changelog topics are retained and `num.standby.replicas ≥ 1`

🌍 Real-World Applications: Where Stream Pipelines Are Applied

Industry	Use case	Why stream processing fits
E-commerce	Live revenue dashboards, fraud detection	Sub-second latency to catch anomalies before chargebacks
Finance	Real-time transaction monitoring, risk scoring	Windowed aggregation over event-time prevents stale risk models
Logistics	Fleet tracking, delivery ETA updates	Stateful joins correlate position events to live route state
AdTech	Impression and click counting per campaign	High-throughput tumbling windows per advertiser ID

🔧 Operating a Stream Pipeline in Production

Ops runbook checkpoints:

Signal	First check	Escalation action
Consumer lag spiking	GC pause logs → thread count → aggregate UDF profiling	Scale consumer threads; add partitions
Wrong aggregates after restart	Changelog topic retention policy	Restore from Kafka backup; increase retention
Rebalance loop on startup	`num.stream.threads` vs. partition count	Set sticky assignor; reduce thread count

🛠️ Apache Flink: DataStream API for Complex Event Processing Pipelines

Apache Flink is a distributed stream processing engine designed for high-throughput, low-latency analytics — it runs as a dedicated cluster (or via Kubernetes Operator) and supports true stream-time semantics with watermarks, complex event patterns, and managed state stores backed by RocksDB.

Where Kafka Streams is the right choice for per-service aggregations embedded in a Spring Boot app, Flink shines for cross-service analytics, ML feature pipelines, and multi-source joins that would require multiple Kafka Streams instances to coordinate.

public class RevenueAggregationJob {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env =
            StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(60_000); // checkpoint every 60s for exactly-once

        DataStream<OrderEvent> orders = env
            .addSource(new FlinkKafkaConsumer<>(
                "order-events",
                new JsonDeserializationSchema<>(OrderEvent.class),
                kafkaProperties()));

        // KeyedStream: partition by region, then apply 5-minute tumbling window
        orders
            .assignTimestampsAndWatermarks(
                WatermarkStrategy.<OrderEvent>forBoundedOutOfOrderness(Duration.ofSeconds(30))
                    .withTimestampAssigner((event, ts) -> event.eventTimeMs()))
            .keyBy(OrderEvent::region)
            .window(TumblingEventTimeWindows.of(Time.minutes(5)))
            .aggregate(new RevenueAggregateFunction())
            .addSink(new FlinkKafkaProducer<>(
                "revenue-by-region",
                new JsonSerializationSchema<>(),
                kafkaProperties()));

        env.execute("Live Revenue Aggregation");
    }

    static class RevenueAggregateFunction
            implements AggregateFunction<OrderEvent, Long, RegionRevenueSummary> {
        @Override public Long createAccumulator() { return 0L; }
        @Override public Long add(OrderEvent e, Long acc) { return acc + e.revenueCents(); }
        @Override public RegionRevenueSummary getResult(Long acc) {
            return new RegionRevenueSummary(acc);
        }
        @Override public Long merge(Long a, Long b) { return a + b; }
    }
}

WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(30)) tells Flink to hold windows open 30 seconds past their nominal close time to absorb late-arriving events — the Flink equivalent of Kafka Streams' .ofSizeAndGrace() policy. env.enableCheckpointing(60_000) activates Flink's distributed snapshot mechanism for exactly-once fault tolerance without a separate transaction coordinator.

Capability	Kafka Streams	Apache Flink
Deployment	Embedded library in Spring Boot	Dedicated cluster or K8s Operator
State store	RocksDB (local, REST queryable)	RocksDB (distributed, HA replicas)
Exactly-once	Kafka transactions	Distributed checkpoints (Chandy-Lamport)
Best for	Per-service aggregations	Cross-service analytics, CEP, ML features

For a full deep-dive on Apache Flink cluster setup, stateful CEP operators, and ML feature pipeline patterns, a dedicated follow-up post is planned.

📊 Stateful Stream: Kafka to Sink

sequenceDiagram
    participant K as Kafka
    participant F as Flink
    participant SB as StateBackend
    participant Sink as OutputSink
    K->>F: Stream of events
    F->>SB: Read keyed state
    SB-->>F: Current aggregate
    F->>F: Update aggregate
    F->>SB: Write updated state
    F->>Sink: Emit aggregated result
    Note over F,SB: Exactly-once via checkpoints

The sequence diagram shows how Flink achieves exactly-once semantics across a stateful aggregation: every event consumed from Kafka triggers a read-modify-write cycle against the state backend, and the updated aggregate is written to both the state backend and the output sink within the same checkpoint boundary. The Note over F,SB: Exactly-once via checkpoints annotation captures the mechanism—Chandy-Lamport distributed snapshots ensure that if the job crashes mid-stream, it replays from the last consistent checkpoint rather than reprocessing from the beginning and double-counting already-emitted aggregates.

📚 Lessons Learned

Event-time is almost always the right choice.Processing-time windows produce correct-looking but subtly wrong results during any consumer lag event or network hiccup.
Re-keying before groupByKey is not optional. Grouping on the original orderId key spreads events for the same region across partitions, silently breaking partition-local aggregation.
Grace periods belong in business requirements. Agree on late-arrival tolerance before writing the topology — reversing it later requires a state store migration.
State store rebuild time is a production SLA risk. Pre-populate standby replicas (num.standby.replicas=1) so failover is a hot swap, not a cold replay lasting minutes.
Interactive queries eliminate a database dependency. The REST endpoint querying revenue-store directly is a first-class Kafka Streams feature, not a workaround.

📌 TLDR: Summary & Key Takeaways

Stream processing transforms continuous event streams in real-time, eliminating the latency gap of batch pipelines — live revenue windows instead of 6-hour reports.
Kafka Streams runs embedded in your Spring Boot service: no extra cluster, full exactly-once via Kafka transactions, and interactive state store queries on the same JVM.
Windowed aggregation requires re-keying by the group dimension, choosing the right window type, and a grace period that matches late-arrival tolerance.
exactly_once_v2 is right for financial pipelines but requires at least 3 Kafka brokers — verify before enabling in new environments.
Consumer lag on the input topic is the most actionable operational signal — it shows the pipeline is falling behind before the dashboard shows stale numbers.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read