Kappa Architecture: Streaming-First Data Pipelines
Eliminate the batch layer entirely: how Kappa architecture uses a single streaming pipeline for both real-time and historical processing.
Abstract AlgorithmsTLDR: Kappa architecture replaces Lambda's batch + speed dual codebases with a single streaming pipeline backed by a replayable Kafka log. Reprocessing becomes replaying from offset 0. One codebase, no drift.
TLDR: Kappa is the right call when your team is streaming-native and can't afford the logic-drift tax of two parallel pipelines.
Three months after your team shipped a Lambda pipeline, a product manager flags a discrepancy in daily active user counts. Your Spark batch job and Flink streaming job are doing nominally the same sessionization logic โ but one handles late events differently, and they've been silently computing different numbers since a business-logic change in Q4. Reconciling the divergence takes two engineers a full week. Sound familiar?
This is not a bug. This is the fundamental cost of Lambda's dual-codebase design. Kappa Architecture, proposed by Jay Kreps in 2014, argues that if your streaming system is reliable enough and your event log is replayable, you don't need a separate batch layer at all.
๐ The Two-Codebase Trap: When Lambda Architecture Becomes a Liability
Lambda Architecture solved a real problem: streaming systems circa 2011 were approximate and lossy, so a trusted batch layer provided the "truth" while the speed layer provided freshness. The serving layer merged both views.
The architecture works โ until you have to change something.
Any non-trivial update to business logic โ a new sessionization window, a revised revenue calculation, a changed filter predicate โ must be applied to two separate systems written in potentially different APIs. Your Spark batch job uses RDD transformations. Your Flink streaming job uses DataStream operators. They may share a concept, but they don't share code.
The result is operational drag that compounds over time:
- Logic drift: Two implementations of the same computation inevitably diverge. Late-event handling, timezone normalization, and null semantics each harbor subtle differences.
- Deployment friction: Changes require coordinated rollouts across two pipelines. A staging environment for one doesn't validate the other.
- Debugging asymmetry: When batch and streaming disagree, which one is right? Tracing the discrepancy across two codebases means reading two job graphs, two state stores, and two output schemas.
The irony is that by adding a batch layer for accuracy, teams often introduce a new source of inaccuracy: the divergence between the two paths.
๐ Lambda's Three-Layer Design and Where the Complexity Lives
Lambda Architecture is built on three layers that work in parallel:
- Batch layer: Processes the complete historical dataset on a schedule (hourly, daily). Accurate by design because it recomputes from immutable source data. Slow โ results lag by the batch interval.
- Speed layer: Processes events in near real-time as they arrive. Low latency, but inherently approximate because it can only see recent data and must discard state periodically.
- Serving layer: Merges results from both layers, preferring batch truth for historical windows and speed-layer estimates for recent windows.
The batch layer is typically Apache Spark, Hive, or Presto. The speed layer is Apache Flink, Spark Structured Streaming, or Kafka Streams. These are different programming models, different deployment units, and different failure domains.
Every time business logic changes, both layers change. The serving layer's merge logic must account for temporal boundaries between them. And the team needs engineers who are comfortable in both paradigms simultaneously.
Kappa's premise: what if the batch layer was never necessary in the first place?
โ๏ธ How Kappa Architecture Collapses Two Systems into One
Jay Kreps introduced Kappa Architecture in a 2014 blog post titled "Questioning the Lambda Architecture". His observation was precise: the only reason Lambda needs a separate batch layer is that the streaming system of the day wasn't reliable enough or fast enough to reprocess history at batch scale. But what if your streaming system could do both?
The Kappa insight has two parts:
1. Treat your event log as the source of truth, not the batch store. Kafka, with configurable retention (days, weeks, or indefinitely via compaction and tiered storage), stores every event immutably in order. There is no separate "cold" historical store โ the Kafka topic is the history.
2. Make reprocessing equivalent to re-consuming. If you want to recompute from scratch โ after a bug fix, a logic change, or a schema migration โ you deploy a new version of your streaming job pointed at offset 0 on the same topic. It reads every event in order, computes the new output, and writes to a new output topic or table. When it catches up to real-time, you atomically switch the serving layer and decommission the old output.
The result: one codebase, one deployment pipeline, one set of operators to test and monitor. Late events, watermarks, and windowing are handled once โ inside the streaming job โ not twice across two separate systems.
| Capability | Lambda approach | Kappa approach |
| Reprocess after a logic change | Re-run Spark batch job | Replay Kafka topic from offset 0 |
| Historical + real-time query | Merge batch + speed outputs | Single stream output (current or reprocessed) |
| Business logic location | Two codebases (Spark + Flink) | One codebase (Flink or Kafka Streams) |
| Late event handling | Batch catches up on next cycle | Watermarks + allowed lateness in stream job |
| Schema evolution | Coordinate across both layers | Change once, replay to migrate output |
๐ง Deep Dive: How Kappa Handles Reprocessing Without a Batch Layer
The Internals: Kafka's Append-Only Log as a Replayable Source of Truth
Kafka's log is the structural enabler of Kappa. Each partition in a Kafka topic is an ordered, immutable sequence of records identified by an integer offset. Consumers track their position by storing an offset โ and that offset can be reset.
This means replay is a first-class operation: set auto.offset.reset=earliest (or seek to offset 0 programmatically), and a consumer group re-reads every event from the beginning as if it were arriving fresh. The Kafka cluster doesn't know or care whether a consumer is processing live events or replaying history โ the read path is identical.
For Kappa, this means historical reprocessing requires zero new infrastructure. You don't need an HDFS cold store, a separate Spark cluster, or a batch scheduler. The same Flink job that processes live events is pointed at the same Kafka topic with a different starting offset. The job is stateless from Kafka's perspective โ and that's precisely the point.
Key Kafka configurations for Kappa deployments:
retention.ms=-1(or a sufficiently long period): Ensures events are retained long enough for full reprocessing. For tiered storage setups (Confluent, MSK with S3 offload), retention can be effectively unlimited without cost-prohibitive local disk.log.compaction: For event-sourced patterns where only the latest value per key matters, compaction retains the last write per key and is suitable for reference data topics.consumer_groupisolation: The reprocessing job uses a new consumer group ID so it doesn't disturb the live job's offset commits.
The critical constraint: Kafka retention must be long enough to cover full reprocessing time. If reprocessing takes 8 hours and your retention is 6 hours, the earliest events are gone before the job reaches them. Size retention to: max(normal replay time) ร safety_factor, typically 2x.
Performance Analysis: Throughput, Latency, and the Reprocessing Window
In a Lambda system, historical reprocessing uses dedicated batch compute: Spark reads HDFS partitions in parallel across a large cluster and produces results in minutes or hours depending on cluster size and data volume. The batch layer runs independently of the streaming layer.
In Kappa, reprocessing uses the same Flink cluster as live processing. This has important implications:
- Throughput ceiling: A Flink job replaying from offset 0 can typically consume 10โ100x faster than the original event ingestion rate, since there's no network I/O wait between Kafka and Flink. For a topic receiving 100K events/second at ingest time, a well-tuned replay job can sustain 2โ5M events/second. Full reprocessing of 30 days of data may take 4โ8 hours rather than 30 days.
- Backpressure vs. live jobs: If the reprocessing job and the live job share the same Flink cluster, reprocessing can starve live processing of resources. Best practice: run reprocessing as a separate Flink job on a separate task manager pool, or time it during off-peak windows.
- State size during replay: Flink maintains state (e.g., windowed aggregations, session state) in RocksDB-backed state stores. During replay, this state can grow much larger than in steady-state because all historical keys are active simultaneously. Provision state backend disk (or memory) for peak replay state, not just steady-state.
- Output catch-up: The new output topic receives no queries until the reprocessing job catches up to real-time. Plan the cutover window so consumers can tolerate delayed fresh data during the replay period.
Throughput rule of thumb: Kappa reprocessing is viable when your Flink cluster can replay 30 days of events in under 24 hours. If reprocessing takes 5+ days, batch compute (Lambda) may still be faster for historical analysis.
๐ Lambda vs Kappa: Visualizing the Architecture Contrast
Lambda Architecture (Two Parallel Paths)
graph TD
Source["๐ฅ Event Source\n(Kafka / S3 / Kinesis)"] --> BL["๐๏ธ Batch Layer\nSpark / Hive"]
Source --> SL["โก Speed Layer\nFlink / Kafka Streams"]
BL --> Serve["๐ Serving Layer\nMerges batch + speed views"]
SL --> Serve
Serve --> Client["๐ Client Queries"]
BL -->|"scheduled recompute\n(hourly / daily)"| BS["๐๏ธ Batch Store\nParquet / Delta Lake"]
BS --> Serve
Lambda's serving layer must merge two independently computed views โ creating the semantic gap that causes silent divergence.
Kappa Architecture (Single Streaming Path)
graph TD
Source["๐ฅ Event Source\n(Kafka โ immutable log)"] --> Job["โ๏ธ Stream Processor\nFlink / Kafka Streams"]
Job --> Out["๐๏ธ Output Store\nKafka topic / DB table"]
Out --> Client["๐ Client Queries"]
Source -->|"Reprocess: offset 0\nnew consumer group"| Job2["๐ Reprocessing Job\n(same codebase, new version)"]
Job2 --> Out2["๐๏ธ New Output v2\nCatchup โ cutover"]
Out2 -->|"atomic swap\nwhen caught up"| Client
Kappa has one code path. Reprocessing is re-consuming. The serving layer always reads from a single output.
Kappa Reprocessing Workflow
sequenceDiagram
participant Ops as ๐งโ๐ป Operator
participant Kafka as ๐ฅ Kafka Topic
participant LiveJob as โ๏ธ Live Job (v1)
participant NewJob as ๐ New Job (v2)
participant Serve as ๐ Serving Layer
Ops->>NewJob: Deploy v2 job with offset=0, new consumer group
NewJob->>Kafka: Consume from offset 0 (full history)
LiveJob->>Kafka: Continue consuming live events (unaffected)
NewJob-->>Serve: Writing to output_v2 (not yet served)
Note over NewJob,Kafka: Replay runs at 20x ingestion speed
NewJob->>Ops: Lag = 0 (caught up to real-time)
Ops->>Serve: Atomic swap: point serving at output_v2
Ops->>LiveJob: Drain and decommission v1 job
Ops->>Kafka: Delete output_v1 topic after validation
Zero-downtime cutover: the live job keeps running while the reprocessing job catches up. Clients never see a gap.
๐ Real-World Applications: Kappa Architecture at LinkedIn and Uber
LinkedIn (the origin story): Jay Kreps co-created both Kafka and the Kappa Architecture proposal at LinkedIn. The motivation was direct: LinkedIn's analytics pipelines had accumulated years of dual-codebase maintenance debt. When the Kafka ecosystem matured enough to support high-throughput reliable consumption with exactly-once semantics, the batch layer's rationale evaporated. LinkedIn's activity feed, skills recommendations, and connection strength signals all migrated to Kappa pipelines using Kafka as the log and Samza (later Flink) as the processor.
Uber's real-time marketplace: Uber's surge pricing and driver dispatch require sub-second aggregations of GPS events, rider demand signals, and supply availability โ all of which must be queryable in historical form for ML feature engineering and pricing model retraining. A Lambda system would require a separate Spark cluster for ML feature backfills. Uber instead uses a Kappa-style design where Flink jobs write to Hudi tables on S3, which serve both real-time queries (via Hudi's incremental view) and historical batch reads (via Hudi's Copy-On-Write snapshot). Reprocessing after a model update is a Flink replay job, not a Spark recompute.
Input / Process / Output walkthrough (e-commerce example):
| Stage | Description |
| Input | Raw clickstream events: user_id, product_id, event_type, timestamp arriving in Kafka at 50K events/sec |
| Process | Flink job applies 30-minute session windows grouped by user_id, aggregates add_to_cart, view, and purchase events per session |
| Output | Session summaries written to Kafka output topic โ consumed by downstream recommendation engine and analytics DB |
| Reprocess | After business redefines "session gap" from 30 min to 20 min: new Flink job replays full Kafka history, writes corrected summaries to sessions_v2 |
โ๏ธ Trade-offs & Failure Modes of Single-Pipeline Kappa Systems
Long-Retention Costs
Kafka storage is not free. At 50K events/second with a 200-byte average payload, a single topic generates ~850 GB/day. Retaining 90 days for full replay requires ~75 TB of Kafka cluster storage โ roughly 10x the cost of equivalent cold storage in S3 or GCS. Tiered storage (Confluent Cloud, AWS MSK with S3 offload) mitigates this but introduces additional latency on historical reads.
Mitigation: Segment topics by retention tier. Use Kafka for hot replay (30โ90 days). Archive to object storage (S3/GCS) for cold replay. Implement a replay bridge that reconstructs a Kafka stream from archived Parquet/Avro for deeper history.
Reprocessing Resource Contention
Running a full replay job on the same Flink cluster as live production jobs creates resource competition. A misconfigured replay job can exhaust task manager slots, memory, or network bandwidth, causing the live job to fall behind on event processing.
Mitigation: Use separate Flink clusters (or Kubernetes namespaces with resource quotas) for reprocessing. Gate replay jobs behind a change management process with defined maintenance windows.
Regulatory and Audit Requirements
Some industries (financial services, healthcare) require an immutable batch audit trail that is provably separate from the operational pipeline. Regulators may specifically ask for batch compute that doesn't share infrastructure with real-time systems. In these cases, Lambda's separation is a feature, not a defect.
Mitigation: If regulatory requirements mandate batch separation, Kappa is the wrong choice. Use Lambda, but invest in shared logic libraries (e.g., a shared Flink/Spark function JAR) to minimize the dual-codebase problem.
Failure Mode: Schema Evolution During Replay
If a Kafka topic contains events serialized with an older Avro or Protobuf schema, and the new Flink job uses an updated schema, replay will encounter deserialization errors on historical events. This is not a Lambda problem because batch jobs typically operate on pre-partitioned Parquet files that can be selectively read.
Mitigation: Always use a Schema Registry (Confluent Schema Registry or AWS Glue Schema Registry) with backward/forward compatibility rules. Write deserializers that handle multiple schema versions. Test replay on a sampled historical window before switching to full replay.
๐งญ Lambda vs Kappa Decision Guide
| Factor | Lambda | Kappa |
| Use when | You need independent batch audit trails, or historical analysis runs against years of data that is cheaper cold | Your team is streaming-native, your event volume fits in affordable Kafka retention, and logic simplicity matters more than batch parallelism |
| Avoid when | Your team can't maintain two separate codebases and test both after every logic change | Retention requirements exceed 90 days at high volume, or regulators require physically separate batch audit infrastructure |
| Reprocessing speed | Fast: dedicated Spark cluster with separate compute for batch recompute | Depends on Flink cluster headroom and Kafka retention span; typically 20โ100x ingestion speed on a dedicated replay cluster |
| Operational complexity | High: two deployment units, two monitoring dashboards, two alerting configurations, dual testing surface | Lower: one job, one deployment, one set of metrics โ but Kafka retention and schema versioning add their own operational surface |
| Late data handling | Batch layer catches up on the next scheduled run; can tolerate hours of lateness | Watermarks and allowedLateness in Flink; late events trigger window reemits if within the allowed window |
| Best fit | Enterprise data warehouses where batch accuracy is non-negotiable and teams have Spark expertise | Real-time analytics platforms, event-driven microservices architectures, streaming-native ML feature pipelines |
Deciding rule of thumb: If your team would reach for Spark first for any new data job, stay with Lambda. If your team would reach for Flink or Kafka Streams first, Kappa will be net-simpler.
๐งช E-Commerce Clickstream Pipeline with Apache Flink (PyFlink)
The following PyFlink job implements the core Kappa pipeline: consuming raw click events from Kafka, computing 30-minute session windows per user, and writing session summaries back to Kafka. This same job โ with offset reset to earliest and a new consumer group โ is the reprocessing job.
from pyflink.datastream import StreamExecutionEnvironment, TimeCharacteristic
from pyflink.datastream.connectors.kafka import (
KafkaSource,
KafkaSink,
KafkaRecordSerializationSchema,
DeliveryGuarantee,
)
from pyflink.common import WatermarkStrategy, Duration, SimpleStringSchema
from pyflink.common.serialization import SimpleStringSchema
from pyflink.datastream.window import EventTimeSessionWindows
from pyflink.datastream.functions import AggregateFunction
import json
from datetime import datetime
# --- Stream execution environment ---
env = StreamExecutionEnvironment.get_execution_environment()
env.set_stream_time_characteristic(TimeCharacteristic.EventTime)
env.set_parallelism(4)
# --- Kafka source (live mode: latest; reprocess mode: earliest) ---
OFFSET_RESET = "latest" # Change to "earliest" for reprocessing
kafka_source = (
KafkaSource.builder()
.set_bootstrap_servers("kafka:9092")
.set_topics("clickstream.raw")
.set_group_id("kappa-session-v2") # New group ID for reprocessing
.set_starting_offsets(
# KafkaOffsetsInitializer.earliest() for reprocessing
# KafkaOffsetsInitializer.latest() for live
__import__("pyflink.common", fromlist=["KafkaOffsetsInitializer"])
.KafkaOffsetsInitializer.latest()
)
.set_value_only_deserializer(SimpleStringSchema())
.build()
)
# --- Parse and assign event-time watermarks ---
def parse_click(raw: str):
"""Returns (user_id, event_type, ts_ms) or None on parse error."""
try:
ev = json.loads(raw)
return (ev["user_id"], ev["event_type"], int(ev["timestamp_ms"]))
except Exception:
return None
raw_stream = env.from_source(
kafka_source,
WatermarkStrategy
.for_bounded_out_of_orderness(Duration.of_seconds(30))
.with_timestamp_assigner(
lambda event, _: event[2] if event else 0
),
"Clickstream Source",
)
click_stream = (
raw_stream
.map(parse_click)
.filter(lambda x: x is not None)
)
# --- Session window aggregation: 30-minute gap ---
class SessionAggregator(AggregateFunction):
def create_accumulator(self):
return {"views": 0, "add_to_cart": 0, "purchases": 0, "start_ts": None, "end_ts": None}
def add(self, event, acc):
_, event_type, ts = event
acc["views"] += 1 if event_type == "view" else 0
acc["add_to_cart"] += 1 if event_type == "add_to_cart" else 0
acc["purchases"] += 1 if event_type == "purchase" else 0
acc["start_ts"] = min(acc["start_ts"] or ts, ts)
acc["end_ts"] = max(acc["end_ts"] or ts, ts)
return acc
def get_result(self, acc):
return json.dumps(acc)
def merge(self, acc_a, acc_b):
return {
"views": acc_a["views"] + acc_b["views"],
"add_to_cart": acc_a["add_to_cart"] + acc_b["add_to_cart"],
"purchases": acc_a["purchases"] + acc_b["purchases"],
"start_ts": min(acc_a["start_ts"] or 0, acc_b["start_ts"] or 0),
"end_ts": max(acc_a["end_ts"] or 0, acc_b["end_ts"] or 0),
}
session_summaries = (
click_stream
.key_by(lambda x: x[0]) # key by user_id
.window(EventTimeSessionWindows.with_gap(Duration.of_minutes(30)))
.aggregate(SessionAggregator())
)
# --- Kafka sink: write session summaries ---
kafka_sink = (
KafkaSink.builder()
.set_bootstrap_servers("kafka:9092")
.set_record_serializer(
KafkaRecordSerializationSchema.builder()
.set_topic("sessions.v2") # New topic for reprocessed output
.set_value_serialization_schema(SimpleStringSchema())
.build()
)
.set_delivery_guarantee(DeliveryGuarantee.EXACTLY_ONCE)
.build()
)
session_summaries.sink_to(kafka_sink)
env.execute("Kappa-Clickstream-Session-Pipeline-v2")
To switch from live to reprocessing mode, change exactly two lines:
set_group_id("kappa-session-v3")โ new consumer group so offsets start fresh.set_starting_offsets(KafkaOffsetsInitializer.earliest())โ replay from the beginningset_topic("sessions.v3")โ write to a new output topic during catchup
When the job's consumer lag reaches zero (visible in Kafka consumer group metrics), you update the serving layer to read from sessions.v3 and stop the v2 job. No Spark cluster needed. No batch scheduler needed.
๐ ๏ธ Apache Flink: The Engine That Makes Kappa Architecture Viable at Scale
Apache Flink is the open-source stream processing framework most commonly paired with Kappa deployments. It was built from the ground up to handle the properties that Kappa requires: exactly-once semantics, event-time processing, and stateful computation with durable state backends.
Why Flink specifically enables Kappa:
- Exactly-once end-to-end: Flink's checkpointing mechanism snapshots operator state and Kafka offsets atomically. After a failure, Flink restores from the last checkpoint and resumes without duplicates โ critical for Kappa because you cannot re-run a separate batch job to correct missed events.
- Event time vs processing time: Flink explicitly separates event time (when the event occurred, embedded in the payload) from processing time (when Flink processes it). For reprocessing, event time ensures that historical windows are computed identically to live windows โ the same session boundaries, the same aggregation results.
- Watermarks: Flink's watermark mechanism tracks how far behind the latest unprocessed event-time timestamp is. During live processing, watermarks advance gradually. During replay, watermarks advance rapidly as historical data streams through. Window triggers fire based on event-time watermarks, so the session window outputs during replay are semantically identical to what they would have been at original ingestion time.
- RocksDB state backend: For stateful jobs (session windows, user-level aggregations), Flink persists state to RocksDB, a local embedded key-value store backed by disk. This allows Flink jobs to maintain state many times larger than heap memory โ essential for Kappa jobs replaying years of user sessions without OOM errors.
Minimal Flink cluster configuration for a Kappa deployment:
# flink-conf.yaml
state.backend: rocksdb
state.checkpoints.dir: s3://my-bucket/flink-checkpoints
execution.checkpointing.interval: 30s
execution.checkpointing.mode: EXACTLY_ONCE
taskmanager.numberOfTaskSlots: 4
taskmanager.memory.process.size: 8g
# Replay job: increase parallelism to exhaust Kafka throughput
pipeline.max-parallelism: 128
For a full deep-dive on Flink's stateful processing model, windowing internals, and exactly-once guarantees, see the Stream Processing Pipeline Pattern post.
๐๏ธ Kappa + Medallion: One Streaming Pipeline, Three Data Tiers
Kappa architecture integrates naturally with the Medallion layering pattern (Bronze / Silver / Gold), replacing batch ETL jobs at each tier with streaming transformations:
- Bronze layer (raw ingest): A Flink job consumes from the source Kafka topic and writes raw events to a Bronze Kafka topic or Delta Lake table โ no transformation, schema-on-read, all fields preserved. This is the append-only truth log.
- Silver layer (cleaned, enriched): A second Flink job consumes Bronze, applies deduplication, schema validation, and lookup enrichments (e.g., user geo from a lookup table), and writes to a Silver Kafka topic or Iceberg table.
- Gold layer (aggregated, business-ready): A third Flink job computes session windows, daily active user counts, or revenue totals from Silver and writes to a Gold Kafka topic or analytical DB (ClickHouse, Druid).
Each layer is independently replayable. If a Silver enrichment bug is discovered, replay Silver from Bronze. Gold will automatically reprocess from the new Silver output. The Medallion layers become a streaming DAG โ no Spark, no batch scheduler, no overnight runs.
๐ Lessons Learned from Running Kappa in Production
Start with retention sizing, not job design. The most common Kappa failure is running out of Kafka retention mid-replay. Calculate your worst-case reprocessing duration at 20x ingestion throughput before choosing retention settings. Add a 2x safety margin. Enable tiered storage before you need it โ retrofitting it under pressure is painful.
New consumer group per version โ always. Reusing a consumer group ID for a reprocessed version of a job means the reprocessing job inherits committed offsets from the previous version and starts from the wrong position. This is a subtle bug that produces seemingly correct but stale output. Automate consumer group ID generation as part of the CI/CD pipeline for Flink jobs.
Validate semantic equivalence before cutover. Before switching the serving layer to the new output, run both versions in parallel for 30โ60 minutes and compare outputs on a sampled set of keys. If they disagree, the new job has a bug and must not be promoted. This is a uniquely Kappa discipline โ Lambda teams have the batch layer as a natural validation backstop.
Schema Registry is non-negotiable. Kappa replays data that was written weeks or months ago. Without schema evolution contracts enforced at write time, historical events will fail deserialization in future job versions. Every topic must have a registered schema with compatibility mode set to BACKWARD_TRANSITIVE or FULL before any data is written.
Don't run Kappa for regulatory-sensitive pipelines without explicit sign-off. Regulators in finance and healthcare may require batch separation. Present Kappa as an architectural choice, not a default, in compliance discussions.
๐ Summary & Key Takeaways
- Kappa Architecture eliminates Lambda's batch layer by treating a replayable Kafka log as the universal source of truth for both real-time and historical computation.
- Reprocessing in Kappa = deploying a new version of the streaming job with a new consumer group at offset 0, writing to a new output, and atomically swapping the serving layer when caught up.
- Apache Flink enables Kappa through exactly-once semantics, event-time processing with watermarks, and durable RocksDB state โ making it possible to replay history and produce semantically identical results to live processing.
- The principal advantage is a single codebase: one job graph, one deployment pipeline, one set of monitoring metrics. Logic changes are applied once and validated once.
- The principal constraint is Kafka retention cost and reprocessing resource capacity. Long retention at high volume is expensive; reprocessing contends with live jobs for Flink cluster resources.
- Choose Kappa when your team is streaming-native, your event volume fits affordable retention windows, and you value operational simplicity over batch separation.
- Kappa + Medallion creates a fully streaming Bronze/Silver/Gold pipeline where each tier is independently replayable from its upstream Kafka topic โ no batch scheduler required at any layer.
The one-liner: If your streaming system is reliable enough to be trusted, it's reliable enough to replace your batch system entirely.
๐ Practice Quiz
What property of Apache Kafka makes it the foundational infrastructure for Kappa Architecture?
- A) Kafka supports SQL queries directly on topics
- B) Kafka stores events as an immutable, ordered, replayable log with configurable retention
- C) Kafka automatically deduplicates events across partitions
- D) Kafka provides built-in windowing and aggregation functions Correct Answer: B
Your team deploys a new version (v3) of a Kappa session-windowing job after fixing a late-event bug. The job must reprocess the last 60 days of clickstream data. Which combination of changes is correct for the reprocessing job configuration?
- A) Reuse the existing consumer group ID; write to the same output topic
- B) Use a new consumer group ID; set starting offset to earliest; write to a new output topic
- C) Use a new consumer group ID; set starting offset to latest; write to the same output topic
- D) Reuse the existing consumer group ID; set starting offset to earliest; write to a new output topic Correct Answer: B
In a Flink-based Kappa pipeline, why does using event time (rather than processing time) matter during historical reprocessing?
- A) Processing time is unavailable when reading from Kafka
- B) Event time ensures window boundaries and aggregation results are identical whether the job processes events live or during replay
- C) Event time is faster to compute because it avoids network round-trips
- D) Flink cannot use processing time with the RocksDB state backend Correct Answer: B
Open-ended challenge: Your organisation has 3 years of transaction data in HDFS Parquet files and wants to migrate to a Kappa architecture. Retention for 3 years at your event volume would cost $80K/month in Kafka. Describe a hybrid strategy that achieves Kappa's single-codebase goal for new data while handling the cold archive without paying full Kafka retention costs. What are the operational risks of this hybrid, and how would you mitigate them?
๐ Related Posts
- Lambda Architecture Pattern: Balancing Batch Accuracy with Streaming Freshness
- Big Data Architecture Patterns: Lambda, Kappa, Medallion, and Data Mesh
- How Kafka Works
- Stream Processing Pipeline Pattern: Stateful Real-Time Data Products

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi
TLDR: Delta Lake, Apache Iceberg, and Apache Hudi are open table formats that wrap Parquet files with a transaction log (or snapshot tree) to deliver ACID guarantees, time travel, schema evolution, an
Medallion Architecture: Bronze, Silver, and Gold Layers in Practice
TLDR: Medallion Architecture solves the "data swamp" problem by organizing a data lake into three progressively refined zones โ Bronze (raw, immutable), Silver (cleaned, conformed), Gold (aggregated, business-ready) โ so teams always build on a trust...
Big Data 101: The 5 Vs, Ecosystem, and Why Scale Breaks Everything
TLDR: Traditional databases fail at big data scale for three concrete reasons โ storage saturation, compute bottleneck, and write-lock contention. The 5 Vs (Volume, Velocity, Variety, Veracity, Value) frame what makes data "big." A layered ecosystem ...
