All Posts

Kappa Architecture: Streaming-First Data Pipelines

Eliminate the batch layer entirely: how Kappa architecture uses a single streaming pipeline for both real-time and historical processing.

Abstract AlgorithmsAbstract Algorithms
ยทยท22 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Kappa architecture replaces Lambda's batch + speed dual codebases with a single streaming pipeline backed by a replayable Kafka log. Reprocessing becomes replaying from offset 0. One codebase, no drift.

TLDR: Kappa is the right call when your team is streaming-native and can't afford the logic-drift tax of two parallel pipelines.

Three months after your team shipped a Lambda pipeline, a product manager flags a discrepancy in daily active user counts. Your Spark batch job and Flink streaming job are doing nominally the same sessionization logic โ€” but one handles late events differently, and they've been silently computing different numbers since a business-logic change in Q4. Reconciling the divergence takes two engineers a full week. Sound familiar?

This is not a bug. This is the fundamental cost of Lambda's dual-codebase design. Kappa Architecture, proposed by Jay Kreps in 2014, argues that if your streaming system is reliable enough and your event log is replayable, you don't need a separate batch layer at all.

๐Ÿ“– The Two-Codebase Trap: When Lambda Architecture Becomes a Liability

Lambda Architecture solved a real problem: streaming systems circa 2011 were approximate and lossy, so a trusted batch layer provided the "truth" while the speed layer provided freshness. The serving layer merged both views.

The architecture works โ€” until you have to change something.

Any non-trivial update to business logic โ€” a new sessionization window, a revised revenue calculation, a changed filter predicate โ€” must be applied to two separate systems written in potentially different APIs. Your Spark batch job uses RDD transformations. Your Flink streaming job uses DataStream operators. They may share a concept, but they don't share code.

The result is operational drag that compounds over time:

  • Logic drift: Two implementations of the same computation inevitably diverge. Late-event handling, timezone normalization, and null semantics each harbor subtle differences.
  • Deployment friction: Changes require coordinated rollouts across two pipelines. A staging environment for one doesn't validate the other.
  • Debugging asymmetry: When batch and streaming disagree, which one is right? Tracing the discrepancy across two codebases means reading two job graphs, two state stores, and two output schemas.

The irony is that by adding a batch layer for accuracy, teams often introduce a new source of inaccuracy: the divergence between the two paths.

๐Ÿ” Lambda's Three-Layer Design and Where the Complexity Lives

Lambda Architecture is built on three layers that work in parallel:

  • Batch layer: Processes the complete historical dataset on a schedule (hourly, daily). Accurate by design because it recomputes from immutable source data. Slow โ€” results lag by the batch interval.
  • Speed layer: Processes events in near real-time as they arrive. Low latency, but inherently approximate because it can only see recent data and must discard state periodically.
  • Serving layer: Merges results from both layers, preferring batch truth for historical windows and speed-layer estimates for recent windows.

The batch layer is typically Apache Spark, Hive, or Presto. The speed layer is Apache Flink, Spark Structured Streaming, or Kafka Streams. These are different programming models, different deployment units, and different failure domains.

Every time business logic changes, both layers change. The serving layer's merge logic must account for temporal boundaries between them. And the team needs engineers who are comfortable in both paradigms simultaneously.

Kappa's premise: what if the batch layer was never necessary in the first place?

โš™๏ธ How Kappa Architecture Collapses Two Systems into One

Jay Kreps introduced Kappa Architecture in a 2014 blog post titled "Questioning the Lambda Architecture". His observation was precise: the only reason Lambda needs a separate batch layer is that the streaming system of the day wasn't reliable enough or fast enough to reprocess history at batch scale. But what if your streaming system could do both?

The Kappa insight has two parts:

1. Treat your event log as the source of truth, not the batch store. Kafka, with configurable retention (days, weeks, or indefinitely via compaction and tiered storage), stores every event immutably in order. There is no separate "cold" historical store โ€” the Kafka topic is the history.

2. Make reprocessing equivalent to re-consuming. If you want to recompute from scratch โ€” after a bug fix, a logic change, or a schema migration โ€” you deploy a new version of your streaming job pointed at offset 0 on the same topic. It reads every event in order, computes the new output, and writes to a new output topic or table. When it catches up to real-time, you atomically switch the serving layer and decommission the old output.

The result: one codebase, one deployment pipeline, one set of operators to test and monitor. Late events, watermarks, and windowing are handled once โ€” inside the streaming job โ€” not twice across two separate systems.

CapabilityLambda approachKappa approach
Reprocess after a logic changeRe-run Spark batch jobReplay Kafka topic from offset 0
Historical + real-time queryMerge batch + speed outputsSingle stream output (current or reprocessed)
Business logic locationTwo codebases (Spark + Flink)One codebase (Flink or Kafka Streams)
Late event handlingBatch catches up on next cycleWatermarks + allowed lateness in stream job
Schema evolutionCoordinate across both layersChange once, replay to migrate output

๐Ÿง  Deep Dive: How Kappa Handles Reprocessing Without a Batch Layer

The Internals: Kafka's Append-Only Log as a Replayable Source of Truth

Kafka's log is the structural enabler of Kappa. Each partition in a Kafka topic is an ordered, immutable sequence of records identified by an integer offset. Consumers track their position by storing an offset โ€” and that offset can be reset.

This means replay is a first-class operation: set auto.offset.reset=earliest (or seek to offset 0 programmatically), and a consumer group re-reads every event from the beginning as if it were arriving fresh. The Kafka cluster doesn't know or care whether a consumer is processing live events or replaying history โ€” the read path is identical.

For Kappa, this means historical reprocessing requires zero new infrastructure. You don't need an HDFS cold store, a separate Spark cluster, or a batch scheduler. The same Flink job that processes live events is pointed at the same Kafka topic with a different starting offset. The job is stateless from Kafka's perspective โ€” and that's precisely the point.

Key Kafka configurations for Kappa deployments:

  • retention.ms=-1 (or a sufficiently long period): Ensures events are retained long enough for full reprocessing. For tiered storage setups (Confluent, MSK with S3 offload), retention can be effectively unlimited without cost-prohibitive local disk.
  • log.compaction: For event-sourced patterns where only the latest value per key matters, compaction retains the last write per key and is suitable for reference data topics.
  • consumer_group isolation: The reprocessing job uses a new consumer group ID so it doesn't disturb the live job's offset commits.

The critical constraint: Kafka retention must be long enough to cover full reprocessing time. If reprocessing takes 8 hours and your retention is 6 hours, the earliest events are gone before the job reaches them. Size retention to: max(normal replay time) ร— safety_factor, typically 2x.

Performance Analysis: Throughput, Latency, and the Reprocessing Window

In a Lambda system, historical reprocessing uses dedicated batch compute: Spark reads HDFS partitions in parallel across a large cluster and produces results in minutes or hours depending on cluster size and data volume. The batch layer runs independently of the streaming layer.

In Kappa, reprocessing uses the same Flink cluster as live processing. This has important implications:

  • Throughput ceiling: A Flink job replaying from offset 0 can typically consume 10โ€“100x faster than the original event ingestion rate, since there's no network I/O wait between Kafka and Flink. For a topic receiving 100K events/second at ingest time, a well-tuned replay job can sustain 2โ€“5M events/second. Full reprocessing of 30 days of data may take 4โ€“8 hours rather than 30 days.
  • Backpressure vs. live jobs: If the reprocessing job and the live job share the same Flink cluster, reprocessing can starve live processing of resources. Best practice: run reprocessing as a separate Flink job on a separate task manager pool, or time it during off-peak windows.
  • State size during replay: Flink maintains state (e.g., windowed aggregations, session state) in RocksDB-backed state stores. During replay, this state can grow much larger than in steady-state because all historical keys are active simultaneously. Provision state backend disk (or memory) for peak replay state, not just steady-state.
  • Output catch-up: The new output topic receives no queries until the reprocessing job catches up to real-time. Plan the cutover window so consumers can tolerate delayed fresh data during the replay period.

Throughput rule of thumb: Kappa reprocessing is viable when your Flink cluster can replay 30 days of events in under 24 hours. If reprocessing takes 5+ days, batch compute (Lambda) may still be faster for historical analysis.

๐Ÿ“Š Lambda vs Kappa: Visualizing the Architecture Contrast

Lambda Architecture (Two Parallel Paths)

graph TD
    Source["๐Ÿ“ฅ Event Source\n(Kafka / S3 / Kinesis)"] --> BL["๐Ÿ—‚๏ธ Batch Layer\nSpark / Hive"]
    Source --> SL["โšก Speed Layer\nFlink / Kafka Streams"]
    BL --> Serve["๐Ÿ”€ Serving Layer\nMerges batch + speed views"]
    SL --> Serve
    Serve --> Client["๐Ÿ“Š Client Queries"]
    BL -->|"scheduled recompute\n(hourly / daily)"| BS["๐Ÿ—„๏ธ Batch Store\nParquet / Delta Lake"]
    BS --> Serve

Lambda's serving layer must merge two independently computed views โ€” creating the semantic gap that causes silent divergence.

Kappa Architecture (Single Streaming Path)

graph TD
    Source["๐Ÿ“ฅ Event Source\n(Kafka โ€” immutable log)"] --> Job["โš™๏ธ Stream Processor\nFlink / Kafka Streams"]
    Job --> Out["๐Ÿ—„๏ธ Output Store\nKafka topic / DB table"]
    Out --> Client["๐Ÿ“Š Client Queries"]
    Source -->|"Reprocess: offset 0\nnew consumer group"| Job2["๐Ÿ”„ Reprocessing Job\n(same codebase, new version)"]
    Job2 --> Out2["๐Ÿ—„๏ธ New Output v2\nCatchup โ†’ cutover"]
    Out2 -->|"atomic swap\nwhen caught up"| Client

Kappa has one code path. Reprocessing is re-consuming. The serving layer always reads from a single output.

Kappa Reprocessing Workflow

sequenceDiagram
    participant Ops as ๐Ÿง‘โ€๐Ÿ’ป Operator
    participant Kafka as ๐Ÿ“ฅ Kafka Topic
    participant LiveJob as โš™๏ธ Live Job (v1)
    participant NewJob as ๐Ÿ”„ New Job (v2)
    participant Serve as ๐Ÿ“Š Serving Layer

    Ops->>NewJob: Deploy v2 job with offset=0, new consumer group
    NewJob->>Kafka: Consume from offset 0 (full history)
    LiveJob->>Kafka: Continue consuming live events (unaffected)
    NewJob-->>Serve: Writing to output_v2 (not yet served)
    Note over NewJob,Kafka: Replay runs at 20x ingestion speed
    NewJob->>Ops: Lag = 0 (caught up to real-time)
    Ops->>Serve: Atomic swap: point serving at output_v2
    Ops->>LiveJob: Drain and decommission v1 job
    Ops->>Kafka: Delete output_v1 topic after validation

Zero-downtime cutover: the live job keeps running while the reprocessing job catches up. Clients never see a gap.

๐ŸŒ Real-World Applications: Kappa Architecture at LinkedIn and Uber

LinkedIn (the origin story): Jay Kreps co-created both Kafka and the Kappa Architecture proposal at LinkedIn. The motivation was direct: LinkedIn's analytics pipelines had accumulated years of dual-codebase maintenance debt. When the Kafka ecosystem matured enough to support high-throughput reliable consumption with exactly-once semantics, the batch layer's rationale evaporated. LinkedIn's activity feed, skills recommendations, and connection strength signals all migrated to Kappa pipelines using Kafka as the log and Samza (later Flink) as the processor.

Uber's real-time marketplace: Uber's surge pricing and driver dispatch require sub-second aggregations of GPS events, rider demand signals, and supply availability โ€” all of which must be queryable in historical form for ML feature engineering and pricing model retraining. A Lambda system would require a separate Spark cluster for ML feature backfills. Uber instead uses a Kappa-style design where Flink jobs write to Hudi tables on S3, which serve both real-time queries (via Hudi's incremental view) and historical batch reads (via Hudi's Copy-On-Write snapshot). Reprocessing after a model update is a Flink replay job, not a Spark recompute.

Input / Process / Output walkthrough (e-commerce example):

StageDescription
InputRaw clickstream events: user_id, product_id, event_type, timestamp arriving in Kafka at 50K events/sec
ProcessFlink job applies 30-minute session windows grouped by user_id, aggregates add_to_cart, view, and purchase events per session
OutputSession summaries written to Kafka output topic โ†’ consumed by downstream recommendation engine and analytics DB
ReprocessAfter business redefines "session gap" from 30 min to 20 min: new Flink job replays full Kafka history, writes corrected summaries to sessions_v2

โš–๏ธ Trade-offs & Failure Modes of Single-Pipeline Kappa Systems

Long-Retention Costs

Kafka storage is not free. At 50K events/second with a 200-byte average payload, a single topic generates ~850 GB/day. Retaining 90 days for full replay requires ~75 TB of Kafka cluster storage โ€” roughly 10x the cost of equivalent cold storage in S3 or GCS. Tiered storage (Confluent Cloud, AWS MSK with S3 offload) mitigates this but introduces additional latency on historical reads.

Mitigation: Segment topics by retention tier. Use Kafka for hot replay (30โ€“90 days). Archive to object storage (S3/GCS) for cold replay. Implement a replay bridge that reconstructs a Kafka stream from archived Parquet/Avro for deeper history.

Reprocessing Resource Contention

Running a full replay job on the same Flink cluster as live production jobs creates resource competition. A misconfigured replay job can exhaust task manager slots, memory, or network bandwidth, causing the live job to fall behind on event processing.

Mitigation: Use separate Flink clusters (or Kubernetes namespaces with resource quotas) for reprocessing. Gate replay jobs behind a change management process with defined maintenance windows.

Regulatory and Audit Requirements

Some industries (financial services, healthcare) require an immutable batch audit trail that is provably separate from the operational pipeline. Regulators may specifically ask for batch compute that doesn't share infrastructure with real-time systems. In these cases, Lambda's separation is a feature, not a defect.

Mitigation: If regulatory requirements mandate batch separation, Kappa is the wrong choice. Use Lambda, but invest in shared logic libraries (e.g., a shared Flink/Spark function JAR) to minimize the dual-codebase problem.

Failure Mode: Schema Evolution During Replay

If a Kafka topic contains events serialized with an older Avro or Protobuf schema, and the new Flink job uses an updated schema, replay will encounter deserialization errors on historical events. This is not a Lambda problem because batch jobs typically operate on pre-partitioned Parquet files that can be selectively read.

Mitigation: Always use a Schema Registry (Confluent Schema Registry or AWS Glue Schema Registry) with backward/forward compatibility rules. Write deserializers that handle multiple schema versions. Test replay on a sampled historical window before switching to full replay.

๐Ÿงญ Lambda vs Kappa Decision Guide

FactorLambdaKappa
Use whenYou need independent batch audit trails, or historical analysis runs against years of data that is cheaper coldYour team is streaming-native, your event volume fits in affordable Kafka retention, and logic simplicity matters more than batch parallelism
Avoid whenYour team can't maintain two separate codebases and test both after every logic changeRetention requirements exceed 90 days at high volume, or regulators require physically separate batch audit infrastructure
Reprocessing speedFast: dedicated Spark cluster with separate compute for batch recomputeDepends on Flink cluster headroom and Kafka retention span; typically 20โ€“100x ingestion speed on a dedicated replay cluster
Operational complexityHigh: two deployment units, two monitoring dashboards, two alerting configurations, dual testing surfaceLower: one job, one deployment, one set of metrics โ€” but Kafka retention and schema versioning add their own operational surface
Late data handlingBatch layer catches up on the next scheduled run; can tolerate hours of latenessWatermarks and allowedLateness in Flink; late events trigger window reemits if within the allowed window
Best fitEnterprise data warehouses where batch accuracy is non-negotiable and teams have Spark expertiseReal-time analytics platforms, event-driven microservices architectures, streaming-native ML feature pipelines

Deciding rule of thumb: If your team would reach for Spark first for any new data job, stay with Lambda. If your team would reach for Flink or Kafka Streams first, Kappa will be net-simpler.

The following PyFlink job implements the core Kappa pipeline: consuming raw click events from Kafka, computing 30-minute session windows per user, and writing session summaries back to Kafka. This same job โ€” with offset reset to earliest and a new consumer group โ€” is the reprocessing job.

from pyflink.datastream import StreamExecutionEnvironment, TimeCharacteristic
from pyflink.datastream.connectors.kafka import (
    KafkaSource,
    KafkaSink,
    KafkaRecordSerializationSchema,
    DeliveryGuarantee,
)
from pyflink.common import WatermarkStrategy, Duration, SimpleStringSchema
from pyflink.common.serialization import SimpleStringSchema
from pyflink.datastream.window import EventTimeSessionWindows
from pyflink.datastream.functions import AggregateFunction
import json
from datetime import datetime

# --- Stream execution environment ---
env = StreamExecutionEnvironment.get_execution_environment()
env.set_stream_time_characteristic(TimeCharacteristic.EventTime)
env.set_parallelism(4)

# --- Kafka source (live mode: latest; reprocess mode: earliest) ---
OFFSET_RESET = "latest"   # Change to "earliest" for reprocessing

kafka_source = (
    KafkaSource.builder()
    .set_bootstrap_servers("kafka:9092")
    .set_topics("clickstream.raw")
    .set_group_id("kappa-session-v2")        # New group ID for reprocessing
    .set_starting_offsets(
        # KafkaOffsetsInitializer.earliest() for reprocessing
        # KafkaOffsetsInitializer.latest() for live
        __import__("pyflink.common", fromlist=["KafkaOffsetsInitializer"])
        .KafkaOffsetsInitializer.latest()
    )
    .set_value_only_deserializer(SimpleStringSchema())
    .build()
)

# --- Parse and assign event-time watermarks ---
def parse_click(raw: str):
    """Returns (user_id, event_type, ts_ms) or None on parse error."""
    try:
        ev = json.loads(raw)
        return (ev["user_id"], ev["event_type"], int(ev["timestamp_ms"]))
    except Exception:
        return None

raw_stream = env.from_source(
    kafka_source,
    WatermarkStrategy
        .for_bounded_out_of_orderness(Duration.of_seconds(30))
        .with_timestamp_assigner(
            lambda event, _: event[2] if event else 0
        ),
    "Clickstream Source",
)

click_stream = (
    raw_stream
    .map(parse_click)
    .filter(lambda x: x is not None)
)

# --- Session window aggregation: 30-minute gap ---
class SessionAggregator(AggregateFunction):
    def create_accumulator(self):
        return {"views": 0, "add_to_cart": 0, "purchases": 0, "start_ts": None, "end_ts": None}

    def add(self, event, acc):
        _, event_type, ts = event
        acc["views"] += 1 if event_type == "view" else 0
        acc["add_to_cart"] += 1 if event_type == "add_to_cart" else 0
        acc["purchases"] += 1 if event_type == "purchase" else 0
        acc["start_ts"] = min(acc["start_ts"] or ts, ts)
        acc["end_ts"] = max(acc["end_ts"] or ts, ts)
        return acc

    def get_result(self, acc):
        return json.dumps(acc)

    def merge(self, acc_a, acc_b):
        return {
            "views": acc_a["views"] + acc_b["views"],
            "add_to_cart": acc_a["add_to_cart"] + acc_b["add_to_cart"],
            "purchases": acc_a["purchases"] + acc_b["purchases"],
            "start_ts": min(acc_a["start_ts"] or 0, acc_b["start_ts"] or 0),
            "end_ts": max(acc_a["end_ts"] or 0, acc_b["end_ts"] or 0),
        }

session_summaries = (
    click_stream
    .key_by(lambda x: x[0])                    # key by user_id
    .window(EventTimeSessionWindows.with_gap(Duration.of_minutes(30)))
    .aggregate(SessionAggregator())
)

# --- Kafka sink: write session summaries ---
kafka_sink = (
    KafkaSink.builder()
    .set_bootstrap_servers("kafka:9092")
    .set_record_serializer(
        KafkaRecordSerializationSchema.builder()
        .set_topic("sessions.v2")              # New topic for reprocessed output
        .set_value_serialization_schema(SimpleStringSchema())
        .build()
    )
    .set_delivery_guarantee(DeliveryGuarantee.EXACTLY_ONCE)
    .build()
)

session_summaries.sink_to(kafka_sink)
env.execute("Kappa-Clickstream-Session-Pipeline-v2")

To switch from live to reprocessing mode, change exactly two lines:

  1. set_group_id("kappa-session-v3") โ€” new consumer group so offsets start fresh
  2. .set_starting_offsets(KafkaOffsetsInitializer.earliest()) โ€” replay from the beginning
  3. set_topic("sessions.v3") โ€” write to a new output topic during catchup

When the job's consumer lag reaches zero (visible in Kafka consumer group metrics), you update the serving layer to read from sessions.v3 and stop the v2 job. No Spark cluster needed. No batch scheduler needed.

Apache Flink is the open-source stream processing framework most commonly paired with Kappa deployments. It was built from the ground up to handle the properties that Kappa requires: exactly-once semantics, event-time processing, and stateful computation with durable state backends.

Why Flink specifically enables Kappa:

  • Exactly-once end-to-end: Flink's checkpointing mechanism snapshots operator state and Kafka offsets atomically. After a failure, Flink restores from the last checkpoint and resumes without duplicates โ€” critical for Kappa because you cannot re-run a separate batch job to correct missed events.
  • Event time vs processing time: Flink explicitly separates event time (when the event occurred, embedded in the payload) from processing time (when Flink processes it). For reprocessing, event time ensures that historical windows are computed identically to live windows โ€” the same session boundaries, the same aggregation results.
  • Watermarks: Flink's watermark mechanism tracks how far behind the latest unprocessed event-time timestamp is. During live processing, watermarks advance gradually. During replay, watermarks advance rapidly as historical data streams through. Window triggers fire based on event-time watermarks, so the session window outputs during replay are semantically identical to what they would have been at original ingestion time.
  • RocksDB state backend: For stateful jobs (session windows, user-level aggregations), Flink persists state to RocksDB, a local embedded key-value store backed by disk. This allows Flink jobs to maintain state many times larger than heap memory โ€” essential for Kappa jobs replaying years of user sessions without OOM errors.

Minimal Flink cluster configuration for a Kappa deployment:

# flink-conf.yaml
state.backend: rocksdb
state.checkpoints.dir: s3://my-bucket/flink-checkpoints
execution.checkpointing.interval: 30s
execution.checkpointing.mode: EXACTLY_ONCE
taskmanager.numberOfTaskSlots: 4
taskmanager.memory.process.size: 8g
# Replay job: increase parallelism to exhaust Kafka throughput
pipeline.max-parallelism: 128

For a full deep-dive on Flink's stateful processing model, windowing internals, and exactly-once guarantees, see the Stream Processing Pipeline Pattern post.

๐Ÿ—๏ธ Kappa + Medallion: One Streaming Pipeline, Three Data Tiers

Kappa architecture integrates naturally with the Medallion layering pattern (Bronze / Silver / Gold), replacing batch ETL jobs at each tier with streaming transformations:

  • Bronze layer (raw ingest): A Flink job consumes from the source Kafka topic and writes raw events to a Bronze Kafka topic or Delta Lake table โ€” no transformation, schema-on-read, all fields preserved. This is the append-only truth log.
  • Silver layer (cleaned, enriched): A second Flink job consumes Bronze, applies deduplication, schema validation, and lookup enrichments (e.g., user geo from a lookup table), and writes to a Silver Kafka topic or Iceberg table.
  • Gold layer (aggregated, business-ready): A third Flink job computes session windows, daily active user counts, or revenue totals from Silver and writes to a Gold Kafka topic or analytical DB (ClickHouse, Druid).

Each layer is independently replayable. If a Silver enrichment bug is discovered, replay Silver from Bronze. Gold will automatically reprocess from the new Silver output. The Medallion layers become a streaming DAG โ€” no Spark, no batch scheduler, no overnight runs.

๐Ÿ“š Lessons Learned from Running Kappa in Production

Start with retention sizing, not job design. The most common Kappa failure is running out of Kafka retention mid-replay. Calculate your worst-case reprocessing duration at 20x ingestion throughput before choosing retention settings. Add a 2x safety margin. Enable tiered storage before you need it โ€” retrofitting it under pressure is painful.

New consumer group per version โ€” always. Reusing a consumer group ID for a reprocessed version of a job means the reprocessing job inherits committed offsets from the previous version and starts from the wrong position. This is a subtle bug that produces seemingly correct but stale output. Automate consumer group ID generation as part of the CI/CD pipeline for Flink jobs.

Validate semantic equivalence before cutover. Before switching the serving layer to the new output, run both versions in parallel for 30โ€“60 minutes and compare outputs on a sampled set of keys. If they disagree, the new job has a bug and must not be promoted. This is a uniquely Kappa discipline โ€” Lambda teams have the batch layer as a natural validation backstop.

Schema Registry is non-negotiable. Kappa replays data that was written weeks or months ago. Without schema evolution contracts enforced at write time, historical events will fail deserialization in future job versions. Every topic must have a registered schema with compatibility mode set to BACKWARD_TRANSITIVE or FULL before any data is written.

Don't run Kappa for regulatory-sensitive pipelines without explicit sign-off. Regulators in finance and healthcare may require batch separation. Present Kappa as an architectural choice, not a default, in compliance discussions.

๐Ÿ“Œ Summary & Key Takeaways

  • Kappa Architecture eliminates Lambda's batch layer by treating a replayable Kafka log as the universal source of truth for both real-time and historical computation.
  • Reprocessing in Kappa = deploying a new version of the streaming job with a new consumer group at offset 0, writing to a new output, and atomically swapping the serving layer when caught up.
  • Apache Flink enables Kappa through exactly-once semantics, event-time processing with watermarks, and durable RocksDB state โ€” making it possible to replay history and produce semantically identical results to live processing.
  • The principal advantage is a single codebase: one job graph, one deployment pipeline, one set of monitoring metrics. Logic changes are applied once and validated once.
  • The principal constraint is Kafka retention cost and reprocessing resource capacity. Long retention at high volume is expensive; reprocessing contends with live jobs for Flink cluster resources.
  • Choose Kappa when your team is streaming-native, your event volume fits affordable retention windows, and you value operational simplicity over batch separation.
  • Kappa + Medallion creates a fully streaming Bronze/Silver/Gold pipeline where each tier is independently replayable from its upstream Kafka topic โ€” no batch scheduler required at any layer.

The one-liner: If your streaming system is reliable enough to be trusted, it's reliable enough to replace your batch system entirely.

๐Ÿ“ Practice Quiz

  1. What property of Apache Kafka makes it the foundational infrastructure for Kappa Architecture?

    • A) Kafka supports SQL queries directly on topics
    • B) Kafka stores events as an immutable, ordered, replayable log with configurable retention
    • C) Kafka automatically deduplicates events across partitions
    • D) Kafka provides built-in windowing and aggregation functions Correct Answer: B
  2. Your team deploys a new version (v3) of a Kappa session-windowing job after fixing a late-event bug. The job must reprocess the last 60 days of clickstream data. Which combination of changes is correct for the reprocessing job configuration?

    • A) Reuse the existing consumer group ID; write to the same output topic
    • B) Use a new consumer group ID; set starting offset to earliest; write to a new output topic
    • C) Use a new consumer group ID; set starting offset to latest; write to the same output topic
    • D) Reuse the existing consumer group ID; set starting offset to earliest; write to a new output topic Correct Answer: B
  3. In a Flink-based Kappa pipeline, why does using event time (rather than processing time) matter during historical reprocessing?

    • A) Processing time is unavailable when reading from Kafka
    • B) Event time ensures window boundaries and aggregation results are identical whether the job processes events live or during replay
    • C) Event time is faster to compute because it avoids network round-trips
    • D) Flink cannot use processing time with the RocksDB state backend Correct Answer: B
  4. Open-ended challenge: Your organisation has 3 years of transaction data in HDFS Parquet files and wants to migrate to a Kappa architecture. Retention for 3 years at your event volume would cost $80K/month in Kafka. Describe a hybrid strategy that achieves Kappa's single-codebase goal for new data while handling the cold archive without paying full Kafka retention costs. What are the operational risks of this hybrid, and how would you mitigate them?


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms