Kafka and Spark Structured Streaming: Building a Production Pipeline
Kafka delivers events, Spark processes them β offset management, schema evolution, and exactly-once delivery determine production reliability
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with the help of AI tools. While efforts are made to ensure accuracy, the content may contain errors or inaccuracies. Please verify critical information independently.
π The 500K-Event Problem: When a Naive Kafka Consumer Falls Apart
An analytics platform at a mid-sized fintech company needs to process 500,000 payment events per second from a Kafka cluster. The team starts with a straightforward approach: a hand-rolled Python consumer that polls Kafka, deserializes each message, and writes records to S3. It works in staging with 1,000 events per second. On day one in production, it falls apart completely.
The consumer cannot drain partitions fast enough. Kafka lag climbs from zero to 50 million messages within four hours. The team adds more consumer instances, but now they fight partition assignment conflicts and duplicate processing. They add a deduplication layer. Now the state blows up in memory. They add checkpointing β but after a crash, the consumer replays from the wrong offset and writes 2 million duplicate records into the data warehouse. The data team spends two days reconciling.
The root problem was not throughput β it was the absence of a structured framework for offset management, fault tolerance, and backpressure. Spark Structured Streaming provides exactly that framework for Kafka consumers, but only if you understand how it manages offsets, enforces rate limits, and coordinates checkpoints across a distributed cluster.
This post builds that understanding from the ground up. You will leave knowing how Spark tracks Kafka offsets (hint: not in Kafka's consumer group metadata), how to tune micro-batch throughput without overloading downstream sinks, and how to diagnose consumer lag before it becomes a warehouse reconciliation nightmare.
π Kafka as a Source and Sink in Structured Streaming: Subscription Modes and Message Schema
Before examining how Spark manages a Kafka connection, it is worth understanding what Kafka actually hands Structured Streaming. Every message read from Kafka arrives as a row in a fixed schema β regardless of what the original producer put into the value bytes.
The Kafka Source Row Schema
When you read from Kafka as a Structured Streaming source, every micro-batch DataFrame has exactly these columns:
| Column | Type | Description |
key | binary | Raw bytes of the message key (null if not set) |
value | binary | Raw bytes of the message payload β your actual data |
topic | string | Kafka topic name the message was read from |
partition | int | Kafka partition number (0-indexed) |
offset | long | Offset of this message within its partition |
timestamp | timestamp | Kafka producer or broker timestamp |
timestampType | int | 0 = CreateTime (producer), 1 = LogAppendTime (broker) |
The critical point: key and value arrive as raw binary. You are responsible for deserialization. For JSON payloads, you call from_json(col("value").cast("string"), schema). For Avro or Protobuf payloads, you call from_avro or integrate with a schema registry. This two-step design β raw bytes first, typed schema second β is intentional: Spark never assumes anything about message encoding.
Three Ways to Subscribe to Kafka Topics
Structured Streaming offers three subscription options, each suited to different operational patterns:
| Option | Syntax | When to use |
subscribe | "topic1,topic2" | Fixed set of known topics; partition count may change |
subscribePattern | "payments.*" | Topic names follow a naming convention; new topics auto-discovered on each micro-batch |
assign | '{"payments-prod":[0,1,2]}' (JSON) | Explicit partition assignment; useful when sharing a topic with another consumer |
subscribePattern is powerful but carries a risk: if a new topic matching the pattern appears mid-stream, Spark automatically starts reading it at the startingOffsets policy β which may not be what you want for a topic created by mistake. In production, most teams use subscribe unless topic names genuinely follow a programmatic pattern.
Kafka as a Streaming Sink
Writing back to Kafka requires exactly three columns in your output DataFrame: topic (string), key (binary or string, nullable), and value (binary or string). The key column is optional but matters for downstream consumers that depend on partition assignment by key. If you write without a key, Kafka distributes messages round-robin across partitions, which breaks ordered processing for any consumer that relies on key-based partitioning.
βοΈ How Spark Structured Streaming Controls Kafka Reads, Rate-Limits, and Delivers Exactly-Once
The KafkaβSpark integration is more than a simple poll loop. Three mechanics distinguish it from naive consumers: offset tracking via checkpoint files (not Kafka consumer groups), micro-batch rate limiting, and end-to-end exactly-once delivery through idempotent sinks.
startingOffsets: Where the Query Begins
When a streaming query starts for the first time, Spark needs to know where to begin reading each partition. The startingOffsets option governs this:
"earliest"β read from the oldest available offset on each partition. Use this for backfill pipelines or initial data loads."latest"β start from the current end of each partition. Use this for real-time dashboards that only care about new events.- A JSON string like
{"payments-prod":{"0":12345,"1":67890,"2":0}}β start from specific offsets per partition. Use this for precise recovery after manual intervention.
Once the query has checkpointed at least one micro-batch, startingOffsets is ignored on subsequent restarts. Spark restores the last committed offset from the checkpoint directory, not from this option. This distinction is crucial: changing startingOffsets in code does nothing after the first successful batch.
maxOffsetsPerTrigger: The Backpressure Lever
Without rate limiting, a Spark query starting at earliest on a topic with 10 billion stored messages would attempt to process all 10 billion in the first micro-batch. The executor memory would overflow and the micro-batch would fail β or succeed but take hours, making the latency guarantee meaningless.
maxOffsetsPerTrigger caps the total number of Kafka offsets consumed per micro-batch across all partitions. If you have a topic with 20 partitions and set maxOffsetsPerTrigger to 100,000, each micro-batch reads at most 5,000 messages per partition. Spark distributes the cap proportionally to each partition's current lag. A partition with 80,000 pending offsets gets a larger slice than a partition with 2,000 pending offsets.
Exactly-Once Delivery: Two Requirements
Exactly-once in Spark Structured Streaming requires two guarantees working together:
Idempotent re-reads: The Kafka source itself is idempotent. If a micro-batch fails mid-execution and Spark retries it, the same offset range is re-read because the checkpoint records what was committed, not what was attempted. No offset range is processed twice.
Idempotent or transactional sink: The source guarantee is wasted if the sink allows duplicates. For Kafka sinks, enabling the idempotent producer (
kafka.enable.idempotence=true) ensures that a retried write does not produce duplicate messages. For Delta Lake sinks, the transactional write mechanism guarantees that a retried micro-batch either commits once or not at all.
Without both, you get at-least-once semantics at best.
π§ Deep Dive: How Spark Owns Kafka Offset State and How to Tune Pipeline Performance
The Internals of Kafka Offset Management in Spark
One of the most common misunderstandings in SparkβKafka pipelines is the assumption that Spark uses Kafka's consumer group commit mechanism to track progress. It does not.
Spark manages its own offset state through the checkpoint directory, completely independent of Kafka's consumer group offset tracking. The checkpoint directory for a streaming query contains a sources/0/ subdirectory (for the first source) that holds one JSON file per micro-batch. Each file is a KafkaSourceOffset β a snapshot of the per-partition offset ranges consumed in that batch.
A typical KafkaSourceOffset file looks like:
{"apache-kafka-source":{"payments-prod":{"0":12345,"1":67890,"2":45012}}}
The WAL (Write-Ahead Log) pattern governs commit order. Before executing a micro-batch, Spark writes the intended offset range to the WAL. After the batch executes and the sink commits, Spark advances the checkpoint pointer. On restart after a failure, Spark reads the last committed pointer and replays from those offsets β regardless of whether Kafka consumer group metadata shows a different committed offset.
This creates a critical operational implication: the kafka.group.id setting (which assigns a Kafka consumer group ID to the Spark query) does not control recovery. It is useful for Kafka-side monitoring β kafka-consumer-groups.sh --describe will show the Spark query as a consumer group, letting you observe lag from Kafka's perspective β but Spark ignores the consumer group offset in __consumer_offsets when it restarts. Only the checkpoint directory matters.
If you delete or corrupt the checkpoint directory, Spark treats the query as brand new and falls back to startingOffsets. This means deleting a checkpoint to "reset" a query is not a safe operation unless you also intend to reprocess all data.
The option failOnDataLoss (default: true) determines what happens if Kafka has deleted offsets that Spark is trying to read β for example, if the Kafka retention period expired while the pipeline was paused. With failOnDataLoss=true, the query throws an exception and stops. With failOnDataLoss=false, Spark silently skips the missing offsets and continues from the earliest available offset. In high-throughput pipelines where brief data gaps are acceptable, failOnDataLoss=false prevents unnecessary outages.
Performance Analysis: Throughput vs Latency Tuning in Kafka-backed Pipelines
Tuning a Spark Structured Streaming pipeline for Kafka involves four interdependent variables. Changing one always affects the others.
Kafka partition count and Spark parallelism. Spark reads one Kafka partition per task. If your topic has 20 partitions and your cluster has 10 executors each with 4 cores, Spark can process all 20 partitions in parallel comfortably. But if the topic has 200 partitions and your cluster has only 5 executor cores, Spark serializes reads into batches of 5 partitions at a time. The maximum read parallelism equals min(Kafka partition count, total executor cores). This means there is a ceiling on throughput that no amount of maxOffsetsPerTrigger tuning can exceed unless you also increase either partition count or executor resources.
maxOffsetsPerTrigger and micro-batch duration. Setting maxOffsetsPerTrigger too low creates small, fast micro-batches β low latency, but high scheduling overhead. Setting it too high creates large, slow micro-batches β high throughput per batch, but increasing end-to-end latency. A practical starting point: target micro-batch duration between 10 and 60 seconds. If your micro-batches consistently complete in under 5 seconds, increase maxOffsetsPerTrigger by 2x. If they regularly exceed 2 minutes, reduce it by 50%.
Kafka consumer lag as a health indicator. The difference between the latest Kafka offset and the last committed Spark offset is the consumer lag. A lag that grows steadily means your pipeline is not keeping up β either maxOffsetsPerTrigger is too low for the incoming event rate, or your transformations are too expensive for the micro-batch window. A lag that spikes and then recovers is normal behavior when Spark is catching up after a backpressure event or a brief pause.
Schema deserialization cost. Avro and Protobuf deserialization via a schema registry adds a network round-trip per batch β the registry must be queried to resolve the schema ID embedded in each message. This cost is often overlooked during capacity planning. Schema registry responses should be cached locally (most connectors do this by default), but cold starts after a restart can add 1β3 seconds of overhead per micro-batch until the cache warms up.
| Tuning lever | Effect on throughput | Effect on latency | Risk if misconfigured |
maxOffsetsPerTrigger increase | Higher batch throughput | Higher end-to-end latency | Executor OOM on very large batches |
| Kafka partition count increase | Higher max parallelism | Lower time per partition | Rebalancing disruption during scaling |
| Executor count increase | More parallel tasks | Lower time per micro-batch | Resource cost; diminishing returns beyond partition count |
| Trigger interval decrease | Lower amortized throughput | Lower latency floor | Driver scheduling overhead; frequent small commits |
π The Complete Kafka-to-Spark Pipeline: From Partition Read to Checkpoint Commit
The diagram below traces the full lifecycle of a single micro-batch in a KafkaβSpark pipeline. The flow moves from Kafka topic partitions through Spark's internal offset tracking machinery, through transformation and deserialization, out to either a Delta Lake table or a downstream Kafka topic, and finally to the checkpoint commit that records progress.
graph TD
KP1[Kafka Partition 0]
KP2[Kafka Partition 1]
KP3[Kafka Partition N]
KS[Spark Kafka Source]
OT[Offset Tracker reads KafkaSourceOffset from checkpoint]
RB[Rate Limiter applies maxOffsetsPerTrigger per partition]
MB[Micro-Batch Execution Engine]
DS[Deserialization - JSON or Avro from_json or from_avro]
TX[Transformation - filter, join, aggregate, watermark]
DL[Delta Lake Sink - transactional write]
KOut[Kafka Sink - idempotent producer]
CP[Checkpoint WAL - commit offset range]
KP1 --> KS
KP2 --> KS
KP3 --> KS
KS --> OT
OT --> RB
RB --> MB
MB --> DS
DS --> TX
TX --> DL
TX --> KOut
DL --> CP
KOut --> CP
Read this diagram from top to bottom. Kafka partitions are polled in parallel by the Spark Kafka source. The offset tracker reads the last committed KafkaSourceOffset from the checkpoint directory to determine the start offset for each partition. The rate limiter applies maxOffsetsPerTrigger to cap how many messages enter the micro-batch. The execution engine deserializes value bytes into typed records, applies your transformations, and writes to the configured sink. Only after the sink successfully commits does Spark write the new offset state to the checkpoint WAL β this ordering is what prevents data loss on failure.
The two sink paths are not mutually exclusive. A single streaming query can write to a Delta Lake table using foreachBatch, which lets you write to multiple sinks within one atomic micro-batch execution.
π Where Spark + Kafka Integration Powers Real Systems
Real-Time ETL to Delta Lake
The most common production pattern is Kafka acting as the ingestion buffer and Delta Lake as the analytical store. Events arrive in Kafka with raw JSON or Avro payloads. Spark reads micro-batches, deserializes and validates each record, applies business-level transformations (currency normalization, user ID resolution, event deduplication), and writes structured Parquet files into a Delta table. Delta's transaction log gives the downstream analytics team a consistent read view even while Spark is actively writing.
Kafka-to-Kafka Stream Processing
When multiple downstream systems need the same enriched event stream, the Kafka-to-Kafka pattern avoids duplicating the enrichment logic in every consumer. Spark reads raw events from a source topic, joins them against a static reference DataFrame (a product catalog, a user attributes table cached as a broadcast variable), and writes enriched events to a new topic. Downstream consumers read the enriched topic and skip the join cost entirely.
Dead Letter Queue for Deserialization Failures
In any high-volume pipeline, some messages will fail deserialization β a malformed JSON record, a producer that forgot to register a schema, a truncated Avro message. The dead letter queue pattern handles this without stopping the pipeline. Inside a foreachBatch sink, the code separates records into two DataFrames: valid_df (records that deserialized successfully) and failed_df (records where from_json returned null). valid_df writes to the primary Delta table. failed_df writes to a payments-dlq Kafka topic that operations teams monitor and replay after the root cause is fixed.
Monitoring Kafka Consumer Lag
Two complementary monitoring surfaces exist for SparkβKafka pipelines:
Spark Streaming UI β available at
http://<driver-host>:4040/StreamingQuery. Shows input rate (records per second), processing rate, micro-batch duration, and end-to-end lag per query. TheinputRowsPerSecondvsprocessedRowsPerSecondratio reveals whether the pipeline is keeping up.Kafka consumer group describe β run
kafka-consumer-groups.sh --bootstrap-server <host>:9092 --describe --group <spark-query-group-id>to see per-partition lag from Kafka's perspective. Remember: this reflects what Spark has committed to Kafka's consumer group metadata, which lags behind Spark's actual internal checkpoint state.
βοΈ Spark Structured Streaming vs Kafka Streams vs Flink: Choosing the Right Processing Engine
No single streaming engine dominates every scenario. The three major options β Spark Structured Streaming, Kafka Streams, and Apache Flink β each make different architectural trade-offs that surface in production at different pain points.
Spark Structured Streaming is a micro-batch engine by default (Continuous Processing mode exists but is experimental). Minimum latency is bounded by trigger interval β practically 1β5 seconds. Its strengths are in pipelines that also need batch processing (unified codebase via the DataFrame API), complex multi-source joins (static DataFrames, streaming-to-streaming joins with watermarks), and large-scale stateful aggregations over hours-long windows.
Kafka Streams runs as an embedded library inside your application JVM β no cluster manager required. It achieves sub-millisecond latency because it processes events record-by-record. Its state is stored in embedded RocksDB instances on the same machine as the processing logic, which means state access is local and fast. The trade-off: scaling Kafka Streams is done by adding application instances, each of which gets a subset of partitions. Heavy stateful operations (large joins with wide time windows) can exhaust local disk.
Apache Flink is a true stream processor with a unified batch/streaming API. It achieves low latency (< 100ms) while supporting exactly-once semantics and checkpointing via its own distributed state backend (RocksDB or in-memory). Its main cost is operational complexity: Flink requires a dedicated cluster, its own checkpoint management, and significantly more tuning surface area than either Spark or Kafka Streams.
| Dimension | Spark Structured Streaming | Kafka Streams | Apache Flink |
| Minimum latency | 1β5 seconds (micro-batch) | < 1 ms (record-by-record) | 10β100 ms |
| State storage | External (checkpointed to HDFS/S3) | Local RocksDB | RocksDB or heap |
| Exactly-once | Yes (with idempotent sink) | Yes | Yes |
| Deployment model | Spark cluster (YARN/K8s) | Embedded library in JVM app | Flink cluster (standalone/K8s) |
| SQL/DataFrame API | Full SQL + DataFrame | Streams DSL / KTable | Table API + SQL |
| Operational complexity | Medium (shared with Spark batch ops) | Low (no separate cluster) | High |
| Multi-source joins | Strong (broadcast, stream-stream) | Limited (KTable only) | Strong (both types) |
The latency floor of Spark Structured Streaming is its most significant limitation. If your SLA requires sub-second end-to-end event processing from Kafka ingestion to sink visibility, Spark is not the right tool. Kafka Streams or Flink should be evaluated. For everything above 5 seconds of acceptable latency β which covers most analytical pipelines, reporting dashboards, and enrichment workflows β Spark's unified API and cluster integration usually win.
Schema evolution is a cross-engine challenge. When producers change Avro schemas without respecting backward compatibility rules, all three engines fail in similar ways β deserialization exceptions that stop the pipeline until the schema registry is updated or the consumer is redeployed. The difference is that Spark's micro-batch model gives a clear failure boundary (the micro-batch that first encounters the new schema fails atomically), whereas Kafka Streams' record-by-record processing may partially process a batch before hitting the incompatible record.
π§ Streaming Engine Selection: Mapping Throughput, Latency, and State Complexity to the Right Tool
Use this table as a first-pass decision guide. For each row, pick the cell that matches your primary constraint, then verify that the trade-offs in that column are acceptable for your secondary constraints.
| Scenario | Spark Structured Streaming | Kafka Streams | Apache Flink |
| Latency requirement < 1 second | β | Preferred | Acceptable |
| Latency requirement 1β30 seconds | Preferred | Acceptable | Acceptable |
| Latency requirement > 30 seconds (near-real-time) | Preferred | Acceptable | Overkill |
| Throughput > 1M events/sec, complex joins | Preferred (scale cluster) | Difficult (partition limits) | Preferred |
| Throughput 10Kβ500K events/sec, simple transforms | Acceptable | Preferred (low overhead) | Acceptable |
| Stateful window aggregation (hours/days) | Preferred | Acceptable (disk bound) | Preferred |
| Shared batch + stream codebase | Preferred (Delta + Spark SQL) | β | Acceptable |
| No cluster management desired | β | Preferred | β |
| Advanced CEP (complex event patterns) | β | β | Preferred |
π§ͺ Diagnosing Kafka Consumer Lag in a Live Spark Streaming Pipeline
This walkthrough describes how to diagnose a Spark Structured Streaming pipeline that is falling behind its Kafka source. The scenario: a payments pipeline reads from a 20-partition Kafka topic and writes to Delta Lake. Operations alerts show the pipeline is 45 minutes behind real-time.
Step 1: Check the Spark Streaming UI. Navigate to http://<driver>:4040/StreamingQuery. Look for the active query. Key metrics to read:
Input ratevsProcessing rateβ if input rate consistently exceeds processing rate, the pipeline is actively falling behind.Batch durationβ if micro-batches are taking 5β10x longer than the trigger interval, a downstream write or a heavy transformation is the bottleneck.Total delayβ the cumulative lag between event ingestion time and processing completion.
Step 2: Check Kafka-side lag per partition. Run the Kafka consumer group describe command (substituting your actual group ID from the kafka.group.id config):
kafka-consumer-groups.sh \
--bootstrap-server kafka-broker:9092 \
--describe \
--group spark-payments-streaming-v1
Look for partitions with disproportionately large lag values. A single "hot" partition with 10x the lag of others suggests a key distribution imbalance (a few producers are sending far more messages to specific partitions) β the solution is Kafka-side repartitioning or a custom partitioner, not Spark tuning.
Step 3: Inspect the checkpoint directory structure. The checkpoint directory for a Kafka-backed streaming query follows a predictable layout:
checkpoint/
commits/
0 <- batch 0 committed
1 <- batch 1 committed
...
offsets/
0 <- KafkaSourceOffset for batch 0 (what was read)
1 <- KafkaSourceOffset for batch 1
...
sources/
0/ <- metadata for source 0 (Kafka)
metadata
If the newest file in offsets/ is significantly ahead of the newest file in commits/, a micro-batch is running but has not committed β it may be stuck. If both are identical, the query may be paused or the trigger interval is very long.
Step 4: Adjust maxOffsetsPerTrigger conservatively. If batch duration is the bottleneck, reduce maxOffsetsPerTrigger to shrink batch size, letting each micro-batch complete faster at the cost of consuming lag more slowly. If the pipeline has capacity headroom (micro-batches finishing in 10 seconds against a 60-second trigger), increase maxOffsetsPerTrigger by 50% and observe whether batch duration remains stable.
π οΈ Spark-Kafka Connector Configuration Reference
The Spark-Kafka connector exposes a focused set of options that control the integration behavior. These are passed as key-value pairs to readStream or writeStream format options.
# === Kafka Source Options (readStream) ===
kafka.bootstrap.servers: "kafka-broker-1:9092,kafka-broker-2:9092"
# Required. Comma-separated list of broker addresses.
subscribe: "payments-prod,payments-staging"
# OR: subscribePattern: "payments.*"
# OR: assign: '{"payments-prod":[0,1,2]}'
startingOffsets: "latest"
# "earliest" | "latest" | JSON per-partition map
# Ignored after first checkpoint commit.
maxOffsetsPerTrigger: 100000
# Total offsets consumed across all partitions per micro-batch.
# Start at 50K-200K and tune based on batch duration.
failOnDataLoss: "true"
# "true" = fail if Kafka has deleted requested offsets (retention expired).
# "false" = skip missing offsets, continue from earliest available.
kafka.group.id: "spark-payments-streaming-v1"
# Consumer group ID visible in Kafka monitoring tools.
# Does NOT control Spark's internal offset management β checkpoint does.
spark.sql.streaming.kafka.useDeprecatedOffsetFetching: "false"
# Set false (Spark 3.1+) to use the new partition discovery mechanism.
# Avoids leader-not-available errors during Kafka rebalances.
# === Kafka Source Options: Schema Registry (Avro/Protobuf) ===
kafka.schema.registry.url: "http://schema-registry:8081"
# Used with from_avro() / to_avro() functions from spark-avro package.
# === Kafka Sink Options (writeStream) ===
kafka.bootstrap.servers: "kafka-broker-1:9092,kafka-broker-2:9092"
# Required for sink too.
topic: "payments-enriched"
# Default topic if 'topic' column is not present in the output DataFrame.
kafka.acks: "all"
# Required for durability. "all" = wait for all in-sync replicas.
kafka.enable.idempotence: "true"
# Enables idempotent producer β required for exactly-once Kafka sink semantics.
kafka.retries: "3"
# Producer retry count on transient failures.
For a full deep-dive on Spark's core execution model and how streaming queries interact with the DAGScheduler, see Spark Architecture: Driver, Executors, DAG Scheduler, and Task Scheduler Explained.
π Lessons Learned from Production KafkaβSpark Pipelines
Checkpoint corruption is an operational reality, not a theoretical edge case. In containerized deployments where checkpoint directories live on network-attached storage (S3, ADLS), partial writes during instance termination can corrupt the WAL. Treat the checkpoint directory as a first-class infrastructure component: version it, back it up before schema changes, and test your recovery procedure in staging.
failOnDataLoss=false is not a free pass. Silently skipping deleted offsets means your pipeline's output is missing data that was in Kafka but expired before Spark could process it. Use this flag only when you have explicit business logic for handling gaps β for example, when the pipeline is a best-effort metrics aggregator and missing 0.1% of events is acceptable.
Kafka partition count is a ceiling, not a suggestion. If your topic has 8 partitions but your Spark cluster has 40 executor cores, you are using 20% of your available parallelism on the read stage. Plan Kafka partition count as a scalability parameter from the start β increasing it later requires a producer restart and may disrupt key-ordering guarantees for in-flight consumers.
Schema evolution breaks pipelines in ways that are hard to observe. A producer team adds a field to their Avro schema without registering a backward-compatible version in the schema registry. Spark's Avro deserializer throws a SchemaParseException. The micro-batch fails. The pipeline stops. The alert fires 10 minutes later when lag crosses the threshold. By then, downstream dashboards are already stale. Solve this with schema compatibility enforcement at the producer side (registry configured to reject non-backward-compatible schemas) and add a schema validation test to your producer CI pipeline.
Monitor two lag numbers, not one. Kafka consumer group lag (from kafka-consumer-groups.sh) and Spark Streaming UI lag are different metrics. Kafka lag reflects what has been committed to Kafka's __consumer_offsets topic β which may lag behind Spark's actual internal checkpoint state by up to one micro-batch duration. In practice, Spark's UI lag is the ground truth; Kafka's consumer group lag is useful for cross-checking and for integrating with infrastructure-level alerting tools that don't have Spark UI access.
π Key Takeaways: Kafka + Spark Structured Streaming at a Glance
TLDR: Spark Structured Streaming integrates with Kafka through a checkpoint-owned offset management system that is independent of Kafka's consumer group mechanism. Configure
maxOffsetsPerTriggerto balance throughput against latency, align Kafka partition count with executor parallelism, use idempotent producers for exactly-once Kafka sink semantics, and monitor both Spark Streaming UI lag and Kafka consumer group lag to catch pipeline drift early.
- Spark stores Kafka offsets in its checkpoint directory (
KafkaSourceOffsetJSON files), not in Kafka's__consumer_offsetstopic. The checkpoint is the source of truth for recovery. startingOffsetsonly controls initial start position; after the first checkpoint commit, Spark ignores it and resumes from the checkpoint.maxOffsetsPerTriggeris the primary backpressure knob β tune it to keep micro-batch duration in the 10β60 second range for most production workloads.- Exactly-once requires two components: Spark's idempotent re-read guarantee (from the checkpoint) and the sink's idempotent or transactional write guarantee.
- Spark Structured Streaming is the right choice for pipelines with latency requirements above 1β5 seconds, complex multi-source joins, or unified batch/streaming codebases. For sub-second latency, evaluate Kafka Streams or Flink.
- The dead letter queue pattern is non-negotiable in production: route deserialization failures to a dedicated topic instead of stopping the pipeline.
π Practice Quiz: Kafka and Spark Structured Streaming
- A Spark Structured Streaming query reads from Kafka and writes to Delta Lake. After a successful run of 20 micro-batches, the query crashes. On restart, where does Spark look to determine the starting offset for each partition?
Show Answer
Correct Answer: Spark reads the last committedKafkaSourceOffset from the checkpoint directory (offsets/ subdirectory), not from Kafka's __consumer_offsets topic. The startingOffsets option in the query definition is ignored once at least one checkpoint commit exists. The checkpoint WAL is the authoritative source for offset recovery.
- You have a Kafka topic with 10 partitions. Your Spark cluster has 4 executor nodes, each with 3 cores (12 total cores). What is the maximum parallelism Spark can apply when reading from this topic, and why?
Show Answer
Correct Answer: The maximum parallelism is 10 β one task per Kafka partition. Even though the cluster has 12 cores, Spark can only create 10 tasks (one per partition), leaving 2 cores idle during the Kafka read stage. To use all 12 cores, the Kafka topic would need at least 12 partitions. This is why partition count planning is a critical scalability decision made at topic creation time.- What is the difference between
subscribe,subscribePattern, andassignwhen configuring a Kafka source? Give one scenario where each is preferable.
Show Answer
Correct Answer: -subscribe: Fixed comma-separated list of topic names. Preferable when you know the exact set of topics and want to avoid accidental auto-discovery of new matching topics. Most common in production.
- subscribePattern: Regex pattern matched against all topic names on the broker. Preferable when producers create topics dynamically following a naming convention (e.g., orders-2026-*) and you want Spark to auto-discover them.
- assign: Explicit JSON map of topic to partition list. Preferable when you need to read specific partitions of a shared topic without consuming partitions owned by another consumer, or during manual recovery.
- Your Spark Structured Streaming pipeline reads from Kafka and the Kafka retention policy is set to 4 hours. The pipeline is paused for 6 hours due to a deployment. When it restarts, what happens with
failOnDataLoss=truevsfailOnDataLoss=false?
Show Answer
Correct Answer: WithfailOnDataLoss=true (the default), Spark detects that the offset it wants to resume from has been deleted by Kafka's retention policy. The query throws an OffsetOutOfRangeException and stops. The data gap must be handled manually before restarting.
With failOnDataLoss=false, Spark silently skips the unavailable offsets and resumes from the earliest available offset on each partition. The pipeline restarts but the 6-hour gap of events is permanently lost β no exception is raised. This is acceptable only when missing data is tolerable by design.
- Explain why exactly-once semantics in a Spark-to-Kafka pipeline requires both
kafka.enable.idempotence=trueon the producer side AND Spark's checkpoint-based offset tracking. What breaks if only one of these is in place?
Show Answer
Correct Answer: Spark's checkpoint ensures that if a micro-batch is retried after a failure, the same offset range is re-read from Kafka β preventing duplicate input. But if the Kafka producer is not idempotent, retrying the write side of a failed micro-batch can produce duplicate messages in the output topic (the producer sends the same record twice when it retries after a transient network error). Conversely, an idempotent producer without a checkpoint means Spark might re-read the same offset range twice (because it doesn't know what was already read), sending the same events to the output topic twice before the producer's deduplication can help. Both guarantees must be active simultaneously: checkpoint prevents duplicate reads, idempotent producer prevents duplicate writes.- Open-ended challenge: A team reports that their Spark Structured Streaming pipeline processes Kafka events 10 seconds slower every day. After 7 days, micro-batches that once took 15 seconds now take 85 seconds. The topic's event rate has not changed. The cluster size has not changed. Describe a systematic investigation plan β what would you check first, second, and third, and what specific Spark UI metrics and Kafka CLI commands would you use at each step?
Show Answer
Open-ended β no single correct answer. Strong answers will cover: First: Check Spark Streaming UI for increasingbatch duration trend vs input rate. A growing batch duration against a constant input rate suggests state accumulation or sink write amplification, not a throughput problem. Look at the "Duration" column across recent batches in the Streaming tab.
Second: Check whether the streaming query uses windowed aggregations or stateful operations. Growing state (e.g., a sliding window that accumulates records) causes every micro-batch to process an increasingly large in-memory state store. The fix is proper watermarking to evict stale state. Run df.explain() or check the physical plan in the SQL tab for StateStoreSave operators.
Third: Check Delta Lake or sink write latency. If the pipeline writes to Delta Lake, a growing number of small files (file accumulation over 7 days of micro-batch writes) can cause OPTIMIZE / compaction pressure that slows writer throughput. Run DESCRIBE HISTORY delta_table_path and check write duration trends. The fix is scheduled OPTIMIZE and VACUUM operations.
Kafka CLI commands: kafka-consumer-groups.sh --describe to check if per-partition lag is growing uniformly or concentrated in specific partitions β useful for ruling out Kafka-side bottlenecks.
π Related Posts
- Spark Architecture: Driver, Executors, DAG Scheduler, and Task Scheduler Explained β how Spark's internal components coordinate to execute the streaming micro-batches that power Kafka pipelines
- Spark Shuffles and GroupBy Performance: What Moves Data Across the Network β understanding shuffle cost is essential for tuning stateful streaming aggregations in Kafka-backed pipelines
- Reading and Writing Data in Spark: Parquet, Delta, and JSON β Delta Lake as the canonical sink for Spark Structured Streaming pipelines: transactional writes, schema enforcement, and time-travel queries

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
Spark Structured Streaming: Micro-Batch vs Continuous Processing
π The 15-Minute Gap: How a Fraud Team Discovered They Needed Real-Time Streaming A fintech team runs payment fraud detection with a well-tuned Spark batch job. Every 15 minutes it reads a day's worth of transaction events from S3, scores them agains...
Stateful Aggregations in Spark Structured Streaming: mapGroupsWithState
TLDR: mapGroupsWithState gives each streaming key its own mutable state object, persisted in a fault-tolerant state store that checkpoints to object storage on every micro-batch. Where window aggregations assume fixed time boundaries, mapGroupsWithSt...
Shuffles in Spark: Why groupBy Kills Performance
TLDR: A Spark shuffle is the most expensive operation in any distributed job β it moves every matching key across the network, writes temporary sorted files to disk, and forces a hard synchronization barrier between every upstream and downstream stag...
