23 min read#apache Spark Kafka Structured Streaming

Kafka and Spark Structured Streaming: Building a Production Pipeline

Kafka delivers events, Spark processes them — offset management, schema evolution, and exactly-once delivery determine production reliability

Abstract Algorithms/Apr 19, 2026/Apache Spark Engineering

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Executive TLDR

📖 The 500K Event Problem: When a Naive Kafka Consumer Falls Apart An analytics platform at a mid sized fintech company needs to process 500,000 payment events per second from a Kafka cluster.
The team starts with a straightforward approach: a hand rolled Python consumer that polls Kafka, deserializes each message, and writes records to S3.
It works in staging with 1,000 events per second.
On day one in production, it falls apart completely.

Core mental model

Read this as a system of state, constraints, and failure boundaries.

Kafka delivers events, Spark processes them — offset management, schema evolution, and exactly-once delivery determine production reliability

Explain simpler Compare tradeoffs

Key systems visualization

The article’s conceptual path

#apache Spark

Kafka

Structured Streaming

Intermediate

Spark

📖 The 500K-Event Problem: When a Naive Kafka Consumer Falls Apart

An analytics platform at a mid-sized fintech company needs to process 500,000 payment events per second from a Kafka cluster. The team starts with a straightforward approach: a hand-rolled Python consumer that polls Kafka, deserializes each message, and writes records to S3. It works in staging with 1,000 events per second. On day one in production, it falls apart completely.

The consumer cannot drain partitions fast enough. Kafka lag climbs from zero to 50 million messages within four hours. The team adds more consumer instances, but now they fight partition assignment conflicts and duplicate processing. They add a deduplication layer. Now the state blows up in memory. They add checkpointing — but after a crash, the consumer replays from the wrong offset and writes 2 million duplicate records into the data warehouse. The data team spends two days reconciling.

The root problem was not throughput — it was the absence of a structured framework for offset management, fault tolerance, and backpressure. Spark Structured Streaming provides exactly that framework for Kafka consumers, but only if you understand how it manages offsets, enforces rate limits, and coordinates checkpoints across a distributed cluster.

This post builds that understanding from the ground up. You will leave knowing how Spark tracks Kafka offsets (hint: not in Kafka's consumer group metadata), how to tune micro-batch throughput without overloading downstream sinks, and how to diagnose consumer lag before it becomes a warehouse reconciliation nightmare.

🔍 Kafka as a Source and Sink in Structured Streaming: Subscription Modes and Message Schema

Before examining how Spark manages a Kafka connection, it is worth understanding what Kafka actually hands Structured Streaming. Every message read from Kafka arrives as a row in a fixed schema — regardless of what the original producer put into the value bytes.

The Kafka Source Row Schema

When you read from Kafka as a Structured Streaming source, every micro-batch DataFrame has exactly these columns:

Column	Type	Description
`key`	`binary`	Raw bytes of the message key (null if not set)
`value`	`binary`	Raw bytes of the message payload — your actual data
`topic`	`string`	Kafka topic name the message was read from
`partition`	`int`	Kafka partition number (0-indexed)
`offset`	`long`	Offset of this message within its partition
`timestamp`	`timestamp`	Kafka producer or broker timestamp
`timestampType`	`int`	0 = CreateTime (producer), 1 = LogAppendTime (broker)

The critical point: key and value arrive as raw binary. You are responsible for deserialization. For JSON payloads, you call from_json(col("value").cast("string"), schema). For Avro or Protobuf payloads, you call from_avro or integrate with a schema registry. This two-step design — raw bytes first, typed schema second — is intentional: Spark never assumes anything about message encoding.

Structured Streaming offers three subscription options, each suited to different operational patterns:

Option	Syntax	When to use
`subscribe`	`"topic1,topic2"`	Fixed set of known topics; partition count may change
`subscribePattern`	`"payments.*"`	Topic names follow a naming convention; new topics auto-discovered on each micro-batch
`assign`	`'{"payments-prod":[0,1,2]}'` (JSON)	Explicit partition assignment; useful when sharing a topic with another consumer

subscribePattern is powerful but carries a risk: if a new topic matching the pattern appears mid-stream, Spark automatically starts reading it at the startingOffsets policy — which may not be what you want for a topic created by mistake. In production, most teams use subscribe unless topic names genuinely follow a programmatic pattern.

Kafka as a Streaming Sink

Writing back to Kafka requires exactly three columns in your output DataFrame: topic (string), key (binary or string, nullable), and value (binary or string). The key column is optional but matters for downstream consumers that depend on partition assignment by key. If you write without a key, Kafka distributes messages round-robin across partitions, which breaks ordered processing for any consumer that relies on key-based partitioning.

⚙️ How Spark Structured Streaming Controls Kafka Reads, Rate-Limits, and Delivers Exactly-Once

The Kafka–Spark integration is more than a simple poll loop. Three mechanics distinguish it from naive consumers: offset tracking via checkpoint files (not Kafka consumer groups), micro-batch rate limiting, and end-to-end exactly-once delivery through idempotent sinks.

startingOffsets: Where the Query Begins

When a streaming query starts for the first time, Spark needs to know where to begin reading each partition. The startingOffsets option governs this:

"earliest" — read from the oldest available offset on each partition. Use this for backfill pipelines or initial data loads.
"latest" — start from the current end of each partition. Use this for real-time dashboards that only care about new events.
A JSON string like {"payments-prod":{"0":12345,"1":67890,"2":0}} — start from specific offsets per partition. Use this for precise recovery after manual intervention.

Once the query has checkpointed at least one micro-batch, startingOffsets is ignored on subsequent restarts. Spark restores the last committed offset from the checkpoint directory, not from this option. This distinction is crucial: changing startingOffsets in code does nothing after the first successful batch.

maxOffsetsPerTrigger: The Backpressure Lever

Without rate limiting, a Spark query starting at earliest on a topic with 10 billion stored messages would attempt to process all 10 billion in the first micro-batch. The executor memory would overflow and the micro-batch would fail — or succeed but take hours, making the latency guarantee meaningless.

maxOffsetsPerTrigger caps the total number of Kafka offsets consumed per micro-batch across all partitions. If you have a topic with 20 partitions and set maxOffsetsPerTrigger to 100,000, each micro-batch reads at most 5,000 messages per partition. Spark distributes the cap proportionally to each partition's current lag. A partition with 80,000 pending offsets gets a larger slice than a partition with 2,000 pending offsets.

Exactly-Once Delivery: Two Requirements

Exactly-once in Spark Structured Streaming requires two guarantees working together:

Idempotent re-reads: The Kafka source itself is idempotent. If a micro-batch fails mid-execution and Spark retries it, the same offset range is re-read because the checkpoint records what was committed, not what was attempted. No offset range is processed twice.
Idempotent or transactional sink: The source guarantee is wasted if the sink allows duplicates. For Kafka sinks, enabling the idempotent producer (kafka.enable.idempotence=true) ensures that a retried write does not produce duplicate messages. For Delta Lake sinks, the transactional write mechanism guarantees that a retried micro-batch either commits once or not at all.

Without both, you get at-least-once semantics at best.

🧠 Deep Dive: How Spark Owns Kafka Offset State and How to Tune Pipeline Performance

The Internals of Kafka Offset Management in Spark

One of the most common misunderstandings in Spark–Kafka pipelines is the assumption that Spark uses Kafka's consumer group commit mechanism to track progress. It does not.

Spark manages its own offset state through the checkpoint directory, completely independent of Kafka's consumer group offset tracking. The checkpoint directory for a streaming query contains a sources/0/ subdirectory (for the first source) that holds one JSON file per micro-batch. Each file is a KafkaSourceOffset — a snapshot of the per-partition offset ranges consumed in that batch.

A typical KafkaSourceOffset file looks like:

{"apache-kafka-source":{"payments-prod":{"0":12345,"1":67890,"2":45012}}}

The WAL (Write-Ahead Log) pattern governs commit order. Before executing a micro-batch, Spark writes the intended offset range to the WAL. After the batch executes and the sink commits, Spark advances the checkpoint pointer. On restart after a failure, Spark reads the last committed pointer and replays from those offsets — regardless of whether Kafka consumer group metadata shows a different committed offset.

This creates a critical operational implication: the kafka.group.id setting (which assigns a Kafka consumer group ID to the Spark query) does not control recovery. It is useful for Kafka-side monitoring — kafka-consumer-groups.sh --describe will show the Spark query as a consumer group, letting you observe lag from Kafka's perspective — but Spark ignores the consumer group offset in __consumer_offsets when it restarts. Only the checkpoint directory matters.

If you delete or corrupt the checkpoint directory, Spark treats the query as brand new and falls back to startingOffsets. This means deleting a checkpoint to "reset" a query is not a safe operation unless you also intend to reprocess all data.

The option failOnDataLoss (default: true) determines what happens if Kafka has deleted offsets that Spark is trying to read — for example, if the Kafka retention period expired while the pipeline was paused. With failOnDataLoss=true, the query throws an exception and stops. With failOnDataLoss=false, Spark silently skips the missing offsets and continues from the earliest available offset. In high-throughput pipelines where brief data gaps are acceptable, failOnDataLoss=false prevents unnecessary outages.

Performance Analysis: Throughput vs Latency Tuning in Kafka-backed Pipelines

Tuning a Spark Structured Streaming pipeline for Kafka involves four interdependent variables. Changing one always affects the others.

Kafka partition count and Spark parallelism. Spark reads one Kafka partition per task. If your topic has 20 partitions and your cluster has 10 executors each with 4 cores, Spark can process all 20 partitions in parallel comfortably. But if the topic has 200 partitions and your cluster has only 5 executor cores, Spark serializes reads into batches of 5 partitions at a time. The maximum read parallelism equals min(Kafka partition count, total executor cores). This means there is a ceiling on throughput that no amount of maxOffsetsPerTrigger tuning can exceed unless you also increase either partition count or executor resources.

maxOffsetsPerTrigger and micro-batch duration. Setting maxOffsetsPerTrigger too low creates small, fast micro-batches — low latency, but high scheduling overhead. Setting it too high creates large, slow micro-batches — high throughput per batch, but increasing end-to-end latency. A practical starting point: target micro-batch duration between 10 and 60 seconds. If your micro-batches consistently complete in under 5 seconds, increase maxOffsetsPerTrigger by 2x. If they regularly exceed 2 minutes, reduce it by 50%.

Kafka consumer lag as a health indicator. The difference between the latest Kafka offset and the last committed Spark offset is the consumer lag. A lag that grows steadily means your pipeline is not keeping up — either maxOffsetsPerTrigger is too low for the incoming event rate, or your transformations are too expensive for the micro-batch window. A lag that spikes and then recovers is normal behavior when Spark is catching up after a backpressure event or a brief pause.

Schema deserialization cost. Avro and Protobuf deserialization via a schema registry adds a network round-trip per batch — the registry must be queried to resolve the schema ID embedded in each message. This cost is often overlooked during capacity planning. Schema registry responses should be cached locally (most connectors do this by default), but cold starts after a restart can add 1–3 seconds of overhead per micro-batch until the cache warms up.

Tuning lever	Effect on throughput	Effect on latency	Risk if misconfigured
`maxOffsetsPerTrigger` increase	Higher batch throughput	Higher end-to-end latency	Executor OOM on very large batches
Kafka partition count increase	Higher max parallelism	Lower time per partition	Rebalancing disruption during scaling
Executor count increase	More parallel tasks	Lower time per micro-batch	Resource cost; diminishing returns beyond partition count
Trigger interval decrease	Lower amortized throughput	Lower latency floor	Driver scheduling overhead; frequent small commits

📊 Visual Reference

flowchart TD
    Driver["Driver Node"]
    Planner["Query Planner"]
    Optimizer["Catalyst Optimizer"]
    Executor["Executor Task"]

    Driver -->|"Spark Execution"| Planner
    Planner --> Optimizer
    Optimizer -->|"Optimized Plan"| Executor
    Executor -->|"Result"| Driver

📊 The Complete Kafka-to-Spark Pipeline: From Partition Read to Checkpoint Commit

The diagram below traces the full lifecycle of a single micro-batch in a Kafka–Spark pipeline. The flow moves from Kafka topic partitions through Spark's internal offset tracking machinery, through transformation and deserialization, out to either a Delta Lake table or a downstream Kafka topic, and finally to the checkpoint commit that records progress.

graph TD
    KP1[Kafka Partition 0]
    KP2[Kafka Partition 1]
    KP3[Kafka Partition N]
    KS[Spark Kafka Source]
    OT[Offset Tracker reads KafkaSourceOffset from checkpoint]
    RB[Rate Limiter applies maxOffsetsPerTrigger per partition]
    MB[Micro-Batch Execution Engine]
    DS[Deserialization - JSON or Avro from_json or from_avro]
    TX[Transformation - filter, join, aggregate, watermark]
    DL[Delta Lake Sink - transactional write]
    KOut[Kafka Sink - idempotent producer]
    CP[Checkpoint WAL - commit offset range]

    KP1 --> KS
    KP2 --> KS
    KP3 --> KS
    KS --> OT
    OT --> RB
    RB --> MB
    MB --> DS
    DS --> TX
    TX --> DL
    TX --> KOut
    DL --> CP
    KOut --> CP

Read this diagram from top to bottom. Kafka partitions are polled in parallel by the Spark Kafka source. The offset tracker reads the last committed KafkaSourceOffset from the checkpoint directory to determine the start offset for each partition. The rate limiter applies maxOffsetsPerTrigger to cap how many messages enter the micro-batch. The execution engine deserializes value bytes into typed records, applies your transformations, and writes to the configured sink. Only after the sink successfully commits does Spark write the new offset state to the checkpoint WAL — this ordering is what prevents data loss on failure.

The two sink paths are not mutually exclusive. A single streaming query can write to a Delta Lake table using foreachBatch, which lets you write to multiple sinks within one atomic micro-batch execution.

🌍 Where Spark + Kafka Integration Powers Real Systems

Real-Time ETL to Delta Lake

The most common production pattern is Kafka acting as the ingestion buffer and Delta Lake as the analytical store. Events arrive in Kafka with raw JSON or Avro payloads. Spark reads micro-batches, deserializes and validates each record, applies business-level transformations (currency normalization, user ID resolution, event deduplication), and writes structured Parquet files into a Delta table. Delta's transaction log gives the downstream analytics team a consistent read view even while Spark is actively writing.

Kafka-to-Kafka Stream Processing

When multiple downstream systems need the same enriched event stream, the Kafka-to-Kafka pattern avoids duplicating the enrichment logic in every consumer. Spark reads raw events from a source topic, joins them against a static reference DataFrame (a product catalog, a user attributes table cached as a broadcast variable), and writes enriched events to a new topic. Downstream consumers read the enriched topic and skip the join cost entirely.

Dead Letter Queue for Deserialization Failures

In any high-volume pipeline, some messages will fail deserialization — a malformed JSON record, a producer that forgot to register a schema, a truncated Avro message. The dead letter queue pattern handles this without stopping the pipeline. Inside a foreachBatch sink, the code separates records into two DataFrames: valid_df (records that deserialized successfully) and failed_df (records where from_json returned null). valid_df writes to the primary Delta table. failed_df writes to a payments-dlq Kafka topic that operations teams monitor and replay after the root cause is fixed.

Monitoring Kafka Consumer Lag

Two complementary monitoring surfaces exist for Spark–Kafka pipelines:

Spark Streaming UI — available at http://<driver-host>:4040/StreamingQuery. Shows input rate (records per second), processing rate, micro-batch duration, and end-to-end lag per query. The inputRowsPerSecond vs processedRowsPerSecond ratio reveals whether the pipeline is keeping up.
Kafka consumer group describe — run kafka-consumer-groups.sh --bootstrap-server <host>:9092 --describe --group <spark-query-group-id> to see per-partition lag from Kafka's perspective. Remember: this reflects what Spark has committed to Kafka's consumer group metadata, which lags behind Spark's actual internal checkpoint state.

⚖️ Spark Structured Streaming vs Kafka Streams vs Flink: Choosing the Right Processing Engine

No single streaming engine dominates every scenario. The three major options — Spark Structured Streaming, Kafka Streams, and Apache Flink — each make different architectural trade-offs that surface in production at different pain points.

Spark Structured Streaming is a micro-batch engine by default (Continuous Processing mode exists but is experimental). Minimum latency is bounded by trigger interval — practically 1–5 seconds. Its strengths are in pipelines that also need batch processing (unified codebase via the DataFrame API), complex multi-source joins (static DataFrames, streaming-to-streaming joins with watermarks), and large-scale stateful aggregations over hours-long windows.

Kafka Streams runs as an embedded library inside your application JVM — no cluster manager required. It achieves sub-millisecond latency because it processes events record-by-record. Its state is stored in embedded RocksDB instances on the same machine as the processing logic, which means state access is local and fast. The trade-off: scaling Kafka Streams is done by adding application instances, each of which gets a subset of partitions. Heavy stateful operations (large joins with wide time windows) can exhaust local disk.

Apache Flink is a true stream processor with a unified batch/streaming API. It achieves low latency (< 100ms) while supporting exactly-once semantics and checkpointing via its own distributed state backend (RocksDB or in-memory). Its main cost is operational complexity: Flink requires a dedicated cluster, its own checkpoint management, and significantly more tuning surface area than either Spark or Kafka Streams.

Dimension	Spark Structured Streaming	Kafka Streams	Apache Flink
Minimum latency	1–5 seconds (micro-batch)	< 1 ms (record-by-record)	10–100 ms
State storage	External (checkpointed to HDFS/S3)	Local RocksDB	RocksDB or heap
Exactly-once	Yes (with idempotent sink)	Yes	Yes
Deployment model	Spark cluster (YARN/K8s)	Embedded library in JVM app	Flink cluster (standalone/K8s)
SQL/DataFrame API	Full SQL + DataFrame	Streams DSL / KTable	Table API + SQL
Operational complexity	Medium (shared with Spark batch ops)	Low (no separate cluster)	High
Multi-source joins	Strong (broadcast, stream-stream)	Limited (KTable only)	Strong (both types)

The latency floor of Spark Structured Streaming is its most significant limitation. If your SLA requires sub-second end-to-end event processing from Kafka ingestion to sink visibility, Spark is not the right tool. Kafka Streams or Flink should be evaluated. For everything above 5 seconds of acceptable latency — which covers most analytical pipelines, reporting dashboards, and enrichment workflows — Spark's unified API and cluster integration usually win.

Schema evolution is a cross-engine challenge. When producers change Avro schemas without respecting backward compatibility rules, all three engines fail in similar ways — deserialization exceptions that stop the pipeline until the schema registry is updated or the consumer is redeployed. The difference is that Spark's micro-batch model gives a clear failure boundary (the micro-batch that first encounters the new schema fails atomically), whereas Kafka Streams' record-by-record processing may partially process a batch before hitting the incompatible record.

🧭 Streaming Engine Selection: Mapping Throughput, Latency, and State Complexity to the Right Tool

Use this table as a first-pass decision guide. For each row, pick the cell that matches your primary constraint, then verify that the trade-offs in that column are acceptable for your secondary constraints.

Scenario	Spark Structured Streaming	Kafka Streams	Apache Flink
Latency requirement < 1 second	—	Preferred	Acceptable
Latency requirement 1–30 seconds	Preferred	Acceptable	Acceptable
Latency requirement > 30 seconds (near-real-time)	Preferred	Acceptable	Overkill
Throughput > 1M events/sec, complex joins	Preferred (scale cluster)	Difficult (partition limits)	Preferred
Throughput 10K–500K events/sec, simple transforms	Acceptable	Preferred (low overhead)	Acceptable
Stateful window aggregation (hours/days)	Preferred	Acceptable (disk bound)	Preferred
Shared batch + stream codebase	Preferred (Delta + Spark SQL)	—	Acceptable
No cluster management desired	—	Preferred	—
Advanced CEP (complex event patterns)	—	—	Preferred

🧪 Diagnosing Kafka Consumer Lag in a Live Spark Streaming Pipeline

This walkthrough describes how to diagnose a Spark Structured Streaming pipeline that is falling behind its Kafka source. The scenario: a payments pipeline reads from a 20-partition Kafka topic and writes to Delta Lake. Operations alerts show the pipeline is 45 minutes behind real-time.

Step 1: Check the Spark Streaming UI. Navigate to http://<driver>:4040/StreamingQuery. Look for the active query. Key metrics to read:

Input rate vs Processing rate — if input rate consistently exceeds processing rate, the pipeline is actively falling behind.
Batch duration — if micro-batches are taking 5–10x longer than the trigger interval, a downstream write or a heavy transformation is the bottleneck.
Total delay — the cumulative lag between event ingestion time and processing completion.

Step 2: Check Kafka-side lag per partition. Run the Kafka consumer group describe command (substituting your actual group ID from the kafka.group.id config):

kafka-consumer-groups.sh \
  --bootstrap-server kafka-broker:9092 \
  --describe \
  --group spark-payments-streaming-v1

Look for partitions with disproportionately large lag values. A single "hot" partition with 10x the lag of others suggests a key distribution imbalance (a few producers are sending far more messages to specific partitions) — the solution is Kafka-side repartitioning or a custom partitioner, not Spark tuning.

Step 3: Inspect the checkpoint directory structure. The checkpoint directory for a Kafka-backed streaming query follows a predictable layout:

checkpoint/
  commits/
    0       <- batch 0 committed
    1       <- batch 1 committed
    ...
  offsets/
    0       <- KafkaSourceOffset for batch 0 (what was read)
    1       <- KafkaSourceOffset for batch 1
    ...
  sources/
    0/      <- metadata for source 0 (Kafka)
  metadata

If the newest file in offsets/ is significantly ahead of the newest file in commits/, a micro-batch is running but has not committed — it may be stuck. If both are identical, the query may be paused or the trigger interval is very long.

Step 4: Adjust maxOffsetsPerTrigger conservatively. If batch duration is the bottleneck, reduce maxOffsetsPerTrigger to shrink batch size, letting each micro-batch complete faster at the cost of consuming lag more slowly. If the pipeline has capacity headroom (micro-batches finishing in 10 seconds against a 60-second trigger), increase maxOffsetsPerTrigger by 50% and observe whether batch duration remains stable.

🛠️ Spark-Kafka Connector Configuration Reference

The Spark-Kafka connector exposes a focused set of options that control the integration behavior. These are passed as key-value pairs to readStream or writeStream format options.

# === Kafka Source Options (readStream) ===

kafka.bootstrap.servers: "kafka-broker-1:9092,kafka-broker-2:9092"
# Required. Comma-separated list of broker addresses.

subscribe: "payments-prod,payments-staging"
# OR: subscribePattern: "payments.*"
# OR: assign: '{"payments-prod":[0,1,2]}'

startingOffsets: "latest"
# "earliest" | "latest" | JSON per-partition map
# Ignored after first checkpoint commit.

maxOffsetsPerTrigger: 100000
# Total offsets consumed across all partitions per micro-batch.
# Start at 50K-200K and tune based on batch duration.

failOnDataLoss: "true"
# "true" = fail if Kafka has deleted requested offsets (retention expired).
# "false" = skip missing offsets, continue from earliest available.

kafka.group.id: "spark-payments-streaming-v1"
# Consumer group ID visible in Kafka monitoring tools.
# Does NOT control Spark's internal offset management — checkpoint does.

spark.sql.streaming.kafka.useDeprecatedOffsetFetching: "false"
# Set false (Spark 3.1+) to use the new partition discovery mechanism.
# Avoids leader-not-available errors during Kafka rebalances.

# === Kafka Source Options: Schema Registry (Avro/Protobuf) ===

kafka.schema.registry.url: "http://schema-registry:8081"
# Used with from_avro() / to_avro() functions from spark-avro package.

# === Kafka Sink Options (writeStream) ===

kafka.bootstrap.servers: "kafka-broker-1:9092,kafka-broker-2:9092"
# Required for sink too.

topic: "payments-enriched"
# Default topic if 'topic' column is not present in the output DataFrame.

kafka.acks: "all"
# Required for durability. "all" = wait for all in-sync replicas.

kafka.enable.idempotence: "true"
# Enables idempotent producer — required for exactly-once Kafka sink semantics.

kafka.retries: "3"
# Producer retry count on transient failures.

For a full deep-dive on Spark's core execution model and how streaming queries interact with the DAGScheduler, see Spark Architecture: Driver, Executors, DAG Scheduler, and Task Scheduler Explained.

📚 Lessons Learned from Production Kafka–Spark Pipelines

Checkpoint corruption is an operational reality, not a theoretical edge case. In containerized deployments where checkpoint directories live on network-attached storage (S3, ADLS), partial writes during instance termination can corrupt the WAL. Treat the checkpoint directory as a first-class infrastructure component: version it, back it up before schema changes, and test your recovery procedure in staging.

failOnDataLoss=false is not a free pass. Silently skipping deleted offsets means your pipeline's output is missing data that was in Kafka but expired before Spark could process it. Use this flag only when you have explicit business logic for handling gaps — for example, when the pipeline is a best-effort metrics aggregator and missing 0.1% of events is acceptable.

Kafka partition count is a ceiling, not a suggestion. If your topic has 8 partitions but your Spark cluster has 40 executor cores, you are using 20% of your available parallelism on the read stage. Plan Kafka partition count as a scalability parameter from the start — increasing it later requires a producer restart and may disrupt key-ordering guarantees for in-flight consumers.

Schema evolution breaks pipelines in ways that are hard to observe. A producer team adds a field to their Avro schema without registering a backward-compatible version in the schema registry. Spark's Avro deserializer throws a SchemaParseException. The micro-batch fails. The pipeline stops. The alert fires 10 minutes later when lag crosses the threshold. By then, downstream dashboards are already stale. Solve this with schema compatibility enforcement at the producer side (registry configured to reject non-backward-compatible schemas) and add a schema validation test to your producer CI pipeline.

Monitor two lag numbers, not one. Kafka consumer group lag (from kafka-consumer-groups.sh) and Spark Streaming UI lag are different metrics. Kafka lag reflects what has been committed to Kafka's __consumer_offsets topic — which may lag behind Spark's actual internal checkpoint state by up to one micro-batch duration. In practice, Spark's UI lag is the ground truth; Kafka's consumer group lag is useful for cross-checking and for integrating with infrastructure-level alerting tools that don't have Spark UI access.

📌 Key Takeaways: Kafka + Spark Structured Streaming at a Glance

TLDR: Spark Structured Streaming integrates with Kafka through a checkpoint-owned offset management system that is independent of Kafka's consumer group mechanism. Configure maxOffsetsPerTrigger to balance throughput against latency, align Kafka partition count with executor parallelism, use idempotent producers for exactly-once Kafka sink semantics, and monitor both Spark Streaming UI lag and Kafka consumer group lag to catch pipeline drift early.

Spark stores Kafka offsets in its checkpoint directory (KafkaSourceOffset JSON files), not in Kafka's __consumer_offsets topic. The checkpoint is the source of truth for recovery.
startingOffsets only controls initial start position; after the first checkpoint commit, Spark ignores it and resumes from the checkpoint.
maxOffsetsPerTrigger is the primary backpressure knob — tune it to keep micro-batch duration in the 10–60 second range for most production workloads.
Exactly-once requires two components: Spark's idempotent re-read guarantee (from the checkpoint) and the sink's idempotent or transactional write guarantee.
Spark Structured Streaming is the right choice for pipelines with latency requirements above 1–5 seconds, complex multi-source joins, or unified batch/streaming codebases. For sub-second latency, evaluate Kafka Streams or Flink.
The dead letter queue pattern is non-negotiable in production: route deserialization failures to a dedicated topic instead of stopping the pipeline.

Quiet AI help

Explain simpler Compare approaches What next?

Article metadata

Written by

Abstract Algorithms

@abstractalgorithms

Related deep dives

Softmax Function Explained: From Raw Scores to Probabilities

23 min read

Dot Product in Machine Learning: The Engine Behind Similarity, Attention, and Neural Networks

22 min read

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

31 min read

ANN Index Types Explained: When to Choose Flat, HNSW, IVF, or IVF-PQ

14 min · Ann · best next step

Open Collection