How Kafka Works: The Log That Never Forgets
Kafka is not just a message queue; it's a distributed streaming platform. We explain Topics, Partitions, and Offsets.
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Kafka is a distributed event store. Unlike a traditional queue (RabbitMQ) where messages disappear after reading, Kafka stores them in a persistent Log. This allows multiple consumers to read the same data at their own pace, replay history, and handle massive scale via Partitions.
๐ The Bookshelf Analogy: Kafka vs Traditional Queues
Imagine a Magazine Subscription Service.
- Producer (Publisher): Writes articles and drops them into a pile.
- Topic (The Magazine Category): A named channel โ e.g., "Sports", "Tech", "user-clicks".
- Consumer (Subscriber): Reads the magazines.
| Traditional Queue (RabbitMQ) | Kafka |
| Publisher hands you a copy. You read it. It's gone. | Publisher places every issue on a numbered bookshelf. |
| If you were on vacation, you missed it. | You read issue #3. Your friend reads #1. Both are still there. |
| One consumer per message. | Many consumers, independent pacing. |
This bookshelf model is the foundational insight: Kafka never deletes a message just because it was read.
๐ Topics, Partitions, and Offsets: The Core Vocabulary
Topics
A Topic is a named log. Think of it as a category: user-clicks, payment-events, driver-locations.
Partitions: The Unit of Scale
A topic is split into Partitions to enable parallel processing:
- Partition 0: Stores events for Users AโM.
- Partition 1: Stores events for Users NโZ.
flowchart LR
subgraph Topic: user-clicks
P0[Partition 0 (Users AM)]
P1[Partition 1 (Users NZ)]
end
Producer -->|key hash| P0
Producer -->|key hash| P1
P0 --> Broker1[Broker 1]
P1 --> Broker2[Broker 2]
Two brokers now handle reads and writes in parallel โ throughput scales horizontally.
Offsets: Your Bookmark in the Log
Each message in a partition gets an Offset โ a monotonically increasing integer (0, 1, 2, โฆ).
Partition 0: [Msg 0] [Msg 1] [Msg 2] [Msg 3] ...
^ ^
Group A Group B
(at offset 0) (at offset 2)
Group A and Group B read the same partition independently. Kafka just tracks each group's current offset.
๐ How Kafka Stores and Delivers Events: End-to-End Flow
flowchart LR
Producer -->|write event| Broker[Kafka Broker (Leader Partition)]
Broker -->|replicate| F1[Follower Broker 2]
Broker -->|replicate| F2[Follower Broker 3]
Broker -->|ack| Producer
CG1[Consumer Group A (Analytics)] -->|poll offset N| Broker
CG2[Consumer Group B (Fraud Detection)] -->|poll offset M| Broker
- Producer serializes the event (typically to Avro or JSON) and sends it to a topic partition determined by the partition key hash.
- Broker appends the event to the partition log on disk and replicates it to followers.
- Consumer Groups independently poll the broker and commit offsets. Group A and Group B each maintain their own offset โ they never block each other.
This decoupling is what makes Kafka powerful: producers and consumers are fully independent.
โ๏ธ Consumer Groups: Parallel Reading Without Coordination
A Consumer Group is a team of consumers that collectively read a topic.
- Kafka assigns each partition to exactly one consumer in the group at a time.
- If a consumer crashes, Kafka reassigns its partition to another group member.
Example โ 4 partitions, 4 consumers in the "Analytics" group:
| Partition | Consumer |
| 0 | Consumer A |
| 1 | Consumer B |
| 2 | Consumer C |
| 3 | Consumer D |
If Consumer D crashes โ Consumer A picks up Partition 3 automatically.
Important: If you add a 5th consumer to a 4-partition topic, the 5th consumer sits idle. Kafka rule: one consumer per partition per group maximum. Add more partitions to scale beyond that.
Offset commit: Kafka tracks "Group=Analytics, Partition=0, Offset=5". On restart, it resumes from offset 5 โ no data loss, no duplicates (at-least-once by default; exactly-once with idempotent producers).
๐ Producer to Consumer: Full Delivery Sequence
sequenceDiagram
participant P as Producer
participant L as Leader Broker
participant F1 as Follower Broker 2
participant F2 as Follower Broker 3
participant CG as Consumer Group
P->>L: Produce event (acks=all)
L->>F1: Replicate to follower
L->>F2: Replicate to follower
F1-->>L: ACK
F2-->>L: ACK
L-->>P: Offset confirmed
CG->>L: Poll (group_id, partition, offset N)
L-->>CG: Batch of messages
CG->>L: Commit offset N+batch_size
This sequence diagram shows the full round-trip from a producer writing a message with acks=all through leader-follower replication and then to a consumer group polling and committing offsets. The replication handshake (lines 3โ6) is what makes Kafka durable โ the leader only acknowledges the producer after all ISR members confirm the write. The offset commit at the end (line 9) is Kafka's bookmark mechanism: if the consumer crashes after this point, it resumes from the committed offset with no data loss and no reprocessing.
๐ง Deep Dive: Log Compaction and Retention Policies
By default Kafka uses time-based retention (e.g., delete messages older than 7 days). This suits event streams where history has a time horizon.
Log compaction is an alternative: Kafka retains only the latest value per key. This is ideal for changelog topics (e.g., user-profile-updates) where you only need the most recent state per user ID.
| Policy | Keeps | Best for |
| Time-based retention | All messages up to N days | Audit logs, analytics pipelines |
| Size-based retention | All messages up to N GB | Bounded storage environments |
| Log compaction | Latest value per key | State snapshots, CDC topics |
๐ Real-World Applications: Kafka in Production: LinkedIn, Uber, Netflix
LinkedIn (Kafka's birthplace): 7 trillion messages per day across pipelines for feed ranking, notifications, and metrics collection.
Uber: The driver-locations topic ingests GPS pings every 5 seconds. Three independent consumer groups read the same stream:
- ETA Service โ calculate arrival time.
- Audit Service โ store history for billing.
- Fraud Service โ detect teleporting drivers.
Zero coordination between the three. No data duplication at the source.
Netflix: Chaos event streams flow through Kafka so multiple observability systems can react independently without coupling.
๐ฌ Deep Dive: How Kafka Achieves Durability and Exactly-Once Delivery
Replication and the ISR
Every partition has one leader and zero or more followers. Producers always write to the leader; followers pull from the leader and stay in sync.
Kafka tracks which followers are sufficiently caught up in the In-Sync Replicas (ISR) list. A follower is removed from ISR if it falls too far behind (default: 10 seconds).
flowchart LR
P[Producer] -->|acks=all| L[Partition Leader (Broker 1)]
L -->|replicate| F1[Follower (Broker 2) ISR]
L -->|replicate| F2[Follower (Broker 3) ISR]
L -->|ack to producer| P
With acks=all, the leader waits for all ISR members to acknowledge the write before confirming to the producer. This guarantees no data loss even if the leader crashes immediately after.
| Producer Ack Setting | Durability | Throughput |
acks=0 | None โ fire and forget | Maximum |
acks=1 | Leader acknowledges | Medium |
acks=all | All ISR members acknowledge | Lowest (most safe) |
Exactly-Once Semantics
Kafka supports three delivery guarantees:
- At-most-once: Producer sends, no retry on failure. Messages can be lost.
- At-least-once: Producer retries on timeout. Messages can be duplicated.
- Exactly-once: Achieved via idempotent producers (each message has a sequence number, broker deduplicates retries) combined with transactions for atomic multi-partition writes.
Idempotent producers are enabled with enable.idempotence=true. Transactions wrap a batch of produce + offset commit into a single atomic operation โ either all succeed or none do. Kafka Streams uses this to build stateful exactly-once pipelines.
Partition Leadership and Broker Failure
When a broker holding a partition leader fails, Kafka's Controller (itself a broker, elected via ZooKeeper or KRaft) detects the failure through a heartbeat timeout and promotes the first ISR follower to leader. Clients automatically receive new metadata and reconnect. Recovery typically takes seconds.
KRaft: Removing the ZooKeeper Dependency
Before Kafka 3.x, ZooKeeper managed broker metadata and leader election โ an operational burden. KRaft (Kafka Raft Metadata) replaces ZooKeeper with a Raft-based metadata quorum built into Kafka itself. Benefits: fewer moving parts, faster controller failover (sub-second vs. 30โ60 seconds), and cluster sizes up to 1 million partitions.
๐ Partition Distribution Across Brokers
flowchart LR
T[Topic: payments (3 partitions)] --> P0[Partition 0 Leader: Broker 1]
T --> P1[Partition 1 Leader: Broker 2]
T --> P2[Partition 2 Leader: Broker 3]
P0 --> R01[Replica on Broker 2]
P0 --> R02[Replica on Broker 3]
P1 --> R11[Replica on Broker 1]
P2 --> R21[Replica on Broker 1]
This flowchart shows how a single topic's three partitions are spread across three brokers, with each partition's leader on a different broker and replicas distributed across the remaining two. This even distribution is the foundation of Kafka's horizontal scalability: each broker handles roughly one-third of both producer writes and consumer reads. The replica layout also illustrates fault tolerance โ if Broker 1 fails, Partition 0's replica on Broker 2 or 3 can be promoted to leader, and the cluster continues operating without data loss.
โ๏ธ Trade-offs & Failure Modes: Trade-Offs: Kafka vs Alternatives
| Dimension | Kafka | RabbitMQ | AWS SQS |
| Message retention | Time/size-based (default 7 days) | Deleted on ack | Deleted on ack (max 14 days) |
| Multiple consumers | Native fan-out via consumer groups | Requires exchange + queue duplication | Separate queues per consumer |
| Ordering guarantee | Per partition | Per queue (with single consumer) | FIFO queues only |
| Throughput | Millions of msgs/sec | Tens of thousands/sec | Managed, auto-scales |
| Operational complexity | High (broker tuning, partition sizing) | Medium | Low (fully managed) |
| Replay | Built-in via offset reset | Not supported | Not supported |
Hot partition warning: If all producers use the same partition key (e.g., userId=null), all traffic lands on one partition. Monitor partition lag per consumer group and redesign key strategy before it becomes a production incident.
๐งญ Decision Guide: When to Use Kafka: Decision Guide
Choose Kafka when:
- Multiple independent services need to react to the same event (fan-out).
- You need to replay events โ audit, re-processing after a bug fix, feeding a new ML model.
- Throughput exceeds what a traditional message broker handles (>100K msgs/sec).
- You're building a streaming pipeline (Kafka Streams, Flink reading from Kafka).
- Change Data Capture (CDC): forwarding database mutations to downstream consumers.
Skip Kafka when:
- You have a simple background job queue with one consumer โ use SQS, BullMQ, or Sidekiq.
- You need request/response semantics โ use gRPC or REST.
- Sub-millisecond latency is required โ use an in-memory queue or NATS.
- Your team is small and doesn't have the bandwidth to operate a Kafka cluster โ use a managed broker (Confluent Cloud, Amazon MSK).
| Signal | Recommendation |
| Fan-out to 3+ services | โ Kafka |
| Need replay for audits/ML | โ Kafka |
| Single background worker | โ Use SQS / BullMQ |
| Latency < 1ms required | โ Use NATS / in-memory |
| No infra team, small scale | โ Use managed broker or RabbitMQ |
๐งช Practical Setup: Kafka Producer and Consumer in Python
Here is a minimal Kafka producer and consumer using the confluent-kafka library:
# producer.py
from confluent_kafka import Producer
producer = Producer({'bootstrap.servers': 'localhost:9092'})
def delivery_report(err, msg):
if err:
print(f'Delivery failed: {err}')
else:
print(f'Delivered to {msg.topic()} [{msg.partition()}] at offset {msg.offset()}')
producer.produce('user-clicks', key='user-123', value='page_view', callback=delivery_report)
producer.flush()
# consumer.py
from confluent_kafka import Consumer
consumer = Consumer({
'bootstrap.servers': 'localhost:9092',
'group.id': 'analytics-group',
'auto.offset.reset': 'earliest'
})
consumer.subscribe(['user-clicks'])
while True:
msg = consumer.poll(1.0)
if msg and not msg.error():
print(f"Key: {msg.key()}, Value: {msg.value()}, Offset: {msg.offset()}")
Key configuration choices:
group.idโ all instances with the same group ID share partition assignments. Scale by adding instances.auto.offset.reset=earliestโ on first run, read from the beginning of the topic.enable.auto.commit=True(default) โ Kafka commits offsets automatically. Set toFalseand callconsumer.commit()manually for exactly-once guarantees.
๐ ๏ธ Spring for Apache Kafka: KafkaTemplate and @KafkaListener in Spring Boot
Spring for Apache Kafka (spring-kafka) is the standard integration library for Kafka in Spring Boot applications. KafkaTemplate<K, V> handles serialization and partition key routing for producers; @KafkaListener turns any Spring bean method into a consumer with offset management, consumer group registration, and error handling wired automatically โ no Kafka client boilerplate needed.
// build.gradle: org.springframework.kafka:spring-kafka:3.2.0
// โโโ Producer: KafkaTemplate โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
@Service
public class UserEventProducer {
private final KafkaTemplate<String, String> kafkaTemplate;
public UserEventProducer(KafkaTemplate<String, String> kafkaTemplate) {
this.kafkaTemplate = kafkaTemplate;
}
public void sendClickEvent(String userId, String page) {
// Key = userId โ same user's events always land on the same partition
// This guarantees per-user ordering โ see the Offsets section above
kafkaTemplate.send("user-clicks", userId,
"{\"userId\":\"%s\",\"page\":\"%s\"}".formatted(userId, page))
.whenComplete((result, ex) -> {
if (ex == null) {
System.out.printf("Delivered โ partition=%d offset=%d%n",
result.getRecordMetadata().partition(),
result.getRecordMetadata().offset());
} else {
System.err.println("Delivery failed: " + ex.getMessage());
}
});
}
}
// โโโ Consumer: @KafkaListener โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
@Component
public class AnalyticsConsumer {
// concurrency="3" โ 3 consumer threads, each assigned one partition
@KafkaListener(
topics = "user-clicks",
groupId = "analytics-group",
concurrency = "3"
)
public void handleClickEvent(
@Payload String payload,
@Header(KafkaHeaders.RECEIVED_PARTITION) int partition,
@Header(KafkaHeaders.OFFSET) long offset) {
System.out.printf("[part=%d off=%d] %s%n", partition, offset, payload);
// Offset is auto-committed after this method returns (at-least-once)
}
}
# application.yml โ minimal Spring Kafka configuration
spring:
kafka:
bootstrap-servers: localhost:9092
consumer:
group-id: analytics-group
auto-offset-reset: earliest # on first run, read from the beginning
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
producer:
key-serializer: org.apache.kafka.common.serialization.StringSerializer
value-serializer: org.apache.kafka.common.serialization.StringSerializer
acks: all # wait for all ISR members โ strongest durability guarantee
retries: 3
The groupId = "analytics-group" in @KafkaListener maps directly to the consumer group concept from this post. Multiple instances of the same Spring Boot service share the group ID, and Spring Kafka handles the partition rebalance on scale-out or crash automatically.
For a full deep-dive on Spring for Apache Kafka including transactions, Schema Registry integration, and dead-letter topic error handling, a dedicated follow-up post is planned.
๐ Lessons Learned: Common Kafka Pitfalls
1. Under-partitioned topics are permanent pain. You can add partitions later, but it breaks partition-key-based ordering. Plan capacity early โ production teams commonly start at 3โ12 partitions per topic for 1โ10 MB/s throughput.
2. Consumer lag is your primary health signal. A growing lag means consumers can't keep up with producers. Use Kafka's kafka-consumer-groups.sh or Prometheus kafka_consumer_group_lag to alert on it.
3. Don't store huge payloads in Kafka. Kafka is optimized for small events (<1 MB). For large blobs (images, videos), store the blob in S3 and put only the reference URL in Kafka.
4. Schema drift breaks consumers silently. Use the Confluent Schema Registry with Avro or Protobuf. It enforces schema compatibility so a producer can't break consumers by adding required fields.
5. Test consumer rebalance behavior. When a consumer group adds or removes a member, Kafka triggers a rebalance. During a rebalance all consumers stop processing. Ensure your application handles on_revoke and on_assign callbacks gracefully.
๐ TLDR: Summary & Key Takeaways
- Log-based: Messages are stored on disk and retained (default: 7 days). Multiple consumers can re-read.
- Topics + Partitions: The unit of throughput. More partitions = more parallel consumers.
- Offsets + Consumer Groups: Durable bookmarks enabling crash-safe resume and independent read pacing.
- Log compaction: Alternative retention keeping only the latest value per key โ great for change data capture.
๐ Related Posts
- System Design: Message Queues and Event-Driven Architecture
- System Design: Caching and Asynchronism
- How Kubernetes Works
- System Design: Replication and Failover

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
