Home/Blog/Distributed Systems/How Kafka Works: The Log That Never Forgets

Distributed SystemsAdvanced•13 min read•Mar 9, 2026

How Kafka Works: The Log That Never Forgets

Kafka is not just a message queue; it's a distributed streaming platform. We explain Topics, Partitions, and Offsets.

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: Kafka is a distributed event store. Unlike a traditional queue (RabbitMQ) where messages disappear after reading, Kafka stores them in a persistent Log. This allows multiple consumers to read the same data at their own pace, replay history, and handle massive scale via Partitions.

📖 The Bookshelf Analogy: Kafka vs Traditional Queues

Imagine a Magazine Subscription Service.

Producer (Publisher): Writes articles and drops them into a pile.
Topic (The Magazine Category): A named channel — e.g., "Sports", "Tech", "user-clicks".
Consumer (Subscriber): Reads the magazines.

Traditional Queue (RabbitMQ)	Kafka
Publisher hands you a copy. You read it. It's gone.	Publisher places every issue on a numbered bookshelf.
If you were on vacation, you missed it.	You read issue #3. Your friend reads #1. Both are still there.
One consumer per message.	Many consumers, independent pacing.

This bookshelf model is the foundational insight: Kafka never deletes a message just because it was read.

🔍 Topics, Partitions, and Offsets: The Core Vocabulary

Topics

A Topic is a named log. Think of it as a category: user-clicks, payment-events, driver-locations.

Partitions: The Unit of Scale

A topic is split into Partitions to enable parallel processing:

Partition 0: Stores events for Users A–M.
Partition 1: Stores events for Users N–Z.

flowchart LR
    subgraph Topic: user-clicks
        P0["Partition 0 (Users AM)"] 
        P1["Partition 1 (Users NZ)"]
    end
    Producer -->|key hash| P0
    Producer -->|key hash| P1
    P0 --> Broker1[Broker 1]
    P1 --> Broker2[Broker 2]

Two brokers now handle reads and writes in parallel — throughput scales horizontally.

Offsets: Your Bookmark in the Log

Each message in a partition gets an Offset — a monotonically increasing integer (0, 1, 2, …).

Partition 0: [Msg 0] [Msg 1] [Msg 2] [Msg 3] ...
                ^                ^
             Group A          Group B
             (at offset 0)   (at offset 2)

Group A and Group B read the same partition independently. Kafka just tracks each group's current offset.

📊 How Kafka Stores and Delivers Events: End-to-End Flow

flowchart LR
    Producer -->|write event| Broker["Kafka Broker (Leader Partition)"]
    Broker -->|replicate| F1[Follower Broker 2]
    Broker -->|replicate| F2[Follower Broker 3]
    Broker -->|ack| Producer
    CG1["Consumer Group A (Analytics)"] -->|poll offset N| Broker
    CG2["Consumer Group B (Fraud Detection)"] -->|poll offset M| Broker

Producer serializes the event (typically to Avro or JSON) and sends it to a topic partition determined by the partition key hash.
Broker appends the event to the partition log on disk and replicates it to followers.
Consumer Groups independently poll the broker and commit offsets. Group A and Group B each maintain their own offset — they never block each other.

This decoupling is what makes Kafka powerful: producers and consumers are fully independent.

⚙️ Consumer Groups: Parallel Reading Without Coordination

A Consumer Group is a team of consumers that collectively read a topic.

Kafka assigns each partition to exactly one consumer in the group at a time.
If a consumer crashes, Kafka reassigns its partition to another group member.

Example — 4 partitions, 4 consumers in the "Analytics" group:

Partition	Consumer
0	Consumer A
1	Consumer B
2	Consumer C
3	Consumer D

If Consumer D crashes → Consumer A picks up Partition 3 automatically.

Important: If you add a 5th consumer to a 4-partition topic, the 5th consumer sits idle. Kafka rule: one consumer per partition per group maximum. Add more partitions to scale beyond that.

Offset commit: Kafka tracks "Group=Analytics, Partition=0, Offset=5". On restart, it resumes from offset 5 — no data loss, no duplicates (at-least-once by default; exactly-once with idempotent producers).

📊 Producer to Consumer: Full Delivery Sequence

sequenceDiagram
    participant P as Producer
    participant L as Leader Broker
    participant F1 as Follower Broker 2
    participant F2 as Follower Broker 3
    participant CG as Consumer Group

    P->>L: Produce event (acks=all)
    L->>F1: Replicate to follower
    L->>F2: Replicate to follower
    F1-->>L: ACK
    F2-->>L: ACK
    L-->>P: Offset confirmed
    CG->>L: Poll (group_id, partition, offset N)
    L-->>CG: Batch of messages
    CG->>L: Commit offset N+batch_size

This sequence diagram shows the full round-trip from a producer writing a message with acks=all through leader-follower replication and then to a consumer group polling and committing offsets. The replication handshake (lines 3–6) is what makes Kafka durable — the leader only acknowledges the producer after all ISR members confirm the write. The offset commit at the end (line 9) is Kafka's bookmark mechanism: if the consumer crashes after this point, it resumes from the committed offset with no data loss and no reprocessing.

🧠 Deep Dive: Log Compaction and Retention Policies

By default Kafka uses time-based retention (e.g., delete messages older than 7 days). This suits event streams where history has a time horizon.

Log compaction is an alternative: Kafka retains only the latest value per key. This is ideal for changelog topics (e.g., user-profile-updates) where you only need the most recent state per user ID.

Policy	Keeps	Best for
Time-based retention	All messages up to N days	Audit logs, analytics pipelines
Size-based retention	All messages up to N GB	Bounded storage environments
Log compaction	Latest value per key	State snapshots, CDC topics

🌍 Real-World Applications: Kafka in Production: LinkedIn, Uber, Netflix

LinkedIn (Kafka's birthplace): 7 trillion messages per day across pipelines for feed ranking, notifications, and metrics collection.

Uber: The driver-locations topic ingests GPS pings every 5 seconds. Three independent consumer groups read the same stream:

ETA Service → calculate arrival time.
Audit Service → store history for billing.
Fraud Service → detect teleporting drivers.

Zero coordination between the three. No data duplication at the source.

Netflix: Chaos event streams flow through Kafka so multiple observability systems can react independently without coupling.

🔬 Deep Dive: How Kafka Achieves Durability and Exactly-Once Delivery

Replication and the ISR

Every partition has one leader and zero or more followers. Producers always write to the leader; followers pull from the leader and stay in sync.

Kafka tracks which followers are sufficiently caught up in the In-Sync Replicas (ISR) list. A follower is removed from ISR if it falls too far behind (default: 10 seconds).

flowchart LR
    P[Producer] -->|acks=all| L["Partition Leader (Broker 1)"]
    L -->|replicate| F1["Follower (Broker 2)  ISR"]
    L -->|replicate| F2["Follower (Broker 3)  ISR"]
    L -->|ack to producer| P

With acks=all, the leader waits for all ISR members to acknowledge the write before confirming to the producer. This guarantees no data loss even if the leader crashes immediately after.

Producer Ack Setting	Durability	Throughput
`acks=0`	None — fire and forget	Maximum
`acks=1`	Leader acknowledges	Medium
`acks=all`	All ISR members acknowledge	Lowest (most safe)

Exactly-Once Semantics

Kafka supports three delivery guarantees:

At-most-once: Producer sends, no retry on failure. Messages can be lost.
At-least-once: Producer retries on timeout. Messages can be duplicated.
Exactly-once: Achieved via idempotent producers (each message has a sequence number, broker deduplicates retries) combined with transactions for atomic multi-partition writes.

Idempotent producers are enabled with enable.idempotence=true. Transactions wrap a batch of produce + offset commit into a single atomic operation — either all succeed or none do. Kafka Streams uses this to build stateful exactly-once pipelines.

Partition Leadership and Broker Failure

When a broker holding a partition leader fails, Kafka's Controller (itself a broker, elected via ZooKeeper or KRaft) detects the failure through a heartbeat timeout and promotes the first ISR follower to leader. Clients automatically receive new metadata and reconnect. Recovery typically takes seconds.

KRaft: Removing the ZooKeeper Dependency

Before Kafka 3.x, ZooKeeper managed broker metadata and leader election — an operational burden. KRaft (Kafka Raft Metadata) replaces ZooKeeper with a Raft-based metadata quorum built into Kafka itself. Benefits: fewer moving parts, faster controller failover (sub-second vs. 30–60 seconds), and cluster sizes up to 1 million partitions.

📊 Partition Distribution Across Brokers

flowchart LR
    T["Topic: payments (3 partitions)"] --> P0[Partition 0 Leader: Broker 1]
    T --> P1[Partition 1 Leader: Broker 2]
    T --> P2[Partition 2 Leader: Broker 3]
    P0 --> R01[Replica on Broker 2]
    P0 --> R02[Replica on Broker 3]
    P1 --> R11[Replica on Broker 1]
    P2 --> R21[Replica on Broker 1]

This flowchart shows how a single topic's three partitions are spread across three brokers, with each partition's leader on a different broker and replicas distributed across the remaining two. This even distribution is the foundation of Kafka's horizontal scalability: each broker handles roughly one-third of both producer writes and consumer reads. The replica layout also illustrates fault tolerance — if Broker 1 fails, Partition 0's replica on Broker 2 or 3 can be promoted to leader, and the cluster continues operating without data loss.

⚖️ Trade-offs & Failure Modes: Trade-Offs: Kafka vs Alternatives

Dimension	Kafka	RabbitMQ	AWS SQS
Message retention	Time/size-based (default 7 days)	Deleted on ack	Deleted on ack (max 14 days)
Multiple consumers	Native fan-out via consumer groups	Requires exchange + queue duplication	Separate queues per consumer
Ordering guarantee	Per partition	Per queue (with single consumer)	FIFO queues only
Throughput	Millions of msgs/sec	Tens of thousands/sec	Managed, auto-scales
Operational complexity	High (broker tuning, partition sizing)	Medium	Low (fully managed)
Replay	Built-in via offset reset	Not supported	Not supported

Hot partition warning: If all producers use the same partition key (e.g., userId=null), all traffic lands on one partition. Monitor partition lag per consumer group and redesign key strategy before it becomes a production incident.

🧭 Decision Guide: When to Use Kafka: Decision Guide

Choose Kafka when:

Multiple independent services need to react to the same event (fan-out).
You need to replay events — audit, re-processing after a bug fix, feeding a new ML model.
Throughput exceeds what a traditional message broker handles (>100K msgs/sec).
You're building a streaming pipeline (Kafka Streams, Flink reading from Kafka).
Change Data Capture (CDC): forwarding database mutations to downstream consumers.

Skip Kafka when:

You have a simple background job queue with one consumer — use SQS, BullMQ, or Sidekiq.
You need request/response semantics — use gRPC or REST.
Sub-millisecond latency is required — use an in-memory queue or NATS.
Your team is small and doesn't have the bandwidth to operate a Kafka cluster — use a managed broker (Confluent Cloud, Amazon MSK).

Signal	Recommendation
Fan-out to 3+ services	✅ Kafka
Need replay for audits/ML	✅ Kafka
Single background worker	❌ Use SQS / BullMQ
Latency < 1ms required	❌ Use NATS / in-memory
No infra team, small scale	❌ Use managed broker or RabbitMQ

🧪 Practical Setup: Kafka Producer and Consumer in Python

Here is a minimal Kafka producer and consumer using the confluent-kafka library:

# producer.py
from confluent_kafka import Producer

producer = Producer({'bootstrap.servers': 'localhost:9092'})

def delivery_report(err, msg):
    if err:
        print(f'Delivery failed: {err}')
    else:
        print(f'Delivered to {msg.topic()} [{msg.partition()}] at offset {msg.offset()}')

producer.produce('user-clicks', key='user-123', value='page_view', callback=delivery_report)
producer.flush()

# consumer.py
from confluent_kafka import Consumer

consumer = Consumer({
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'analytics-group',
    'auto.offset.reset': 'earliest'
})
consumer.subscribe(['user-clicks'])

while True:
    msg = consumer.poll(1.0)
    if msg and not msg.error():
        print(f"Key: {msg.key()}, Value: {msg.value()}, Offset: {msg.offset()}")

Key configuration choices:

group.id — all instances with the same group ID share partition assignments. Scale by adding instances.
auto.offset.reset=earliest — on first run, read from the beginning of the topic.
enable.auto.commit=True (default) — Kafka commits offsets automatically. Set to False and call consumer.commit() manually for exactly-once guarantees.

🛠️ Spring for Apache Kafka: KafkaTemplate and @KafkaListener in Spring Boot

Spring for Apache Kafka (spring-kafka) is the standard integration library for Kafka in Spring Boot applications. KafkaTemplate<K, V> handles serialization and partition key routing for producers; @KafkaListener turns any Spring bean method into a consumer with offset management, consumer group registration, and error handling wired automatically — no Kafka client boilerplate needed.

// build.gradle: org.springframework.kafka:spring-kafka:3.2.0

// ─── Producer: KafkaTemplate ─────────────────────────────────────────────────
@Service
public class UserEventProducer {

    private final KafkaTemplate<String, String> kafkaTemplate;

    public UserEventProducer(KafkaTemplate<String, String> kafkaTemplate) {
        this.kafkaTemplate = kafkaTemplate;
    }

    public void sendClickEvent(String userId, String page) {
        // Key = userId → same user's events always land on the same partition
        // This guarantees per-user ordering — see the Offsets section above
        kafkaTemplate.send("user-clicks", userId,
                           "{\"userId\":\"%s\",\"page\":\"%s\"}".formatted(userId, page))
                     .whenComplete((result, ex) -> {
                         if (ex == null) {
                             System.out.printf("Delivered → partition=%d offset=%d%n",
                                 result.getRecordMetadata().partition(),
                                 result.getRecordMetadata().offset());
                         } else {
                             System.err.println("Delivery failed: " + ex.getMessage());
                         }
                     });
    }
}

// ─── Consumer: @KafkaListener ─────────────────────────────────────────────────
@Component
public class AnalyticsConsumer {

    // concurrency="3" → 3 consumer threads, each assigned one partition
    @KafkaListener(
        topics      = "user-clicks",
        groupId     = "analytics-group",
        concurrency = "3"
    )
    public void handleClickEvent(
            @Payload String payload,
            @Header(KafkaHeaders.RECEIVED_PARTITION) int partition,
            @Header(KafkaHeaders.OFFSET) long offset) {

        System.out.printf("[part=%d off=%d] %s%n", partition, offset, payload);
        // Offset is auto-committed after this method returns (at-least-once)
    }
}

# application.yml — minimal Spring Kafka configuration
spring:
  kafka:
    bootstrap-servers: localhost:9092
    consumer:
      group-id: analytics-group
      auto-offset-reset: earliest        # on first run, read from the beginning
      key-deserializer:   org.apache.kafka.common.serialization.StringDeserializer
      value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
    producer:
      key-serializer:   org.apache.kafka.common.serialization.StringSerializer
      value-serializer: org.apache.kafka.common.serialization.StringSerializer
      acks: all          # wait for all ISR members — strongest durability guarantee
      retries: 3

The groupId = "analytics-group" in @KafkaListener maps directly to the consumer group concept from this post. Multiple instances of the same Spring Boot service share the group ID, and Spring Kafka handles the partition rebalance on scale-out or crash automatically.

For a full deep-dive on Spring for Apache Kafka including transactions, Schema Registry integration, and dead-letter topic error handling, a dedicated follow-up post is planned.

📚 Lessons Learned: Common Kafka Pitfalls

1. Under-partitioned topics are permanent pain. You can add partitions later, but it breaks partition-key-based ordering. Plan capacity early — production teams commonly start at 3–12 partitions per topic for 1–10 MB/s throughput.

2. Consumer lag is your primary health signal. A growing lag means consumers can't keep up with producers. Use Kafka's kafka-consumer-groups.sh or Prometheus kafka_consumer_group_lag to alert on it.

3. Don't store huge payloads in Kafka. Kafka is optimized for small events (<1 MB). For large blobs (images, videos), store the blob in S3 and put only the reference URL in Kafka.

4. Schema drift breaks consumers silently. Use the Confluent Schema Registry with Avro or Protobuf. It enforces schema compatibility so a producer can't break consumers by adding required fields.

5. Test consumer rebalance behavior. When a consumer group adds or removes a member, Kafka triggers a rebalance. During a rebalance all consumers stop processing. Ensure your application handles on_revoke and on_assign callbacks gracefully.

📌 TLDR: Summary & Key Takeaways

Log-based: Messages are stored on disk and retained (default: 7 days). Multiple consumers can re-read.
Topics + Partitions: The unit of throughput. More partitions = more parallel consumers.
Offsets + Consumer Groups: Durable bookmarks enabling crash-safe resume and independent read pacing.
Log compaction: Alternative retention keeping only the latest value per key — great for change data capture.

Article tools

Explain simpler Compare approaches What next?

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata