A Guide to Raft, Paxos, and Consensus Algorithms

How do distributed databases agree on data? We explain the Leader Election and Log Replication me...

Abstract Algorithms

·Mar 9, 2026·13 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR

TLDR: Consensus algorithms allow a cluster of computers to agree on a single value (e.g., "Who is the leader?"). Paxos is the academic standard — correct but notoriously hard to understand. Raft is the practical standard — designed for understandability, used in Kubernetes (etcd), Kafka (KRaft), and CockroachDB.

📖 The Restaurant Order Problem: Why Distributed Agreement Is Hard

Imagine a restaurant with three waiters and no central ticket system. A table orders "steak." Two waiters hear "steak," one hears "fish." They all run to the kitchen with different tickets.

This is the distributed systems problem: three servers receiving one write need to agree on the final value before confirming to the client. If one crashes mid-write, the others must not contradict each other.

Consensus algorithms solve this. They provide two fundamental guarantees:

Safety: no two nodes ever commit different values for the same slot — even during a crash or network partition.
Liveness: the cluster eventually makes progress as long as a majority of nodes are alive and can communicate.

Both Raft and Paxos satisfy these properties. Where they differ is in how readable and debuggable the protocol is for practising engineers.

🔍 Safety, Liveness, and Quorums: What Consensus Must Guarantee

The key rule that makes consensus possible is quorum — a simple majority. The cluster only commits a value once a quorum acknowledges it. You can lose floor(N/2) nodes and still write, but lose one more and writes stall.

Cluster Size	Quorum (majority)	Max node failures tolerated
3	2	1
5	3	2
7	4	3

Both Raft and Paxos rely on quorums. Neither sacrifices safety; they differ in liveness and protocol readability.

⚙️ Raft's Three Roles: Follower, Candidate, and Leader

Raft simplifies consensus by restricting who can write. At any moment every node is in exactly one state:

stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate : election timeout (no heartbeat)
    Candidate --> Leader : majority votes received
    Candidate --> Follower : higher term seen
    Leader --> Follower : higher term seen

Role	Responsibility
Follower	Passive; accepts log entries and heartbeats from the leader
Candidate	Temporarily during elections; requests votes from peers
Leader	Handles all writes; sends heartbeats every ~150 ms

Only the leader accepts client writes. This single-writer design makes the protocol easy to reason about — at any given term there is at most one authoritative source of truth.

🔢 Leader Election: Terms, Votes, and the Role of Randomness

Raft divides time into Terms — monotonically increasing integers. Think of them as electoral cycles. A new term begins whenever a follower suspects the leader has failed.

Election flow:

A follower's election timer expires (no heartbeat received from a leader).
It increments its term, transitions to candidate, votes for itself.
It sends RequestVote RPCs to all other nodes.
A node grants its vote if: the candidate's term is at least as high and its log is at least as up-to-date.
First candidate to get (N/2 + 1) votes wins and becomes leader for that term.

Cluster: 5 nodes  →  quorum = 3 votes required
Term 7, Candidate A asks nodes B, C, D, E
B votes yes, C votes yes → A wins with 3/5 (including itself)
A sends heartbeats to all → everyone transitions to follower

Split votes (two candidates tie) are resolved by randomised timeouts — each follower waits a different random delay (e.g., 150–300 ms) before starting an election. The first to time out usually wins before others wake up, breaking any tie without a coordinator.

📊 Raft Leader Election Sequence

sequenceDiagram
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant F3 as Follower 3

    Note over F1: Election timeout fires
    F1->>F1: become Candidate, term++
    F1->>F2: RequestVote (term=8)
    F1->>F3: RequestVote (term=8)
    F2-->>F1: vote granted
    F3-->>F1: vote granted
    Note over F1: majority (3/3)  become Leader
    F1->>F2: AppendEntries heartbeat
    F1->>F3: AppendEntries heartbeat
    Note over F2,F3: revert to Follower

🧠 Deep Dive: How Raft Replicates Logs to Followers

Once a leader is elected, client writes flow through the AppendEntries RPC. The sequence below is the core of Raft's safety guarantee: the leader only acknowledges a write to the client after a majority of nodes have persisted it.

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    C->>L: Write X=5
    L->>F1: AppendEntries (X=5)
    L->>F2: AppendEntries (X=5)
    F1-->>L: ACK
    F2-->>L: ACK
    L->>L: Commit X=5
    L-->>C: OK

The leader only confirms to the client after receiving ACKs from a quorum (F1 + F2). The write is durable even if the leader crashes immediately after replying.

The five-step write path:

Client sends write to leader.
Leader appends entry to its local log (as uncommitted).
Leader broadcasts AppendEntries RPC to all followers in parallel.
Once a majority ACK the entry, leader marks it committed and applies it to the state machine.
Leader replies OK to client; followers learn of the commit on the next heartbeat.

If a follower crashes before ACKing, the leader retries indefinitely. If the leader crashes after committing, the new leader will still carry the entry — it won election precisely because it had the most up-to-date log.

📊 The Full Raft Write Path: From Client Request to Durable Commit

The sequence diagram shows the happy path. This flowchart adds the commit guard and the retry loop for lagging followers:

flowchart TD
    A[Client sends write to Leader] --> B[Leader appends to local log uncommitted]
    B --> C[Leader broadcasts AppendEntries to all followers in parallel]
    C --> D{Quorum ACKs received?}
    D -- No --> E[Wait and retry lagging followers]
    E --> D
    D -- Yes --> F[Leader marks entry committed applies to state machine]
    F --> G[Leader replies OK to client]
    G --> H[Followers apply on next AppendEntries heartbeat]

Writes are blocked at the commit guard until a majority respond — this is what makes Raft strongly consistent. A client that gets OK can be certain the value survived any single-node failure.

⚖️ Trade-offs & Failure Modes: Raft vs Paxos

Dimension	Raft	Paxos
Designed for	Understandability	Formal proof of correctness
Leader model	Single strong leader	Flexible, multi-proposer
Complexity	Lower — one spec, one paper	Higher — many competing variants
Common variants	Multi-Raft (per-shard groups)	Multi-Paxos, Fast Paxos, Flexible Paxos
Used in production	etcd, CockroachDB, TiKV, Consul	Chubby (Google), Zookeeper (Zab)

Failure modes to know before you operate a Raft cluster:

Leader isolation: A leader cut off from the majority keeps receiving client writes but cannot commit them (quorum ACKs will never arrive). The majority partition elects a new leader in a higher term. When the old leader reconnects, it sees the higher term and steps down, discarding all its uncommitted log entries. No committed data is ever lost.
Follower lag: A slow or restarting follower doesn't block commits — the leader only needs a quorum. The lagging follower replays the log from its last known index on reconnect.
Split-brain prevention: Raft prevents two leaders in the same term by requiring a majority vote. You cannot have two nodes simultaneously acting as leader in the same term.

🧭 Decision Guide: When to Use Raft vs Paxos vs ZooKeeper

Situation	Recommendation
Building a new distributed store or metadata service	Use Raft — mature libraries exist (`etcd/raft`, `hashicorp/raft`); easy to debug
Need a formal correctness proof for a safety-critical system	Consider Paxos — richer academic tooling for proving safety properties
Already running ZooKeeper for coordination	Evaluate etcd as a migration target; stay if the system is stable and the team knows ZooKeeper
Multi-partition database (per-shard replication)	Use Multi-Raft (CockroachDB / TiKV model) — one Raft group per 64 MB range
System tolerates eventual consistency	Skip consensus entirely — use last-write-wins or CRDTs to avoid the per-write latency cost

The rule of thumb: reach for Raft when you need strong consistency and want a protocol you can actually read, trace, and debug in a postmortem.

🌍 Real-World Applications: etcd, Kafka KRaft, and CockroachDB

Kubernetes / etcd: Every kubectl command that changes cluster state is linearised through etcd's Raft log. A 3-node etcd quorum tolerates one node failure; production deployments typically run 5 nodes for resilience during rolling upgrades.
Kafka KRaft mode (3.x+): Replaces ZooKeeper with a built-in Raft-based metadata log. The controller quorum (typically 3 nodes) runs Raft; each broker is a follower for metadata changes. This eliminates the separate ZooKeeper cluster and cuts operational complexity.
CockroachDB / TiKV: Each 64 MB data range runs its own Raft group across 3 replicas. Node failures trigger elections only for affected ranges — the rest of the cluster keeps serving normally.
Consul: Distributed service registry uses Raft for consistent key-value state across server nodes.

Operational consequence: Writes stall until a quorum of nodes is reachable. Lose more than floor(N/2) nodes and the cluster enters read-only mode until quorum is restored.

🧪 Walking Through a Kubernetes Control-Plane Write via etcd

When you run kubectl apply -f deployment.yaml, here is what happens at the consensus layer:

The Kubernetes API server receives the request, validates it, and calls etcd.Put("/registry/deployments/default/my-app", serializedSpec).
etcd's Raft leader appends the key-value write to its log (uncommitted).
The leader broadcasts AppendEntries to the other two etcd nodes in the quorum.
On receiving ACKs from both followers (quorum = 2 of 3), the leader commits the entry and applies it to the in-memory key-value state.
etcd returns OK to the API server, which confirms to kubectl. Kubernetes controllers watching the key prefix see the change and begin reconciling — scheduling pods, updating replica sets.

If the etcd leader crashes between steps 3 and 5, the new leader replays the uncommitted entry (it won election because it had the most up-to-date log). The API server retries and gets OK once the new leader commits.

🛠️ Apache ZooKeeper, etcd, and Hazelcast: Raft Consensus in Java Production Systems

Apache ZooKeeper is the battle-tested coordination service behind Kafka (pre-3.x), Hadoop, and HBase — it implements Zab (ZooKeeper Atomic Broadcast), a Paxos variant. etcd is the Raft-based key-value store that powers Kubernetes. Hazelcast is a Java-native in-process distributed data grid whose CP subsystem implements Raft — the lowest-friction path to adding consensus to a Spring Boot application.

Hazelcast CP subsystem — distributed leader election and linearizable counters in Java:

// Maven: com.hazelcast:hazelcast:5.3.6
import com.hazelcast.config.*;
import com.hazelcast.core.*;
import com.hazelcast.cp.CPSubsystem;
import com.hazelcast.cp.lock.FencedLock;
import com.hazelcast.cp.IAtomicLong;

public class ConsensusDemo {

    public static void main(String[] args) {
        // Configure CP subsystem — minimum 3 members for Raft quorum
        Config config = new Config();
        config.getCPSubsystemConfig().setCPMemberCount(3);  // 3-node Raft group

        HazelcastInstance hz = Hazelcast.newHazelcastInstance(config);
        CPSubsystem cp = hz.getCPSubsystem();

        // 1. Distributed atomic counter — linearizable across all 3 members
        //    Every incrementAndGet() goes through a full Raft round-trip
        IAtomicLong requestCounter = cp.getAtomicLong("request-counter");
        long requestId = requestCounter.incrementAndGet();
        System.out.println("Assigned ID (globally unique): " + requestId);

        // 2. Distributed leader lock — only one node holds this lock at a time
        //    FencedLock uses Raft to ensure mutual exclusion even across network partitions
        FencedLock leaderLock = cp.getLock("scheduler-leader-lock");
        leaderLock.lock();
        try {
            System.out.println("I am the scheduler leader — running distributed cron job");
            // Only one node in the cluster runs this block simultaneously
        } finally {
            leaderLock.unlock();
        }
    }
}

etcd Java client (jetcd) — reading Kubernetes cluster state written via Raft:

// Maven: io.etcd:jetcd-core:0.7.7
import io.etcd.jetcd.Client;
import io.etcd.jetcd.ByteSequence;
import java.nio.charset.StandardCharsets;

Client etcd = Client.builder().endpoints("http://etcd:2379").build();

// Read the Kubernetes Deployment spec stored via Raft by the API server
ByteSequence key = ByteSequence.from(
    "/registry/deployments/default/my-app", StandardCharsets.UTF_8);

etcd.getKVClient().get(key).thenAccept(response ->
    response.getKvs().forEach(kv ->
        System.out.println("Deployment spec: " + kv.getValue().toString(StandardCharsets.UTF_8))
    )
).get();  // blocks — use .thenAccept() for async in production

Choosing between ZooKeeper, etcd, and Hazelcast:

System	Protocol	Java integration	Best for
Apache ZooKeeper	Zab (Paxos-variant)	Apache Curator client	Kafka metadata, HBase, legacy Hadoop
etcd	Raft	jetcd Java client	Kubernetes-native services, metadata stores
Hazelcast CP	Raft	In-process (same JVM)	Spring Boot services needing distributed locks/counters

For new Spring Boot services, Hazelcast CP requires no external cluster to operate — the Raft group runs inside your application nodes. For Kubernetes-native services, jetcd reaches the existing etcd cluster directly.

For a full deep-dive on Hazelcast CP subsystem configuration, Apache Curator for ZooKeeper, and operating etcd clusters in production, a dedicated follow-up post is planned.

📚 Lessons from Running Consensus Clusters in Production

Always run an odd number of nodes. A 4-node cluster tolerates the same 1 failure as a 3-node cluster but requires 3 ACKs instead of 2 — higher write latency with no additional fault tolerance. Use 3 or 5.
Raft does not fix slow disks. Followers persist entries to disk before ACKing. An NVMe SSD versus a spinning disk can be the difference between 1 ms and 12 ms consensus round trips.
Span nodes across failure domains, not just machines. A 3-node cluster with 2 nodes in AZ-A and 1 in AZ-B loses quorum if AZ-A fails. Distribute 1 node per AZ across 3 zones to survive a full AZ outage.
Watch out for large Raft logs. etcd recommends keeping the total key-value space under 8 GB. Beyond that, snapshotting and log compaction slow recovery. Compact aggressively and monitor etcd_mvcc_db_total_size_in_bytes.

📌 TLDR: Summary & Key Takeaways

Consensus algorithms let a cluster agree on a single value even when nodes fail — the primitive behind distributed databases, service registries, and metadata stores.
Raft uses a single leader per term; all writes flow through it, making the protocol straightforward to reason about and debug.
Leader election uses term numbers, randomised timeouts, and majority votes to elect exactly one leader per term without a coordinator.
Log replication commits only after a quorum ACKs — clients never observe uncommitted writes.
Paxos is the academic foundation; Raft is the production standard. For new systems, start with an existing Raft library rather than implementing from scratch.
Always run an odd number of nodes (3 or 5) and spread them across independent failure domains.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read