All Posts

A Guide to Raft, Paxos, and Consensus Algorithms

How do distributed databases agree on data? We explain the Leader Election and Log Replication me...

Abstract AlgorithmsAbstract Algorithms
ยทยท15 min read
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR

TLDR: Consensus algorithms allow a cluster of computers to agree on a single value (e.g., "Who is the leader?"). Paxos is the academic standard โ€” correct but notoriously hard to understand. Raft is the practical standard โ€” designed for understandability, used in Kubernetes (etcd), Kafka (KRaft), and CockroachDB.


๐Ÿ“– The Restaurant Order Problem: Why Distributed Agreement Is Hard

Imagine a restaurant with three waiters and no central ticket system. A table orders "steak." Two waiters hear "steak," one hears "fish." They all run to the kitchen with different tickets.

This is the distributed systems problem: three servers receiving one write need to agree on the final value before confirming to the client. If one crashes mid-write, the others must not contradict each other.

Consensus algorithms solve this. They provide two fundamental guarantees:

  • Safety: no two nodes ever commit different values for the same slot โ€” even during a crash or network partition.
  • Liveness: the cluster eventually makes progress as long as a majority of nodes are alive and can communicate.

Both Raft and Paxos satisfy these properties. Where they differ is in how readable and debuggable the protocol is for practising engineers.


๐Ÿ” Safety, Liveness, and Quorums: What Consensus Must Guarantee

The key rule that makes consensus possible is quorum โ€” a simple majority. The cluster only commits a value once a quorum acknowledges it. You can lose floor(N/2) nodes and still write, but lose one more and writes stall.

Cluster SizeQuorum (majority)Max node failures tolerated
321
532
743

Both Raft and Paxos rely on quorums. Neither sacrifices safety; they differ in liveness and protocol readability.


โš™๏ธ Raft's Three Roles: Follower, Candidate, and Leader

Raft simplifies consensus by restricting who can write. At any moment every node is in exactly one state:

stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate : election timeout (no heartbeat)
    Candidate --> Leader : majority votes received
    Candidate --> Follower : higher term seen
    Leader --> Follower : higher term seen
RoleResponsibility
FollowerPassive; accepts log entries and heartbeats from the leader
CandidateTemporarily during elections; requests votes from peers
LeaderHandles all writes; sends heartbeats every ~150 ms

Only the leader accepts client writes. This single-writer design makes the protocol easy to reason about โ€” at any given term there is at most one authoritative source of truth.


๐Ÿ”ข Leader Election: Terms, Votes, and the Role of Randomness

Raft divides time into Terms โ€” monotonically increasing integers. Think of them as electoral cycles. A new term begins whenever a follower suspects the leader has failed.

Election flow:

  1. A follower's election timer expires (no heartbeat received from a leader).
  2. It increments its term, transitions to candidate, votes for itself.
  3. It sends RequestVote RPCs to all other nodes.
  4. A node grants its vote if: the candidate's term is at least as high and its log is at least as up-to-date.
  5. First candidate to get (N/2 + 1) votes wins and becomes leader for that term.
Cluster: 5 nodes  โ†’  quorum = 3 votes required
Term 7, Candidate A asks nodes B, C, D, E
B votes yes, C votes yes โ†’ A wins with 3/5 (including itself)
A sends heartbeats to all โ†’ everyone transitions to follower

Split votes (two candidates tie) are resolved by randomised timeouts โ€” each follower waits a different random delay (e.g., 150โ€“300 ms) before starting an election. The first to time out usually wins before others wake up, breaking any tie without a coordinator.

๐Ÿ“Š Raft Leader Election Sequence

sequenceDiagram
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant F3 as Follower 3

    Note over F1: Election timeout fires
    F1->>F1: become Candidate, term++
    F1->>F2: RequestVote (term=8)
    F1->>F3: RequestVote (term=8)
    F2-->>F1: vote granted
    F3-->>F1: vote granted
    Note over F1: majority (3/3) โ†’ become Leader
    F1->>F2: AppendEntries heartbeat
    F1->>F3: AppendEntries heartbeat
    Note over F2,F3: revert to Follower

๐Ÿง  Deep Dive: How Raft Replicates Logs to Followers

Once a leader is elected, client writes flow through the AppendEntries RPC. The sequence below is the core of Raft's safety guarantee: the leader only acknowledges a write to the client after a majority of nodes have persisted it.

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    C->>L: Write X=5
    L->>F1: AppendEntries (X=5)
    L->>F2: AppendEntries (X=5)
    F1-->>L: ACK
    F2-->>L: ACK
    L->>L: Commit X=5
    L-->>C: OK

The leader only confirms to the client after receiving ACKs from a quorum (F1 + F2). The write is durable even if the leader crashes immediately after replying.

The five-step write path:

  1. Client sends write to leader.
  2. Leader appends entry to its local log (as uncommitted).
  3. Leader broadcasts AppendEntries RPC to all followers in parallel.
  4. Once a majority ACK the entry, leader marks it committed and applies it to the state machine.
  5. Leader replies OK to client; followers learn of the commit on the next heartbeat.

If a follower crashes before ACKing, the leader retries indefinitely. If the leader crashes after committing, the new leader will still carry the entry โ€” it won election precisely because it had the most up-to-date log.


๐Ÿ“Š The Full Raft Write Path: From Client Request to Durable Commit

The sequence diagram shows the happy path. This flowchart adds the commit guard and the retry loop for lagging followers:

flowchart TD
    A[Client sends write to Leader] --> B[Leader appends to local log\nuncommitted]
    B --> C[Leader broadcasts AppendEntries\nto all followers in parallel]
    C --> D{Quorum ACKs\nreceived?}
    D -- No --> E[Wait and retry\nlagging followers]
    E --> D
    D -- Yes --> F[Leader marks entry committed\napplies to state machine]
    F --> G[Leader replies OK to client]
    G --> H[Followers apply on next\nAppendEntries heartbeat]

Writes are blocked at the commit guard until a majority respond โ€” this is what makes Raft strongly consistent. A client that gets OK can be certain the value survived any single-node failure.


โš–๏ธ Trade-offs & Failure Modes: Raft vs Paxos

DimensionRaftPaxos
Designed forUnderstandabilityFormal proof of correctness
Leader modelSingle strong leaderFlexible, multi-proposer
ComplexityLower โ€” one spec, one paperHigher โ€” many competing variants
Common variantsMulti-Raft (per-shard groups)Multi-Paxos, Fast Paxos, Flexible Paxos
Used in productionetcd, CockroachDB, TiKV, ConsulChubby (Google), Zookeeper (Zab)

Failure modes to know before you operate a Raft cluster:

  • Leader isolation: A leader cut off from the majority keeps receiving client writes but cannot commit them (quorum ACKs will never arrive). The majority partition elects a new leader in a higher term. When the old leader reconnects, it sees the higher term and steps down, discarding all its uncommitted log entries. No committed data is ever lost.
  • Follower lag: A slow or restarting follower doesn't block commits โ€” the leader only needs a quorum. The lagging follower replays the log from its last known index on reconnect.
  • Split-brain prevention: Raft prevents two leaders in the same term by requiring a majority vote. You cannot have two nodes simultaneously acting as leader in the same term.

๐Ÿงญ Decision Guide: When to Use Raft vs Paxos vs ZooKeeper

SituationRecommendation
Building a new distributed store or metadata serviceUse Raft โ€” mature libraries exist (etcd/raft, hashicorp/raft); easy to debug
Need a formal correctness proof for a safety-critical systemConsider Paxos โ€” richer academic tooling for proving safety properties
Already running ZooKeeper for coordinationEvaluate etcd as a migration target; stay if the system is stable and the team knows ZooKeeper
Multi-partition database (per-shard replication)Use Multi-Raft (CockroachDB / TiKV model) โ€” one Raft group per 64 MB range
System tolerates eventual consistencySkip consensus entirely โ€” use last-write-wins or CRDTs to avoid the per-write latency cost

The rule of thumb: reach for Raft when you need strong consistency and want a protocol you can actually read, trace, and debug in a postmortem.


๐ŸŒ Real-World Applications: etcd, Kafka KRaft, and CockroachDB

  • Kubernetes / etcd: Every kubectl command that changes cluster state is linearised through etcd's Raft log. A 3-node etcd quorum tolerates one node failure; production deployments typically run 5 nodes for resilience during rolling upgrades.
  • Kafka KRaft mode (3.x+): Replaces ZooKeeper with a built-in Raft-based metadata log. The controller quorum (typically 3 nodes) runs Raft; each broker is a follower for metadata changes. This eliminates the separate ZooKeeper cluster and cuts operational complexity.
  • CockroachDB / TiKV: Each 64 MB data range runs its own Raft group across 3 replicas. Node failures trigger elections only for affected ranges โ€” the rest of the cluster keeps serving normally.
  • Consul: Distributed service registry uses Raft for consistent key-value state across server nodes.

Operational consequence: Writes stall until a quorum of nodes is reachable. Lose more than floor(N/2) nodes and the cluster enters read-only mode until quorum is restored.


๐Ÿงช Walking Through a Kubernetes Control-Plane Write via etcd

When you run kubectl apply -f deployment.yaml, here is what happens at the consensus layer:

  1. The Kubernetes API server receives the request, validates it, and calls etcd.Put("/registry/deployments/default/my-app", serializedSpec).
  2. etcd's Raft leader appends the key-value write to its log (uncommitted).
  3. The leader broadcasts AppendEntries to the other two etcd nodes in the quorum.
  4. On receiving ACKs from both followers (quorum = 2 of 3), the leader commits the entry and applies it to the in-memory key-value state.
  5. etcd returns OK to the API server, which confirms to kubectl. Kubernetes controllers watching the key prefix see the change and begin reconciling โ€” scheduling pods, updating replica sets.

If the etcd leader crashes between steps 3 and 5, the new leader replays the uncommitted entry (it won election because it had the most up-to-date log). The API server retries and gets OK once the new leader commits.


๐Ÿ› ๏ธ Apache ZooKeeper, etcd, and Hazelcast: Raft Consensus in Java Production Systems

Apache ZooKeeper is the battle-tested coordination service behind Kafka (pre-3.x), Hadoop, and HBase โ€” it implements Zab (ZooKeeper Atomic Broadcast), a Paxos variant. etcd is the Raft-based key-value store that powers Kubernetes. Hazelcast is a Java-native in-process distributed data grid whose CP subsystem implements Raft โ€” the lowest-friction path to adding consensus to a Spring Boot application.

Hazelcast CP subsystem โ€” distributed leader election and linearizable counters in Java:

// Maven: com.hazelcast:hazelcast:5.3.6
import com.hazelcast.config.*;
import com.hazelcast.core.*;
import com.hazelcast.cp.CPSubsystem;
import com.hazelcast.cp.lock.FencedLock;
import com.hazelcast.cp.IAtomicLong;

public class ConsensusDemo {

    public static void main(String[] args) {
        // Configure CP subsystem โ€” minimum 3 members for Raft quorum
        Config config = new Config();
        config.getCPSubsystemConfig().setCPMemberCount(3);  // 3-node Raft group

        HazelcastInstance hz = Hazelcast.newHazelcastInstance(config);
        CPSubsystem cp = hz.getCPSubsystem();

        // 1. Distributed atomic counter โ€” linearizable across all 3 members
        //    Every incrementAndGet() goes through a full Raft round-trip
        IAtomicLong requestCounter = cp.getAtomicLong("request-counter");
        long requestId = requestCounter.incrementAndGet();
        System.out.println("Assigned ID (globally unique): " + requestId);

        // 2. Distributed leader lock โ€” only one node holds this lock at a time
        //    FencedLock uses Raft to ensure mutual exclusion even across network partitions
        FencedLock leaderLock = cp.getLock("scheduler-leader-lock");
        leaderLock.lock();
        try {
            System.out.println("I am the scheduler leader โ€” running distributed cron job");
            // Only one node in the cluster runs this block simultaneously
        } finally {
            leaderLock.unlock();
        }
    }
}

etcd Java client (jetcd) โ€” reading Kubernetes cluster state written via Raft:

// Maven: io.etcd:jetcd-core:0.7.7
import io.etcd.jetcd.Client;
import io.etcd.jetcd.ByteSequence;
import java.nio.charset.StandardCharsets;

Client etcd = Client.builder().endpoints("http://etcd:2379").build();

// Read the Kubernetes Deployment spec stored via Raft by the API server
ByteSequence key = ByteSequence.from(
    "/registry/deployments/default/my-app", StandardCharsets.UTF_8);

etcd.getKVClient().get(key).thenAccept(response ->
    response.getKvs().forEach(kv ->
        System.out.println("Deployment spec: " + kv.getValue().toString(StandardCharsets.UTF_8))
    )
).get();  // blocks โ€” use .thenAccept() for async in production

Choosing between ZooKeeper, etcd, and Hazelcast:

SystemProtocolJava integrationBest for
Apache ZooKeeperZab (Paxos-variant)Apache Curator clientKafka metadata, HBase, legacy Hadoop
etcdRaftjetcd Java clientKubernetes-native services, metadata stores
Hazelcast CPRaftIn-process (same JVM)Spring Boot services needing distributed locks/counters

For new Spring Boot services, Hazelcast CP requires no external cluster to operate โ€” the Raft group runs inside your application nodes. For Kubernetes-native services, jetcd reaches the existing etcd cluster directly.

For a full deep-dive on Hazelcast CP subsystem configuration, Apache Curator for ZooKeeper, and operating etcd clusters in production, a dedicated follow-up post is planned.


๐Ÿ“š Lessons from Running Consensus Clusters in Production

  • Always run an odd number of nodes. A 4-node cluster tolerates the same 1 failure as a 3-node cluster but requires 3 ACKs instead of 2 โ€” higher write latency with no additional fault tolerance. Use 3 or 5.
  • Raft does not fix slow disks. Followers persist entries to disk before ACKing. An NVMe SSD versus a spinning disk can be the difference between 1 ms and 12 ms consensus round trips.
  • Span nodes across failure domains, not just machines. A 3-node cluster with 2 nodes in AZ-A and 1 in AZ-B loses quorum if AZ-A fails. Distribute 1 node per AZ across 3 zones to survive a full AZ outage.
  • Watch out for large Raft logs. etcd recommends keeping the total key-value space under 8 GB. Beyond that, snapshotting and log compaction slow recovery. Compact aggressively and monitor etcd_mvcc_db_total_size_in_bytes.

๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • Consensus algorithms let a cluster agree on a single value even when nodes fail โ€” the primitive behind distributed databases, service registries, and metadata stores.
  • Raft uses a single leader per term; all writes flow through it, making the protocol straightforward to reason about and debug.
  • Leader election uses term numbers, randomised timeouts, and majority votes to elect exactly one leader per term without a coordinator.
  • Log replication commits only after a quorum ACKs โ€” clients never observe uncommitted writes.
  • Paxos is the academic foundation; Raft is the production standard. For new systems, start with an existing Raft library rather than implementing from scratch.
  • Always run an odd number of nodes (3 or 5) and spread them across independent failure domains.

๐Ÿ“ Practice Quiz

  1. What triggers a follower to start a leader election in Raft?

    A) A client write request arrives directly at the follower
    B) The follower's election timeout expires without receiving a heartbeat
    C) The follower detects a network partition between two other nodes
    D) The follower's log falls behind the leader by more than 10 entries

    Correct Answer: B โ€” Raft followers only start elections when the election timer fires, signalling the leader may be down. Heartbeats from an active leader continuously reset this timer.

  2. A 5-node Raft cluster has 2 nodes unreachable. Can it still accept writes?

    A) No โ€” the cluster needs all 5 nodes to commit
    B) Yes โ€” any single node can accept and commit writes independently
    C) Yes โ€” the remaining 3 nodes form a quorum (majority of 5) and can commit
    D) No โ€” Raft requires at least 4 nodes to be online in a 5-node cluster

    Correct Answer: C โ€” Quorum for 5 nodes is floor(5/2) + 1 = 3. With 3 healthy nodes the cluster remains fully writable.

  3. Why do randomised election timeouts prevent split votes from repeating indefinitely?

    A) They give the node with the longest log automatic priority to become leader
    B) They ensure only one candidate starts an election at a time on average, reducing tied races
    C) They synchronise all node clocks so every candidate starts at exactly the same moment
    D) They force the cluster into a read-only state until a winner is chosen

    Correct Answer: B โ€” Each follower waits a different random duration before starting an election. The first to time out typically wins before others even begin campaigning, statistically breaking ties without coordination.

  4. A Raft leader crashes immediately after sending AppendEntries to two followers but before receiving any ACKs back. What happens to the uncommitted log entry?

    A) The entry is permanently lost โ€” no ACK means it was never persisted anywhere
    B) The followers discard the entry on the next heartbeat from the new leader
    C) The followers already persisted the entry locally; the new leader will commit it once it discovers a quorum already has it
    D) The client automatically retries and the new leader receives a fresh write request

    Correct Answer: C โ€” Followers write entries to disk before ACKing, so the data is already durable on both followers. The new leader wins election (it has the most up-to-date log) and commits the entry when it confirms a quorum already holds it via its own AppendEntries pass.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms