All Posts

A Guide to Raft, Paxos, and Consensus Algorithms

How do distributed databases agree on data? We explain the Leader Election and Log Replication me...

Abstract AlgorithmsAbstract Algorithms
··13 min read

AI-assisted content.

TLDR

TLDR: Consensus algorithms allow a cluster of computers to agree on a single value (e.g., "Who is the leader?"). Paxos is the academic standard — correct but notoriously hard to understand. Raft is the practical standard — designed for understandability, used in Kubernetes (etcd), Kafka (KRaft), and CockroachDB.


📖 The Restaurant Order Problem: Why Distributed Agreement Is Hard

Imagine a restaurant with three waiters and no central ticket system. A table orders "steak." Two waiters hear "steak," one hears "fish." They all run to the kitchen with different tickets.

This is the distributed systems problem: three servers receiving one write need to agree on the final value before confirming to the client. If one crashes mid-write, the others must not contradict each other.

Consensus algorithms solve this. They provide two fundamental guarantees:

  • Safety: no two nodes ever commit different values for the same slot — even during a crash or network partition.
  • Liveness: the cluster eventually makes progress as long as a majority of nodes are alive and can communicate.

Both Raft and Paxos satisfy these properties. Where they differ is in how readable and debuggable the protocol is for practising engineers.


🔍 Safety, Liveness, and Quorums: What Consensus Must Guarantee

The key rule that makes consensus possible is quorum — a simple majority. The cluster only commits a value once a quorum acknowledges it. You can lose floor(N/2) nodes and still write, but lose one more and writes stall.

Cluster SizeQuorum (majority)Max node failures tolerated
321
532
743

Both Raft and Paxos rely on quorums. Neither sacrifices safety; they differ in liveness and protocol readability.


⚙️ Raft's Three Roles: Follower, Candidate, and Leader

Raft simplifies consensus by restricting who can write. At any moment every node is in exactly one state:

stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate : election timeout (no heartbeat)
    Candidate --> Leader : majority votes received
    Candidate --> Follower : higher term seen
    Leader --> Follower : higher term seen
RoleResponsibility
FollowerPassive; accepts log entries and heartbeats from the leader
CandidateTemporarily during elections; requests votes from peers
LeaderHandles all writes; sends heartbeats every ~150 ms

Only the leader accepts client writes. This single-writer design makes the protocol easy to reason about — at any given term there is at most one authoritative source of truth.


🔢 Leader Election: Terms, Votes, and the Role of Randomness

Raft divides time into Terms — monotonically increasing integers. Think of them as electoral cycles. A new term begins whenever a follower suspects the leader has failed.

Election flow:

  1. A follower's election timer expires (no heartbeat received from a leader).
  2. It increments its term, transitions to candidate, votes for itself.
  3. It sends RequestVote RPCs to all other nodes.
  4. A node grants its vote if: the candidate's term is at least as high and its log is at least as up-to-date.
  5. First candidate to get (N/2 + 1) votes wins and becomes leader for that term.
Cluster: 5 nodes  →  quorum = 3 votes required
Term 7, Candidate A asks nodes B, C, D, E
B votes yes, C votes yes → A wins with 3/5 (including itself)
A sends heartbeats to all → everyone transitions to follower

Split votes (two candidates tie) are resolved by randomised timeouts — each follower waits a different random delay (e.g., 150–300 ms) before starting an election. The first to time out usually wins before others wake up, breaking any tie without a coordinator.

📊 Raft Leader Election Sequence

sequenceDiagram
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant F3 as Follower 3

    Note over F1: Election timeout fires
    F1->>F1: become Candidate, term++
    F1->>F2: RequestVote (term=8)
    F1->>F3: RequestVote (term=8)
    F2-->>F1: vote granted
    F3-->>F1: vote granted
    Note over F1: majority (3/3)  become Leader
    F1->>F2: AppendEntries heartbeat
    F1->>F3: AppendEntries heartbeat
    Note over F2,F3: revert to Follower

🧠 Deep Dive: How Raft Replicates Logs to Followers

Once a leader is elected, client writes flow through the AppendEntries RPC. The sequence below is the core of Raft's safety guarantee: the leader only acknowledges a write to the client after a majority of nodes have persisted it.

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    C->>L: Write X=5
    L->>F1: AppendEntries (X=5)
    L->>F2: AppendEntries (X=5)
    F1-->>L: ACK
    F2-->>L: ACK
    L->>L: Commit X=5
    L-->>C: OK

The leader only confirms to the client after receiving ACKs from a quorum (F1 + F2). The write is durable even if the leader crashes immediately after replying.

The five-step write path:

  1. Client sends write to leader.
  2. Leader appends entry to its local log (as uncommitted).
  3. Leader broadcasts AppendEntries RPC to all followers in parallel.
  4. Once a majority ACK the entry, leader marks it committed and applies it to the state machine.
  5. Leader replies OK to client; followers learn of the commit on the next heartbeat.

If a follower crashes before ACKing, the leader retries indefinitely. If the leader crashes after committing, the new leader will still carry the entry — it won election precisely because it had the most up-to-date log.


📊 The Full Raft Write Path: From Client Request to Durable Commit

The sequence diagram shows the happy path. This flowchart adds the commit guard and the retry loop for lagging followers:

flowchart TD
    A[Client sends write to Leader] --> B[Leader appends to local log uncommitted]
    B --> C[Leader broadcasts AppendEntries to all followers in parallel]
    C --> D{Quorum ACKs received?}
    D -- No --> E[Wait and retry lagging followers]
    E --> D
    D -- Yes --> F[Leader marks entry committed applies to state machine]
    F --> G[Leader replies OK to client]
    G --> H[Followers apply on next AppendEntries heartbeat]

Writes are blocked at the commit guard until a majority respond — this is what makes Raft strongly consistent. A client that gets OK can be certain the value survived any single-node failure.


⚖️ Trade-offs & Failure Modes: Raft vs Paxos

DimensionRaftPaxos
Designed forUnderstandabilityFormal proof of correctness
Leader modelSingle strong leaderFlexible, multi-proposer
ComplexityLower — one spec, one paperHigher — many competing variants
Common variantsMulti-Raft (per-shard groups)Multi-Paxos, Fast Paxos, Flexible Paxos
Used in productionetcd, CockroachDB, TiKV, ConsulChubby (Google), Zookeeper (Zab)

Failure modes to know before you operate a Raft cluster:

  • Leader isolation: A leader cut off from the majority keeps receiving client writes but cannot commit them (quorum ACKs will never arrive). The majority partition elects a new leader in a higher term. When the old leader reconnects, it sees the higher term and steps down, discarding all its uncommitted log entries. No committed data is ever lost.
  • Follower lag: A slow or restarting follower doesn't block commits — the leader only needs a quorum. The lagging follower replays the log from its last known index on reconnect.
  • Split-brain prevention: Raft prevents two leaders in the same term by requiring a majority vote. You cannot have two nodes simultaneously acting as leader in the same term.

🧭 Decision Guide: When to Use Raft vs Paxos vs ZooKeeper

SituationRecommendation
Building a new distributed store or metadata serviceUse Raft — mature libraries exist (etcd/raft, hashicorp/raft); easy to debug
Need a formal correctness proof for a safety-critical systemConsider Paxos — richer academic tooling for proving safety properties
Already running ZooKeeper for coordinationEvaluate etcd as a migration target; stay if the system is stable and the team knows ZooKeeper
Multi-partition database (per-shard replication)Use Multi-Raft (CockroachDB / TiKV model) — one Raft group per 64 MB range
System tolerates eventual consistencySkip consensus entirely — use last-write-wins or CRDTs to avoid the per-write latency cost

The rule of thumb: reach for Raft when you need strong consistency and want a protocol you can actually read, trace, and debug in a postmortem.


🌍 Real-World Applications: etcd, Kafka KRaft, and CockroachDB

  • Kubernetes / etcd: Every kubectl command that changes cluster state is linearised through etcd's Raft log. A 3-node etcd quorum tolerates one node failure; production deployments typically run 5 nodes for resilience during rolling upgrades.
  • Kafka KRaft mode (3.x+): Replaces ZooKeeper with a built-in Raft-based metadata log. The controller quorum (typically 3 nodes) runs Raft; each broker is a follower for metadata changes. This eliminates the separate ZooKeeper cluster and cuts operational complexity.
  • CockroachDB / TiKV: Each 64 MB data range runs its own Raft group across 3 replicas. Node failures trigger elections only for affected ranges — the rest of the cluster keeps serving normally.
  • Consul: Distributed service registry uses Raft for consistent key-value state across server nodes.

Operational consequence: Writes stall until a quorum of nodes is reachable. Lose more than floor(N/2) nodes and the cluster enters read-only mode until quorum is restored.


🧪 Walking Through a Kubernetes Control-Plane Write via etcd

When you run kubectl apply -f deployment.yaml, here is what happens at the consensus layer:

  1. The Kubernetes API server receives the request, validates it, and calls etcd.Put("/registry/deployments/default/my-app", serializedSpec).
  2. etcd's Raft leader appends the key-value write to its log (uncommitted).
  3. The leader broadcasts AppendEntries to the other two etcd nodes in the quorum.
  4. On receiving ACKs from both followers (quorum = 2 of 3), the leader commits the entry and applies it to the in-memory key-value state.
  5. etcd returns OK to the API server, which confirms to kubectl. Kubernetes controllers watching the key prefix see the change and begin reconciling — scheduling pods, updating replica sets.

If the etcd leader crashes between steps 3 and 5, the new leader replays the uncommitted entry (it won election because it had the most up-to-date log). The API server retries and gets OK once the new leader commits.


🛠️ Apache ZooKeeper, etcd, and Hazelcast: Raft Consensus in Java Production Systems

Apache ZooKeeper is the battle-tested coordination service behind Kafka (pre-3.x), Hadoop, and HBase — it implements Zab (ZooKeeper Atomic Broadcast), a Paxos variant. etcd is the Raft-based key-value store that powers Kubernetes. Hazelcast is a Java-native in-process distributed data grid whose CP subsystem implements Raft — the lowest-friction path to adding consensus to a Spring Boot application.

Hazelcast CP subsystem — distributed leader election and linearizable counters in Java:

// Maven: com.hazelcast:hazelcast:5.3.6
import com.hazelcast.config.*;
import com.hazelcast.core.*;
import com.hazelcast.cp.CPSubsystem;
import com.hazelcast.cp.lock.FencedLock;
import com.hazelcast.cp.IAtomicLong;

public class ConsensusDemo {

    public static void main(String[] args) {
        // Configure CP subsystem — minimum 3 members for Raft quorum
        Config config = new Config();
        config.getCPSubsystemConfig().setCPMemberCount(3);  // 3-node Raft group

        HazelcastInstance hz = Hazelcast.newHazelcastInstance(config);
        CPSubsystem cp = hz.getCPSubsystem();

        // 1. Distributed atomic counter — linearizable across all 3 members
        //    Every incrementAndGet() goes through a full Raft round-trip
        IAtomicLong requestCounter = cp.getAtomicLong("request-counter");
        long requestId = requestCounter.incrementAndGet();
        System.out.println("Assigned ID (globally unique): " + requestId);

        // 2. Distributed leader lock — only one node holds this lock at a time
        //    FencedLock uses Raft to ensure mutual exclusion even across network partitions
        FencedLock leaderLock = cp.getLock("scheduler-leader-lock");
        leaderLock.lock();
        try {
            System.out.println("I am the scheduler leader — running distributed cron job");
            // Only one node in the cluster runs this block simultaneously
        } finally {
            leaderLock.unlock();
        }
    }
}

etcd Java client (jetcd) — reading Kubernetes cluster state written via Raft:

// Maven: io.etcd:jetcd-core:0.7.7
import io.etcd.jetcd.Client;
import io.etcd.jetcd.ByteSequence;
import java.nio.charset.StandardCharsets;

Client etcd = Client.builder().endpoints("http://etcd:2379").build();

// Read the Kubernetes Deployment spec stored via Raft by the API server
ByteSequence key = ByteSequence.from(
    "/registry/deployments/default/my-app", StandardCharsets.UTF_8);

etcd.getKVClient().get(key).thenAccept(response ->
    response.getKvs().forEach(kv ->
        System.out.println("Deployment spec: " + kv.getValue().toString(StandardCharsets.UTF_8))
    )
).get();  // blocks — use .thenAccept() for async in production

Choosing between ZooKeeper, etcd, and Hazelcast:

SystemProtocolJava integrationBest for
Apache ZooKeeperZab (Paxos-variant)Apache Curator clientKafka metadata, HBase, legacy Hadoop
etcdRaftjetcd Java clientKubernetes-native services, metadata stores
Hazelcast CPRaftIn-process (same JVM)Spring Boot services needing distributed locks/counters

For new Spring Boot services, Hazelcast CP requires no external cluster to operate — the Raft group runs inside your application nodes. For Kubernetes-native services, jetcd reaches the existing etcd cluster directly.

For a full deep-dive on Hazelcast CP subsystem configuration, Apache Curator for ZooKeeper, and operating etcd clusters in production, a dedicated follow-up post is planned.


📚 Lessons from Running Consensus Clusters in Production

  • Always run an odd number of nodes. A 4-node cluster tolerates the same 1 failure as a 3-node cluster but requires 3 ACKs instead of 2 — higher write latency with no additional fault tolerance. Use 3 or 5.
  • Raft does not fix slow disks. Followers persist entries to disk before ACKing. An NVMe SSD versus a spinning disk can be the difference between 1 ms and 12 ms consensus round trips.
  • Span nodes across failure domains, not just machines. A 3-node cluster with 2 nodes in AZ-A and 1 in AZ-B loses quorum if AZ-A fails. Distribute 1 node per AZ across 3 zones to survive a full AZ outage.
  • Watch out for large Raft logs. etcd recommends keeping the total key-value space under 8 GB. Beyond that, snapshotting and log compaction slow recovery. Compact aggressively and monitor etcd_mvcc_db_total_size_in_bytes.

📌 TLDR: Summary & Key Takeaways

  • Consensus algorithms let a cluster agree on a single value even when nodes fail — the primitive behind distributed databases, service registries, and metadata stores.
  • Raft uses a single leader per term; all writes flow through it, making the protocol straightforward to reason about and debug.
  • Leader election uses term numbers, randomised timeouts, and majority votes to elect exactly one leader per term without a coordinator.
  • Log replication commits only after a quorum ACKs — clients never observe uncommitted writes.
  • Paxos is the academic foundation; Raft is the production standard. For new systems, start with an existing Raft library rather than implementing from scratch.
  • Always run an odd number of nodes (3 or 5) and spread them across independent failure domains.

Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms