All Posts

A Guide to Raft, Paxos, and Consensus Algorithms

How do distributed databases agree on data? We explain the Leader Election and Log Replication me...

Abstract AlgorithmsAbstract Algorithms
··5 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Consensus algorithms allow a cluster of computers to agree on a single value (e.g., "Who is the leader?"). Paxos is the academic standard — correct but notoriously hard to understand. Raft is the practical standard — designed for understandability, used in Kubernetes (etcd), Kafka (KRaft), and CockroachDB.


📖 The Restaurant Order Problem: Why Distributed Agreement Is Hard

Imagine a restaurant with three waiters and no central ticket system. A table orders "steak." Two waiters hear "steak," one hears "fish." They all run to the kitchen with different tickets.

This is the distributed systems problem: three servers receiving one write need to agree on the final value before confirming to the client. If one crashes mid-write, the others must not contradict each other.

Consensus algorithms solve this. They guarantee:

  • Safety: no two nodes commit different values for the same slot
  • Liveness: the cluster eventually makes progress as long as a majority of nodes are alive

⚙️ Raft's Three Roles: Follower, Candidate, and Leader

Raft simplifies consensus by restricting who can write. At any moment every node is in exactly one state:

stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate : election timeout (no heartbeat)
    Candidate --> Leader : majority votes received
    Candidate --> Follower : higher term seen
    Leader --> Follower : higher term seen
RoleResponsibility
FollowerPassive; accepts log entries and heartbeats from the leader
CandidateTemporarily during elections; requests votes from peers
LeaderHandles all writes; sends heartbeats every ~150 ms

Only the leader accepts client writes. This single-writer design makes the protocol easy to reason about.


🔢 Leader Election: Terms, Votes, and Quorum

Raft divides time into Terms — monotonically increasing integers. Think of them as electoral cycles.

Election flow:

  1. A follower's election timer expires (no heartbeat from leader).
  2. It increments its term, transitions to candidate, votes for itself.
  3. It sends RequestVote RPCs to all other nodes.
  4. A node grants its vote if: the candidate's term is at least as high and its log is at least as up-to-date.
  5. First candidate to get (N/2 + 1) votes wins and becomes leader.
Cluster: 5 nodes  →  quorum = 3 votes required
Term 7, Candidate A asks nodes B, C, D, E
B votes yes, C votes yes → A wins with 3/5 (including itself)
A sends heartbeats to all → everyone transitions to follower

Split votes (two candidates tie) are resolved by randomised timeouts — each follower waits a different random delay before starting an election.


🧠 Log Replication: How Raft Keeps Data Consistent

Once a leader is elected, writes flow like this:

  1. Client sends write to leader.
  2. Leader appends entry to its local log (uncommitted).
  3. Leader sends AppendEntries RPC to all followers.
  4. Once a majority acknowledge the entry, leader marks it committed.
  5. Leader notifies followers to apply the entry; leader replies to client.
sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    C->>L: Write X=5
    L->>F1: AppendEntries (X=5)
    L->>F2: AppendEntries (X=5)
    F1-->>L: ACK
    F2-->>L: ACK
    L->>L: Commit X=5
    L-->>C: OK

If a follower crashes before acknowledging, the leader retries. If the leader crashes, the new leader catches followers up using the log — ensuring no committed entry is ever lost.


⚖️ Raft vs Paxos: Simplicity vs Formal Rigor

DimensionRaftPaxos
Designed forUnderstandabilityFormal proof of correctness
Leader modelSingle strong leaderFlexible, multi-proposer
ComplexityLowerHigher
VariantsRaft, Multi-RaftMulti-Paxos, Fast Paxos, Flexible Paxos
Used in productionetcd, CockroachDB, TiKVChubby (Google), Zookeeper (Zab, similar)

Raft is the answer to "I need a consensus library I can actually implement and debug." Paxos is the answer to "I need to prove this protocol is correct."


🌍 Where You'll Find Raft in the Wild

  • Kubernetes: etcd (cluster state store) uses Raft. Every kubectl command that changes cluster state goes through etcd's Raft log.
  • Kafka: KRaft mode (Kafka 3.x) replaces ZooKeeper with a built-in Raft-based metadata log.
  • CockroachDB / TiKV: Each shard (range) runs its own Raft group to replicate data.
  • Consul: Distributed service registry uses Raft for consistent key-value state.

Operational consequence: In a Raft cluster, writes stall until a quorum of nodes is reachable. If you lose more than floor(N/2) nodes simultaneously, the cluster enters a read-only state until you restore quorum.


📌 Key Takeaways

  • Consensus algorithms let a cluster agree on a value even when nodes fail.
  • Raft uses a single leader per term; all writes go through it.
  • Leader election is driven by term numbers, randomised timeouts, and majority voting.
  • Log replication requires acknowledgement from a quorum before committing.
  • Raft is the production standard (etcd, Kafka KRaft, CockroachDB); Paxos is the academic foundation.

🧩 Test Your Understanding

  1. What triggers a follower to start an election in Raft?
  2. A 5-node Raft cluster has 2 nodes unreachable. Can it still accept writes?
  3. Why do randomised election timeouts prevent split votes from looping forever?
  4. What is a "term" in Raft and why does it always increase?

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms