All Posts

Split Brain Explained: When Two Nodes Both Think They Are Leader

How network partitions create two leaders, why that produces irrecoverable data loss, and the quorum, fencing, and STONITH patterns that prevent it

Abstract AlgorithmsAbstract Algorithms
··20 min read
📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 20 min

AI-assisted content.

TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader — each accepting writes the other never sees. Prevent it with quorum consensus (at least ⌊N/2⌋+1 nodes must agree before leadership is granted), fencing tokens (each new leader issues a monotonically increasing token; storage layers reject writes below the current high-water mark), and STONITH (physically power off the old leader before the new one begins). Raft, Zookeeper, and etcd enforce quorum by design. MongoDB with w=1 and Redis Sentinel with quorum=1 do not — and production incidents prove it.


Picture this:your primary database node receives a network glitch at 3 a.m. The five nodes in your cluster lose contact with each other. Two subgroups form. Each group holds an election, each group elects a leader, and now two nodes are simultaneously accepting writes as "primary." One records a payment of $500. The other records the same account receiving $750. When the network heals, you have irreconcilable history — and no way to know which number is right.

This is split brain. It is not a theoretical curiosity. It has taken down MongoDB in production, quietly corrupted Redis Sentinel deployments for hours, and caused Elasticsearch clusters to surface stale search results while silently losing indexed documents. Understanding it — and the surprisingly elegant mathematics that prevents it — is essential for anyone designing or operating distributed systems.


📖 The Consistency Trap: What Split Brain Really Means

Split brain is the condition in which two or more nodes in a cluster simultaneously believe they are the authoritative leader, with no mechanism to arbitrate between them. Each leader accepts reads and writes independently, producing divergent state. When the partition heals and the nodes reconnect, the divergence is often irrecoverable without human intervention or an explicit rollback policy.

A useful analogy: imagine two branch managers of the same bank, cut off from each other by a phone outage. Both continue approving loans using the same customer account balances they last saw. When phone service returns, each has approved different loans against the same collateral. Neither manager did anything wrong locally; the problem is structural — there was no single source of truth during the outage.

Split brain sits at the availability–integrity boundary in the CAP theorem taxonomy. Systems that prefer availability over consistency (AP systems) may survive a partition by allowing both halves to accept writes, but they pay with eventual data divergence. Systems that prefer consistency over availability (CP systems) refuse writes if they cannot confirm a quorum, preventing split brain at the cost of rejecting legitimate requests during the outage. There is no free lunch — every distributed system must choose where on this spectrum it sits, and the choice determines split brain exposure.


🔍 What Is Split Brain? The Core Failure Mode

A split brain is a distributed systems failure where two or more nodes simultaneously believe they are the authoritative leader — each accepting writes with no knowledge of what the other is doing.

The name comes from neurology: when the corpus callosum connecting the two brain hemispheres is severed, each hemisphere operates independently, unaware of the other's decisions. The same thing happens in a distributed cluster when the network that connects nodes is severed.

The three conditions that create split brain:

  1. A network partition — nodes cannot reach each other, but continue running
  2. No quorum enforcement — the cluster allows leadership without a majority
  3. No leader lease or fencing — the old leader isn't told to stop accepting writes

Without all three safeguards, any partition can produce two leaders. With all three, simultaneous majority is mathematically impossible.

⚙️ How a Five-Node Partition Creates Two Leaders Step by Step

Consider a five-node Raft cluster: one leader (P) and four followers (S1, S2, S3, S4). The cluster is healthy. Then a network partition divides the cluster asymmetrically:

  • Partition A contains: P and S1 (two nodes — a minority)
  • Partition B contains: S2, S3, and S4 (three nodes — a majority)

Here is what happens, step by step:

Step 1 — Heartbeats stop crossing the partition. S2, S3, and S4 stop receiving heartbeats from P. Each starts an election timeout countdown — a randomised timer, typically 150–300 ms. The randomisation means one of them will expire first.

Step 2 — Partition B elects a new leader. S2's timer fires first. S2 increments its Raft term counter to Term 2 (the cluster was operating in Term 1) and sends RequestVote to S3 and S4. Both grant their votes. S2 now becomes the leader of Partition B with three votes — a valid majority of the five-node cluster. S2 begins accepting writes.

Step 3 — Partition A's old leader keeps accepting writes. P and S1, isolated in Partition A, do not know the election happened. P is still in Term 1 and still believes it is the leader. If clients route to P, it continues accepting writes and replicating them to S1.

Step 4 — Conflicting writes accumulate. Both P (in Term 1) and S2 (in Term 2) are accepting writes. They are writing to divergent logs. A user's account balance might be modified on both sides with different values.

Step 5 — The partition heals. P receives a message from S2 carrying Term 2. Raft's rule is absolute: a node that discovers a higher term must immediately step down and convert to follower. P reverts to follower status. Its divergent log entries — every write it accepted since the partition began — are rolled back and overwritten with S2's log. Any write that was committed only to P and S1 is permanently lost.

This rollback is not optional. It is the price of returning to a consistent state. Data that clients received ACKs for is gone.


🧠 Deep Dive: Why Quorum Math Makes Two Simultaneous Majorities Impossible

The reason Raft's quorum rule prevents true split brain (as opposed to the temporary confusion described above) is rooted in a simple mathematical identity. For a cluster of N nodes, a majority quorum requires ⌊N/2⌋ + 1 nodes. Two majorities drawn from the same N nodes must overlap by at least one node.

In a five-node cluster, a quorum is 3. Two groups cannot simultaneously each have 3 nodes from a pool of 5 — that would require 6 nodes. This means that at most one side of any partition can form a quorum and elect a leader. The minority side — P and S1 in our example — cannot elect a new leader because S1's single vote plus P's own vote equals only 2, short of the required 3. P remains leader in its own view, but it is effectively frozen: it cannot replicate writes to a quorum, so any write it tentatively accepts cannot be durably committed.

Raft uses term numbers as lightweight fencing. Every message carries the sender's current term. Any node receiving a message from a higher term immediately updates and defers. This makes stale leaders self-evicting the moment connectivity returns. The term number is the distributed system's equivalent of a version gate: old leaders cannot sneak writes past it.

Split Brain Detection Internals: How Raft and ZooKeeper Identify Rogue Leaders

Both Raft and ZooKeeper solve the same underlying problem — detecting a stale leader before it causes damage — but they use different internal mechanisms.

Raft's approach: term-based self-eviction. Every Raft RPC carries the sender's term number. The rule is absolute: if a node receives any message with a term higher than its own, it immediately converts to follower and updates its term. This makes the split-brain detection passive — the old leader does not need to "notice" anything. The moment the network heals and the new leader's heartbeat arrives, the old leader self-evicts. The detection latency is exactly one heartbeat interval (typically 50–150 ms). There is no polling, no timeout negotiation, and no coordinator — just a comparison of two integers.

ZooKeeper's approach: epoch-gated session leases. ZooKeeper uses an epoch (called zxid — the ZooKeeper transaction ID). Every write is stamped with the current epoch. Clients hold session leases tied to the current leader's epoch. If a client's session was established in epoch 3 and the current leader is in epoch 5, ZooKeeper rejects the client's writes as stale. The fencing happens at the storage layer, not at the client — the new leader does not need to actively contact old clients. The old leader's writes are simply refused.

The shared principle: both systems push detection to the data path. There is no separate "split brain checker" process — the ordinary read/write flow carries the epoch or term, and any staleness is rejected immediately.

Performance Analysis: The Latency Cost of Quorum Enforcement

Quorum consensus prevents split brain but imposes a measurable latency overhead. Understanding the numbers helps you make the right trade-off.

Replication ModeP50 LatencyP99 LatencySplit Brain RiskDurability
w=1 (primary only)~1 ms~5 msHigh — primary death loses writesWeakest
w=majority async~5–15 ms~30 msLow — quorum requiredStrong
w=majority + journal~10–25 ms~50 msLowVery strong
Raft full commit~15–30 ms~60 msEliminated by designStrongest

The key insight from these numbers: quorum acknowledgement adds approximately one network round-trip to write latency (the primary must wait for a majority of replicas to confirm). For most services, 10–30 ms is acceptable. For sub-millisecond latency requirements (high-frequency trading, in-process caches), quorum consensus is architecturally incompatible — those systems accept split brain risk explicitly and resolve divergence at read time.

The replication acknowledgement mode you configure directly controls your exposure:

Write ModeLatencyDurabilitySplit Brain Risk
w=1 (ack from primary only)LowestWeakest — data lost if primary dies before replicationHighest — primary may accept writes it alone holds
w=majority (ack from quorum)ModerateStrong — at least one replica in any future quorum has the writeLow — write cannot be lost without losing a quorum node
w=all (ack from every replica)HighestStrongest — write is everywhereLowest — any single node failure blocks writes entirely
w=majority + j=true (journalled)Moderate–highVery strong — persisted to disk on quorum membersLow — combines quorum durability with crash recovery

The critical insight: w=1 does not mean the write is unsafe in normal operation. It means the write is unsafe during a partition or failover. Teams often configure w=1 for latency reasons and only discover the implication when a leader election surfaces data that was never replicated.


📊 Visualizing Split Brain: From Partition to Rollback

The Timeline of a Split-Brain Event and Recovery

The sequence diagram below traces the full lifecycle of a split-brain event in a three-participant simplified cluster: the old leader, a newly elected leader on the other side of the partition, and the client that unwittingly proves the problem.

sequenceDiagram
    participant C1 as Client A
    participant OL as "Old Leader (Term 1)"
    participant NL as "New Leader (Term 2)"
    participant C2 as Client B

    Note over OL,NL: Network partition begins
    C1->>OL: Write balance=500
    OL-->>C1: ACK (accepted in Term 1)
    C2->>NL: Write balance=750
    NL-->>C2: ACK (accepted in Term 2)
    Note over OL,NL: Both leaders hold divergent state
    Note over OL,NL: Partition heals — Term 2 propagates
    NL->>OL: AppendEntries term=2
    OL-->>NL: Step down — reverting to follower
    Note over OL: balance=500 write ROLLED BACK
    Note over NL: balance=750 is the durable truth

After the partition heals, the old leader receives an AppendEntries RPC from the new leader carrying Term 2. Because Term 2 > Term 1, the old leader immediately steps down, discards its divergent log entries (including the balance=500 write), and overwrites with the new leader's log. Client A received an ACK for a write that no longer exists.

How Quorum and Fencing Prevent the Problem from Starting

The flowchart below shows the decision path a node takes when it detects that the network has partitioned — the correct path that keeps split brain from ever occurring in a well-configured cluster.

flowchart TD
    A["Network partition detected"] --> B{"Can node reach a quorum?"}
    B -->|"Yes — majority reachable"| C["Continue normal operation"]
    B -->|"No — minority partition"| D["Reject all client writes"]
    D --> E["Increment election timeout counter"]
    E --> F{"Election timeout expired?"}
    F -->|"No"| E
    F -->|"Yes"| G["Send RequestVote with incremented term"]
    G --> H{"Votes received from majority?"}
    H -->|"No"| I["Remain follower — cannot become leader"]
    H -->|"Yes"| J["Elected leader in new term"]
    J --> K{"STONITH configured?"}
    K -->|"Yes"| L["Fence old leader node via IPMI or PDU"]
    K -->|"No"| M["Issue new fencing token to self"]
    L --> N["Old leader cannot write — hardware isolated"]
    M --> O["Old zombie writes rejected by token high-water mark"]
    N --> P["Split brain prevented"]
    O --> P

This diagram illustrates two layers of defence working in sequence. The quorum check is the primary gate: a minority partition simply cannot accumulate enough votes to elect a leader, so split brain never gets started. STONITH and fencing tokens are the backup layer for cases where the primary gate fails — such as a misconfigured quorum value, or the zombie leader scenario described later.


🌍 Real-World Incidents: When Split Brain Escaped into Production

MongoDB's Two-Primary Bug (2012)

In MongoDB's replica set implementation prior to version 2.6, a specific race condition in the election protocol could produce two simultaneous primaries under high network jitter. Both primaries accepted writes. When the bug was surfaced publicly, the MongoDB team confirmed that any data written exclusively to the deposed primary was permanently lost, as the oplog entries were simply overwritten during resync. The fix required both a protocol correction — tightening the quorum check — and a recommendation to all operators to use w=majority acknowledgement.

Redis Sentinel with quorum=1 in a Three-Sentinel Setup

Redis Sentinel uses a quorum value to decide when a master is truly unreachable and should be failed over. A quorum of 1 means a single Sentinel can trigger a failover unilaterally. In a three-Sentinel deployment with quorum=1, a Sentinel that experiences a local network issue (not the master) can falsely declare the master down and promote a replica — while the original master continues accepting writes from clients that can still reach it. The result is two masters, both acknowledging writes, with no automatic reconciliation. The correct quorum for three Sentinels is 2 (majority), ensuring at least one other Sentinel must agree before a failover proceeds.

Elasticsearch minimum_master_nodes Misconfiguration

Elasticsearch clusters prior to version 7.0 required manual configuration of minimum_master_nodes — the quorum for master election. The formula is ⌊N/2⌋ + 1, so a five-node cluster needs the value set to 3. Teams often left it at the default of 1 for ease of initial setup and forgot to update it as the cluster grew. With minimum_master_nodes=1, any single node could declare itself master independently, producing multiple masters. Elasticsearch 7.0 replaced this manual setting with automatic quorum calculation specifically because misconfiguration was so common and the consequences so severe.


⚖️ Comparing Defences: Quorum, Fencing, STONITH, Consensus, and Manual Procedures

Each split-brain defence has a different cost profile and failure envelope. Use this table to choose the right layer for your architecture:

Defence MechanismSplit Brain PreventionWrite Latency CostOperational ComplexityFailure of the Defence
Quorum writesStrong — requires majority for commitModerate (extra round trips)Low — configured onceMisconfigured quorum value (e.g., quorum=1)
Fencing tokensStrong — stale leaders self-rejectNegligible (local token check)Moderate — requires token infrastructureToken service itself becomes unavailable
STONITHVery strong — physically isolates nodeNone on write pathHigh — hardware integration requiredSTONITH device fails or is misconfigured
Raft / Paxos consensusProtocol-level — mathematically guaranteedModerate (log replication latency)Low — built into managed enginesBugs in implementation (rare but documented)
Manual runbookWeak — depends on operator response timeNoneVery high — requires 24/7 on-call disciplineHuman error, alert fatigue, slow response

No single mechanism is sufficient on its own. Production-grade systems layer quorum writes (primary gate) with fencing tokens or STONITH (backup gate for zombie scenarios).


🧭 Choosing Your Split Brain Defence: A Decision Guide

SituationRecommended Approach
Running a managed consensus engine (etcd, ZooKeeper, CockroachDB)Trust the built-in Raft/Paxos quorum — configure election timeouts and heartbeat intervals; no additional fencing needed
Self-managed replica set (MongoDB, PostgreSQL Patroni)Set w=majority for all writes + configure STONITH via IPMI or cloud provider fencing agent
Redis with SentinelSet quorum = ⌈N/2⌉ for N Sentinels; never use quorum=1 in any multi-Sentinel setup
Elasticsearch / OpenSearchUse version 7.0+ with auto-quorum; pin cluster.initial_master_nodes at bootstrap only
Multi-region active–activeAccept eventual consistency explicitly (use CRDTs or application-level conflict resolution) — quorum across regions has latency cost that may be unacceptable
Kubernetes control planeetcd must run on an odd number of nodes (3 or 5); never 2 or 4; enable --auto-compaction-retention
Low-latency trading or payment systemsRaft + synchronous replication + STONITH; the latency cost is non-negotiable — async replication is not acceptable

🧪 The Zombie Leader: When a GC Pause Outlasts the Election Timeout

Split brain does not require a network partition in the traditional sense. It can emerge from a single node's extended pause — the zombie leader scenario.

Here is the sequence: the leader node enters a full-stop garbage collection pause. The JVM (or Go runtime) stops all threads, including the thread that sends Raft heartbeats. The pause lasts 800 ms. The election timeout is 500 ms. Before the GC pause finishes, the followers time out, elect a new leader, and begin operating in Term 2.

The original leader resumes from the GC pause completely unaware that 800 ms have elapsed. From its perspective, it just processed some garbage and is ready to continue. It has no idea that it has been deposed. It begins processing queued client requests and attempting to append log entries to followers — all in Term 1.

Every follower that receives a Term 1 message rejects it immediately, because they are now operating in Term 2. The zombie leader's writes go nowhere. This is Raft's term-based fencing doing exactly what it was designed for.

But the scenario becomes dangerous if the zombie leader is writing to a resource that is not protected by term-based fencing. A leader that directly writes to an external database, a message queue, or a file system using cached credentials from before its term expired can produce writes that the cluster never agreed to — and those systems have no way to detect the stale term.

The defence is a fencing token high-water mark: every time a new leader is elected, the token service issues a monotonically increasing token. Any external resource that accepts writes validates the incoming token against the highest token it has seen. If the zombie leader presents token 4 and the resource already accepted token 5, the write is rejected. This is why distributed lock services like Chubby and etcd always return monotonically increasing lease revisions.

The operational implication: tune GC pause budgets below your election timeout. In JVM-based systems, use G1GC or ZGC with soft pause targets well below the Raft heartbeat interval. Monitor GC pause duration as a first-class SLO, not a diagnostic detail.


🛠️ etcd and Redis Sentinel: Configuring Split-Brain Defences in Practice

etcd: Heartbeat Interval, Election Timeout, and Quorum

etcd is a distributed key-value store built entirely around Raft and is the backing store for Kubernetes cluster state. Its split-brain resistance comes from correct timing configuration and an odd-node cluster size. The following etcd cluster member configuration demonstrates the critical parameters:

# etcd member configuration — /etc/etcd/etcd.conf.yml
name: etcd-node-1
data-dir: /var/lib/etcd

# Cluster membership
initial-cluster: >
  etcd-node-1=https://10.0.0.1:2380,
  etcd-node-2=https://10.0.0.2:2380,
  etcd-node-3=https://10.0.0.3:2380

# Raft timing — heartbeat must be << election timeout
# Rule: election-timeout >= 10 * heartbeat-interval
heartbeat-interval: 100      # milliseconds — leader sends heartbeat every 100ms
election-timeout: 1000       # milliseconds — follower times out after 1000ms without heartbeat

# Quorum is automatic for a 3-node cluster: majority = 2
# Never use an even number of nodes — 2 or 4 cannot tolerate any partition safely

# Client TLS — prevents rogue nodes from joining and disrupting quorum
client-transport-security:
  cert-file: /etc/etcd/server.crt
  key-file: /etc/etcd/server.key
  trusted-ca-file: /etc/etcd/ca.crt
  client-cert-auth: true

The ratio rule — election timeout must be at least ten times the heartbeat interval — ensures that transient network jitter does not trigger spurious elections. Spurious elections are not split brain, but they are the precursor: each unnecessary election increments the term counter and causes a brief window of unavailability while the cluster re-stabilises.

Redis Sentinel: Setting the Correct Quorum

Redis Sentinel monitors a Redis master and promotes a replica on failure. The quorum value controls how many Sentinels must agree that the master is unreachable before a failover begins. For a three-Sentinel deployment, the value must be 2:

# /etc/redis/sentinel.conf — three-Sentinel deployment

# Monitor the master; quorum = 2 means 2 of 3 Sentinels must agree
sentinel monitor mymaster 10.0.0.10 6379 2

# How long (ms) master must be unreachable before marking it subjectively down
sentinel down-after-milliseconds mymaster 5000

# How long (ms) to wait for a replica to complete promotion
sentinel failover-timeout mymaster 60000

# How many replicas can be reconfigured simultaneously after failover
sentinel parallel-syncs mymaster 1

Setting quorum 1 on a three-Sentinel setup is the single most common Redis split-brain misconfiguration in production. With quorum 1, any Sentinel experiencing a local network issue can independently trigger a failover, resulting in two masters. With quorum 2, a single misbehaving Sentinel is outvoted.

For a full deep-dive on Redis Sentinel architecture, including cross-datacenter Sentinel placement, see a planned companion post on Redis high availability patterns.


📚 Lessons Hard-Learned from Split Brain in Production

A quorum of 1 is not a quorum. Any system configured with quorum=1 — whether Redis Sentinel, Elasticsearch minimum_master_nodes, or a custom consensus protocol — has split-brain protection equivalent to no quorum at all. Audit every quorum setting every time you change cluster size.

GC pause duration is a distributed systems parameter, not just a JVM tuning concern. If your GC pause can exceed your election timeout, you have a zombie leader risk. Track 99th-percentile GC pause time in your dashboards alongside replication lag and leader election frequency.

Audit every write path for term or token checks. Raft protects the consensus log, but any write path that bypasses the log — direct database writes, external API calls made by the leader — is invisible to Raft's term fencing. These paths need fencing tokens.

STONITH is a backup, not a primary defence. STONITH (Shoot The Other Node In The Head — physically powering off or network-isolating a node) is reliable when it works, but STONITH devices can fail, be misconfigured, or cause split-brain of their own if both nodes attempt to fence each other simultaneously (the "fencing duel"). It should be your second layer, not your first.

Test failover, not uptime. A cluster that has never failed over in production is a cluster whose failover behaviour is unknown. Run quarterly chaos tests — kill the leader, partition nodes, inject GC pauses — and measure recovery time and data loss against your RPO/RTO targets. Uptime metrics tell you nothing about your split-brain exposure.

Odd node counts are a correctness requirement, not a preference. A two-node cluster has no majority: losing one node leaves you with 50%, which is not enough to elect a leader or proceed with quorum writes. A four-node cluster wastes a node — it can only tolerate one failure, the same as a three-node cluster, but costs more. Always deploy 3, 5, or 7 nodes.


📌 TLDR: Six Things to Remember About Split Brain

  • Split brain happens when a network partition causes two nodes to simultaneously believe they are the authoritative leader, accepting conflicting writes with no way to merge them deterministically.
  • Quorum math is the primary defence: in a cluster of N nodes, a majority requires ⌊N/2⌋ + 1 votes. Two partitions cannot simultaneously each hold a majority — they would need more than N nodes combined.
  • Raft term numbers are lightweight fencing: any node receiving a message from a higher term immediately steps down. Stale leaders self-evict the moment connectivity returns.
  • The zombie leader scenario arises when a GC pause outlasts the election timeout, deposing a leader that does not know it has been replaced. Fencing tokens with monotonically increasing values are the correct defence.
  • Every quorum value must be audited: quorum=1 in Redis Sentinel, minimum_master_nodes=1 in old Elasticsearch, or any even-number cluster size all create invisible split-brain exposure that only manifests under failure.
  • Layer your defences: quorum writes as the primary gate, fencing tokens for external write paths, STONITH as hardware-level backup, and chaos-tested failover runbooks as operational assurance.

Share

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms