All Posts

System Design Replication and Failover: Keep Services Alive When a Primary Dies

A clear guide to replicas, failover detection, replica lag, and the trade-offs behind reliable database designs.

Abstract AlgorithmsAbstract Algorithms
ยทยท11 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Replication means keeping multiple copies of your data so the system can survive machine, process, or availability-zone failures. Failover is the coordinated act of promoting a healthy replica, rerouting traffic, and recovering without corrupting data.

TLDR: If one database going down takes your product down, replication is the next design step. Failover is how you make that redundancy usable.

๐Ÿ“– Why Replication Shows Up in Every Serious System Design Conversation

In a beginner system design interview, a single database is often a perfectly fine starting point. It is easy to explain, easy to operate, and often good enough for the first version of a product.

The trouble starts when that one database becomes both the storage layer and the single point of failure. If the node crashes, the application can still be healthy, the cache can still be warm, and the API can still be reachable, but users will experience an outage because the source of truth is gone.

That is why replication usually enters the conversation right after the simple design. If you came here from System Design Interview Basics, this is the deeper follow-up to the line "add replication for reliability."

Replication answers one question: how do we keep another copy of the same data available elsewhere? Failover answers the next question: how do we decide which copy becomes authoritative when the primary fails?

Without replicationWith replication
One node holds all writes and all readsA primary handles writes while replicas provide redundancy
Any crash can cause a full outageA crash can trigger promotion of a healthy replica
Reads compete with writes on one machineReads can be spread across replicas
Recovery depends on backups aloneRecovery can use already-running replicas

The crucial interview insight is that replication is not just about uptime. It also affects latency, read scale, consistency, recovery complexity, and operational risk.

๐Ÿ” Primary, Replicas, and the Three Questions You Must Answer

Every replicated system has to answer three practical questions.

Question 1: Where do writes go? In the simplest model, all writes go to a single primary node. The primary serializes changes and ships them to replicas.

Question 2: Where do reads go? Some systems send all reads to the primary for strong consistency. Others send read traffic to replicas to reduce load. That improves scale, but it creates the risk of stale reads when replicas lag.

Question 3: Who decides that the primary is dead? This is the core of failover. If every node can promote itself independently, split-brain becomes possible. A good design needs a leader election rule, quorum, or external coordinator.

Here is the common vocabulary:

TermMeaningWhy it matters
PrimaryThe authoritative node for writesKeeps write ordering simple
ReplicaA follower that replays changes from the primaryAdds redundancy and read scale
Synchronous replicationPrimary waits for one or more replicas before acknowledging commitSafer, but slower writes
Asynchronous replicationPrimary acknowledges first and replicas catch up laterFaster, but can lose recent writes on failover
FailoverPromote a replica and redirect trafficRestores service after primary loss

This is why replication is always a trade-off discussion. The more aggressively you protect against data loss, the more latency and coordination you usually add.

โš™๏ธ How Writes, Replicas, and Promotion Actually Work

At a high level, replicated databases usually follow a log-driven pattern.

  1. A client sends a write to the primary.
  2. The primary appends the change to a durable log.
  3. The primary applies the change locally.
  4. Replicas receive the change log and replay it in order.
  5. Reads may go to the primary, to replicas, or to both.

That sounds simple until you trace the timing.

StepWhat happensOperational consequence
t0Client sends INSERT order_123User is waiting for confirmation
t1Primary writes to its local WAL/binlogDurability starts here
t2Primary optionally waits for replica acknowledgmentHigher safety, higher latency
t3Primary replies success to clientUser sees commit complete
t4Replica applies the changeRead replicas may still be behind

If the primary crashes between t3 and t4, the answer depends on the replication mode. With asynchronous replication, the newest write may not exist on any surviving replica. With synchronous replication, the client waited longer, but the system is more likely to preserve that write during failover.

That single timing gap is the reason replication discussions matter in interviews. A strong answer does not just say "I would add replicas." It says whether those replicas are protecting availability, read throughput, or data durability.

๐Ÿง  Deep Dive: What Makes Failover Safe Instead of Chaotic

Failover is not merely switching traffic to another box. It is a correctness problem disguised as an availability feature.

The Internals: Write-Ahead Logs, Heartbeats, and Promotion

Most production databases replicate ordered change records rather than shipping whole tables after every write. PostgreSQL uses a write-ahead log. MySQL uses binlogs. The idea is the same: replicas apply the same ordered stream of changes to converge toward primary state.

On top of that replication stream, the system needs health signals. Common signals include:

  • Heartbeats between nodes.
  • A replication lag metric such as seconds behind primary.
  • A quorum-based lease or vote so only one node can become leader.

When the primary appears unhealthy, a failover controller typically asks three questions:

  1. Is the primary really unreachable, or just slow?
  2. Which replica is most up to date?
  3. Can we promote exactly one replica without creating split-brain?

That third question is the dangerous one. If two replicas promote themselves in different network partitions, clients may write divergent states. Merging that later is painful or impossible depending on the workload.

This is why many systems use a consensus service or a strong election protocol. Even if you do not mention every technology by name in an interview, you should explain that leader promotion needs coordination, not guesswork.

Performance Analysis: Commit Latency, Replica Lag, and Read Scaling

Replication changes both write and read performance.

Write latency: synchronous replication increases the commit path because the primary waits for confirmation from one or more replicas. If the network is slow or one zone is unhealthy, p95 and p99 write latency rise.

Read scale: replicas are great for read-heavy workloads such as catalogs, dashboards, and reporting. They let you separate hot reads from critical writes.

Replica lag: asynchronous replicas can trail the primary by milliseconds or seconds. That means a user could create an object successfully and then not see it immediately if the next read goes to a lagging replica.

MetricWhat to watchWhy it matters
Commit latencyp95 and p99 write timeReplication can slow the critical path
Replica lagSeconds or log positions behind primaryDetermines staleness of reads
Read QPS per replicaDistribution of trafficShows whether replicas are actually offloading the primary
Failover timeDetection + promotion + rerouteDirectly affects user-visible downtime

For interviews, the cleanest summary is this: replication improves resilience and read scale, but it forces you to choose how much stale data and write latency you can tolerate.

๐Ÿ“Š The Failover Journey From Normal Writes to Recovery

flowchart TD
  A[Client sends write] --> B[Primary appends WAL or binlog]
  B --> C[Primary applies change locally]
  C --> D[Replica receives and replays log]
  D --> E{Primary healthy?}
  E -->|Yes| F[Keep serving writes from primary]
  E -->|No| G[Failover controller checks replica freshness]
  G --> H[Promote healthiest replica]
  H --> I[Route new writes to promoted node]

This diagram hides a lot of detail, but it captures the correct story for an interview: writes are ordered, replicated, evaluated for freshness, and then rerouted through a controller that promotes one healthy node.

๐ŸŒ Real-World Applications: Catalogs, Payment Metadata, and Control Planes

Replication appears in many different forms depending on the workload.

E-commerce catalog: product reads dominate writes, so read replicas help scale traffic while the primary handles inventory updates and editorial changes.

Payment metadata systems: even if money movement uses stronger guarantees elsewhere, supporting metadata such as transaction views, audit logs, or reconciliation dashboards often needs highly available replicas for reporting and operations.

SaaS control planes: tenants expect admin dashboards to keep working during partial failures. Replicas and automated failover keep configuration reads available while preserving a well-defined primary for updates.

The pattern stays the same: a workload that needs durability, moderate write coordination, and meaningful read traffic benefits from replication.

โš–๏ธ Trade-offs & Failure Modes: The Price of Reliability

Trade-off or failure modeWhat goes wrongFirst mitigation
Replica lagUsers read stale dataRead recent writes from primary or use read-after-write routing
Split-brainTwo nodes accept writes as leaderUse quorum or external consensus
Slow synchronous commitWrites become slower at peakReplicate to one synchronous follower, others async
Failover to stale replicaRecent writes disappearPromote the most up-to-date replica only
Manual failover playbooksRecovery takes too longAutomate promotion and traffic switch

The mature interview answer is not "replication solves outages." It is "replication changes the outage shape from hard downtime into a coordination problem I can manage with failover rules."

๐Ÿงญ Decision Guide: Which Replication Pattern Fits the Problem?

SituationRecommendation
Read-heavy app with moderate consistency needsSingle primary with read replicas
Money-critical workflowSingle primary with stronger synchronous confirmation
Global writes from many regionsConsider multi-primary only if conflicts are acceptable or carefully resolved
Small startup appStart with backup + restore, add replication when availability becomes a real requirement

The practical interview trick is sequencing. Start simple. Add read replicas when the primary is overloaded by reads. Add automated failover when downtime costs become meaningful.

๐Ÿงช Practical Example: Evolving a URL Shortener Beyond One Database

Suppose your first URL shortener stores everything in one relational database. That is a good first answer.

But traffic grows:

  • Redirect reads climb into the tens of thousands per second.
  • Marketing campaigns create bursty hot links.
  • The business now cares about uptime because every outage breaks customer campaigns.

The first upgrade path looks like this:

  1. Keep one primary for new short-link writes.
  2. Add read replicas for redirect lookups that do not require the newest possible state.
  3. Put hot redirects behind a cache.
  4. Add automated failover so the primary is not a single point of outage.

That sequence is interview gold because it shows evolution rather than premature complexity. It also connects directly back to System Design Interview Basics: start simple, then justify the next layer only when the bottleneck appears.

๐Ÿ“š Lessons Learned

  • Replication and failover are separate ideas: one copies data, the other restores authority.
  • Read replicas help scale reads, but they do not guarantee fresh data.
  • Asynchronous replication reduces write latency but can lose the newest acknowledged data during a crash.
  • Safe failover depends on leader election and freshness checks, not just health probes.
  • A strong interview answer explains both availability gains and correctness risks.

๐Ÿ“Œ Summary & Key Takeaways

  • Replication keeps multiple copies of data so the system can survive node loss.
  • Failover promotes a healthy replica and reroutes traffic after primary failure.
  • Synchronous replication favors safety; asynchronous replication favors speed.
  • Replica lag, split-brain, and stale failover targets are the main risks to discuss.
  • The best interview answers introduce replication only when reliability or read scale actually require it.

๐Ÿ“ Practice Quiz

  1. Why do systems add read replicas before they add database sharding in many read-heavy workloads?

A) Replicas are usually simpler and offload reads without changing the write model
B) Replicas eliminate the need for a primary database
C) Replicas guarantee globally strong consistency at zero cost

Correct Answer: A

  1. What is the main danger of asynchronous replication during failover?

A) It always doubles read latency
B) The promoted replica may not have the latest acknowledged writes
C) It prevents any scaling of reads

Correct Answer: B

  1. What problem is split-brain describing?

A) Two nodes simultaneously acting as leader and accepting conflicting writes
B) A replica serving stale reads
C) A backup being too old to restore

Correct Answer: A

  1. Open-ended challenge: if your product requires near-zero data loss and low write latency, what replication compromise would you propose, and what business trade-off would that imply?
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms

More Posts

System Design Service Discovery and Health Checks: Routing Traffic to Healthy Instances

TLDR: Service discovery is how clients find the right service instance at runtime, and health checks are how systems decide whether an instance should receive traffic. Together, they turn dynamic infrastructure from guesswork into deterministic routi...

โ€ข8 min read

System Design Roadmap: A Complete Learning Path from Basics to Advanced Architecture

TLDR: This roadmap organizes every system-design-tagged post in this repository into learning groups and a recommended order. It is designed for interview prep and practical architecture thinking, from fundamentals to scaling, reliability, and implem...

โ€ข10 min read

System Design Observability, SLOs, and Incident Response: Operating Systems You Can Trust

TLDR: Observability is how you understand system behavior from telemetry, SLOs are explicit reliability targets, and incident response is the execution model when those targets are at risk. Together, they convert operational chaos into measurable, re...

โ€ข8 min read

System Design Message Queues and Event-Driven Architecture: Building Reliable Asynchronous Systems

TLDR: Message queues and event-driven architecture let services communicate asynchronously, absorb bursty traffic, and isolate failures. The core design challenge is not adding a queue. It is defining delivery semantics, retry behavior, and idempoten...

โ€ข8 min read