System Design Replication and Failover: Keep Services Alive When a Primary Dies
A clear guide to replicas, failover detection, replica lag, and the trade-offs behind reliable database designs.
Abstract AlgorithmsTLDR: Replication means keeping multiple copies of your data so the system can survive machine, process, or availability-zone failures. Failover is the coordinated act of promoting a healthy replica, rerouting traffic, and recovering without corrupting data.
TLDR: If one database going down takes your product down, replication is the next design step. Failover is how you make that redundancy usable.
๐ Why Replication Shows Up in Every Serious System Design Conversation
In a beginner system design interview, a single database is often a perfectly fine starting point. It is easy to explain, easy to operate, and often good enough for the first version of a product.
The trouble starts when that one database becomes both the storage layer and the single point of failure. If the node crashes, the application can still be healthy, the cache can still be warm, and the API can still be reachable, but users will experience an outage because the source of truth is gone.
That is why replication usually enters the conversation right after the simple design. If you came here from System Design Interview Basics, this is the deeper follow-up to the line "add replication for reliability."
Replication answers one question: how do we keep another copy of the same data available elsewhere? Failover answers the next question: how do we decide which copy becomes authoritative when the primary fails?
| Without replication | With replication |
| One node holds all writes and all reads | A primary handles writes while replicas provide redundancy |
| Any crash can cause a full outage | A crash can trigger promotion of a healthy replica |
| Reads compete with writes on one machine | Reads can be spread across replicas |
| Recovery depends on backups alone | Recovery can use already-running replicas |
The crucial interview insight is that replication is not just about uptime. It also affects latency, read scale, consistency, recovery complexity, and operational risk.
๐ Primary, Replicas, and the Three Questions You Must Answer
Every replicated system has to answer three practical questions.
Question 1: Where do writes go? In the simplest model, all writes go to a single primary node. The primary serializes changes and ships them to replicas.
Question 2: Where do reads go? Some systems send all reads to the primary for strong consistency. Others send read traffic to replicas to reduce load. That improves scale, but it creates the risk of stale reads when replicas lag.
Question 3: Who decides that the primary is dead? This is the core of failover. If every node can promote itself independently, split-brain becomes possible. A good design needs a leader election rule, quorum, or external coordinator.
Here is the common vocabulary:
| Term | Meaning | Why it matters |
| Primary | The authoritative node for writes | Keeps write ordering simple |
| Replica | A follower that replays changes from the primary | Adds redundancy and read scale |
| Synchronous replication | Primary waits for one or more replicas before acknowledging commit | Safer, but slower writes |
| Asynchronous replication | Primary acknowledges first and replicas catch up later | Faster, but can lose recent writes on failover |
| Failover | Promote a replica and redirect traffic | Restores service after primary loss |
This is why replication is always a trade-off discussion. The more aggressively you protect against data loss, the more latency and coordination you usually add.
โ๏ธ How Writes, Replicas, and Promotion Actually Work
At a high level, replicated databases usually follow a log-driven pattern.
- A client sends a write to the primary.
- The primary appends the change to a durable log.
- The primary applies the change locally.
- Replicas receive the change log and replay it in order.
- Reads may go to the primary, to replicas, or to both.
That sounds simple until you trace the timing.
| Step | What happens | Operational consequence |
t0 | Client sends INSERT order_123 | User is waiting for confirmation |
t1 | Primary writes to its local WAL/binlog | Durability starts here |
t2 | Primary optionally waits for replica acknowledgment | Higher safety, higher latency |
t3 | Primary replies success to client | User sees commit complete |
t4 | Replica applies the change | Read replicas may still be behind |
If the primary crashes between t3 and t4, the answer depends on the replication mode. With asynchronous replication, the newest write may not exist on any surviving replica. With synchronous replication, the client waited longer, but the system is more likely to preserve that write during failover.
That single timing gap is the reason replication discussions matter in interviews. A strong answer does not just say "I would add replicas." It says whether those replicas are protecting availability, read throughput, or data durability.
๐ง Deep Dive: What Makes Failover Safe Instead of Chaotic
Failover is not merely switching traffic to another box. It is a correctness problem disguised as an availability feature.
The Internals: Write-Ahead Logs, Heartbeats, and Promotion
Most production databases replicate ordered change records rather than shipping whole tables after every write. PostgreSQL uses a write-ahead log. MySQL uses binlogs. The idea is the same: replicas apply the same ordered stream of changes to converge toward primary state.
On top of that replication stream, the system needs health signals. Common signals include:
- Heartbeats between nodes.
- A replication lag metric such as seconds behind primary.
- A quorum-based lease or vote so only one node can become leader.
When the primary appears unhealthy, a failover controller typically asks three questions:
- Is the primary really unreachable, or just slow?
- Which replica is most up to date?
- Can we promote exactly one replica without creating split-brain?
That third question is the dangerous one. If two replicas promote themselves in different network partitions, clients may write divergent states. Merging that later is painful or impossible depending on the workload.
This is why many systems use a consensus service or a strong election protocol. Even if you do not mention every technology by name in an interview, you should explain that leader promotion needs coordination, not guesswork.
Performance Analysis: Commit Latency, Replica Lag, and Read Scaling
Replication changes both write and read performance.
Write latency: synchronous replication increases the commit path because the primary waits for confirmation from one or more replicas. If the network is slow or one zone is unhealthy, p95 and p99 write latency rise.
Read scale: replicas are great for read-heavy workloads such as catalogs, dashboards, and reporting. They let you separate hot reads from critical writes.
Replica lag: asynchronous replicas can trail the primary by milliseconds or seconds. That means a user could create an object successfully and then not see it immediately if the next read goes to a lagging replica.
| Metric | What to watch | Why it matters |
| Commit latency | p95 and p99 write time | Replication can slow the critical path |
| Replica lag | Seconds or log positions behind primary | Determines staleness of reads |
| Read QPS per replica | Distribution of traffic | Shows whether replicas are actually offloading the primary |
| Failover time | Detection + promotion + reroute | Directly affects user-visible downtime |
For interviews, the cleanest summary is this: replication improves resilience and read scale, but it forces you to choose how much stale data and write latency you can tolerate.
๐ The Failover Journey From Normal Writes to Recovery
flowchart TD
A[Client sends write] --> B[Primary appends WAL or binlog]
B --> C[Primary applies change locally]
C --> D[Replica receives and replays log]
D --> E{Primary healthy?}
E -->|Yes| F[Keep serving writes from primary]
E -->|No| G[Failover controller checks replica freshness]
G --> H[Promote healthiest replica]
H --> I[Route new writes to promoted node]
This diagram hides a lot of detail, but it captures the correct story for an interview: writes are ordered, replicated, evaluated for freshness, and then rerouted through a controller that promotes one healthy node.
๐ Real-World Applications: Catalogs, Payment Metadata, and Control Planes
Replication appears in many different forms depending on the workload.
E-commerce catalog: product reads dominate writes, so read replicas help scale traffic while the primary handles inventory updates and editorial changes.
Payment metadata systems: even if money movement uses stronger guarantees elsewhere, supporting metadata such as transaction views, audit logs, or reconciliation dashboards often needs highly available replicas for reporting and operations.
SaaS control planes: tenants expect admin dashboards to keep working during partial failures. Replicas and automated failover keep configuration reads available while preserving a well-defined primary for updates.
The pattern stays the same: a workload that needs durability, moderate write coordination, and meaningful read traffic benefits from replication.
โ๏ธ Trade-offs & Failure Modes: The Price of Reliability
| Trade-off or failure mode | What goes wrong | First mitigation |
| Replica lag | Users read stale data | Read recent writes from primary or use read-after-write routing |
| Split-brain | Two nodes accept writes as leader | Use quorum or external consensus |
| Slow synchronous commit | Writes become slower at peak | Replicate to one synchronous follower, others async |
| Failover to stale replica | Recent writes disappear | Promote the most up-to-date replica only |
| Manual failover playbooks | Recovery takes too long | Automate promotion and traffic switch |
The mature interview answer is not "replication solves outages." It is "replication changes the outage shape from hard downtime into a coordination problem I can manage with failover rules."
๐งญ Decision Guide: Which Replication Pattern Fits the Problem?
| Situation | Recommendation |
| Read-heavy app with moderate consistency needs | Single primary with read replicas |
| Money-critical workflow | Single primary with stronger synchronous confirmation |
| Global writes from many regions | Consider multi-primary only if conflicts are acceptable or carefully resolved |
| Small startup app | Start with backup + restore, add replication when availability becomes a real requirement |
The practical interview trick is sequencing. Start simple. Add read replicas when the primary is overloaded by reads. Add automated failover when downtime costs become meaningful.
๐งช Practical Example: Evolving a URL Shortener Beyond One Database
Suppose your first URL shortener stores everything in one relational database. That is a good first answer.
But traffic grows:
- Redirect reads climb into the tens of thousands per second.
- Marketing campaigns create bursty hot links.
- The business now cares about uptime because every outage breaks customer campaigns.
The first upgrade path looks like this:
- Keep one primary for new short-link writes.
- Add read replicas for redirect lookups that do not require the newest possible state.
- Put hot redirects behind a cache.
- Add automated failover so the primary is not a single point of outage.
That sequence is interview gold because it shows evolution rather than premature complexity. It also connects directly back to System Design Interview Basics: start simple, then justify the next layer only when the bottleneck appears.
๐ Lessons Learned
- Replication and failover are separate ideas: one copies data, the other restores authority.
- Read replicas help scale reads, but they do not guarantee fresh data.
- Asynchronous replication reduces write latency but can lose the newest acknowledged data during a crash.
- Safe failover depends on leader election and freshness checks, not just health probes.
- A strong interview answer explains both availability gains and correctness risks.
๐ Summary & Key Takeaways
- Replication keeps multiple copies of data so the system can survive node loss.
- Failover promotes a healthy replica and reroutes traffic after primary failure.
- Synchronous replication favors safety; asynchronous replication favors speed.
- Replica lag, split-brain, and stale failover targets are the main risks to discuss.
- The best interview answers introduce replication only when reliability or read scale actually require it.
๐ Practice Quiz
- Why do systems add read replicas before they add database sharding in many read-heavy workloads?
A) Replicas are usually simpler and offload reads without changing the write model
B) Replicas eliminate the need for a primary database
C) Replicas guarantee globally strong consistency at zero cost
Correct Answer: A
- What is the main danger of asynchronous replication during failover?
A) It always doubles read latency
B) The promoted replica may not have the latest acknowledged writes
C) It prevents any scaling of reads
Correct Answer: B
- What problem is split-brain describing?
A) Two nodes simultaneously acting as leader and accepting conflicting writes
B) A replica serving stale reads
C) A backup being too old to restore
Correct Answer: A
- Open-ended challenge: if your product requires near-zero data loss and low write latency, what replication compromise would you propose, and what business trade-off would that imply?
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
System Design Service Discovery and Health Checks: Routing Traffic to Healthy Instances
TLDR: Service discovery is how clients find the right service instance at runtime, and health checks are how systems decide whether an instance should receive traffic. Together, they turn dynamic infrastructure from guesswork into deterministic routi...
System Design Roadmap: A Complete Learning Path from Basics to Advanced Architecture
TLDR: This roadmap organizes every system-design-tagged post in this repository into learning groups and a recommended order. It is designed for interview prep and practical architecture thinking, from fundamentals to scaling, reliability, and implem...
System Design Observability, SLOs, and Incident Response: Operating Systems You Can Trust
TLDR: Observability is how you understand system behavior from telemetry, SLOs are explicit reliability targets, and incident response is the execution model when those targets are at risk. Together, they convert operational chaos into measurable, re...
System Design Message Queues and Event-Driven Architecture: Building Reliable Asynchronous Systems
TLDR: Message queues and event-driven architecture let services communicate asynchronously, absorb bursty traffic, and isolate failures. The core design challenge is not adding a queue. It is defining delivery semantics, retry behavior, and idempoten...
