9 min readSystem Design High Availability Reliability

Designing for High Availability: The Road to 99.99% Reliability

Beyond redundancy: How to build systems that survive regional outages and infrastructure failures.

Abstract Algorithms/Apr 5, 2026/System Design Interview Prep

Executive TLDR

TLDR: High Availability (HA) is the art of eliminating Single Points of Failure (SPOFs).
By using Active Active redundancy, automated health checks, and global failover via GSLB, you can achieve "Four Nines" (99.99%) reliability—limiting downtime to just 52 minutes per year.
📖 The "Everything Fails" Problem Imagine you are the Lead Architect for a global payment processor.
It’s Black Friday, and traffic is peaking.

Core mental model

Read this as a system of state, constraints, and failure boundaries.

Beyond redundancy: How to build systems that survive regional outages and infrastructure failures.

Explain simpler Compare tradeoffs

Key systems visualization

The article’s conceptual path

📖 The "Everything Fails" Problem

🎯 Why You Need High Availability (HA)

🔍 The Basics of Redundancy and Health

⚙️ How Automated Failover Keeps the Lights On

📊 Visualizing the Flow of Automated Failover

TLDR: High Availability (HA) is the art of eliminating Single Points of Failure (SPOFs). By using Active-Active redundancy, automated health checks, and global failover via GSLB, you can achieve "Four Nines" (99.99%) reliability—limiting downtime to just 52 minutes per year.

📖 The "Everything Fails" Problem

Imagine you are the Lead Architect for a global payment processor. It’s Black Friday, and traffic is peaking. Your system is distributed across 20 nodes, and your database is a robust, sharded SQL cluster. Everything is green on the dashboard—until it isn't.

At 2:14 PM, a major cloud provider suffers a networking "blip" in your primary region. It’s not a total blackout, but latency spikes to 10 seconds, and 30% of packets are being dropped. Because your system was designed with a single-region database and a synchronous write path, every transaction globally starts timing out. For the next 20 minutes, your "robust" system is effectively dead.

The cost? $50,000 in lost transaction fees and a permanent dent in customer trust.

This is the single point of failure (SPOF) trap. Most developers build for scale (handling many users) but forget to build for availability (handling failure). They assume the infrastructure is stable. In reality, in a large enough system, failure is not a possibility—it is a statistical certainty. Designing for High Availability (HA) is the art of ensuring that when a component fails, the system as a whole keeps moving.

🎯 Why You Need High Availability (HA)

High Availability is often confused with Reliability, but they are distinct concepts. Think of a car: Reliability is how likely the engine is to start every morning for five years. Availability is whether the car is actually in your driveway and ready to drive when you need to go to work.

Reliability is the probability that a component performs its function without failure.
Availability is the percentage of time the system is operational.

You need HA because downtime is expensive. Whether it’s direct revenue loss (e-commerce), safety risks (healthcare systems), or regulatory fines (banking), the world no longer accepts "scheduled maintenance" or "regional outages" as excuses. If your system is the backbone of a business, it must be designed to survive the failure of any single component—including an entire data centre.

Goal	Availability %	Yearly Downtime	Business Context
Basic	99%	3.65 days	Internal tools, non-critical CRUD apps
Silver	99.9%	8.77 hours	Standard SaaS, most consumer apps
Gold (HA)	99.99%	52.56 minutes	Payment gateways, Auth services, Ad-tech
Platinum	99.999%	5.26 minutes	Financial exchanges, Telecommunications

🔍 The Basics of Redundancy and Health

The foundation of HA is simple: Never have just one of anything. If you have one load balancer, one database, or one region, you have a SPOF.

To move beyond SPOFs, we use Redundancy. This isn't just about having "more servers"; it's about how those servers interact. In an Active-Passive setup, a "Warm Standby" waits for the primary to fail. In an Active-Active setup, all nodes share the load, making the system naturally more resilient.

However, redundancy is useless without Health Checks. A load balancer that continues to send traffic to a crashed server isn't providing HA; it's providing a 50% failure rate. A system must be "self-healing," meaning it can detect a dead node and reroute traffic in milliseconds without human intervention.

⚙️ How Automated Failover Keeps the Lights On

The core mechanic of HA is the Failover Loop. This is a continuous cycle of monitoring, detection, and reconfiguration.

Heartbeating: Nodes constantly send small "I am alive" packets to a controller or to each other.
Detection: If a heartbeat is missed $X$ times, the node is marked as "Unhealthy."
Reconfiguration: The traffic director (Load Balancer or Service Mesh) updates its routing table to exclude the unhealthy node.
Promotion (Passive only): In a database setup, a replica is "promoted" to become the new primary.

📊 Visualizing the Flow of Automated Failover

graph TD
    User((User)) --> LB[Load Balancer]
    LB -->|Health Check: OK| S1[Server 1: Active]
    LB -->|Health Check: OK| S2[Server 2: Active]

    subgraph Failure_Scenario
        S1 -.->|Crash| S1_Down((X))
        LB -->|Timeout| S1_Down
        LB -->|Reroute| S2
    end

Explanation of the Diagram: The diagram shows a standard horizontal failover flow. The Load Balancer acts as the central traffic director. It performs regular health checks on all active nodes. When Server 1 fails to respond (detected via a timeout or heartbeat miss), the Load Balancer immediately marks it as unhealthy and reroutes 100% of the traffic to Server 2. This ensures zero downtime for the end-user.

🧠 Deep Dive: The Internals of High Availability

Achieving 99.99% requires looking under the hood at how traffic actually moves across the wire during a failure.

🛡️ The Internals: Virtual IPs and VRRP

How does a Load Balancer fail over if it is the one that crashes? We use a Virtual IP (VIP). Two physical load balancers share a single IP address using a protocol called VRRP (Virtual Router Redundancy Protocol). Both nodes listen, but only the "Master" responds to ARP requests for that IP. If the Master stops heartbeating, the "Backup" node takes over the IP in less than a second.

📊 Performance Analysis: The Cost of Checks

Health checks aren't free. If you have 1,000 microservices checking each other every 1 second, you've created a "Distributed Denial of Service" (DDoS) on your own network.

Time Complexity: Failover detection is $O(1)$ per node, but the total network load is $O(N^2)$ in a mesh or $O(N)$ with a centralized observer.
Bottlenecks: The "Observer" (Load Balancer) becomes the bottleneck if it has to manage too many health check states. This is why modern systems use Gossip Protocols (like in Cassandra or Consul) where nodes tell each other who is healthy, distributing the workload.

🏗️ Advanced Concepts: Global Traffic Management (GSLB)

To survive an entire region going dark, you need Global Server Load Balancing (GSLB). This happens at the DNS level.

When a user in London requests api.example.com, a GSLB-enabled DNS server (like Cloudflare or Route53) looks at the health of your data centres. If the London region is at 90% capacity or reporting errors, the DNS server returns the IP address for the Dublin region instead.

Anycast takes this further by allowing multiple data centres across the world to announce the exact same IP address via BGP (Border Gateway Protocol). The internet's routing infrastructure naturally sends the user to the "closest" healthy data centre. If one data centre stops announcing the IP (because it crashed), the internet "fails over" automatically to the next closest one.

🌍 Real-World Applications: Designing for 99.99%

Case Study 1: The Payment Gateway (Active-Passive DB)

A payment gateway uses a primary SQL database in us-east-1 and a synchronous replica in us-west-2.

Input: Transaction request.
Process: Write to Primary → Sync to Replica → Commit.
Failure Scenario: us-east-1 go down.
HA Logic: The monitoring system detects the outage, promotes the us-west-2 replica to Primary, and updates the application connection strings.
Result: Transactions resume within 30 seconds.

Case Study 2: The Edge CDN (Active-Active Stateless)

A CDN like Cloudflare uses Anycast to serve static assets.

Input: Image request.
Process: BGP routes user to the nearest Edge POP (Point of Presence).
Failure Scenario: The London POP loses power.
HA Logic: BGP routes automatically shift to the Paris or Amsterdam POPs.
Result: Zero downtime; slight increase in latency for London users.

⚖️ Trade-offs & Failure Modes

High Availability isn't free. It introduces new risks:

Performance vs. Cost: Running an Active-Active multi-region setup doubles your cloud bill. You are paying for "idle" capacity just in case a disaster happens.
Split-Brain Syndrome: This is the most dangerous failure mode. If two nodes in a cluster lose connection to each other but are both still running, they might both think they are the "Leader." If they both start writing to the same storage, you get data corruption.
Mitigation: Use Quorum-based voting (majority rule). A node can only become Leader if it can see a majority of its peers.

🧭 Decision Guide: Choosing your HA Strategy

Situation	Recommendation
Use when	Uptime is tied to revenue (e-commerce, payments, ads).
Avoid when	Building internal prototypes or non-critical batch jobs where 1-hour downtime is acceptable.
Alternative	"Cold Standby" (Backups) — cheaper, but recovery takes hours (RTO).
Edge cases	High-security government systems where data must be destroyed rather than risk an unverified failover.

🧪 Practical Example: HAProxy and Keepalived

The gold standard for a highly available ingress layer is the combination of Keepalived (for IP management) and HAProxy (for load balancing).

Example 1: The Keepalived Config

This configuration ensures that two servers share a single "Floating IP" (192.168.1.100).

# /etc/keepalived/keepalived.conf
vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 101 # Higher priority wins
    advert_int 1
    virtual_ipaddress {
        192.168.1.100
    }
}

Example 2: The HAProxy Health Check

HAProxy uses this config to ensure it only sends traffic to healthy backend servers.

backend app_servers
    balance roundrobin
    # Check /health every 2s, fail after 3 misses, recover after 2 passes
    option httpchk GET /health
    server app1 10.0.0.1:8080 check inter 2s fall 3 rise 2
    server app2 10.0.0.2:8080 check inter 2s fall 3 rise 2

📚 Lessons Learned

Hardware will fail, so build software that doesn't care. Assume every disk, cable, and power supply has a timer counting down to zero.
The "Passive" node is usually the weakest link. If you don't test your standby server, it won't work when you need it. Configuration drift is the silent killer of HA.
99.99% is about MTTR (Mean Time to Recovery). You can't prevent every crash, but you can automate the recovery so it happens in seconds, not minutes.

📌 Summary & Key Takeaways

Eliminate SPOFs: Redundancy at every layer (DNS, LB, App, DB).
Automate Detection: Use health checks with aggressive but safe timeouts.
Fail Over, Not Down: Use Active-Active where possible; use Quorum for Active-Passive.
GSLB for Disaster Recovery: Regional outages require DNS/BGP-level traffic shifting.
Remember: Availability is a choice you make during design, not a feature you buy later.

Quiet AI help

Explain simpler Compare approaches What next?

Article metadata