All Posts

System Design Networking: DNS, CDNs, and Load Balancers

The internet's traffic control system. We explain how DNS resolves names, CDNs cache content, and Load Balancers distribute traffic.

Abstract AlgorithmsAbstract Algorithms
ยทยท16 min read
Cover Image for System Design Networking: DNS, CDNs, and Load Balancers
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: When you hit a URL, DNS translates the name to an IP, CDNs serve static assets from the edge nearest to you, and Load Balancers spread traffic across many servers so no single machine becomes a bottleneck. These three layers are the traffic control system of the modern internet.


๐ŸŒ Three Layers That Stand Between a User and Your Server

In 2021, GitHub went dark for nearly two hours. The cause was a misconfigured BGP route that blackholed traffic to GitHub's DNS servers โ€” meaning no request could resolve github.com to an IP address at all, let alone reach an application server. The outage was not a code bug or a database crash; it was a failure in the three infrastructure layers that execute before your application code ever runs.

Understanding DNS, CDNs, and load balancers โ€” and critically, how they fail โ€” is what separates engineers who can debug production outages from those who just restart services and hope.

Here is what a DNS failure looks like from a client's perspective:

$ dig github.com
;; connection timed out; no servers could be reached

That single timeout means every user on every device sees a blank page โ€” not because GitHub's servers were down, but because the phone book pointing to those servers was unreachable. No DNS resolution, no request ever gets off the ground.

Every web request travels through at least three invisible layers before it reaches your application code:

  1. DNS (Domain Name System) โ€” translates a human-readable hostname like github.com into a machine-readable IP address. Think of it as the internet's phone book: without it, nothing is addressable by name.
  2. CDN (Content Delivery Network) โ€” serves cached assets (images, JS, CSS) from a server geographically close to the user, cutting the round-trip distance to your origin.
  3. Load Balancer โ€” distributes live requests across a pool of backend servers, removing any single point of failure.

Remove any one of these, and your system either breaks under load, slows to a crawl for distant users, or collapses on a single node.

LayerRoleWhat breaks without it
DNSName โ†’ IP resolutionNothing is reachable by hostname
CDNEdge caching of static contentAll requests hit origin; high latency for distant users
Load BalancerTraffic distributionSingle-server bottleneck; no fault tolerance

๐Ÿ” The Building Blocks: What Each Component Does

Every web request to your application routes through three core network layers before reaching your code.

DNS (Domain Name System) is the phone book of the internet: it translates human-readable hostnames like api.example.com into machine-readable IP addresses. Without DNS, users would need to memorize IP addresses directly to reach your service.

CDN (Content Delivery Network) is a distributed cache of static assets. By placing copies of your images, CSS, and JavaScript at dozens of edge locations worldwide, a CDN serves those files from the server physically closest to each user โ€” dramatically reducing round-trip latency.

Load Balancer is the traffic manager. It accepts all incoming connections and distributes them across a pool of application servers. If one server fails a health check, the load balancer stops sending it traffic automatically โ€” providing fault tolerance without operator intervention.

Together, DNS resolves where to send traffic, the CDN handles static content at the edge, and the load balancer distributes dynamic requests across healthy backend instances. These three layers form the foundation of any horizontally scalable web architecture.


๐Ÿ“– DNS: The Internet's Phone Book

DNS maps example.com โ†’ 192.0.2.1. The resolution chain has four hops:

  1. Recursive resolver (usually your ISP or 8.8.8.8) receives the query.
  2. Root server directs to the TLD server.
  3. TLD server (.com) directs to the authoritative name server.
  4. Authoritative NS returns the A record (IPv4) or AAAA record (IPv6).

TTL (Time To Live) controls how long resolvers cache the result. Lower TTL = faster failover; higher TTL = lower resolver load.

DNS record typePurpose
AHostname โ†’ IPv4
AAAAHostname โ†’ IPv6
CNAMEAlias one hostname to another
MXMail server routing
NSDelegated authoritative name server
TXTArbitrary text (SPF, DKIM, verification)

DNSSEC adds cryptographic signatures to prevent cache poisoning โ€” always enable it on public zones.

๐Ÿ“Š DNS Resolution: Full Lookup Chain

sequenceDiagram
    participant B as Browser
    participant OS as OS Cache
    participant RR as Recursive Resolver
    participant Root as Root NS (.)
    participant TLD as TLD NS (.com)
    participant Auth as Authoritative NS

    B->>OS: Resolve api.example.com
    OS-->>B: Cache miss
    B->>RR: Query api.example.com
    RR->>Root: Where is .com?
    Root-->>RR: TLD NS address for .com
    RR->>TLD: Where is example.com?
    TLD-->>RR: Auth NS for example.com
    RR->>Auth: A record for api.example.com?
    Auth-->>RR: 203.0.113.10 (TTL 300s)
    RR-->>B: 203.0.113.10 (cached per TTL)

This diagram traces a full DNS resolution from the browser through five hops to the authoritative nameserver. The browser first checks the OS cache; on a miss, the recursive resolver fans out through root, TLD, and authoritative nameservers before returning the final IP address with its TTL. Notice that once the result is cached at the recursive resolver, all subsequent lookups skip these hops entirely โ€” this is why a high TTL reduces resolver load but delays failover propagation when you change an IP address.


๐Ÿ“ฆ CDNs: Bringing Content Closer to Users

A CDN is a globally distributed cache. When a user requests /static/logo.png, the CDN serves it from an edge PoP (Point of Presence) in their city rather than your origin server in a distant datacenter.

CDN hit ratio is the fraction of requests served from cache:

$$ ext{Avg Latency} = r \cdot L_ ext{edge} + (1 - r) \cdot L_ ext{origin}$$

where $r$ is the hit ratio. Even improving $r$ from 0.80 to 0.95 dramatically reduces effective latency.

Cache invalidation strategies:

  • Versioned filenames (app.v3.js) โ€” cached indefinitely; invalidate by renaming.
  • Cache-Control headers โ€” max-age=86400 for 24-hour TTL.
  • CDN purge API โ€” force-invalidate specific paths (use sparingly; defeats caching).

Edge compute (Cloudflare Workers, Lambda@Edge) allows running lightweight logic at the edge โ€” A/B tests, request rewriting, or authentication โ€” without round-tripping to origin.


โš™๏ธ Load Balancers: Distributing Traffic Intelligently

A load balancer sits in front of your server pool and routes incoming requests:

flowchart TD
    A[User Browser] -->|DNS lookup returns LB IP| B[Load Balancer L7]
    B -->|Route /api| C[App Server 1]
    B -->|Route /api| D[App Server 2]
    B -->|Route /static| E[CDN Edge Node]
    C --> F[Origin DB]
    D --> F

Layer 4 vs Layer 7:

Layer 4 (TCP/UDP)Layer 7 (HTTP)
InspectsIP + portFull HTTP headers, URL, cookies
RoutingBy connectionBy path, host, method
Use caseRaw throughput, TCP forwardingApplication-aware routing, TLS termination

Routing algorithms:

AlgorithmHow it worksBest for
Round-RobinRotate through servers sequentiallyHomogeneous servers
Least ConnectionsRoute to server with fewest active connectionsVariable request duration
IP HashHash client IP โ†’ serverSession affinity (sticky sessions)
WeightedAssign traffic % per serverHeterogeneous server capacities

Health checks probe each backend (HTTP /health, TCP ping) on a configurable interval. Unhealthy servers are removed from rotation without human intervention.

Load imbalance metric:

$$ ext{Imbalance} = rac{\max( ext{server QPS})}{ ext{avg}( ext{server QPS})}$$

A ratio near 1.0 means excellent distribution; above 2.0 is a warning sign.

๐Ÿ“Š CDN Cache Miss and Hit Flow

sequenceDiagram
    participant C as Client
    participant E as CDN Edge PoP
    participant O as Origin Server

    C->>E: GET /static/logo.png
    Note over E: Cache MISS (first request)
    E->>O: Forward to origin
    O-->>E: 200 OK + logo.png payload
    E-->>C: 200 OK (served + cached at edge)
    Note over E: Cached for max-age=86400s

    C->>E: GET /static/logo.png
    Note over E: Cache HIT
    E-->>C: 200 OK (from edge, ~5ms)
    Note over O: Origin not contacted on hit

This sequence diagram contrasts a CDN cache miss on the first request with a cache hit on the second request for the same static asset. On the first request, the edge PoP has no cached copy and must forward the request to origin, incurring full round-trip latency; on the second request the edge serves directly from its cache in roughly 5 ms without contacting origin at all. The key takeaway: maximising the cache hit ratio by setting appropriate Cache-Control headers and using versioned filenames is the most direct lever for reducing both user-facing latency and origin infrastructure load.


๐Ÿง  Deep Dive: Inside a DNS Resolution and Cache Lifecycle

When you type a URL, the OS checks its local DNS cache first. On a miss, the query travels to a recursive resolver, which fans out to root โ†’ TLD โ†’ authoritative nameserver. Each hop adds a few milliseconds โ€” but the result is cached per TTL, so repeat visits skip all hops. This is why TTL tuning matters: a 24-hour TTL means stale IPs persist for 24 hours after a server change.

Cache stageManaged byTTL impact
OS / browserLocal machineVery short; typically 30โ€“60 s
Recursive resolverISP or public DNSSet by authoritative NS record
Authoritative NSYouLower = faster failover; higher = less resolver load

๐Ÿ“Š The Full Request Journey: DNS โ†’ CDN โ†’ Load Balancer โ†’ App

sequenceDiagram
    participant U as User Browser
    participant R as Recursive Resolver
    participant A as Authoritative NS
    participant LB as Load Balancer
    participant CDN as CDN Edge
    participant App as App Server

    U->>R: Resolve example.com
    R->>A: Query A record
    A-->>R: Returns LB IP
    R-->>U: LB IP (cached per TTL)
    U->>LB: GET /api/data
    LB->>App: Forward request
    App-->>LB: JSON response
    LB-->>U: JSON response
    U->>CDN: GET /static/logo.png
    CDN-->>U: Serve from edge cache

This diagram shows how a single user interaction splits into two parallel paths: the API request travels through DNS resolution to the load balancer and then to an app server, while the static asset request is intercepted and served directly by the nearest CDN edge node. Notice that the CDN request never touches the load balancer or app server โ€” this is the core value proposition of a CDN, removing an entire class of requests from the origin. The combined effect of DNS caching, edge caching, and load-balanced app servers is what allows a modest backend cluster to serve millions of users globally.


โš–๏ธ Trade-offs & Failure Modes: Trade-offs and Failure Modes

LayerCommon failureMitigation
DNSStale cached IP after failoverSet TTL โ‰ค 60s before planned change
DNSCache poisoningEnable DNSSEC
CDNStale content after deployVersioned asset filenames + purge on deploy
CDNCache miss storm on cold startWarm cache before traffic shift
Load BalancerHealth-check lag โ†’ routing to dead serverAggressive health checks + circuit breaker
Load BalancerSession breakageSticky sessions or stateless session design

๐ŸŒ Real-World Applications: Real-World Deployments: DNS, CDN, and Load Balancer in Action

Major web platforms rely on all three layers working in concert.

E-commerce (Amazon, Shopify): A CDN caches product images and CSS globally, reducing page load times for users worldwide. A Layer-7 load balancer routes checkout API requests to dedicated payment-processing servers. GeoDNS routes users to the nearest regional datacenter, keeping latency below 50 ms for 99% of requests.

Streaming platforms (Netflix, YouTube): DNS Anycast routes users to the nearest Point of Presence. The CDN edge stores cached video segments. Adaptive bitrate algorithms request different quality segments based on available bandwidth โ€” all served from edge nodes, not the origin.

SaaS platforms (Slack, Notion): Load balancers distribute WebSocket connections across stateful nodes with sticky sessions, ensuring each user remains connected to the same backend throughout their session. CDNs cache static app bundles so browser reloads are instant.

Startup MVP: A single load balancer in front of two application servers, backed by a CDN like Cloudflare's free tier, handles most early-stage traffic without dedicated infrastructure investment. Start simple and add DNS-based geo-routing as your user base grows internationally.


๐Ÿงญ Decision Guide: When to Use What

SituationRecommendationWhy
Serving static assetsDeploy a CDN (Cloudflare, Fastly, CloudFront)Edge caching cuts latency dramatically
Horizontal scaling of API serversLayer 7 Load BalancerSmart routing, health checks, TLS termination
Global user baseGeoDNS + Anycast + regional LBRoutes users to nearest edge, minimizing RTT
Session affinity neededIP-Hash or Cookie-Based LBGuarantees subsequent requests hit same backend
Rapid failoverTTL โ‰ค 60s + DNS health-monitored failoverReduces stale records during outages
Dynamic content cachingCDN edge compute (ESI, Cloudflare Workers)Caches fragments while personalizing at edge

๐Ÿงช Practical Setup: CDN and Load Balancer in 4 Steps

This example walks through the four-step sequence for wiring DNS, a CDN, and a load balancer into an existing web application โ€” the exact configuration that connects all three layers discussed throughout this post. It was designed around the real-world order of operations engineers follow in production: start at the cache layer, then distribute traffic, then tune DNS last, because updating DNS before the load balancer is healthy is the most common cause of self-inflicted outages during infrastructure changes. Focus on the dependency between steps: each one builds on the previous, and the health-check endpoint you configure in Step 2 is the exact signal that makes Step 4's failover test meaningful.

Step 1 โ€” Configure the CDN: point your CDN provider (Cloudflare, CloudFront, Fastly) at your origin server. Set Cache-Control: max-age=31536000 on static assets and use versioned filenames like app.v4.js so browsers never serve stale bundles without intentional invalidation.

Step 2 โ€” Deploy a load balancer: provision a managed load balancer (AWS ALB, GCP Load Balancing, NGINX). Add at least two backend servers for redundancy. Enable HTTPS termination at the load balancer and configure a /health endpoint on each backend that returns 200 only when the server is genuinely ready.

Step 3 โ€” Update DNS: point your domain's A record to the load balancer IP. Set TTL to 300 seconds initially; lower it to 60 seconds before planned failover events to reduce stale-record impact.

Step 4 โ€” Test failover: take one backend server out of rotation manually and verify the load balancer detects the failure within one health-check interval and stops routing traffic to it before restoring the server.


๐ŸŽฏ What to Learn Next


๐Ÿ› ๏ธ Spring Boot + Spring Cloud Gateway: Health Checks and Dynamic Routing in Java

Spring Boot is the standard Java framework for building services behind load balancers, and Spring Cloud Gateway is its companion API gateway that implements Layer-7 routing, health-check probes, and circuit-breaker patterns โ€” making the load balancer concepts from this post directly configurable in Java code.

The /health endpoint that load balancers probe is provided automatically by Spring Boot Actuator; Spring Cloud Gateway handles path-based routing and can integrate with a discovery service for dynamic backend registration:

// dependencies: spring-boot-starter-actuator, spring-cloud-starter-gateway,
//               spring-cloud-starter-circuitbreaker-resilience4j

// application.yml โ€” Spring Cloud Gateway routing config (NGINX equivalent in Java)
/*
spring:
  cloud:
    gateway:
      routes:
        - id: api-route
          uri: lb://backend-service       # lb:// = Spring Cloud LoadBalancer
          predicates:
            - Path=/api/**
          filters:
            - StripPrefix=1
            - nam
e: CircuitBreaker
              args:
                name: backendCB
                fallbackUri: forward:/fallback
        - id: static-route
          uri: https://cdn.example.com
          predicates:
            - Path=/static/**
*/

import org.springframework.boot.actuate.health.Health;
import org.springframework.boot.actuate.health.HealthIndicator;
import org.springframework.stereotype.Component;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;

// Custom health indicator โ€” LB probes GET /actuator/health
// Returns 200 only when DB connection is healthy (not just process liveness)
@Component("database")
public class DatabaseHealthIndicator implements HealthIndicator {

    private final javax.sql.DataSource dataSource;

    public DatabaseHealthIndicator(javax.sql.DataSource dataSource) {
        this.dataSource = dataSource;
    }

    @Override
    public Health health() {
        try (var conn = dataSource.getConnection()) {
            conn.createStatement().execute("SELECT 1");
            return Health.up().withDetail("db", "reachable").build();
        } catch (Exception e) {
            // LB removes server from rotation when this returns DOWN
            return Health.down().withDetail("error", e.getMessage()).build();
        }
    }
}

// Fallback controller called by circuit breaker when backend is unhealthy
@RestController
public class FallbackController {
    @GetMapping("/fallback")
    public String fallback() {
        return "{\"error\": \"Service temporarily unavailable. Please retry.\"}";
    }
}

Configure health-check interval in application.yml:

management:
  endpoints:
    web:
      exposure:
        include: health,metrics,info
  endpoint:
    health:
      show-details: always     # exposes per-component status to the LB probe

The DatabaseHealthIndicator returns HTTP 503 when the database is unreachable โ€” the correct signal for a load balancer to remove the server from rotation, implementing the "health checks must reflect true readiness" lesson from the Lessons section.

For a full deep-dive on Spring Cloud Gateway, a dedicated follow-up post is planned.


๐Ÿ“š Lessons from Operating These Layers

Teams that run DNS, CDNs, and load balancers in production converge on the same operational lessons.

DNS TTL discipline matters: set TTL low (60โ€“300 s) before planned changes; let it drift back up during normal operation to reduce resolver load. Many outages last longer than necessary because an engineer raised TTL before a migration and forgot to lower it first.

Cache invalidation is harder than it looks: versioned filenames are the most reliable cache-busting strategy. Relying on CDN purge APIs for time-sensitive invalidations introduces propagation delay of seconds to minutes across distributed edge nodes.

Health checks must reflect true readiness: a /health endpoint that returns 200 even when the database is down will route traffic to a broken server. Health checks should probe critical dependencies, not just process liveness.

Monitor the cache hit ratio daily: a ratio below 0.80 indicates cache misconfiguration or too many unique URLs bypassing edge caching. Tuning Cache-Control headers is usually the first and highest-impact fix.


๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • DNS = global phone book; Anycast + low TTL = fast, resilient name resolution.
  • CDN = edge cache; versioned filenames and cache-control headers keep content fresh and fast.
  • Load Balancer = traffic manager; choose algorithm (Round-Robin, Least-Conn, IP-Hash) based on session needs.
  • Combine all three: DNS โ†’ LB โ†’ CDN โ†’ Origin โ†’ backend services.
  • Monitor latency at each layer; cache-hit ratio and health-check lag are the most actionable signals.

๐Ÿ“ Practice Quiz

  1. Q1: Which DNS record type aliases one hostname to another?

    • A) A record
    • B) CNAME record
    • C) MX record

    Correct Answer: B

  2. Q2: Why does a CDN reduce latency for users far from the origin?

    • A) It compresses responses at the origin
    • B) It serves cached content from an edge server geographically close to the user
    • C) It upgrades the user's network connection

    Correct Answer: B

  3. Q3: When is IP-Hash load balancing the right choice?

    • A) When all servers have identical capacity
    • B) When you need session affinity so users consistently hit the same backend
    • C) When you want to minimize active connections per server

    Correct Answer: B



Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms