System Design Service Discovery and Health Checks: Routing Traffic to Healthy Instances

Learn how clients find services safely with registries, heartbeats, and health-aware load balancing.

Abstract Algorithms

·Mar 12, 2026·12 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Service discovery is how clients find the right service instance at runtime, and health checks are how systems decide whether an instance should receive traffic. Together, they turn dynamic infrastructure from guesswork into deterministic routing.

TLDR: If you scale beyond static IPs, discovery plus health-aware routing becomes a core reliability primitive.

📖 Why Service Discovery Is the Invisible Backbone of Modern Systems

In small systems, service communication can start with fixed hostnames and static configuration. That model breaks quickly once autoscaling, rolling deploys, and multi-zone failover enter the picture.

In production, service instances come and go all day:

New instances launch during traffic spikes.
Old instances terminate during scale-down.
Deployments replace instances in waves.
Network partitions make some endpoints temporarily unreachable.

If clients keep a stale list of backends, requests fail even when healthy capacity exists elsewhere. Service discovery solves this by making endpoint lookup dynamic and health-aware.

Static endpoint model	Discovery-driven model
Manually maintained host lists	Registry-backed live instance view
Slow reaction to failures	Automatic unhealthy-instance eviction
Risky deploy coordination	Safer rolling updates and failover
Works for small fixed fleets	Works for elastic and multi-zone fleets

For interviews, this is a key signal: strong candidates explain that scaling services is not only about compute. It is also about continuously correct routing decisions.

🔍 The Two Discovery Models You Must Distinguish in Interviews

Service discovery usually appears in one of two patterns.

Client-side discovery: the client queries a service registry and chooses a backend instance directly. This is common in microservice SDKs where clients include load-balancing logic.

Server-side discovery: the client calls a stable endpoint (for example, a load balancer or API gateway), and that component resolves healthy backends.

Pattern	How lookup works	Operational trade-off
Client-side discovery	Client asks registry and picks instance	Better client control, higher client complexity
Server-side discovery	Proxy or LB resolves target instance	Simpler clients, centralized routing layer
DNS-based discovery	Name resolves to rotating endpoints	Easy integration, slower convergence in some setups
Mesh-integrated discovery	Sidecar/proxy handles lookup and routing	Strong control plane, higher platform complexity

Interview-friendly takeaway: neither model is universally better. The right choice depends on organizational maturity, traffic behavior, and operational ownership.

⚙️ How Discovery and Health Checks Work End-to-End

A robust discovery path is usually a loop, not a one-time lookup.

Service instance starts and registers itself with metadata.
Registry stores endpoint, zone, version, and status.
Clients or proxies query for candidate instances.
Health checks evaluate liveness/readiness continuously.
Unhealthy nodes are removed from traffic until recovery.

Health checks are often split into two types:

Liveness check: is the process alive enough to restart decision logic?
Readiness check: can this instance safely serve real traffic now?

Check type	Purpose	Failure action
Liveness	Detect stuck/crashed process	Restart instance
Readiness	Detect dependency or warmup issues	Stop routing traffic
Dependency check	Validate database/cache reachability	Mark degraded or not ready
Synthetic check	Validate user-journey behavior	Trigger alert/escalation

A frequent production pitfall is using only liveness checks. That can keep a process alive but still route traffic to an instance that cannot serve real requests because dependencies are down.

🧠 Deep Dive: What Actually Makes Discovery Reliable Under Failure

The Internals: Registries, Heartbeats, TTLs, and Routing Metadata

Most systems maintain a control plane with these pieces:

Registry store for service instances and metadata.
Heartbeat protocol to refresh instance presence.
TTL eviction logic to remove stale endpoints.
Watch/stream mechanism to push updates to clients or proxies.

When an instance registers, it usually publishes metadata like zone, version, and tags (canary, stable, gpu). Routing layers can then enforce traffic policies, such as zone-affinity or canary rollout splits.

A practical sequence looks like this:

Instance sends heartbeat every N seconds.
Registry updates last_seen timestamp.
If heartbeat expires beyond TTL, endpoint is marked unhealthy.
Load balancer excludes endpoint from selection set.

This flow is simple but safety-critical. Aggressive TTLs reduce stale routing risk but can amplify flapping during transient network spikes. Conservative TTLs lower churn but keep bad endpoints in circulation longer.

Performance Analysis: Lookup Latency, Convergence Time, and Flapping

Discovery systems are often judged by three metrics.

Metric	Why it matters
Lookup latency	Impacts request path when cache misses occur
Convergence time	Measures how quickly routing reflects real health
Flap rate	Indicates instability in health signals

Lookup latency: if discovery calls are synchronous and slow, p95 request latency rises. Many systems cache discovery results briefly to reduce lookup overhead.

Convergence time: this is the delay between a backend failure and traffic stop. Faster convergence improves reliability but requires aggressive health-check cadence and low-control-plane lag.

Flapping: if health checks are too strict, instances bounce between healthy/unhealthy states, creating churn and cascading retries. Hysteresis and multi-sample thresholds help avoid this.

In interviews, saying "I would optimize for stable convergence, not just fastest possible eviction" shows operational maturity.

📊 Discovery Flow: Registration to Health-Aware Routing

flowchart TD
    A[Instance boots] --> B[Register with service registry]
    B --> C[Heartbeat and metadata updates]
    C --> D{Healthy and ready?}
    D -->|Yes| E[Add to routing pool]
    D -->|No| F[Exclude from routing pool]
    E --> G[Client or proxy resolves target]
    G --> H[Request served]
    F --> I[Recovery or restart]
    I --> C

This model captures the key principle: discovery and health checks are continuous control loops, not setup-time configuration.

📊 Service Registration and Client Discovery

sequenceDiagram
    participant S as OrderService
    participant R as Service Registry
    participant C as Client
    S->>R: Register: host port tags
    R-->>S: ACK registration
    S->>R: Heartbeat every 5s
    C->>R: Discover order-service
    R-->>C: Return healthy endpoints
    C->>S: Route request
    S-->>C: Response

This sequence diagram traces the full lifecycle of service registration and client-driven discovery. OrderService registers with the registry and sends heartbeats every 5 seconds to maintain its healthy status; when a Client queries for available endpoints, the registry returns only healthy instances, and the Client routes its request directly. Discovery is not a one-time lookup — it is a continuous health-maintained contract that ensures clients never route to stale or unhealthy backends.

📊 Health Check Lifecycle: Unhealthy to Deregister

sequenceDiagram
    participant R as Registry
    participant S as Service Instance
    R->>S: GET /ready every 5s
    S-->>R: 200 OK
    Note over S: DB connection lost
    R->>S: GET /ready
    S-->>R: 503 Unhealthy
    R->>R: Mark instance DOWN
    R->>R: Remove from routing pool
    Note over S: DB connection restored
    S->>R: Re-register
    R->>S: GET /ready
    S-->>R: 200 OK
    R->>R: Add to routing pool

This sequence diagram shows what happens when a service instance loses a critical dependency. The registry polls the instance every 5 seconds; when a database connection is lost and the instance returns 503, the registry marks it DOWN and removes it from the routing pool — no manual intervention required. Once the database connection is restored and the instance re-registers and returns 200 OK, the registry automatically adds it back to the pool, completing the self-healing loop.

🌍 Real-World Applications: API Gateways, Payments, and Internal Platforms

HashiCorp Consul at scale: Consul's gossip protocol propagates health changes across a cluster in ~200ms on typical LAN deployments. The deregister_critical_service_after field automatically removes services that remain unhealthy beyond a configurable window — preventing stale endpoints from silently accumulating in the registry.

Consul service registration (JSON):

{
  "service": {
    "name": "orders-api",
    "port": 8080,
    "tags": ["v2", "stable"],
    "check": {
      "http": "http://localhost:8080/ready",
      "interval": "5s",
      "timeout": "2s",
      "deregister_critical_service_after": "30s"
    }
  }
}

Kubernetes endpoint controller: Kubernetes removes a failing pod from its EndpointSlice within the kube-proxy sync interval — typically < 1 second on a healthy cluster — when the pod's readiness probe fails. This is faster than any DNS TTL-based failover mechanism.

Kubernetes readiness and liveness probes (YAML):

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10
  failureThreshold: 3

The /ready endpoint returns 200 only when the service has established its database connection and warmed its local caches. The /healthz endpoint returns 200 as long as the process is responsive — a liveness failure triggers a pod restart, which is destructive and should be reserved for genuinely deadlocked or unrecoverable processes.

Envoy xDS health propagation: Envoy's Endpoint Discovery Service (EDS) receives health status from a control plane such as Istio, Consul Connect, or a custom xDS server. Status changes propagate to all connected Envoy proxies in < 50ms in a well-tuned mesh — orders of magnitude faster than DNS TTL expiry. This speed is what enables effective circuit-breaking and near-instant unhealthy-instance removal in a service mesh.

Failure scenario: a payments team used only liveness probes — no readiness probes — on their transaction-processing service. During a scheduled database maintenance window, pods stayed alive but could not process transactions. The load balancer continued routing requests to these pods for 6 minutes until engineers manually drained them. Adding a readiness probe that checks database connectivity eliminated this failure class entirely over the following 12 months.

⚖️ Trade-offs & Failure Modes: Where Discovery Can Go Wrong

Failure mode	Symptom	Root cause	First mitigation
Stale endpoint routing	Requests hit dead instances	Slow TTL or missed deregistration	Faster heartbeat + TTL tuning
Health-check flapping	Repeated traffic churn	Overly strict check thresholds	Hysteresis and consecutive-fail windows
Registry outage blast radius	New instances never get traffic	Discovery control plane as single point	Highly available registry deployment
Readiness blind spots	Alive but broken instances serve traffic	Liveness-only checks	Add dependency-aware readiness probes
Zone imbalance	One zone overloaded unexpectedly	No zone-aware routing policy	Weighted and zone-local balancing

The interview-quality answer always includes one sentence like: "I would define clear health semantics and failure thresholds before tuning load-balancer algorithms."

🧭 Decision Guide: Choosing a Discovery Strategy

Situation	Recommendation
Small internal system with stable topology	DNS or server-side discovery is often enough
Rapidly scaling microservices with frequent deploys	Registry + health-aware proxy routing
Team comfortable with rich client SDKs	Client-side discovery with local caching
Strong platform team and mesh investment	Service mesh with control-plane discovery

When unsure in interviews, start with server-side discovery for simpler client behavior, then discuss where client-side control may be worth the complexity.

🧪 Practical Example: Evolving a Checkout Service Beyond Static Backends

Imagine a checkout service initially routed via hardcoded backend IPs.

Problems appear during traffic spikes:

New app instances launch but receive no traffic.
One bad instance still receives requests for minutes.
Rolling deploys create intermittent errors from stale endpoint lists.

A safer evolution path:

Introduce a service registry with instance metadata.
Route through a load balancer that consumes registry updates.
Add readiness checks that include payment-db connectivity.
Add zone-aware balancing to reduce cross-zone latency.

Expected outcome:

Before	After
Manual endpoint updates	Automatic registration and eviction
Inconsistent failover	Deterministic health-aware rerouting
Deploy-induced error spikes	Smoother rolling deployments

This is a strong interview answer because it keeps architecture evolution incremental and justified by failures.

🛠️ Spring Cloud Netflix Eureka and Spring Cloud Consul: Dynamic Discovery for Java Microservices

Spring Cloud Netflix Eureka is a client-side service registry built into the Spring Cloud ecosystem; Spring Cloud Consul provides the same programming model backed by HashiCorp Consul's gossip-based registry. Both integrate with @EnableDiscoveryClient and Spring Boot's HealthIndicator to expose liveness/readiness semantics to the control plane automatically.

How it solves the problem: A Spring Boot microservice annotated with @EnableDiscoveryClient registers itself with the registry on startup, refreshes its heartbeat on a configurable interval, and is automatically evicted when the heartbeat stops. Spring Boot's HealthIndicator lets each service publish dependency-aware readiness — so a payment service that lost its database connection reports DOWN before the load balancer routes a real transaction to it.

// Enable discovery client (works for both Eureka and Consul)
@SpringBootApplication
@EnableDiscoveryClient
public class PaymentServiceApp {
    public static void main(String[] args) {
        SpringApplication.run(PaymentServiceApp.class, args);
    }
}

// Custom HealthIndicator: readiness check includes DB connectivity
@Component("paymentDatabase")
public class PaymentDbHealthIndicator implements HealthIndicator {

    private final DataSource dataSource;

    public PaymentDbHealthIndicator(DataSource dataSource) {
        this.dataSource = dataSource;
    }

    @Override
    public Health health() {
        try (Connection conn = dataSource.getConnection()) {
            conn.isValid(1);   // 1-second timeout
            return Health.up()
                .withDetail("db", "reachable")
                .build();
        } catch (SQLException ex) {
            // Registry marks this instance DOWN → removed from routing pool
            return Health.down()
                .withDetail("db", "unreachable")
                .withException(ex)
                .build();
        }
    }
}

// Client: discover and call another service without hardcoded URLs
@Service
public class OrderFulfillmentClient {

    private final RestTemplate restTemplate;   // @LoadBalanced — resolves via registry

    public FulfillmentResponse fulfill(String orderId) {
        // "fulfillment-service" resolves to a healthy instance via registry lookup
        return restTemplate.postForObject(
            "http://fulfillment-service/api/fulfill/" + orderId,
            null, FulfillmentResponse.class);
    }
}

Spring Cloud Consul registration configuration:

spring:
  cloud:
    consul:
      host: consul.internal
      port: 8500
      discovery:
        health-check-path: /actuator/health
        health-check-interval: 5s
        deregister: true                # auto-deregister on shutdown
        instance-id: ${spring.application.name}-${server.port}
        tags:
          - v2
          - stable

The deregister: true flag ensures Spring's @PreDestroy lifecycle hook deregisters the instance gracefully, preventing stale endpoints during rolling deployments — one of the most common sources of 502 errors in blue-green and canary rollouts.

For a full deep-dive on Spring Cloud service discovery with Eureka and Consul, a dedicated follow-up post is planned.

📚 Lessons Learned

Service discovery is a control-plane capability, not just a DNS trick.
Health checks must distinguish process liveness from real request readiness.
Faster failover is useful only when flapping is controlled.
Registry availability and correctness directly affect data-plane reliability.
Discovery design should align with team ownership and platform maturity.

📌 TLDR: Summary & Key Takeaways

Dynamic systems need dynamic endpoint resolution.
Discovery and health checks are tightly coupled reliability primitives.
Readiness semantics matter more than raw check frequency.
Control-plane failures can become data-plane outages if unmanaged.
Start simple, then add richer routing metadata and policies as scale grows.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read