All Posts

System Design Service Discovery and Health Checks: Routing Traffic to Healthy Instances

Learn how clients find services safely with registries, heartbeats, and health-aware load balancing.

Abstract AlgorithmsAbstract Algorithms
ยทยท8 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Service discovery is how clients find the right service instance at runtime, and health checks are how systems decide whether an instance should receive traffic. Together, they turn dynamic infrastructure from guesswork into deterministic routing.

TLDR: If you scale beyond static IPs, discovery plus health-aware routing becomes a core reliability primitive.

๐Ÿ“– Why Service Discovery Is the Invisible Backbone of Modern Systems

In small systems, service communication can start with fixed hostnames and static configuration. That model breaks quickly once autoscaling, rolling deploys, and multi-zone failover enter the picture.

In production, service instances come and go all day:

  • New instances launch during traffic spikes.
  • Old instances terminate during scale-down.
  • Deployments replace instances in waves.
  • Network partitions make some endpoints temporarily unreachable.

If clients keep a stale list of backends, requests fail even when healthy capacity exists elsewhere. Service discovery solves this by making endpoint lookup dynamic and health-aware.

Static endpoint modelDiscovery-driven model
Manually maintained host listsRegistry-backed live instance view
Slow reaction to failuresAutomatic unhealthy-instance eviction
Risky deploy coordinationSafer rolling updates and failover
Works for small fixed fleetsWorks for elastic and multi-zone fleets

For interviews, this is a key signal: strong candidates explain that scaling services is not only about compute. It is also about continuously correct routing decisions.

๐Ÿ” The Two Discovery Models You Must Distinguish in Interviews

Service discovery usually appears in one of two patterns.

Client-side discovery: the client queries a service registry and chooses a backend instance directly. This is common in microservice SDKs where clients include load-balancing logic.

Server-side discovery: the client calls a stable endpoint (for example, a load balancer or API gateway), and that component resolves healthy backends.

PatternHow lookup worksOperational trade-off
Client-side discoveryClient asks registry and picks instanceBetter client control, higher client complexity
Server-side discoveryProxy or LB resolves target instanceSimpler clients, centralized routing layer
DNS-based discoveryName resolves to rotating endpointsEasy integration, slower convergence in some setups
Mesh-integrated discoverySidecar/proxy handles lookup and routingStrong control plane, higher platform complexity

Interview-friendly takeaway: neither model is universally better. The right choice depends on organizational maturity, traffic behavior, and operational ownership.

โš™๏ธ How Discovery and Health Checks Work End-to-End

A robust discovery path is usually a loop, not a one-time lookup.

  1. Service instance starts and registers itself with metadata.
  2. Registry stores endpoint, zone, version, and status.
  3. Clients or proxies query for candidate instances.
  4. Health checks evaluate liveness/readiness continuously.
  5. Unhealthy nodes are removed from traffic until recovery.

Health checks are often split into two types:

  • Liveness check: is the process alive enough to restart decision logic?
  • Readiness check: can this instance safely serve real traffic now?
Check typePurposeFailure action
LivenessDetect stuck/crashed processRestart instance
ReadinessDetect dependency or warmup issuesStop routing traffic
Dependency checkValidate database/cache reachabilityMark degraded or not ready
Synthetic checkValidate user-journey behaviorTrigger alert/escalation

A frequent production pitfall is using only liveness checks. That can keep a process alive but still route traffic to an instance that cannot serve real requests because dependencies are down.

๐Ÿง  Deep Dive: What Actually Makes Discovery Reliable Under Failure

The Internals: Registries, Heartbeats, TTLs, and Routing Metadata

Most systems maintain a control plane with these pieces:

  • Registry store for service instances and metadata.
  • Heartbeat protocol to refresh instance presence.
  • TTL eviction logic to remove stale endpoints.
  • Watch/stream mechanism to push updates to clients or proxies.

When an instance registers, it usually publishes metadata like zone, version, and tags (canary, stable, gpu). Routing layers can then enforce traffic policies, such as zone-affinity or canary rollout splits.

A practical sequence looks like this:

  1. Instance sends heartbeat every N seconds.
  2. Registry updates last_seen timestamp.
  3. If heartbeat expires beyond TTL, endpoint is marked unhealthy.
  4. Load balancer excludes endpoint from selection set.

This flow is simple but safety-critical. Aggressive TTLs reduce stale routing risk but can amplify flapping during transient network spikes. Conservative TTLs lower churn but keep bad endpoints in circulation longer.

Performance Analysis: Lookup Latency, Convergence Time, and Flapping

Discovery systems are often judged by three metrics.

MetricWhy it matters
Lookup latencyImpacts request path when cache misses occur
Convergence timeMeasures how quickly routing reflects real health
Flap rateIndicates instability in health signals

Lookup latency: if discovery calls are synchronous and slow, p95 request latency rises. Many systems cache discovery results briefly to reduce lookup overhead.

Convergence time: this is the delay between a backend failure and traffic stop. Faster convergence improves reliability but requires aggressive health-check cadence and low-control-plane lag.

Flapping: if health checks are too strict, instances bounce between healthy/unhealthy states, creating churn and cascading retries. Hysteresis and multi-sample thresholds help avoid this.

In interviews, saying "I would optimize for stable convergence, not just fastest possible eviction" shows operational maturity.

๐Ÿ“Š Discovery Flow: Registration to Health-Aware Routing

flowchart TD
    A[Instance boots] --> B[Register with service registry]
    B --> C[Heartbeat and metadata updates]
    C --> D{Healthy and ready?}
    D -->|Yes| E[Add to routing pool]
    D -->|No| F[Exclude from routing pool]
    E --> G[Client or proxy resolves target]
    G --> H[Request served]
    F --> I[Recovery or restart]
    I --> C

This model captures the key principle: discovery and health checks are continuous control loops, not setup-time configuration.

๐ŸŒ Real-World Applications: API Gateways, Payments, and Internal Platforms

API gateway pathing: gateways use discovery to route to per-service backends whose instance set changes with autoscaling.

Payments and checkout APIs: readiness checks often include dependency health (database and fraud service) so traffic avoids partially broken instances.

Platform teams in multi-tenant SaaS: discovery metadata allows routing policies by region, tenant tier, and canary version.

Different domains tune thresholds differently, but all rely on the same fundamentals: live endpoint inventory and trustworthy health signals.

โš–๏ธ Trade-offs & Failure Modes: Where Discovery Can Go Wrong

Failure modeSymptomRoot causeFirst mitigation
Stale endpoint routingRequests hit dead instancesSlow TTL or missed deregistrationFaster heartbeat + TTL tuning
Health-check flappingRepeated traffic churnOverly strict check thresholdsHysteresis and consecutive-fail windows
Registry outage blast radiusNew instances never get trafficDiscovery control plane as single pointHighly available registry deployment
Readiness blind spotsAlive but broken instances serve trafficLiveness-only checksAdd dependency-aware readiness probes
Zone imbalanceOne zone overloaded unexpectedlyNo zone-aware routing policyWeighted and zone-local balancing

The interview-quality answer always includes one sentence like: "I would define clear health semantics and failure thresholds before tuning load-balancer algorithms."

๐Ÿงญ Decision Guide: Choosing a Discovery Strategy

SituationRecommendation
Small internal system with stable topologyDNS or server-side discovery is often enough
Rapidly scaling microservices with frequent deploysRegistry + health-aware proxy routing
Team comfortable with rich client SDKsClient-side discovery with local caching
Strong platform team and mesh investmentService mesh with control-plane discovery

When unsure in interviews, start with server-side discovery for simpler client behavior, then discuss where client-side control may be worth the complexity.

๐Ÿงช Practical Example: Evolving a Checkout Service Beyond Static Backends

Imagine a checkout service initially routed via hardcoded backend IPs.

Problems appear during traffic spikes:

  • New app instances launch but receive no traffic.
  • One bad instance still receives requests for minutes.
  • Rolling deploys create intermittent errors from stale endpoint lists.

A safer evolution path:

  1. Introduce a service registry with instance metadata.
  2. Route through a load balancer that consumes registry updates.
  3. Add readiness checks that include payment-db connectivity.
  4. Add zone-aware balancing to reduce cross-zone latency.

Expected outcome:

BeforeAfter
Manual endpoint updatesAutomatic registration and eviction
Inconsistent failoverDeterministic health-aware rerouting
Deploy-induced error spikesSmoother rolling deployments

This is a strong interview answer because it keeps architecture evolution incremental and justified by failures.

๐Ÿ“š Lessons Learned

  • Service discovery is a control-plane capability, not just a DNS trick.
  • Health checks must distinguish process liveness from real request readiness.
  • Faster failover is useful only when flapping is controlled.
  • Registry availability and correctness directly affect data-plane reliability.
  • Discovery design should align with team ownership and platform maturity.

๐Ÿ“Œ Summary & Key Takeaways

  • Dynamic systems need dynamic endpoint resolution.
  • Discovery and health checks are tightly coupled reliability primitives.
  • Readiness semantics matter more than raw check frequency.
  • Control-plane failures can become data-plane outages if unmanaged.
  • Start simple, then add richer routing metadata and policies as scale grows.

๐Ÿ“ Practice Quiz

  1. What is the main purpose of service discovery in distributed systems?

A) Encrypt all service-to-service traffic
B) Dynamically resolve available service instances
C) Replace logging and metrics systems

Correct Answer: B

  1. Why is readiness checking different from liveness checking?

A) Readiness checks only CPU usage
B) Liveness checks block all traffic automatically
C) Readiness determines if an instance can safely serve requests

Correct Answer: C

  1. Which risk is most associated with overly aggressive health-check thresholds?

A) Permanent cache misses
B) Endpoint flapping and traffic churn
C) Guaranteed strong consistency

Correct Answer: B

  1. Open-ended challenge: if your registry is healthy but clients still route to stale endpoints, where would you instrument first to isolate control-plane propagation delay versus client-side cache staleness?
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms