System Design HLD Example: API Gateway for Microservices

A practical HLD for centralized routing, auth, throttling, and observability across services.

Abstract Algorithms

·Mar 13, 2026·14 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: An API Gateway centralizes "cross-cutting concerns" like authentication, rate limiting, and routing at the edge of your infrastructure. The architectural crux is the separation of the Control Plane (managing configurations) from the Data Plane (high-performance request proxying). By using an in-memory Radix Tree for routing and Redis for distributed rate limiting, the gateway ensures that backend services remain decoupled, secure, and focused solely on domain logic.

🌐 The "Messy Front Door" Problem

Imagine you’ve successfully migrated your monolith to 200 microservices. Each service has its own URL, its own way of checking user permissions, and its own rate limits. Now, your mobile app needs to render a single "Home Screen" that requires data from five different services: User Profile, Notifications, Order History, Recommendations, and Promotional Banners.

Without a gateway, the mobile app must:

Manage five separate connections, increasing battery drain and latency.
Handle five different auth tokens or complex handshakes.
Know the internal IP or DNS of every service, making refactoring impossible.

If you change the "Notifications" service from /v1/notify to /v2/alerts, you have to push a new version of the mobile app to the App Store and wait weeks for users to update. This is the "Messy Front Door" problem. An API Gateway solves this by acting as a single, stable ingress point that translates a single external request into multiple internal ones, enforcing security and policy uniformly.

📖 API Gateway: Use Cases & Requirements

Actors & Journeys

External Client: A mobile app, browser, or third-party partner calling public APIs.
Service Developer: Owns a backend microservice; wants to expose an endpoint without writing auth/rate-limiting boilerplate.
Platform Engineer: Manages the gateway's global policies, such as "Block all traffic from IP range X" or "Enable Canary for Service Y."
Security Auditor: Needs a centralized log of every request that entered the system.

In/Out Scope

In-Scope: Request routing, protocol translation (HTTP to gRPC), authentication/authorization, rate limiting, request/response transformation, and canary traffic splitting.
Out-of-Scope: Business logic processing, long-term data persistence, and heavy analytical processing (which belongs in a data warehouse).

Functional Requirements

Authentication: Verify JWTs, API keys, or OAuth2 tokens at the edge.
Routing: Match requests by path patterns (e.g., /users/**) and HTTP methods.
Throttling: Enforce per-client and per-service rate limits.
Observability: Generate correlation IDs (X-Request-Id) and log every request asynchronously.
Canary Release: Support weighted traffic splitting (e.g., 5% to v2, 95% to v1).

Non-Functional Requirements (NFRs)

Low Overhead: Gateway latency must be $< 10ms$ (p99).
High Availability: 99.999% uptime; the gateway is the single point of failure for the entire platform.
Stateless Scaling: Scale horizontally by adding more instances behind a Load Balancer.
Configuration Hot-Reload: Update routing rules without restarting the gateway or dropping active connections.

🔍 Foundations: Data Plane vs. Control Plane

The most critical architectural distinction in a modern gateway is the split between the Data Plane and the Control Plane.

The Data Plane: This is the "hot path." It handles the live traffic. It must be written in high-performance, non-blocking code (e.g., Netty, Envoy, or Go) and should have almost zero dependencies on external slow databases.
The Control Plane: This is the "management path." It stores the routing table and policies in a database (e.g., Postgres). When a platform engineer changes a rule, the Control Plane pushes the update to all Data Plane instances.

Why the split? If your database goes down, your Control Plane is broken (you can't add new routes), but your Data Plane is fine (it keeps routing traffic based on its last cached config). This is called Static Stability.

⚙️ The Gateway Mechanics: The Plugin Pipeline

A request doesn't just "jump" to the backend. It passes through a Linear Filter Chain.

Pre-Auth Filters: IP Whitelisting, DDoS protection.
Auth Filters: JWT signature verification, API key lookup.
Routing Filters: Path pattern matching, service discovery lookup.
Transformation Filters: Header injection (X-User-Id), body masking.
Execution: Forwarding the request to the upstream service.
Post-Execution Filters: Logging, metric collection, and response header cleanup.

📐 Estimations & Design Goals

Capacity Math (The Enterprise Scale)

Requests per Second (RPS): 100,000 (Steady state).
Peak RPS: 500,000.
Number of Routes: 2,000.
Active Consumers: 50,000 API Keys.
Log Volume: $100K \text{ requests/sec} \times 1KB \text{ per log} = \mathbf{100 \text{ MB/sec}}$ of log data.

Scaling Targets

Max Latency Overhead: 5ms.
Redis Throttling Latency: $< 1ms$ (Requires Redis to be in the same VPC/Region).
Config Sync Time: $< 5s$ across all 100+ gateway instances.

📊 High-Level Design: The Distributed Gateway Architecture

graph TD
    Client[Mobile/Web Client] --> LB[Network Load Balancer]
    LB --> GW[API Gateway Cluster]

    subgraph "Data Plane (Stateless)"
        GW --> Auth[Auth Cache: Redis]
        GW --> RL[Rate Limit: Redis]
        GW --> Logic[Plugin Pipeline]
    end

    subgraph "Control Plane (Stateful)"
        Admin[Admin API] --> DB[(Postgres: Routes & Policies)]
        DB --> ConfigSvc[Config Distribution Svc]
        ConfigSvc -- Push --> GW
    end

    GW -- Proxy --> SvcA[User Service]
    GW -- Proxy --> SvcB[Order Service]
    GW -- Async Log --> Kafka{Kafka}
    Kafka --> Logger[Logging Service]

The gateway cluster maintains a local in-memory Radix Tree copy of the routing table. The Config Distribution Service pushes routing updates to every instance via a Redis pub/sub channel within five seconds of any Admin API change — no gateway restart is required. All auth token lookups and rate-limit counter reads resolve from Redis, never from the Postgres control plane, so the data plane continues routing even when the config database is temporarily unreachable. The async log path to Kafka keeps Kafka I/O completely off the client response path, ensuring the p99 overhead stays below 5ms.

Route and Policy Data Model

Every route is stored as a structured record in the Postgres control plane database and cached locally on each gateway instance:

Column	Type	Description
route_id	UUID	Unique route identifier
path_pattern	VARCHAR	Glob pattern, e.g. `/api/v1/users/**`
http_methods	TEXT[]	Allowed HTTP methods for this route
upstream_url	VARCHAR	Target service URL or load-balanced name
auth_required	BOOLEAN	Whether JWT or API key validation is required
rate_limit_rps	INTEGER	Per-consumer requests-per-second cap
canary_weight	SMALLINT	Percentage of traffic routed to the canary version (0–100)
plugin_chain	JSONB	Ordered list of filter plugin names to execute
updated_at	TIMESTAMP	Last configuration change timestamp

🧠 Deep Dive: Radix Tree Routing, Token Bucket Rate Limiting, and JWT Cache Under Load

The gateway's two most performance-critical components are the routing engine and the distributed rate limiter. Together they explain how a single gateway instance sustains hundreds of thousands of requests per second with sub-5ms overhead per request.

Internals: Radix Tree Path Lookup and Token Authentication Caching

A naive routing implementation stores routes in a list and searches linearly — O(N) per request. With 2,000 routes and 100,000 RPS, this burns significant CPU. A Radix Tree (Patricia Trie) compresses route paths into a branching prefix structure. Looking up /api/v1/users/42 traverses the tree character-group by character-group until a leaf node bound to an upstream URL is reached. Time complexity is O(K) where K is the path length — completely independent of total route count. Both Nginx and Envoy implement this internally; Go's http.ServeMux uses a similar trie. For wildcard paths like /api/v1/users/**, the tree stores a wildcard edge at the point of divergence, matching any continuation after that prefix.

JWT validation presents a parallel internals challenge. Verifying an RS256 JWT signature requires a public-key cryptographic operation costing roughly 0.5–1ms of CPU per request. At 100,000 RPS, this is 100 CPU-seconds per second across the cluster — an unacceptable overhead. The gateway resolves this by caching the validated token payload (user ID, scopes, and expiry timestamp) in Redis with a 5-minute TTL. On a cache hit, the gateway injects the identity headers directly into the upstream request and skips all cryptographic work. For API keys, the gateway keeps a local in-memory hash map refreshed every 60 seconds from Redis, eliminating the network round-trip for the most frequent authentication pattern entirely.

Performance Analysis: Atomic Token Bucket Enforcement via Redis Lua Scripts

The gateway enforces per-consumer rate limits using the Token Bucket algorithm. Each API consumer has a virtual bucket with a configurable capacity. Each request consumes one token. Tokens replenish at a fixed rate per second. The critical correctness requirement is atomicity: without it, two concurrent requests could both read a count of "1" and both proceed past the limit, resulting in a 2× burst that violates the contract. The gateway executes a Redis Lua script that reads the current token count, decrements it if tokens are available, updates the next-replenishment timestamp, and returns the result — all as a single indivisible operation. Redis guarantees that no other command can interleave during Lua script execution. At p99, Redis Lua script execution completes in under 0.5ms when the cache cluster is colocated in the same data center as the gateway nodes, making the rate-limit check negligible even at 500,000 RPS.

🌍 Real-World Deployments: Kong, AWS API Gateway, and Netflix Zuul

Kong is the most widely deployed open-source API gateway. Its plugin system mirrors the filter chain described above — teams attach rate limiting, OAuth, request transformation, and distributed tracing to any route via a single REST Admin API call, with no gateway process restart required. Kong uses PostgreSQL for control plane config and pushes updates to worker nodes via a polling sync loop with a configurable interval, typically set to 1–5 seconds.

AWS API Gateway represents the fully managed extreme. When you deploy a stage, AWS compiles your routing configuration into an immutable, pre-resolved snapshot and distributes it to edge Points of Presence globally. The data plane at runtime has zero live database dependency — it serves entirely from compiled in-memory routing tables. This is why AWS API Gateway continues routing during large-scale AWS regional control plane outages: the data plane is architecturally static-stable. The trade-off is inflexibility — you must redeploy a new stage rather than hot-patching a single route entry.

Netflix Zuul pioneered dynamic filter loading at production scale. Groovy-based filters can be pushed to a running Zuul instance without a process restart, making Zuul the first truly hot-reloadable filter chain operating on internet-scale traffic. Netflix's Zuul 2 migration from blocking I/O to non-blocking Netty demonstrated that blocking thread-per-request I/O is a hard architectural ceiling for gateways handling more than 50,000 concurrent connections at sub-10ms latency.

⚖️ Trade-offs and Failure Modes in API Gateway Design

The gateway is the single ingress point for your entire platform. Every architectural choice has amplified blast-radius consequences.

Dimension	Option A	Option B	Recommended Default
Deployment model	Centralized gateway	Sidecar proxy per service (Envoy)	Centralized for fewer than 50 services; sidecar for 50+
Auth strategy	Gateway as sole auth authority	Gateway edge check + service-level RBAC	Both layers for defense-in-depth
Logging	Synchronous (adds latency)	Async to Kafka (risk losing logs on crash)	Async with at-least-once Kafka delivery
Config propagation	Hot-reload	Immutable deploy	Immutable for production safety; hot-reload for development speed
Rate limit enforcement	Distributed via Redis	Local in-memory per instance	Redis for billing-critical limits; local for DDoS defense

Cascade Failure Risk: If the gateway cluster becomes CPU-saturated, every upstream service simultaneously becomes unreachable — a platform-wide outage triggered by any single misbehaving client. Mitigations include per-upstream circuit breakers, request shedding under CPU pressure (shed analytics and telemetry traffic before payment and auth traffic), and auto-scaling driven by active connection count rather than RPS alone.

Redis Rate-Limit Dependency Failure: When the Redis cluster is unreachable, the gateway must choose between full permissiveness (DDoS exposure) and full rejection (self-inflicted outage). The correct design is a local in-memory token bucket fallback with a conservative rate (50% of the configured value), combined with immediate alerting. Never architect a system where one dependency failure collapses into only two binary outcomes.

🧭 Decision Guide: Matching Gateway Strategy to Platform Scale

System Characteristics	Recommended Gateway Approach
Fewer than 10 microservices, single team	AWS API Gateway or Kong Cloud — managed, zero ops overhead
10–50 services, multiple teams	Self-hosted Kong or Traefik with a shared config repository
50–200 services, multi-region	Envoy-based gateway (Ambassador, Emissary, or custom control plane)
200+ services, global active-active	Service mesh (Istio/Linkerd) for east-west + dedicated edge gateway for north-south
Latency budget under 1ms per hop	Sidecar proxies (Envoy) to eliminate the centralized gateway network hop entirely
API monetization or developer portal	Kong with Rate Limiting Advanced and Developer Portal plugins
gRPC-first microservices	Envoy with gRPC-JSON transcoding for HTTP/1.1 backward compatibility

In an interview, always clarify the boundary between gateway-level authentication (who is the caller?) and service-level authorization (what is this caller allowed to do within this domain?). The gateway should never serve as the sole authority for business permissions that require domain context only the backend service holds.

🧪 Interview Delivery Example: Walking Through an API Gateway Design in 45 Minutes

Structure your whiteboard session to maximize signal at each phase:

Minutes 1–5 — Frame the Problem: Open with the "Messy Front Door" scenario. Without a gateway, mobile clients manage separate connections per service, auth logic is duplicated across every service team, and refactoring any internal URL requires an app-store update. Establish NFRs: sub-10ms overhead, 99.999% availability, config hot-reload within 5 seconds.

Minutes 6–15 — Data Plane vs. Control Plane Split: Draw the split before any service boxes. Explain static stability: the Data Plane continues routing on its last cached config even when the Control Plane database is unreachable. This is the highest-signal observation you can deliver in the first 15 minutes of a system design interview.

Minutes 16–30 — Walk the Plugin Pipeline: Enumerate filter chain stages in order. Explain Radix Tree routing (O(K) vs. O(N) for list-based routing). Explain Token Bucket via Redis Lua atomics. Describe JWT validation caching with a 5-minute TTL to eliminate signature verification overhead.

Minutes 31–40 — Failure Modes (Proactively): Raise the Redis rate-limiting failure scenario before the interviewer asks. Propose the local in-memory fallback. Raise the cascade failure risk from upstream saturation and propose circuit breakers and CPU-pressure-based request shedding.

Minutes 41–45 — Trade-offs and OSS Comparison: Compare centralized gateway vs. sidecar (Envoy/Istio). Recommend sidecar for Netflix-scale deployments, centralized for teams under 50 services. Reference Kong, Envoy, and AWS API Gateway as production implementations each representing a different point in the control-plane vs. data-plane design space.

🛠️ Open-Source Gateway Implementations Worth Knowing

Kong: Nginx/OpenResty-based. Plugin ecosystem covers auth, rate limiting, logging, and request transformation without code changes. PostgreSQL or Cassandra for config storage.
Envoy Proxy: C++-based, non-blocking I/O. Foundation of Istio and AWS App Mesh. Uses the xDS protocol for control plane communication. Best for service-mesh-adjacent deployments.
Traefik: Go-based. Auto-discovers routes from Docker labels and Kubernetes Ingress annotations. Zero-config for container-native teams who want automatic route registration.
Apache APISIX: etcd-based config, Lua plugin support, dynamic route updates without control plane restart. Strong choice for teams needing high plugin extensibility at lower operational cost than Kong Enterprise.

📚 Lessons Learned from API Gateway Failures in Production

No Circuit Breakers Caused a Platform-Wide Outage. A major e-commerce platform's gateway had no circuit breakers on upstream connections. When the Order Service began timing out, the gateway kept forwarding requests into the failure. Within two minutes, the gateway's connection pool was exhausted. All services sharing the gateway — including completely unrelated services like User Profile — became simultaneously unreachable. Every upstream, regardless of perceived reliability, needs a circuit breaker with a half-open probe state that allows measured recovery testing.

Config Hot-Reload Without Validation Broke Payments for Four Minutes. A financial platform pushed a routing rule with a typo in the upstream URL. Hot-reload applied it immediately to all gateway instances. 100% of Payment Service traffic returned HTTP 502 for four minutes until the team manually rolled back. Always validate config changes against a staging environment before propagation, and implement a two-phase deploy — validate first, then apply with canary rollout — with automatic rollback triggered by error-rate spike detection.

Rate Limits by IP Let Bots Through While Blocking Enterprise Customers. A SaaS API applied rate limits by source IP. Distributed bot traffic from 50,000 unique IPs each sending one request per minute bypassed the per-IP threshold entirely. Meanwhile, a legitimate enterprise customer routing all traffic through a single corporate NAT IP was blocked. Apply rate limits on authenticated consumer identity (API key or OAuth client ID) as the primary dimension. Source IP is a secondary DDoS defense layer, not the primary rate-control mechanism.

📌 TLDR & Key Takeaways

Split the gateway into a stateless Data Plane (Redis-backed, handles live traffic) and a stateful Control Plane (Postgres-backed, manages configuration). The Data Plane must operate independently when the Control Plane database is down — this is called static stability.
Use a Radix Tree for O(K) path routing and a Token Bucket implemented as a Redis Lua atomic script for accurate distributed rate limiting without race conditions across instances.
Cache JWT validation results in Redis with a 5-minute TTL to eliminate per-request cryptographic overhead on the hot request path.
The gateway's greatest production risk is becoming a cascade failure amplifier. Per-upstream circuit breakers and CPU-pressure-based request shedding are non-negotiable safety mechanisms.
Apply rate limits on authenticated consumer identity, not source IP, to correctly target both abuse and enterprise traffic patterns.
In interviews: draw the data plane / control plane split first, walk the plugin pipeline second, surface failure modes proactively third.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read