System Design HLD Example: Distributed Rate Limiter

Design a production-grade HLD for distributed rate limiting with fairness and low latency.

Abstract Algorithms

·Mar 13, 2026·18 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: A distributed rate limiter protects APIs from abuse and "noisy neighbors" by enforcing request quotas across a cluster of servers. The core technical challenge is Atomic State Management—solved by using Redis Lua scripts to perform a "check-and-increment" operation in a single atomic step, preventing race conditions that would allow users to bypass their limits.

🛑 The "Thundering Herd" API Crash

Imagine your company just launched a new public API. You have 1,000 users on the "Free Tier," each allowed 100 requests per minute. Suddenly, a popular blog writes about your API, and 500 of those users simultaneously start running automated scripts.

If your rate limiter is naive and runs locally on each of your 10 web servers, each server will allow 100 requests.

The Result: 500 users 100 requests 10 servers = 500,000 requests per minute hitting your database.
The Fallout: Your primary database CPU hits 100%, your payment service times out, and your "Paid Tier" enterprise customers are now seeing 500 Internal Server Error.

This is the Distributed Quota Leak. A rate limiter isn't just a safety check; it's a "Load Shedder" that ensures your system stays alive during spikes. If it's not distributed and atomic, it's just a suggestion, not a limit.

📖 Rate Limiting: Use Cases & Requirements

Actors

API Client: The application or user calling the API.
Service Owner: Defines the limits (e.g., "100 RPM for Free, 10k RPM for Pro").
Gateway: The component that intercepts requests and asks the Rate Limiter, "Is this allowed?"

Functional Requirements

Distributed Enforcement: Limits must be shared across all servers in a cluster.
Flexible Algorithms: Support for Token Bucket, Leaky Bucket, or Sliding Window.
Granular Policies: Rate limit by API Key, IP Address, or specific endpoints.
Error Feedback: Return 429 Too Many Requests with a Retry-After header.

Non-Functional Requirements

Ultra-Low Latency: Every single API call passes through the rate limiter. It must respond in < 2ms.
High Availability: If the rate limiter fails, the system should "Fail Open" (allow requests) rather than blocking everything.
Scalability: Must handle millions of limit checks per second.
Accuracy: Minimize the "Race Condition" window where two requests can slip through the limit simultaneously.

🔍 Basics: Common Algorithms

There are three primary ways to count requests:

Fixed Window: Counts requests in 1-minute blocks (e.g., 0:00-0:59). Problem: You can double your limit at the edge of the window (e.g., 50 requests at 0:59 and 50 at 1:00).
Token Bucket: A bucket holds $N$ tokens. Each request takes one. Tokens refill at a steady rate. Pros: Allows for "bursts" of traffic.
Sliding Window Log: Stores the timestamp of every request. Pros: Most accurate. Cons: Uses massive amounts of memory.

For most distributed systems, the Sliding Window Counter (a hybrid approach) provides the best balance of memory and accuracy.

⚙️ Mechanics: Atomic Increment with Redis

In a distributed environment, you cannot use a local variable to count requests. You must use a shared store like Redis.

The most critical mechanic is the Atomic Read-Modify-Write. If two servers both see count = 99, and both increment it to 100, they might both allow the request even though the limit was 100. To solve this, we use Redis Lua Scripts. Redis executes the entire script in a single thread, ensuring that the "Check" and the "Increment" happen as one indivisible operation.

📐 Estimations & Design Goals

The Math of Protection

Peak Traffic: 1 Million requests per second (RPS).
Users: 10 Million active API keys.
Memory per User: One Redis key (approx 100 bytes).
Total Redis Memory: 10M keys 100 bytes = *1 GB of RAM.

Design Goal: The rate limiter must be a Stateless Sidecar or Middleware. It should reside as close to the Load Balancer as possible to "Drop" bad traffic before it even reaches your application servers.

📊 High-Level Design: The API Gateway Filter

The rate limiter typically lives within or adjacent to the API Gateway.

graph TD
    User((User)) --> LB[Load Balancer]
    LB --> GW[API Gateway / Sidecar]

    subgraph Enforcement_Path
        GW --> RLS[Rate Limiter Service]
        RLS --> RC[(Distributed Counter: Redis)]
    end

    subgraph Control_Plane
        Admin[Admin UI] --> PS[Policy Service]
        PS --> PDB[(Policy DB: Postgres)]
        PS --> RC
    end

The diagram shows the two-layer architecture of a production rate limiter. The Enforcement Path sits inline on every API request: the Gateway consults the Rate Limiter Service, which performs an atomic check-and-increment against the Distributed Counter in Redis. If the counter exceeds the configured policy limit, the gateway immediately returns a 429 Too Many Requests response without forwarding the request to the application tier — effectively shedding load before it reaches downstream services. The Control Plane is the management layer: the Policy Service reads limit configurations from Postgres and writes updated policies into Redis, enabling policy changes to propagate within seconds without a deployment.

🧠 Deep Dive: How Each Algorithm Works and When the Atomic Redis Pattern Is Required

Internals: How Redis Lua Scripts Provide Atomic Enforcement Across a Distributed Cluster

Redis executes Lua scripts on its single-threaded event loop — the same loop that handles all GET, SET, INCR, and EXPIRE commands. When a Rate Limiter Service instance sends a EVALSHA command with a Lua script, Redis processes it from start to finish without executing any other command in between. This is not a transaction (which requires MULTI/EXEC and does not prevent interleaving of other clients' commands) — it is true atomicity at the command level.

The Sliding Window Counter Lua script reads two keys — the current window counter and the previous window counter — calculates the weighted estimate (prev_count × remaining_fraction) + curr_count, and either increments the current counter (allow) or returns a denial code (reject), all within a single atomic operation. The script also sets the TTL on the current window key if it does not already exist, ensuring that idle API keys' Redis entries expire automatically and do not accumulate indefinitely.

Redis's single-threaded model means a Lua script running for more than 5ms (the default lua-time-limit) will cause other commands to queue behind it. Production rate limiter scripts must complete in under 0.1ms — achievable given that they perform only 2–4 Redis data structure operations with O(1) complexity.

Performance Analysis: Latency Impact and Memory Footprint at 1M Requests Per Second

Adding a rate limiter to every API call creates a mandatory latency overhead. Here is the production cost breakdown:

Component	Latency Added	Notes
Rate Limiter Service network round-trip (gRPC)	0.3–0.5ms	In-region service call
Redis Lua script execution	0.05–0.1ms	Single-threaded O(1) ops
Local policy cache lookup	< 0.01ms	In-process LRU cache hit
Total overhead (happy path, cache hit)	~0.5ms	Well within 2ms NFR
Total overhead (policy cache miss + Redis policy read)	~1.5ms	Policy read adds one extra Redis round-trip

Memory cost at scale:

Data	Keys	Size per Key	Total Redis Memory
Sliding Window counters (current + previous)	10M active API keys × 2	50 bytes	1 GB
Policy configurations	10M API keys	200 bytes	2 GB
Total Rate Limiter Redis footprint	—	—	~3 GB

A 3-node Redis Cluster with 8GB RAM per node comfortably holds the entire rate limiter state with 60% memory headroom. At 1 million requests per second, each generating one Redis EVALSHA call, the cluster must handle 1M atomic script executions per second — approximately 333K per node, well within Redis's documented throughput of 500K+ simple commands per second on modern hardware.

Three algorithms dominate rate limiter implementations in production. Understanding their mechanics, memory footprints, and edge cases is the most important technical depth an interviewer will probe.

Algorithm Comparison Table

Algorithm	Memory Per User	Burst Handling	Edge Case	Best For
Fixed Window Counter	O(1) — 1 Redis key per window	Allows 2× burst at window boundary	Window-edge doubling exploit	Simple internal APIs, low-stakes limits
Sliding Window Log	O(n) — stores every timestamp	Perfect accuracy, no burst exploit	Prohibitive memory at high throughput	Low-volume, high-accuracy use cases
Token Bucket	O(1) — 2 Redis fields (tokens, last_refill)	Designed for controlled bursts	Requires atomic refill calculation	APIs that allow burst within overall limit
Sliding Window Counter	O(1) — 2 keys per window	Approximation, not exact	Small inaccuracy at window boundary	High-throughput, production APIs

The Sliding Window Counter — a hybrid of Fixed Window and Sliding Window Log — is the production standard. It uses two counters: one for the current minute window and one for the previous minute window. The estimated count is calculated as: (previous_count × remaining_fraction_of_previous_window) + current_count. This approximation is accurate to within 1% at scale while using only two O(1) Redis keys per user.

Why Atomic Operations Are Non-Negotiable in Distributed Systems

Consider what happens without atomicity. Server A reads the counter for user api_key_123 and sees count = 99 against a limit of 100. Server B also reads count = 99 simultaneously. Both servers conclude the request is allowed and increment to 100. The result: two requests slip through when only one should have been permitted — the limit is violated.

In Redis, the INCR command is atomic — it reads and increments in a single indivisible operation. However, a simple INCR does not handle the "check before increment" logic needed for rate limiting (increment only if below limit). This is why Redis Lua Scripts are the standard solution: Redis executes an entire Lua script as a single atomic operation on a single thread, ensuring that the check and increment happen with no interleaving.

Redis Key Schema for Rate Limiting

Redis Key Pattern	Value	TTL	Algorithm
`rl:fixed:{api_key}:{window_minute}`	Integer count	60 seconds	Fixed Window Counter
`rl:sw_curr:{api_key}`	Integer count (current window)	60 seconds	Sliding Window Counter
`rl:sw_prev:{api_key}`	Integer count (previous window)	120 seconds	Sliding Window Counter
`rl:bucket:{api_key}`	Hash {tokens, last_refill_ts}	No TTL (lazy expiry on access)	Token Bucket
`rl:policy:{api_key}`	JSON {limit, window_sec, algorithm}	No TTL (updated by Control Plane)	All algorithms

The rl:policy:{api_key} key stores the per-API-key configuration. The Rate Limiter Service reads this key before executing the algorithm — allowing different users to have different limits (Free: 100 RPM, Pro: 10,000 RPM, Enterprise: custom) with no code changes, just Redis key updates from the Policy Service.

The Token Bucket: A Visual Walk-Through

graph TD
    A[Request Arrives] --> B{Check Tokens in Bucket}
    B -->|Tokens >= 1| C[Consume 1 Token]
    C --> D[Allow Request]
    B -->|Tokens = 0| E[Reject with 429]
    F[Refill Timer] -->|Add tokens at fixed rate| B
    G[Burst Scenario] -->|Multiple requests arrive simultaneously| B

The Token Bucket diagram shows how burst traffic is handled. When multiple requests arrive at once, each consumes one token from the available pool. As long as tokens remain, all requests are allowed — this is the "burst" capacity. When the bucket is empty, additional requests receive a 429 response until the refill timer adds new tokens at the configured rate. Unlike the Fixed Window algorithm, Token Bucket never allows the boundary-edge doubling exploit: the token count is the authoritative limit regardless of when within a window a request arrives.

🌍 Real-World Applications: Stripe, GitHub, and Cloudflare at Rate Limiting Scale

Stripe enforces API rate limits across all their payment and payout endpoints. Stripe's rate limiter uses a combination of Token Bucket (for smooth burst control on critical write endpoints like charge creation) and Sliding Window Counter (for read endpoints). Stripe's rate limiter is implemented as a Redis-backed sidecar that runs on every API server. Their public documentation specifies exact limits per endpoint tier and includes Retry-After headers on 429 responses to allow well-behaved clients to back off correctly.

GitHub uses rate limiting to protect their API from automated scraping and CI pipeline abuse. GitHub's rate limiter distinguishes between authenticated requests (5,000 per hour per API token) and unauthenticated requests (60 per hour per IP). GitHub's implementation uses a Fixed Window Counter with a 1-hour window, accepting the window-edge doubling risk (10,000 requests in a 2-minute span at the boundary) because their primary concern is aggregate hourly abuse, not sub-minute burst control.

Cloudflare operates rate limiting at a fundamentally different scale — they process over 55 million HTTP requests per second across their global edge network. Cloudflare's rate limiter cannot use a centralized Redis cluster (the round-trip latency to a central store would add 50–200ms to every request). Instead, they use local approximate counters per edge node with periodic gossip synchronization. This trades some accuracy (a user might get 5–10% more requests than their limit for a short window) for zero-added-latency enforcement on the global edge.

⚖️ Trade-offs and Failure Modes in Distributed Rate Limiting

Fail-Open vs. Fail-Closed: The Most Important Operational Decision

If the Redis cluster goes down, what should the rate limiter do? Two options:

Failure Mode	Behavior	Risk
Fail-Open (allow all traffic)	Rate limiter circuit-breaks and allows all requests through	Your API becomes unprotected — a DDoS or runaway client can overwhelm downstream services
Fail-Closed (deny all traffic)	Rate limiter blocks all requests until Redis recovers	Your API becomes unavailable — legitimate users cannot make requests

The industry standard for external APIs is Fail-Open: it is better to have a brief window of unprotected access than to have your API appear down to customers. Implement a short circuit-breaker timeout (2–5 seconds) and alert immediately when Redis fails. For internal security-critical rate limiters (auth endpoint brute-force protection), Fail-Closed may be the correct choice.

The Race Condition Window and Its Real-World Impact

Even with Redis Lua atomic scripts, there is a small timing window for race conditions when using cluster-mode Redis. When a key's hash slot is migrating between nodes during a cluster rebalance, operations targeting that slot may return MOVED errors and require client-side retry. During the retry window (typically < 1ms), two concurrent requests may both pass the limit check before either commits the increment. This results in a small over-admission window (typically 1–5 extra requests per limit boundary). Production systems accept this as tolerable — eliminating it would require distributed transactions with prohibitive latency cost.

Local Cache for Policy Lookups

The Policy Service stores rate limit configurations in Postgres and pushes them to Redis. But reading the policy from Redis on every single request adds one Redis round-trip to every API call's latency. Production systems add a local in-process cache (LRU with a 30-second TTL) on each Rate Limiter Service instance. This means a policy change takes up to 30 seconds to propagate to all instances — an acceptable trade-off for eliminating a Redis read on the hot path.

🧭 Decision Guide: Choosing the Right Algorithm and Enforcement Strategy

Scenario	Recommended Algorithm	Enforcement Mode	Reason
Public API, high throughput, DDoS protection	Sliding Window Counter	Gateway sidecar	Best balance of accuracy and O(1) memory
Auth endpoint brute-force protection	Token Bucket with small capacity	Fail-Closed on Redis failure	Allows human login speed (1–3 req/min) while blocking automated attacks
Marketing API with burst allowance	Token Bucket with large capacity	Fail-Open on Redis failure	Clients need burst headroom for batch sends
Multi-tenant SaaS with tier-based limits	Fixed Window Counter per tenant tier	Policy Service-managed Redis keys	Simple to understand and audit per customer
Global edge network (CDN-level)	Local approximate counter + gossip sync	Distributed, no central Redis	Centralized Redis round-trip latency is unacceptable at edge scale
Developer sandbox / low-stakes internal API	In-memory local counter	No Redis needed	Simplest implementation; server restart resets counters acceptably

🧪 Interview Delivery Example: Presenting a Distributed Rate Limiter in 45 Minutes

Minute 1–5: Requirements scoping. Ask: "Is this for a public external API or an internal service? Do different user tiers need different limits? Is the enforcement point at the API gateway or within each service? What is the acceptable latency overhead — 1ms, 5ms?" These questions reveal whether you need a centralized shared limiter or a sidecar per service.

Minute 6–15: Algorithm selection. Explain the trade-offs between Fixed Window (simple, double-burst risk), Token Bucket (burst-friendly, stateful), and Sliding Window Counter (accurate, O(1) memory). Conclude: "For a high-throughput public API with 1 million RPS, I would use the Sliding Window Counter. It provides near-perfect accuracy with only two O(1) Redis keys per user, and it has no exploitable boundary edge case."

Minute 16–30: The atomicity problem and Redis solution. Say: "If two servers both read count = 99 and both increment to 100, both allow the request — the limit is violated. Redis Lua scripts solve this: the entire check-and-increment sequence executes in a single atomic operation on Redis's single-threaded event loop. No other operation can interleave between the check and the write."

Minute 31–40: Architecture walkthrough. Describe the two layers: the Enforcement Path (inline on every request, Redis atomic check) and the Control Plane (Policy Service manages limit configurations per API key stored in both Postgres and Redis). Explain the local policy cache: "Reading the policy from Redis on every request adds 1ms of latency. A 30-second local LRU cache on each Rate Limiter instance eliminates this overhead, with the trade-off that policy changes propagate within 30 seconds — acceptable for most use cases."

Minute 41–45: Failure modes. Address the Redis failure scenario: "I would implement Fail-Open — if Redis is unavailable, the circuit breaker opens and all requests pass through. This protects API availability at the cost of temporary unprotected access. I'd alert immediately and restore Redis within the SLA window." For auth endpoints, flip to Fail-Closed and explain why.

🛠️ Redis and Envoy: How the Production Rate Limiting Stack Works Together

Redis is the distributed counter store that makes shared state possible across all Rate Limiter Service instances. Redis's single-threaded event loop guarantees that Lua scripts execute atomically — the foundation of correct distributed rate limiting. In production, Redis Cluster distributes rate limit keys across nodes using hash slots. API keys are distributed across slots by their hash, ensuring even load distribution. Redis's O(1) GET/INCR/EXPIRE operations keep rate limit checks under 1ms including network round-trip.

Envoy Proxy is the most widely deployed platform for implementing rate limiting as a sidecar proxy in Kubernetes-native architectures. Envoy integrates with an external rate limit service via gRPC, where the Rate Limiter Service (backed by Redis) is the gRPC server. Every request that passes through Envoy triggers a gRPC check to the rate limit service before being forwarded upstream. Envoy supports configurable failure mode (fail-open or fail-closed), timeout budgets for the rate limit check, and local shadow mode (count without enforcing) for testing new limits before activation.

NGINX and Kong provide built-in rate limiting modules for teams that do not use Envoy. Kong's rate limiting plugin supports both local (per-node) and cluster-level (Redis-backed) enforcement modes and handles policy configuration through the Kong Admin API — enabling limit changes without touching application code or infrastructure configurations.

📚 Lessons Learned from Production Rate Limiter Deployments

The Policy Service Is as Critical as the Enforcement Engine. Teams that hardcode rate limits in application configuration files spend hours debugging why limit changes require a full deployment. A Policy Service that pushes limit configurations to Redis at runtime — without a deployment — is the enabler for operational agility. At Stripe, the ability to temporarily increase rate limits for a specific API key during a customer incident (without a deployment) has prevented multiple escalations from becoming outages.

Retry-After Headers Are Not Optional. When a client receives a 429 response without a Retry-After header, well-behaved clients implement their own backoff — often exponential, often misconfigured. This creates a thundering herd when thousands of clients synchronously retry after the same backoff interval. Always include Retry-After: {seconds} and X-RateLimit-Reset: {unix_timestamp} headers in 429 responses to enable correct client-side backoff.

The Window-Edge Doubling Bug Hits Real Customers. The Fixed Window algorithm's double-burst edge case is not theoretical. At GitHub, engineers discovered that a popular CI tool was designed to make exactly 5,000 API calls — the hourly limit — starting at 23:59:30 every night, then immediately making another 5,000 calls starting at 00:00:00. The tool was exploiting the window boundary to get 10,000 calls in a 30-second window. Switching to Sliding Window Counter eliminated this behavior without changing the documented limit.

Local Counters Lie Under Horizontal Scale. A rate limiter that uses only in-process counters per server works correctly when there is one server. With 10 servers, each allows 100 requests per minute — giving the user an effective limit of 1,000 requests per minute. This is the Distributed Quota Leak described in the hook. Never use local-only counters for enforcing shared limits across a cluster.

📌 Key Takeaways: Distributed Rate Limiter System Design

A distributed rate limiter must use shared state (Redis) — not local per-server counters — to enforce limits correctly across a cluster of application servers.
Redis Lua scripts provide the atomic check-and-increment operation required to prevent race conditions where two concurrent requests both slip past a limit boundary.
Algorithm selection matters: Sliding Window Counter provides the best balance of accuracy and O(1) memory for high-throughput public APIs. Token Bucket is preferred when controlled burst allowance is a product requirement.
Fail-Open vs. Fail-Closed on Redis failure is a critical operational decision: Fail-Open preserves API availability during Redis outages; Fail-Closed is appropriate for security-critical endpoints like authentication.
Always include Retry-After headers in 429 responses. Without them, clients implement uncoordinated backoff that creates thundering-herd retry storms.
The Control Plane (Policy Service with Postgres-backed configuration pushed to Redis at runtime) enables limit changes without deployments — an essential operational capability.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read