System Design Advanced: Security, Rate Limiting, and Reliability
How do you protect your API from hackers and traffic spikes? We cover Rate Limiting algorithms (T...

Abstract Algorithms
Helping engineers master software engineering topics.
TLDR: Three reliability tools every backend system needs: Rate Limiting prevents API spam and DDoS, Circuit Breakers stop cascading failures when downstream services degrade, and Bulkheads isolate failure blast radius. Knowing when and how to combine them separates junior from senior system design.
๐ The TLDR: A Layered Defense for Distributed Systems
A house electrical panel has three layers of protection:
- Fuse/breaker per circuit โ no single appliance can knock out the house.
- Main breaker โ kills everything if total load is too dangerous.
- Surge protector โ absorbs voltage spikes before they reach appliances.
Distributed systems need the same layered defense at the API gateway, service-to-service, and individual thread pool level. Rate limiting is your surge protector at the edge. Circuit breakers are the fuse on each service call. Bulkheads are the per-circuit isolation that keeps one slow dependency from tripping everything else.
โ๏ธ Rate Limiting and Circuit Breaking: The Two Inbound Guards
Rate limiting is enforced at the API Gateway or reverse proxy layer before requests reach your application.
Token Bucket Algorithm
Each client gets a bucket of tokens. One token equals one request. Tokens refill at a fixed rate.
Each client starts with a full bucket and spends one token per request. When the bucket is empty, the rate limiter rejects the request with HTTP 429 Too Many Requests and returns a Retry-After header indicating when tokens will replenish. The client must wait until the refill cycle adds enough tokens before retrying.
| Algorithm | Burst Handling | Use Case |
| Token Bucket | Allows small bursts up to bucket size | API rate limits per user |
| Leaky Bucket | No bursts, constant output rate | Smoothing traffic, QoS |
| Fixed Window | Large bursts possible at window boundary | Simple, low-overhead admin limits |
| Sliding Window | Smooth rate, no boundary spikes | Production API gateways (most common) |
DDoS Defense: Layered Response
flowchart LR
Internet[Internet Traffic] --> CDN[CDN
(absorb volumetric attacks)]
CDN --> WAF[WAF
(block malicious patterns)]
WAF --> RL[Rate Limiter
(per-IP / per-token limits)]
RL --> App[Application Servers]
RL -->|IP repeatedly violates| BH[Blackholing
(drop to /dev/null)]
Blackholing routes the attacker's traffic to a null interface โ no response, minimal server overhead. Used by ISPs and CDN providers against volumetric attacks.
Circuit Breaker States
Without a circuit breaker, a slow downstream service blocks all your threads, fills your thread pool, and the cascade propagates upward. A circuit breaker short-circuits this by tracking error rates and failing fast once a threshold is crossed.
stateDiagram-v2
[*] --> CLOSED : System healthy
CLOSED --> OPEN : Error rate > threshold (e.g., 50% in 10s)
OPEN --> HALF_OPEN : After timeout (e.g., 30s)
HALF_OPEN --> CLOSED : Probe request succeeds
HALF_OPEN --> OPEN : Probe request fails
| State | Behavior | When |
| CLOSED | All requests pass through | Normal operation |
| OPEN | All requests fail fast (no actual call) | After too many failures |
| HALF-OPEN | One probe request allowed | After recovery timeout |
Once the circuit breaker library detects that a downstream call has failed beyond the configured threshold โ commonly five consecutive failures or a 50% error rate over a sliding window โ it opens the circuit and immediately rejects further calls without making any actual network request. This eliminates blocked threads and returns a fast failure to the caller, which can then invoke a fallback response or queue the operation for retry.
๐ง Deep Dive: Reliability Patterns Under the Hood
The Bulkhead pattern is named after ship hull compartments: if one compartment floods, the rest stay dry. In software, give different traffic types separate thread pools and separate connection pools.
Critical payment operations are assigned a dedicated thread pool of around 20 threads, analytics processing a smaller isolated pool of around 5, and background jobs a separate pool of 10. Each pool is bounded so it cannot consume resources beyond its allocation.
If the analytics pool saturates, the payment pool is unaffected. Without bulkheads, one slow operation starves everything.
Internals: How Circuit Breakers Track State
A circuit breaker's core data structure is a sliding window counter โ a ring buffer of the last N request outcomes. Each completion (success or failure) is written into the current slot. The window slides periodically and the oldest slot is discarded.
The error rate is computed as the ratio of failures to total calls within the window. When this ratio crosses the configured threshold โ commonly 50% โ and the minimum call count floor has been met, the breaker transitions to the OPEN state. The half-open probe is a time-delayed single request that asks "is it safe to close again?" without flooding a recovering service.
Bulkhead internals are simpler: a bounded ThreadPoolExecutor per service tier. Calls exceeding the pool queue depth are rejected immediately with a BulkheadFullException rather than queueing indefinitely.
Performance Analysis: Overhead Per Reliability Layer
| Pattern | Typical Overhead | Notes |
| Rate Limiter (in-process) | < 0.05 ms | Atomic counter + timestamp |
| Rate Limiter (Redis) | 1-2 ms | 2x Redis round trips per request |
| Circuit Breaker (local) | ~0.1 ms | Lock-free ring buffer read/write |
| Bulkhead (thread pool) | < 0.01 ms | queue.offer() + thread dispatch |
Use local in-process rate limiting for high-throughput endpoints and reserve Redis for cross-node consistency across a fleet of servers.
Mathematical Model: Token Bucket Formalism
Let C = bucket capacity (tokens), r = refill rate (tokens/second), t0 = time of last refill, and tokens(t0) = token count at t0.
Mathematically, the available token count at any time t equals the minimum of the bucket capacity C and the sum of remaining tokens at the last refill time plus the product of the refill rate r and the elapsed seconds since that refill. A request is permitted if at least one token is available; otherwise it is rejected with HTTP 429.
Worked example: With a capacity of 10 tokens and a refill rate of 5 tokens per second, a client starting with an empty bucket can make 5 requests after 1 second and 5 more after 2 seconds. A rapid burst of 6 requests before any refill occurs results in all 6 being rejected immediately.
For circuit breakers, the error rate is measured over a sliding time window. The breaker trips when that rate exceeds the threshold and total call volume exceeds a minimum floor โ preventing false opens during low-traffic periods such as overnight deployments.
๐ Token Bucket Rate Limiting: Request Flow
sequenceDiagram
participant C as Client
participant RL as Rate Limiter
participant B as Token Bucket
participant API as API Server
C->>RL: Request (API key: user_42)
RL->>B: Check tokens for user_42
B-->>RL: tokens = 5 (bucket not empty)
RL->>B: Decrement token count
RL->>API: Forward request
API-->>C: 200 OK
C->>RL: Request (burst: 10 rapid calls)
RL->>B: Check tokens for user_42
B-->>RL: tokens = 0 (bucket empty)
RL-->>C: 429 Too Many Requests (Retry-After: 1s)
Note over B: Tokens refill at r/sec
B->>B: Refill +10 tokens (1s elapsed)
C->>RL: Retry after 1s
RL->>B: Check tokens
B-->>RL: tokens = 10
RL->>API: Forward request
API-->>C: 200 OK
This sequence diagram walks through two distinct token bucket scenarios for the same client. In the first scenario, tokens are available and the request is forwarded normally. In the second, a burst of 10 rapid calls exhausts the bucket and the rate limiter returns 429 Too Many Requests with a Retry-After header โ critical UX that tells the client exactly when to retry. After 1 second of token refill, the next request succeeds. The takeaway is that a well-implemented rate limiter communicates when to try again, not just whether to block.
๐๏ธ Advanced Concepts: Combining Patterns for Defense in Depth
Each pattern in isolation solves one failure mode. Together they form defense in depth โ no single layer is the only thing standing between your system and failure.
The retry storm problem illustrates why combining patterns matters. Add exponential backoff retries to every service call. A downstream service degrades. All upstream clients retry with growing delays โ a thundering herd. A circuit breaker solves this: once OPEN, retries stop entirely until the half-open probe confirms recovery.
Backpressure signals upstream callers to slow their send rate rather than dropping requests silently. Bulkheads emit BulkheadFullException; Kafka consumers use lag metrics to throttle producers; gRPC uses HTTP/2 flow control.
Jitter on circuit close prevents thundering herd on recovery. Instead of all clients retrying at t + 30s, each uses t + 30s + random(0, 5s), smoothing the re-entry spike across a 5-second window.
| Failure Mode | Pattern Combination | Key Config |
| DDoS / API spam | Rate Limiter + WAF | Per-IP limits, burst cap |
| Cascading slow dependency | Circuit Breaker + timeout | 50% error rate, 200ms timeout |
| Thread starvation | Bulkhead + Circuit Breaker | Isolated pools, fail-fast on full |
| Retry storms | Circuit Breaker + backoff + jitter | Base 100ms, max 30s, full jitter |
| Thundering herd on recovery | Jitter on recovery timeout | +/-5s random spread per client |
๐ System Flow: Request Through the Reliability Stack
Every inbound request traverses all three layers. The order matters: rate limit first, then bulkhead, then circuit breaker.
flowchart LR
Client[Client Request] --> RL[Rate Limiter
Token Bucket]
RL -->|429 if limit hit| Reject[Reject: 429]
RL --> BH[Bulkhead
Thread Pool]
BH -->|503 if pool full| Full[Reject: 503]
BH --> CB[Circuit Breaker
CLOSED / OPEN / HALF-OPEN]
CB -->|OPEN: fail fast| FB[Fallback Response]
CB -->|CLOSED: call through| Svc[Downstream Service]
Svc -->|success| CB
Svc -->|failure threshold hit| CB
At each gate, a rejection is cheap and deterministic. By the time a request reaches the downstream service it has already proven: within rate limits, a thread is available, and the target is believed healthy. This is fail early, fail fast.
๐ Real-World Applications: Where These Patterns Run in Production
Rate Limiting in the wild:
- GitHub API enforces 5,000 requests/hour per authenticated token. Unauthenticated requests are limited to 60/hour to force authentication.
- Stripe limits to 100 requests/second per secret key. Exceeding this returns 429 with a Retry-After header; Stripe SDKs implement automatic exponential backoff.
- AWS API Gateway offers per-method throttling and account-level throttling with separate burst limits per deployment stage.
Circuit Breakers in the wild:
- Netflix Hystrix was the original widely-adopted circuit breaker library, used to isolate microservices so a recommendations failure would not take down video streaming. Succeeded by Resilience4j.
- Istio service mesh implements circuit breaking at the Envoy sidecar proxy so every service-to-service call is guarded without application code changes.
Bulkheads in the wild:
- Kubernetes resource limits and namespace quotas act as infrastructure-level bulkheads preventing rogue jobs from consuming all cluster CPU.
- Payment processing APIs universally use isolated thread pools so payment authorization never competes with analytics or background jobs.
โ๏ธ Trade-offs & Failure Modes: Trade-offs and Failure Modes: Choosing the Right Pattern
| Scenario | Pattern | Notes |
| Public API with free and paid tiers | Rate Limiting (sliding window, per API key) | Redis-backed for cross-node consistency |
| Microservice calling an unreliable external API | Circuit Breaker | Set timeout <= your SLA budget |
| High-value transaction isolation | Bulkhead (dedicated thread + connection pool) | Payment pool must never share with analytics |
| Protecting origin from DDoS | CDN + WAF + Rate Limiter layered | Blackholing for repeat offenders |
| Service-to-service timeout cascade | Circuit Breaker + timeout (aggressive: 200ms) | Timeout must be less than circuit breaker window |
| Queue consumer falling behind | Backpressure | Consumer signals producer to slow down |
๐งญ Decision Guide: Which Pattern for Which Problem
| Problem | Pattern | Configuration |
| Protect API from spam / DDoS | Rate Limiting | Token bucket, per-IP + per-token limits |
| Prevent cascading failure | Circuit Breaker | 50% error rate threshold, 30s timeout |
| Isolate critical from non-critical work | Bulkhead | Separate thread pools per service tier |
| Handle burst traffic gracefully | Token Bucket | Capacity = max burst, refill = sustained rate |
| Retry safely without overloading | Exponential backoff + Jitter | Base 100ms, max 30s, full jitter |
| Circuit breaker keeps false-opening | Raise min_calls floor | Require 20+ calls before evaluating error rate |
| Recovery thundering herd | Jitter on half-open timeout | +/-5s random spread per client |
Non-obvious edge cases:
- Set circuit breaker recovery_timeout <= your SLA budget or users wait the full timeout before recovery is even attempted.
- Rate limiting without a Retry-After header is UX-hostile โ clients cannot back off intelligently without it.
- A bulkhead with an unbounded queue is a time bomb. Always set queue_capacity explicitly.
๐งช Practical: Resilience4j Configuration Approach
Resilience4j provides composable annotations that layer circuit breaking, rate limiting, and bulkhead isolation directly on service methods. Each reliability pattern is configured independently by specifying key thresholds: the circuit breaker watches a sliding window of recent calls and trips when the failure rate exceeds a percentage (commonly 50%) across a minimum number of calls (commonly 10), then waits a configured duration in the open state before probing recovery. The rate limiter is set with a call limit per refresh period โ for example, 100 requests per second per service instance. The bulkhead caps the maximum number of concurrent calls and defines how long a request waits for a slot before being rejected.
Each service method is decorated with annotations for all three concerns. When the circuit is open, a fallback method returns a safe default response โ such as "service temporarily unavailable, please retry." When the rate limit is exceeded, the fallback returns an explicit error with a Retry-After hint. When the bulkhead is full, the fallback rejects immediately rather than queuing indefinitely.
Key observability signals to monitor:
| Metric | What It Signals |
| Circuit breaker state | Whether the breaker is CLOSED, OPEN, or HALF_OPEN |
| Rate limiter waiting threads | Requests blocked waiting for a rate limit token |
| Bulkhead available concurrent calls | Remaining bulkhead capacity before rejection |
When the circuit opens during a rolling deployment or degraded dependency, all three metrics become visible at once โ the breaker moves to OPEN, waiting threads increase at the rate limiter, and bulkhead capacity drops. Probing recovery through the HALF_OPEN state resets these counters as service health is confirmed.
๐ ๏ธ Defense Stack: Authentication, Rate Limiting, and Reliability in Layers
A complete defense stack for a production API service combines authentication, per-client rate limiting, bulkhead isolation, and circuit breaking into a sequential request pipeline. Each layer is independent and can fail fast without involving the layers that follow it.
Authentication runs first. An incoming request presents a JWT or API key, which is validated against a signing secret or public key. If authentication fails, the request is rejected immediately with a 401 Unauthorized response before any business logic is executed.
Rate limiting runs second, scoped to the authenticated client identity. A token bucket per client identity is stored in a shared in-memory data store like Redis so that enforcement is consistent across all application nodes in the fleet. When a client exhausts its token budget, the response includes a Retry-After header specifying the exact number of seconds until the bucket refills โ allowing well-behaved clients to back off cleanly rather than retrying in a tight loop.
Bulkhead isolation runs third, capping the number of concurrent calls that can reach critical downstream services. A dedicated thread pool per service tier โ for example, separate pools for payment processing, analytics, and background jobs โ prevents a slow or saturated dependency from starving all other request handling.
Circuit breaking runs last, wrapping the actual downstream call. If the downstream service has been failing beyond the configured error rate and call volume thresholds, the circuit opens and the request is returned a fallback immediately without any network call being made. After the recovery timeout elapses, a single probe request tests whether the service has recovered, and the circuit closes if the probe succeeds.
The combined sequence โ authentication โ rate limiting โ bulkhead โ circuit breaker โ means that by the time a request reaches the actual downstream call, it has already passed through every protective gate. Rejections at any early stage are cheap and deterministic.
For a full deep-dive on distributed rate limiting with Redis, OAuth2 resource server patterns, and Bucket4j configuration, a dedicated follow-up post is planned.
๐ Key Lessons from Reliability Pattern Failures in Production
- Rate limiting without a Retry-After header breaks clients โ they cannot back off intelligently without knowing when to retry. Always return the header.
- Circuit breakers need careful threshold tuning: too sensitive (low minimumNumberOfCalls) causes false opens during rolling deployments; too loose means cascading failures still propagate.
- Bulkheads only work if you correctly identify which work is critical โ equal-priority pools for payment and analytics defeats the purpose.
- Never set circuit breaker recovery_timeout longer than your SLA โ a 60-second recovery_timeout on a 500ms SLA means your fallback runs for a full minute before recovery is attempted.
- The order of Resilience4j annotations matters โ Bulkhead then Circuit Breaker then Rate Limiter (outermost first). Inverted order means a rate limit rejection counts as a circuit breaker failure.
- Combine patterns at both the service mesh level (Istio/Envoy) for zero-code-change protection and at the library level for per-method control.
๐ TLDR: Summary & Key Takeaways
- Token Bucket enforces per-client rate limits with allowance for small bursts; tokens(t) = min(C, tokens(t0) + r*(t-t0)).
- Circuit Breaker (CLOSED to OPEN to HALF-OPEN) short-circuits failing calls using a sliding window error rate; set min_calls to avoid false opens.
- Bulkhead compartmentalizes thread pools so slow dependencies cannot starve critical paths; always configure bounded queues.
- DDoS defense is layered: CDN absorbs volume, WAF filters patterns, rate limiter blocks persistent abusers, blackholing drops the worst offenders.
- Combine all three for defense in depth and add jitter to prevent thundering herd on circuit recovery.
- Operational discipline matters as much as the patterns: monitor CB state, rate limit hits, and bulkhead rejections via Micrometer; tune thresholds against production traffic, not guesses.
Article tools
Reader feedback
Was this article useful?
Rate it if it helped, then continue with the next deep dive when you are ready.