System Design Advanced: Security, Rate Limiting, and Reliability
How do you protect your API from hackers and traffic spikes? We cover Rate Limiting algorithms (T...
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Three reliability tools every backend system needs: Rate Limiting prevents API spam and DDoS, Circuit Breakers stop cascading failures when downstream services degrade, and Bulkheads isolate failure blast radius. Knowing when and how to combine them separates junior from senior system design.
๐ The TLDR: A Layered Defense for Distributed Systems
A house electrical panel has three layers of protection:
- Fuse/breaker per circuit โ no single appliance can knock out the house.
- Main breaker โ kills everything if total load is too dangerous.
- Surge protector โ absorbs voltage spikes before they reach appliances.
Distributed systems need the same layered defense at the API gateway, service-to-service, and individual thread pool level. Rate limiting is your surge protector at the edge. Circuit breakers are the fuse on each service call. Bulkheads are the per-circuit isolation that keeps one slow dependency from tripping everything else.
โ๏ธ Rate Limiting and Circuit Breaking: The Two Inbound Guards
Rate limiting is enforced at the API Gateway or reverse proxy layer before requests reach your application.
Token Bucket Algorithm
Each client gets a bucket of tokens. One token equals one request. Tokens refill at a fixed rate.
Bucket capacity = 100 requests
Refill rate = 10 tokens/second
If tokens > 0: allow request, decrement token
If tokens == 0: return HTTP 429 Too Many Requests
| Algorithm | Burst Handling | Use Case |
| Token Bucket | Allows small bursts up to bucket size | API rate limits per user |
| Leaky Bucket | No bursts, constant output rate | Smoothing traffic, QoS |
| Fixed Window | Large bursts possible at window boundary | Simple, low-overhead admin limits |
| Sliding Window | Smooth rate, no boundary spikes | Production API gateways (most common) |
DDoS Defense: Layered Response
flowchart LR
Internet[Internet Traffic] --> CDN[CDN
(absorb volumetric attacks)]
CDN --> WAF[WAF
(block malicious patterns)]
WAF --> RL[Rate Limiter
(per-IP / per-token limits)]
RL --> App[Application Servers]
RL -->|IP repeatedly violates| BH[Blackholing
(drop to /dev/null)]
Blackholing routes the attacker's traffic to a null interface โ no response, minimal server overhead. Used by ISPs and CDN providers against volumetric attacks.
Circuit Breaker States
Without a circuit breaker, a slow downstream service blocks all your threads, fills your thread pool, and the cascade propagates upward. A circuit breaker short-circuits this by tracking error rates and failing fast once a threshold is crossed.
stateDiagram-v2
[*] --> CLOSED : System healthy
CLOSED --> OPEN : Error rate > threshold (e.g., 50% in 10s)
OPEN --> HALF_OPEN : After timeout (e.g., 30s)
HALF_OPEN --> CLOSED : Probe request succeeds
HALF_OPEN --> OPEN : Probe request fails
| State | Behavior | When |
| CLOSED | All requests pass through | Normal operation |
| OPEN | All requests fail fast (no actual call) | After too many failures |
| HALF-OPEN | One probe request allowed | After recovery timeout |
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=30)
def call_payment_service(order_id: str):
return requests.post(PAYMENT_URL, json={"order_id": order_id}, timeout=2)
When call_payment_service() fails 5 times, subsequent calls raise CircuitBreakerError immediately โ no actual network call, no blocked threads.
๐ง Deep Dive: Reliability Patterns Under the Hood
The Bulkhead pattern is named after ship hull compartments: if one compartment floods, the rest stay dry. In software, give different traffic types separate thread pools and separate connection pools.
Critical Payments Pool: 20 threads (isolated)
Analytics Pool: 5 threads (isolated)
Background Job Pool: 10 threads (isolated)
If the analytics pool saturates, the payment pool is unaffected. Without bulkheads, one slow operation starves everything.
Internals: How Circuit Breakers Track State
A circuit breaker's core data structure is a sliding window counter โ a ring buffer of the last N request outcomes. Each completion (success or failure) is written into the current slot. The window slides periodically and the oldest slot is discarded.
Error rate is computed as:
error_rate = failures_in_window / total_in_window
When error_rate crosses the threshold (commonly 50%), the state transitions to OPEN and all requests short-circuit without touching the downstream service. The half-open probe is a time-delayed single request that asks "is it safe to close again?" without flooding a recovering service.
Bulkhead internals are simpler: a bounded ThreadPoolExecutor per service tier. Calls exceeding the pool queue depth are rejected immediately with a BulkheadFullException rather than queueing indefinitely.
Performance Analysis: Overhead Per Reliability Layer
| Pattern | Typical Overhead | Notes |
| Rate Limiter (in-process) | < 0.05 ms | Atomic counter + timestamp |
| Rate Limiter (Redis) | 1-2 ms | 2x Redis round trips per request |
| Circuit Breaker (local) | ~0.1 ms | Lock-free ring buffer read/write |
| Bulkhead (thread pool) | < 0.01 ms | queue.offer() + thread dispatch |
Use local in-process rate limiting for high-throughput endpoints and reserve Redis for cross-node consistency across a fleet of servers.
Mathematical Model: Token Bucket Formalism
Let C = bucket capacity (tokens), r = refill rate (tokens/second), t0 = time of last refill, and tokens(t0) = token count at t0.
At time t, the available token count is:
tokens(t) = min(C, tokens(t0) + r * (t - t0))
A request is allowed if tokens(t) >= 1, then tokens(t) is decremented by 1. A request is rejected (HTTP 429) if tokens(t) = 0.
Worked example: C = 10, r = 5 tokens/sec, tokens(t0) = 0 at t0 = 0.
- At t = 1s: tokens = min(10, 5) = 5, so 5 requests proceed.
- At t = 2s after those 5 are consumed: tokens = 5 again.
- Rapid burst of 6 requests at t = 0.1s with empty bucket: all 6 rejected with 429.
For circuit breakers, the error rate over sliding window W is:
error_rate(t) = errors_in_[t-W, t] / total_calls_in_[t-W, t]
Trip the breaker when error_rate >= threshold AND total_calls >= min_calls. The min_calls floor prevents false opens during low-traffic periods.
๐ Token Bucket Rate Limiting: Request Flow
sequenceDiagram
participant C as Client
participant RL as Rate Limiter
participant B as Token Bucket
participant API as API Server
C->>RL: Request (API key: user_42)
RL->>B: Check tokens for user_42
B-->>RL: tokens = 5 (bucket not empty)
RL->>B: Decrement token count
RL->>API: Forward request
API-->>C: 200 OK
C->>RL: Request (burst: 10 rapid calls)
RL->>B: Check tokens for user_42
B-->>RL: tokens = 0 (bucket empty)
RL-->>C: 429 Too Many Requests (Retry-After: 1s)
Note over B: Tokens refill at r/sec
B->>B: Refill +10 tokens (1s elapsed)
C->>RL: Retry after 1s
RL->>B: Check tokens
B-->>RL: tokens = 10
RL->>API: Forward request
API-->>C: 200 OK
This sequence diagram walks through two distinct token bucket scenarios for the same client. In the first scenario, tokens are available and the request is forwarded normally. In the second, a burst of 10 rapid calls exhausts the bucket and the rate limiter returns 429 Too Many Requests with a Retry-After header โ critical UX that tells the client exactly when to retry. After 1 second of token refill, the next request succeeds. The takeaway is that a well-implemented rate limiter communicates when to try again, not just whether to block.
๐๏ธ Advanced Concepts: Combining Patterns for Defense in Depth
Each pattern in isolation solves one failure mode. Together they form defense in depth โ no single layer is the only thing standing between your system and failure.
The retry storm problem illustrates why combining patterns matters. Add exponential backoff retries to every service call. A downstream service degrades. All upstream clients retry with growing delays โ a thundering herd. A circuit breaker solves this: once OPEN, retries stop entirely until the half-open probe confirms recovery.
Backpressure signals upstream callers to slow their send rate rather than dropping requests silently. Bulkheads emit BulkheadFullException; Kafka consumers use lag metrics to throttle producers; gRPC uses HTTP/2 flow control.
Jitter on circuit close prevents thundering herd on recovery. Instead of all clients retrying at t + 30s, each uses t + 30s + random(0, 5s), smoothing the re-entry spike across a 5-second window.
| Failure Mode | Pattern Combination | Key Config |
| DDoS / API spam | Rate Limiter + WAF | Per-IP limits, burst cap |
| Cascading slow dependency | Circuit Breaker + timeout | 50% error rate, 200ms timeout |
| Thread starvation | Bulkhead + Circuit Breaker | Isolated pools, fail-fast on full |
| Retry storms | Circuit Breaker + backoff + jitter | Base 100ms, max 30s, full jitter |
| Thundering herd on recovery | Jitter on recovery timeout | +/-5s random spread per client |
๐ System Flow: Request Through the Reliability Stack
Every inbound request traverses all three layers. The order matters: rate limit first, then bulkhead, then circuit breaker.
flowchart LR
Client[Client Request] --> RL[Rate Limiter
Token Bucket]
RL -->|429 if limit hit| Reject[Reject: 429]
RL --> BH[Bulkhead
Thread Pool]
BH -->|503 if pool full| Full[Reject: 503]
BH --> CB[Circuit Breaker
CLOSED / OPEN / HALF-OPEN]
CB -->|OPEN: fail fast| FB[Fallback Response]
CB -->|CLOSED: call through| Svc[Downstream Service]
Svc -->|success| CB
Svc -->|failure threshold hit| CB
At each gate, a rejection is cheap and deterministic. By the time a request reaches the downstream service it has already proven: within rate limits, a thread is available, and the target is believed healthy. This is fail early, fail fast.
๐ Real-World Applications: Where These Patterns Run in Production
Rate Limiting in the wild:
- GitHub API enforces 5,000 requests/hour per authenticated token. Unauthenticated requests are limited to 60/hour to force authentication.
- Stripe limits to 100 requests/second per secret key. Exceeding this returns 429 with a Retry-After header; Stripe SDKs implement automatic exponential backoff.
- AWS API Gateway offers per-method throttling and account-level throttling with separate burst limits per deployment stage.
Circuit Breakers in the wild:
- Netflix Hystrix was the original widely-adopted circuit breaker library, used to isolate microservices so a recommendations failure would not take down video streaming. Succeeded by Resilience4j.
- Istio service mesh implements circuit breaking at the Envoy sidecar proxy so every service-to-service call is guarded without application code changes.
Bulkheads in the wild:
- Kubernetes resource limits and namespace quotas act as infrastructure-level bulkheads preventing rogue jobs from consuming all cluster CPU.
- Payment processing APIs universally use isolated thread pools so payment authorization never competes with analytics or background jobs.
โ๏ธ Trade-offs & Failure Modes: Trade-offs and Failure Modes: Choosing the Right Pattern
| Scenario | Pattern | Notes |
| Public API with free and paid tiers | Rate Limiting (sliding window, per API key) | Redis-backed for cross-node consistency |
| Microservice calling an unreliable external API | Circuit Breaker | Set timeout <= your SLA budget |
| High-value transaction isolation | Bulkhead (dedicated thread + connection pool) | Payment pool must never share with analytics |
| Protecting origin from DDoS | CDN + WAF + Rate Limiter layered | Blackholing for repeat offenders |
| Service-to-service timeout cascade | Circuit Breaker + timeout (aggressive: 200ms) | Timeout must be less than circuit breaker window |
| Queue consumer falling behind | Backpressure | Consumer signals producer to slow down |
๐งญ Decision Guide: Which Pattern for Which Problem
| Problem | Pattern | Configuration |
| Protect API from spam / DDoS | Rate Limiting | Token bucket, per-IP + per-token limits |
| Prevent cascading failure | Circuit Breaker | 50% error rate threshold, 30s timeout |
| Isolate critical from non-critical work | Bulkhead | Separate thread pools per service tier |
| Handle burst traffic gracefully | Token Bucket | Capacity = max burst, refill = sustained rate |
| Retry safely without overloading | Exponential backoff + Jitter | Base 100ms, max 30s, full jitter |
| Circuit breaker keeps false-opening | Raise min_calls floor | Require 20+ calls before evaluating error rate |
| Recovery thundering herd | Jitter on half-open timeout | +/-5s random spread per client |
Non-obvious edge cases:
- Set circuit breaker recovery_timeout <= your SLA budget or users wait the full timeout before recovery is even attempted.
- Rate limiting without a Retry-After header is UX-hostile โ clients cannot back off intelligently without it.
- A bulkhead with an unbounded queue is a time bomb. Always set queue_capacity explicitly.
๐งช Practical: Configuring Resilience4j in Spring Boot
Resilience4j provides composable annotations that layer circuit breaking, rate limiting, and bulkhead isolation directly on service methods.
resilience4j:
circuitbreaker:
instances:
paymentService:
slidingWindowSize: 20
minimumNumberOfCalls: 10
failureRateThreshold: 50
waitDurationInOpenState: 30s
ratelimiter:
instances:
paymentService:
limitForPeriod: 100
limitRefreshPeriod: 1s
bulkhead:
instances:
paymentService:
maxConcurrentCalls: 20
maxWaitDuration: 10ms
@Service
public class PaymentService {
@Bulkhead(name = "paymentService", fallbackMethod = "bulkheadFallback")
@CircuitBreaker(name = "paymentService", fallbackMethod = "circuitFallback")
@RateLimiter(name = "paymentService", fallbackMethod = "rateFallback")
public PaymentResult processPayment(Order order) {
return paymentGatewayClient.charge(order);
}
public PaymentResult circuitFallback(Order order, CallNotPermittedException ex) {
return PaymentResult.retry("Payment gateway temporarily unavailable.");
}
public PaymentResult rateFallback(Order order, RequestNotPermitted ex) {
return PaymentResult.error("Rate limit exceeded. Retry-After: 1s");
}
}
Testing the circuit breaker: Simulate failures by mocking the gateway to throw 500 errors. After minimumNumberOfCalls (10) with >= 50% failure rate, the circuit opens. Verify via Actuator: curl localhost:8080/actuator/health | jq '.components.circuitBreakers'. The paymentService entry shows state: OPEN. After 30 seconds it transitions to HALF_OPEN and probe calls determine whether it closes.
Micrometer metrics to monitor:
| Metric | What It Signals |
| resilience4j_circuitbreaker_state | Current CB state (0=CLOSED, 1=OPEN, 2=HALF_OPEN) |
| resilience4j_ratelimiter_waiting_threads | Requests blocked waiting for a token |
| resilience4j_bulkhead_available_concurrent_calls | Remaining bulkhead capacity |
๐ ๏ธ Spring Security, Bucket4j, and Resilience4j: A Complete Spring Boot Defense Stack
Spring Security is the standard authentication and authorization framework for Spring Boot, providing a filter chain that intercepts every HTTP request. Bucket4j is a Java rate-limiting library built on the Token Bucket algorithm, with optional Redis-backed distributed buckets for fleet-wide enforcement. Together with Resilience4j (shown in the practical section above), these three libraries form a complete, annotation-driven reliability stack for Spring Boot microservices.
// Spring Security JWT filter + Bucket4j per-client rate limiter
@Component
@RequiredArgsConstructor
public class JwtRateLimitingFilter extends OncePerRequestFilter {
private final JwtTokenProvider tokenProvider;
private final BucketRepository buckets; // Bucket4j + Redis backend
@Override
protected void doFilterInternal(HttpServletRequest request,
HttpServletResponse response,
FilterChain chain)
throws ServletException, IOException {
// 1. Validate JWT โ authenticate the client
String token = resolveToken(request);
if (token == null || !tokenProvider.validateToken(token)) {
response.sendError(HttpServletResponse.SC_UNAUTHORIZED, "Invalid token");
return;
}
String clientId = tokenProvider.getSubject(token);
// 2. Enforce per-client rate limit via Bucket4j Token Bucket
Bucket bucket = buckets.getOrCreate(clientId,
Bandwidth.classic(100, Refill.greedy(100, Duration.ofMinutes(1))));
ConsumptionProbe probe = bucket.tryConsumeAndReturnRemaining(1);
if (!probe.isConsumed()) {
long retryAfterSec = probe.getNanosToWaitForRefill() / 1_000_000_000;
response.setHeader("X-RateLimit-Remaining", "0");
response.setHeader("Retry-After", String.valueOf(retryAfterSec));
response.sendError(429, "Rate limit exceeded. Retry-After: " + retryAfterSec + "s");
return;
}
response.setHeader("X-RateLimit-Remaining",
String.valueOf(probe.getRemainingTokens()));
chain.doFilter(request, response);
}
private String resolveToken(HttpServletRequest req) {
String bearer = req.getHeader("Authorization");
return (bearer != null && bearer.startsWith("Bearer "))
? bearer.substring(7) : null;
}
}
The filter executes in the Spring Security filter chain: JWT validation runs first (authentication gate), then Bucket4j checks the per-client token bucket before any request reaches business logic. Combine with the Resilience4j @CircuitBreaker and @Bulkhead from the Practical section above for complete defense in depth โ authentication โ rate limiting โ bulkhead โ circuit breaker.
For a full deep-dive on Spring Security OAuth2 resource server configuration and Bucket4j distributed Redis mode, a dedicated follow-up post is planned.
๐ Key Lessons from Reliability Pattern Failures in Production
- Rate limiting without a Retry-After header breaks clients โ they cannot back off intelligently without knowing when to retry. Always return the header.
- Circuit breakers need careful threshold tuning: too sensitive (low minimumNumberOfCalls) causes false opens during rolling deployments; too loose means cascading failures still propagate.
- Bulkheads only work if you correctly identify which work is critical โ equal-priority pools for payment and analytics defeats the purpose.
- Never set circuit breaker recovery_timeout longer than your SLA โ a 60-second recovery_timeout on a 500ms SLA means your fallback runs for a full minute before recovery is attempted.
- The order of Resilience4j annotations matters โ Bulkhead then Circuit Breaker then Rate Limiter (outermost first). Inverted order means a rate limit rejection counts as a circuit breaker failure.
- Combine patterns at both the service mesh level (Istio/Envoy) for zero-code-change protection and at the library level for per-method control.
๐ TLDR: Summary & Key Takeaways
- Token Bucket enforces per-client rate limits with allowance for small bursts; tokens(t) = min(C, tokens(t0) + r*(t-t0)).
- Circuit Breaker (CLOSED to OPEN to HALF-OPEN) short-circuits failing calls using a sliding window error rate; set min_calls to avoid false opens.
- Bulkhead compartmentalizes thread pools so slow dependencies cannot starve critical paths; always configure bounded queues.
- DDoS defense is layered: CDN absorbs volume, WAF filters patterns, rate limiter blocks persistent abusers, blackholing drops the worst offenders.
- Combine all three for defense in depth and add jitter to prevent thundering herd on circuit recovery.
- Operational discipline matters as much as the patterns: monitor CB state, rate limit hits, and bulkhead rejections via Micrometer; tune thresholds against production traffic, not guesses.
๐ Related Posts
- System Design: Caching and Asynchronism
- System Design: Sharding Strategy
- System Design: Replication and Failover
- Capacity Estimation Guide

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
