System Design: Complete Guide to Caching — Patterns, Eviction, and Distributed Strategies
Cache-Aside to Refresh-Ahead, LRU to ARC — every caching pattern, eviction policy, and distributed strategy you need for system design
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Caching is the single highest-leverage performance tool in distributed systems. This guide covers every read/write pattern (Cache-Aside through Refresh-Ahead), every eviction policy (LRU through ARC), cache invalidation pitfalls, thundering herd mitigations, multi-level topologies, and how Twitter, Netflix, and Facebook run caches at planetary scale. Read once; reference forever.
📖 Why Your Database Is Begging for a Cache
Every relational database has the same story: it starts fast, traffic grows, queries get slower, you add indexes, you tune joins, and eventually you hit the wall — the database is doing too much I/O for too many concurrent readers. The standard answer is a cache.
A cache is a key-value store that lives in RAM. It exploits one simple truth about most web traffic: 80% of reads touch 20% of the data (the Pareto principle applied to access patterns). By keeping that hot 20% in fast memory, you absorb the vast majority of read load without touching the database at all.
The numbers are staggering. A Redis GET command returns in ~100 microseconds over a local network. A PostgreSQL query that hits disk can take 10–50 milliseconds — up to 500× slower. For an application serving 10,000 requests per second, the difference between a 95% cache hit rate and a 75% hit rate is the difference between a 16-server fleet and a 64-server fleet.
But caching is not just "stick Redis in front of your database." The details — which read/write pattern you use, how you evict stale entries, how you handle thousands of concurrent misses on the same key — determine whether your cache helps or becomes its own source of failure. This guide covers every pattern, every policy, and every failure mode you need to know.
🔍 How Caches Fit Into the Read and Write Path
Before diving into patterns, ground yourself in what a cache is not: it is not a database. Caches are:
- Volatile — data survives only as long as the cache process is running (unless persistence is explicitly configured, as in Redis RDB/AOF modes).
- Bounded — caches have a fixed memory budget and cannot grow infinitely. When the budget fills, the eviction policy decides what to drop.
- Eventually consistent with the DB by default — unless you implement write-through, the cache may serve data that lags behind the source of truth.
Four metrics define cache health at a glance:
| Metric | Formula | Healthy Target |
| Hit Ratio | hits / (hits + misses) | ≥ 90% for a well-tuned cache |
| Eviction Rate | evictions per second | Near 0 in steady state; high eviction signals the cache is undersized |
| Miss Latency | end-to-end time to serve a cache miss | < 20 ms total (DB fetch + cache populate + response) |
| Memory Utilization | used / max memory | Stay below 80% to avoid eviction pressure |
A hit ratio below 80% usually means one of four things: the cache is too small for the working set, TTLs are too short so keys expire before being reused, the access pattern is nearly uniform (no hot keys to exploit), or a new code path is bypassing the cache entirely.
⚙️ Every Cache Pattern Explained: From Cache-Aside to Refresh-Ahead
There are six canonical caching patterns. Each solves a different problem and carries different consistency trade-offs. A typical production system mixes them — Cache-Aside for general reads, Write-Through for high-value writes, and Refresh-Ahead for peak-traffic hot keys.
Pattern 1: Cache-Aside (Lazy Loading)
The most common pattern. The application owns the cache interaction: check cache first; on a miss, load from DB, populate the cache, return data. The cache is populated lazily — only data that is actually read ever enters it.
The diagram below shows both paths through Cache-Aside. The left branch is a cache miss (cold path); the right branch is a cache hit (warm path).
flowchart TD
Request[Incoming Request] --> CheckCache{Check Cache}
CheckCache -->|Hit| ReturnCached[Return cached data instantly]
CheckCache -->|Miss| QueryDB[Query Database]
QueryDB --> PopulateCache[Write result to cache with TTL]
PopulateCache --> ReturnFresh[Return fresh data to caller]
On a miss the application makes two sequential round trips — one to the cache (which returns nothing) and one to the database — before it can respond and warm the cache. On a subsequent hit the database is bypassed entirely. The TTL is the critical safety valve: without it, stale data lives in cache indefinitely.
Pros: Simple to implement; only hot data occupies memory; resilient to cache restarts since the DB is always authoritative.
Cons: The first request on a cold cache always hits the DB; stale data can be served for up to TTL seconds after a DB write.
Pattern 2: Read-Through
The cache sits transparently in front of the database. The application only ever talks to the cache. On a miss, the cache itself fetches from the database, populates itself, and returns the data. The application code never issues a direct DB query.
Used by Hibernate's second-level cache, NCache read-through configurations, and effectively every CDN (the edge node fetches from origin on a miss).
Pros: Application code is cleaner — one data source for all reads.
Cons: The first miss still incurs full DB latency; tight coupling between the cache tier and the DB connector layer.
Pattern 3: Write-Through
Every write goes to the cache and the database synchronously — in the same operation. Both stores are updated before the write is acknowledged to the caller. The cache is always consistent with the database for recently written keys.
The diagram below shows write-through on the left and write-back on the right, illustrating the synchronous vs. asynchronous distinction.
flowchart LR
subgraph WT[Write-Through]
WT_App[Application] -->|Write| WT_Cache[Cache]
WT_Cache -->|Sync write| WT_DB[Database]
WT_DB -->|Acknowledge| WT_App
end
subgraph WB[Write-Back]
WB_App[Application] -->|Write| WB_Cache[Cache]
WB_Cache -->|Acknowledge immediately| WB_App
WB_Cache -.->|Async flush later| WB_DB[Database]
end
In write-through, the write is not complete until both the cache and DB have confirmed it — guaranteeing consistency. In write-back, the cache acknowledges immediately and defers the DB write, trading consistency for low write latency.
Pros: No stale reads after writes; cache is warm for data that is read immediately after being written.
Cons: Higher write latency (every write waits for both a cache write and a DB write); cache gets filled with data that may never be read.
Pattern 4: Write-Back (Write-Behind)
Writes go to the cache immediately. The cache acknowledges the write and asynchronously flushes to the database in a background job — often batched for efficiency. Typical flush intervals range from 100 ms to several seconds.
Pros: Extremely low write latency (write to RAM only, acknowledge immediately); batch DB writes reduce I/O pressure; excellent for write-heavy workloads like counters and leaderboards.
Cons: Data loss risk if the cache crashes before flushing; DB temporarily lags behind cache — dangerous for inventory, financial balances, or any data where the DB must be the truth.
Real-world users: game leaderboard score updates, social media like counts, analytics counters, and session activity tracking — all workloads where a small window of potential loss is acceptable.
Pattern 5: Write-Around
Writes skip the cache entirely and go directly to the database. The cache is only populated when data is subsequently read (via Cache-Aside). The cache is never polluted with data that was just written but may never be read again.
Best for: Write-once, read-rarely data — audit logs, batch-imported records, compliance archives, telemetry. Writing this data to cache would evict frequently accessed hot keys without providing any read benefit.
Pattern 6: Refresh-Ahead
The cache proactively refreshes entries that are approaching TTL expiry — before the TTL actually elapses. When an entry is accessed and its remaining TTL has dropped below a configured threshold (e.g., 20% of original TTL remaining), the cache triggers a background fetch from the database and silently updates the entry.
Pros: Eliminates the cold-miss penalty for frequently accessed hot keys; high-traffic keys are perpetually warm.
Cons: Fetches data that may not be needed again (wasted DB work) if traffic drops; requires accurately estimating refresh thresholds.
Implemented by Caffeine's refreshAfterWrite, Varnish's Grace Mode, and HTTP's stale-while-revalidate Cache-Control directive.
Six-Pattern Comparison Table
| Pattern | Read Latency | Write Latency | Consistency Risk | Best Use Case |
| Cache-Aside | Low (hit) / High (miss) | Normal — DB only | Stale during TTL window | General-purpose reads |
| Read-Through | Low (hit) / High (miss) | Normal — DB only | Stale during TTL window | Simplified application code |
| Write-Through | Low | Higher — cache + DB sync | None for recent writes | Data read immediately after write |
| Write-Back | Low | Very Low — cache only | Window of potential loss | Write-heavy, loss-tolerant workloads |
| Write-Around | Low (hit) / High (miss) | Normal — DB only | Cache never stale (never populated on write) | Write-once, read-rarely data |
| Refresh-Ahead | Very Low — always warm | Normal — DB only | Minimal — background refresh before expiry | High-traffic hot keys with TTL pressure |
🧠 Deep Dive: Eviction Policies, Invalidation, and the Thundering Herd
The Internals: LRU, LFU, ARC, SLRU, and How They Actually Work
When the cache is full and a new key must be inserted, the eviction policy selects the victim to remove. Choosing the wrong policy wastes memory, degrades hit rates, and in pathological cases (scan pollution), can evict your entire working set.
LRU — Least Recently Used
The most widely deployed policy. Internally implemented as a doubly linked list + hash map: the hash map provides O(1) key lookup; the linked list maintains access recency from most-recent (head) to least-recent (tail). Every access unlinks the node and moves it to the head. Eviction removes the tail node.
flowchart LR
Head[HEAD - Most Recent] --> N1[Key C] --> N2[Key A] --> N3[Key F] --> Tail[TAIL - Evict First]
This diagram represents the internal LRU linked list. Key C was accessed most recently (it's at the head); Key F was accessed least recently and will be the first victim when the cache needs room for a new entry. Every GET operation promotes the accessed node to the HEAD position in O(1) time.
The clock approximation used in Redis allkeys-lru avoids maintaining a full linked list by storing a 24-bit LRU clock value per key — cheaper at scale but approximate (good enough for most workloads).
LFU — Least Frequently Used
Tracks an access frequency counter per key. Evicts the key with the lowest counter. The naïve implementation uses a min-heap (O(log n) for eviction), but the O(1) LFU algorithm (Ketan Shah, 2010) uses doubly linked lists of frequency buckets for constant-time operations.
The critical weakness: a viral post accessed 50,000 times yesterday but not at all today retains its high frequency counter and will not be evicted. Frequency aging — periodic counter halving — mitigates this. Caffeine's W-TinyLFU uses a Count-Min Sketch for approximate frequency counting with O(1) updates and bounded memory.
FIFO — First In, First Out
The simplest policy: evict the oldest inserted key regardless of access recency or frequency. Implemented as a plain queue. Appropriate for time-series data where age directly correlates with staleness — sensor readings with a 5-minute relevance window, rate-limit counters, rolling log windows.
TTL — Time-To-Live
Not a replacement-on-full policy but a per-key expiry mechanism. Two expiry strategies coexist in Redis:
- Lazy expiry: The key is not deleted when it expires. It is deleted on the next access attempt if the TTL has elapsed — zero background cost, but expired keys occupy memory until touched.
- Active expiry: A background cycle (runs 10 times per second in Redis, sampling 20 random keys per cycle) finds and deletes expired keys proactively, reclaiming memory without waiting for access.
ARC — Adaptive Replacement Cache
Developed by IBM Research (Nimrod Megiddo and Dharmendra Modha, 2003). ARC maintains four lists:
T1: recently accessed keys (once only)T2: frequently accessed keys (more than once)B1: ghost entries — keys recently evicted fromT1(metadata only, no data)B2: ghost entries — keys recently evicted fromT2
The split between T1 and T2 is dynamically tuned based on observed hit patterns in the ghost lists: if evictions from T1 are being requested again soon (hit in B1), ARC expands T1 space. ARC is scan-resistant (a sequential table scan does not evict the entire working set), adapts to workload shifts automatically, and needs no tuning parameter. Used in ZFS, IBM DS8000 storage systems, and several database buffer pools.
SLRU — Segmented LRU
Divides the cache into two segments: a protected segment (for keys accessed at least twice — proven hot) and a probationary segment (for keys accessed only once — unproven). New keys enter probationary. On a second access they are promoted to protected. Eviction always takes from probationary first, shielding keys that have demonstrated their value from accidental eviction.
Caffeine's W-TinyLFU extends SLRU with an admission filter: a Count-Min Sketch estimates the historical frequency of both the new candidate key and the eviction candidate. The new key is only admitted to the protected segment if its estimated frequency exceeds the eviction candidate's — preventing a low-frequency key from displacing a proven hot key.
Random Replacement
Evicts a uniformly random key. Zero metadata overhead — no linked lists, no counters, no timestamps. Surprisingly competitive with LRU on workloads with flat access distributions (all keys equally likely). Memcached uses random replacement within slab classes.
Performance Analysis: Choosing the Right Eviction Policy
| Policy | Memory Overhead | Hit/Miss O() | Scan Resistant | Frequency Aware | Real-World Users |
| LRU | O(n) linked list | O(1) | ❌ | ❌ | Redis, Varnish, Nginx |
| LFU | O(n) counters | O(1) with O(1) LFU | ❌ | ✅ | Redis 4.0+, Guava |
| FIFO | O(1) per key | O(1) | N/A | ❌ | Simple rate-limit windows |
| TTL | Per-key timestamp | O(1) | N/A | N/A | Redis, Memcached |
| ARC | ~2× LRU (ghost lists) | O(1) | ✅ | Partially | ZFS, IBM DS8000, some DB pools |
| W-TinyLFU | O(n) sketch | O(1) | ✅ | ✅ | Caffeine, Guava 26+ |
| Random | O(0) | O(1) | N/A | ❌ | Memcached slab classes |
Rule of thumb: Start with LRU (allkeys-lru in Redis). Upgrade to W-TinyLFU (Caffeine) for JVM in-process caches where mixed recency/frequency workloads are common. Use ARC at the storage or database buffer-pool layer where access patterns are opaque and sequential scans occur regularly.
Cache Invalidation: The Hardest Problem in Computer Science
Phil Karlton's famous quip — "There are only two hard things in computer science: cache invalidation and naming things" — is not hyperbole. Invalidation is hard because you must coordinate two systems (cache + database) that have no shared transaction boundary.
The classic failure mode: the database row is updated, then the application tries to delete the cache key, but the delete fails (network error, process crash, key not found). Users continue reading stale data until TTL expires. And if you reverse the order — delete cache first, then update DB — a concurrent read between the delete and the DB update will re-populate the cache with the old value.
TTL-Based Invalidation
The simplest approach. Every key has an expiry time. Stale data is served for at most TTL seconds. Safe, predictable, and self-healing. Works well when users can tolerate eventual consistency — social feed counters, recommendation lists, product catalog pages.
Event-Driven Invalidation via CDC
A Change Data Capture (CDC) pipeline watches the database's transaction log (e.g., Debezium on MySQL binlog, or Postgres logical replication). When a row is updated, a Kafka event triggers the cache layer to delete the relevant key. Achieves near-real-time consistency without write-through overhead. The risk: if the Kafka consumer lags behind, the cache remains stale until the event is processed.
Cache Busting (URL Fingerprinting)
Used exclusively for CDN and static asset caches. Asset URLs embed a content hash: styles.a1b2c3d4.css. When the file changes, the hash changes, the URL changes, and CDN nodes serve fresh content immediately — no explicit invalidation required. Old nodes serve the old URL forever, but no browser will request it. Zero-cost invalidation at CDN scale.
Key Versioning
Cache keys include a version number: user:42:v3. When the user's record is updated, the application increments the version to v4. The old key is never explicitly deleted — it simply expires via TTL and stops being requested. The new key starts with a miss and populates on first access. Simple and crash-safe, but version state must be stored somewhere (often a counter in Redis or the DB row itself).
Why strong consistency is nearly impossible without 2PC: To guarantee a cache is never stale, every DB write must atomically also invalidate the cache key. Atomicity across two separate systems — a relational database and an in-memory cache — requires a two-phase commit (2PC), which is expensive, complex, and a common source of deadlocks. Most production systems accept a brief stale window rather than pay this cost.
Thundering Herd and Cache Stampede: When TTL Expiry Becomes a Self-Inflicted DDoS
A cache stampede occurs when a high-traffic key expires and thousands of concurrent requests simultaneously find a cache miss, all rush to the database to regenerate the value, and overload it. The database receives a sudden burst of identical, duplicate queries — effectively a self-inflicted DDoS.
The problem is acute for: high-traffic homepages where a single cached rendering absorbs enormous traffic; popular content (celebrity profiles, trending articles); and coordinated deployments where all cache nodes restart simultaneously, invalidating the entire hot working set.
Mitigation 1: Probabilistic Early Expiration (XFetch)
Introduced by Vattani et al. (2015). Each request computes expiry_time − delta × beta × log(rand()) and triggers a background refresh early if the result is in the past. The probability of triggering a refresh increases smoothly as TTL approaches zero, spreading the cold-miss load across multiple requests rather than concentrating it at the exact moment of expiry. No locks, no coordination — purely probabilistic.
Mitigation 2: Mutex Lock (Cache Lock)
The first thread to detect a miss acquires a distributed lock (implemented via Redis SET key NX PX ttl). Only the lock holder queries the database; all other threads either wait or receive a stale value. On completion, the lock is released and the cache is populated. Subsequent waiters hit the now-warm cache. Trade-off: waiters see a latency spike during the lock window; if the lock holder crashes, the lock must time out before another thread can proceed.
Mitigation 3: Stale-While-Revalidate (Background Refresh)
Serve the stale cached value immediately to every waiting request. In parallel, trigger an asynchronous background job to refresh the key. Users see slightly stale data, but the response is instant and the database receives exactly one refresh query. Caffeine's refreshAfterWrite, Varnish's Grace Mode, and the HTTP stale-while-revalidate Cache-Control directive all implement this pattern.
Mitigation 4: TTL Jitter
When populating a large batch of keys (after a deployment, cold start, or large import), add a random offset to each key's TTL. Instead of all keys expiring at t + 300s, they expire uniformly in the range t + 270s to t + 330s. The synchronized expiry cascade is spread over a 60-second window instead of triggering as a single simultaneous surge.
📊 Cache Topologies: Local, Distributed, CDN, and Multi-Level Hierarchies
Caches exist at multiple levels in a real system. Understanding the latency and consistency characteristics of each level is essential for designing the right topology for your workload. The four standard layers, ordered from fastest to slowest, are shown below.
flowchart LR
Client[Client Request] --> CDN[CDN Edge Cache]
CDN -->|Cache miss| LB[Load Balancer]
LB --> App[App Server]
App --> L1[L1 Local Cache]
L1 -->|Miss| L2[L2 Distributed Cache]
L2 -->|Miss| DB[Database]
Each hop in the diagram represents an order-of-magnitude latency jump: L1 local cache (nanoseconds), L2 distributed cache (1–2 ms), database (10–50 ms), CDN edge (20–80 ms to nearest PoP vs. 150–300 ms to origin). The goal is to answer as many requests as possible at the leftmost — fastest — layer.
L1 — In-Process Local Cache (~100 ns)
Lives inside the application process itself. Examples: Caffeine (Java), Guava Cache (Java), Node.js lru-cache, Python functools.lru_cache. Zero network overhead — a local cache lookup is a hash table read in the JVM heap, typically 50–200 nanoseconds.
The key limitation: it is not shared. In a 32-server fleet, each server maintains its own private L1 cache. A write that invalidates a key on server 1 does not affect servers 2–32. Local caches work best for immutable or slowly changing reference data: country lists, feature flags, currency exchange rates, config values.
L2 — Distributed Cache (~1 ms)
A shared, networked cache tier. Examples: Redis Cluster, Memcached, Apache Ignite. All application servers in the fleet read from and write to the same cache cluster. Network round-trip adds ~0.5–2 ms but ensures consistency across the fleet — a write on server 1 is immediately visible to servers 2–32.
Redis Cluster shards keys across nodes using CRC16 hash slots (16,384 total) and replicates each shard to standby nodes for high availability. Memcached uses client-side consistent hashing with no built-in replication.
L3 — CDN Cache (~20–80 ms edge, ~150–300 ms origin fallback)
A globally distributed network of cache nodes (Points of Presence, PoPs) positioned close to end users. Examples: Cloudflare, CloudFront, Fastly, Akamai. CDN caches serve static and semi-static content (HTML, CSS, JS, images, API responses) with edge latency of 20–80 ms — versus 150–300 ms to a single-region origin for geographically distant users.
CDN behavior is controlled by HTTP headers: Cache-Control: public, max-age=86400 instructs every caching layer (browser, CDN, proxy) to store the response for 24 hours. Cache-Control: private prevents CDN caching for user-specific content.
Multi-Level (Hierarchical) Caching
Production systems stack L1 and L2: check local cache first; on miss, check Redis; on miss, query DB and populate both levels. The key consistency challenge: when a key is invalidated in Redis (L2), L1 caches across all servers still hold the stale value.
Two standard solutions:
- Short L1 TTLs (1–5 seconds) for frequently changing data — the stale window is brief and self-healing.
- Pub/Sub invalidation — on every DB write, publish a message to a Redis Pub/Sub channel; all app servers subscribe and evict the key from their L1 cache immediately.
| Cache Level | Typical Latency | Shared Across Fleet | Primary Use |
| L1 Local (Caffeine) | ~100 ns | No — per-process | Reference data, config, feature flags |
| L2 Distributed (Redis) | ~1 ms | Yes — cluster-wide | Session data, user profiles, hot API responses |
| L3 CDN (Cloudflare) | ~50 ms edge | Yes — global PoP | Static assets, public API responses, HTML pages |
| Database (PostgreSQL) | ~10–50 ms | Yes | Source of truth — all cache layers miss |
🌍 How Twitter, Netflix, Facebook, and Cloudflare Cache at Scale
Twitter — Redis Sorted Sets for Timeline Assembly
Twitter's home timeline is one of the most complex caching problems at scale. Assembling a feed from tens of thousands of followed accounts in real time would require expensive fan-in joins across hundreds of database shards on every page load. Instead, Twitter pre-computes each user's timeline and caches it in Redis as a sorted set (scores = tweet timestamps). Opening the app reads at most 800 timeline entries from Redis — no database joins, purely cache reads answering in sub-millisecond time. Heavy accounts with millions of followers use a hybrid push/pull model: the most popular accounts' tweets are fetched from Redis at read time rather than pre-written to millions of followers' sorted sets (too expensive to write).
Netflix — EVCache at 99% Cache Hit Rate
Netflix runs EVCache, a distributed Memcached derivative built for multi-region replication. Every piece of video metadata (titles, thumbnails, subtitles), personalization data, and subscriber preferences lives in EVCache. Netflix reports >99% cache hit rates for EVCache, meaning the databases backing these caches handle less than 1% of read traffic. EVCache replicates writes across AWS regions asynchronously, accepting a brief eventual consistency window in exchange for sub-millisecond reads globally. The system handles millions of requests per second across 190+ countries.
Facebook — Lease Mechanism for Thundering Herd at 100,000-Server Scale
Facebook's 2013 Memcached paper (Nishtala et al.) describes a lease mechanism that directly solves thundering herd in their look-aside cache. When a thread misses a key, it receives a lease token from Memcached. Only the lease-holding thread is permitted to query the database and repopulate the cache. Other threads that concurrently miss the same key receive either a "wait and retry" response (short sleep, then retry the cache) or a stale value if one is available. This guarantees at most one database query per cold key regardless of how many concurrent threads are waiting. Facebook's Memcached fleet spans 100,000+ servers and stores the social graph, photo metadata, and user profile data at a scale that no database tier alone could absorb.
Cloudflare — Tiered Cache to Protect Origin
Cloudflare's Tiered Cache adds a middle layer (upper-tier PoPs, also called shield PoPs) between the 300+ global edge PoPs and the customer's origin server. When an edge PoP in London misses a key, it queries an upper-tier PoP in Amsterdam before falling back to the customer's origin. This collapses the thundering herd from 300 simultaneous origin requests (one per PoP) down to a single request — reducing origin bandwidth by up to 70% on popular content. The topology directly implements hierarchical caching at global infrastructure scale.
⚖️ Consistency vs. Performance: The Core Caching Trade-off
Every caching decision sits on the consistency–performance axis. There is no configuration that delivers both perfect consistency and zero additional latency — every gain on one axis costs something on the other.
| Dimension | Maximise Consistency | Maximise Performance |
| Pattern | Write-Through | Write-Back |
| Invalidation | Event-driven CDC | TTL-based |
| TTL duration | Short (seconds) | Long (hours) |
| Cache scope | Distributed only (L2) | Multi-level L1 + L2 |
| Eviction policy | Short TTL + explicit delete | LRU/LFU with long TTL |
| Stampede mitigation | Mutex lock (one writer, consistent) | Stale-while-revalidate (instant, slightly stale) |
The sharpest trade-off — write-through vs. write-back: Write-through guarantees cache = DB at all times but every write waits for both a cache write and a DB write before returning. Write-back cuts write latency to sub-millisecond but risks losing un-flushed writes. For financial transactions, inventory decrements, and payment state, write-through is non-negotiable. For game scores, reaction counts, and session activity, write-back is the right call.
Consistency requirements mapped to data type:
| Data Type | Acceptable Staleness | Recommended Pattern |
| Inventory / payment state | 0 seconds — must be exact | Write-Through + DB transaction |
| User profile fields | 1–60 seconds | Cache-Aside + short TTL + event invalidation on write |
| Social feed counts (likes, views) | 1–5 minutes | Write-Back + periodic flush |
| Static HTML / CSS / JS | Hours to days | CDN + cache busting on deploy |
| Feature flags / config | 1–5 minutes | Local cache + Pub/Sub invalidation |
| Search index data | Minutes to hours | Write-Around + scheduled cache warm-up |
🧭 Decision Guide: Which Cache Pattern and Eviction Policy to Pick
Selecting a Cache Pattern by Scenario
| Scenario | Recommended Pattern | Why |
| Read-heavy, writes are rare | Cache-Aside | Simple; only hot data occupies memory; DB is always authoritative |
| Data is read immediately after being written | Write-Through | Cache stays warm for fresh data; no stale reads after writes |
| Write-heavy workload, loss is acceptable | Write-Back | Minimises write latency; batch flushes reduce DB I/O pressure |
| Write-once, read-rarely (audit logs, archives) | Write-Around | Prevents cache pollution; preserves capacity for hot reads |
| High-traffic hot keys under TTL pressure | Refresh-Ahead | Eliminates cold-miss latency; keys are always warm |
| Application simplicity is the priority | Read-Through | Single data source for all reads; no dual-write logic in application |
Selecting an Eviction Policy by Workload
| Workload Characteristics | Recommended Policy | Redis maxmemory-policy |
| General web traffic — mixed recency | LRU | allkeys-lru |
| Long-lived viral content, stable hot set | LFU | allkeys-lfu |
| Only evict keys that have a TTL set | LRU on volatile keys | volatile-lru |
| Evict keys closest to expiry first | TTL proximity | volatile-ttl |
| JVM in-process cache with mixed workload | W-TinyLFU | Caffeine default |
| Storage-layer buffer pool (ZFS, databases) | ARC | ZFS / DB default |
| Extremely memory-constrained, flat access | Random | allkeys-random |
Selecting a Cache Topology by Need
| What You Need | Recommended Topology |
| Lowest possible read latency (nanoseconds) | L1 local cache — Caffeine or Guava |
| Shared state across a multi-server fleet | L2 distributed — Redis Cluster |
| Global static asset or API response delivery | L3 CDN — Cloudflare or CloudFront |
| Low latency + fleet consistency | L1 + L2 with Pub/Sub invalidation |
| Thundering herd protection on hot keys | XFetch or mutex lock at L2; stale-while-revalidate at CDN |
🧪 Sizing and Tuning Your Cache Before It Breaks in Production
Cache sizing is not optional — a cache that runs out of memory at peak load starts evicting aggressively, hit rate collapses, and the database absorbs a sudden 10× read spike exactly when you can least afford it. Size before you deploy.
Step 1 — Measure the working set size.
The working set is the set of keys that receive at least one access per TTL window. Start by logging distinct cache keys accessed over a 5-minute window at peak traffic. Multiply the key count by average value size. Most web workloads follow a power law: 10% of keys absorb 90% of traffic. Caching the top 10% of keys typically achieves 80–90% hit ratio; the top 25% typically reaches 95%.
Step 2 — Apply the sizing formulas.
| Resource | Sizing Formula | Worked Example |
| Redis memory | (working_set_size) × 1.2 headroom | 500K keys × 2 KB × 1.2 = 1.2 GB |
| Caffeine local cache size | peak_rps × avg_ttl_seconds × 0.1 | 5,000 rps × 60 s × 0.1 = 30,000 entries |
| CDN TTL | acceptable_staleness_seconds / 2 | 10 min acceptable → 5 min TTL |
| Alert threshold (eviction rate) | > 0 sustained evictions/sec | Alert immediately; sustained eviction = cache is undersized |
Step 3 — Set the eviction policy before you hit the memory limit.
Redis defaults to maxmemory-policy noeviction, which returns OOM errors to the application when memory is full. This is almost never what you want in production. Set allkeys-lru as your default before the first deployment. Revisit after observing your actual access pattern in production.
Step 4 — Monitor these four signals continuously:
- Hit ratio — alert if it drops below 80%.
- Eviction rate — alert on any sustained nonzero eviction in steady state.
- Memory utilization — alert above 80% of
maxmemory. - Keyspace size — a sudden spike indicates an application bug writing unexpected keys.
🛠️ Redis and Caffeine: Production Configuration Reference
Redis Eviction and Memory Configuration
Redis ships with conservative defaults that are wrong for most production caching workloads. The five settings below are the minimum you must review before deploying Redis as a cache.
# redis.conf — core caching parameters
maxmemory 4gb
# Hard memory cap. Redis refuses writes (noeviction) or evicts (all other policies)
# when this limit is reached. Always set this explicitly — never leave it unbounded.
maxmemory-policy allkeys-lru
# Evict least-recently-used keys from ALL keyspaces when maxmemory is reached.
# Alternatives: allkeys-lfu (frequency-aware), volatile-lru (only TTL keys),
# volatile-ttl (evict keys closest to expiry), allkeys-random.
activerehashing yes
# Incrementally rehash the main hash table in the background.
# Prevents latency spikes when the hash table needs to resize at peak traffic.
lazyfree-lazy-eviction yes
# Perform evictions in a background thread rather than the main event loop.
# Critical for large values: evicting a 10 MB list synchronously blocks
# all other commands for milliseconds. Set to yes in all production deployments.
lazyfree-lazy-expire yes
# Delete expired keys in a background thread instead of blocking on expiry.
# Complements active expiry; reduces tail latency from key deletion.
lazyfree-lazy-eviction yes is the single most important change for latency-sensitive caches. Without it, evicting large values blocks Redis's single-threaded event loop and produces latency spikes that cascade to the application layer.
Caffeine In-Process Cache Configuration for JVM Services
Caffeine is the standard L1 in-process cache for JVM applications, replacing Guava Cache. The builder parameters below cover the four most important production settings.
// Caffeine cache builder — L1 in-process cache configuration
Cache<String, UserProfile> cache = Caffeine.newBuilder()
.maximumSize(50_000) // Max entries; W-TinyLFU eviction when full
.expireAfterWrite(5, MINUTES) // TTL: hard expiry 5 min after last write
.refreshAfterWrite(3, MINUTES) // Refresh-Ahead: background reload at 3 min,
// before expiry; serves stale value during refresh
.recordStats() // Enable CacheStats (hit ratio, eviction count)
// Wire to Micrometer for Prometheus/Grafana
.build(key -> userRepository.findById(key)); // Auto-load on miss (read-through)
refreshAfterWrite is deliberately set lower than expireAfterWrite: the background reload fires at 3 minutes so the entry is always fresh before the 5-minute hard expiry. This implements Refresh-Ahead and eliminates the cold-miss latency spike for frequently accessed keys. recordStats() exposes a CacheStats object with per-cache hit/miss/eviction metrics — always enable it in production and export to your observability stack.
For a full deep-dive on Caffeine tuning, W-TinyLFU internals, and multi-level cache topology with Spring, see the companion post: LLD for LRU Cache: Designing a High-Performance Cache.
📚 Lessons Learned from Running Caches in Production
Fix the slow query before you cache it. A 500 ms query cached for 5 minutes still hammers the database every 5 minutes — and with thundering herd potential every time the TTL fires. Optimize the query first; cache the fast result. A cache on top of a slow query is technical debt with a TTL.
Never cache error responses. If the database is temporarily unavailable and your application returns an error, caching that error means every user sees the error until TTL expires — even after the database recovers. Always inspect the result before writing to cache. A null or error return should never be stored.
A cache without TTL is a slow memory leak. Data that never expires accumulates until the server hits maxmemory and begins evicting unpredictably. Set a TTL on every key, even if it is 24 or 48 hours, to give the cache a natural floor for reclamation.
Namespace cache keys from day one. Never use bare IDs as cache keys (42). Use qualified keys with entity type, version, and where needed, tenant: user:42:profile:v3, tenant:acme:config:v1. Bare ID keys cause silent data leakage (user A's response served to user B) and make keyspace debugging impossible at scale.
Test cache failure modes in staging. Most cache-related production outages stem from scenarios that were never tested: what happens when Redis hits maxmemory? When a cache node is restarted mid-traffic? When 70% of keys expire simultaneously after a long weekend? Run these scenarios deliberately in staging before they surprise you in production.
Monitor hit ratio as a first-class SLO. A hit ratio that drops from 92% to 75% overnight is a signal — perhaps a new feature introduced uncached access patterns, TTLs were shortened too aggressively, or the data size grew past the cache budget. Set an alert at 80% and treat a sustained drop as an incident.
Use short TTLs for mutable, user-specific data. A 24-hour TTL on a user profile means a user who changes their display name sees their old name for up to 24 hours after the update. For mutable personalized data, keep TTLs under 60 seconds or implement event-driven invalidation on write. The performance gain from a long TTL is rarely worth the user-facing inconsistency.
Encrypt sensitive values before writing to cache. A Redis KEYS * or DEBUG OBJECT command on an unsecured instance exposes every cached value in plaintext. Never store raw PII (emails, phone numbers, SSNs, payment tokens) in cache without field-level encryption. Treat cache as a public bus, not a vault.
📌 TLDR: The Caching Reference Card
- Cache-Aside is the safe default: check cache → miss → load DB → populate cache → return.
- Write-Through ensures no stale reads after writes; every write pays cache + DB latency.
- Write-Back minimises write latency at the cost of a data-loss window; use only for loss-tolerant workloads (counters, scores).
- Write-Around prevents cache pollution for write-once, read-rarely data (audit logs, archives).
- Refresh-Ahead eliminates cold-miss latency for hot keys; Caffeine's
refreshAfterWriteimplements it natively. - LRU (
allkeys-lru) is the safe default eviction policy for Redis; W-TinyLFU (Caffeine) is superior for mixed workloads in JVM services. - ARC self-tunes between recency and frequency; ideal for opaque storage-layer caches where access patterns are unpredictable.
- Cache invalidation is genuinely hard: TTL + event-driven CDC is the standard hybrid; 2PC across cache + DB is not practical.
- Thundering herd destroys databases at scale; mitigate with XFetch probabilistic expiry, mutex locks, stale-while-revalidate, or TTL jitter.
- Multi-level topology (L1 local → L2 distributed → L3 CDN) stacks latency savings across orders of magnitude.
- Hit ratio < 80% = cache is undersized, TTLs are too short, working set is too large, or a new code path bypasses the cache.
lazyfree-lazy-eviction yesin Redis is mandatory at scale; prevents eviction-induced latency spikes on the event loop.- Never cache error responses. Never use infinite TTLs. Always namespace cache keys.
🔗 Related Posts
- System Design HLD: Distributed Cache Example
- LLD for LRU Cache: Designing a High-Performance Cache
- Partitioning Approaches in SQL and NoSQL
- How Kafka Works: The Log That Never Forgets
- System Design Advanced: Security, Rate Limiting, and Reliability

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
