Home/Blog/Caching/System Design: Complete Guide to Caching — Patterns, Eviction, and Distributed Strategies

CachingAdvanced•33 min read•Mar 9, 2026

System Design: Complete Guide to Caching — Patterns, Eviction, and Distributed Strategies

Cache-Aside to Refresh-Ahead, LRU to ARC — every caching pattern, eviction policy, and distributed strategy you need for system design

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: Caching is the single highest-leverage performance tool in distributed systems. This guide covers every read/write pattern (Cache-Aside through Refresh-Ahead), every eviction policy (LRU through ARC), cache invalidation pitfalls, thundering herd mitigations, multi-level topologies, and how Twitter, Netflix, and Facebook run caches at planetary scale. Read once; reference forever.

📖 Why Your Database Is Begging for a Cache

Every relational database has the same story: it starts fast, traffic grows, queries get slower, you add indexes, you tune joins, and eventually you hit the wall — the database is doing too much I/O for too many concurrent readers. The standard answer is a cache.

A cache is a key-value store that lives in RAM. It exploits one simple truth about most web traffic: 80% of reads touch 20% of the data (the Pareto principle applied to access patterns). By keeping that hot 20% in fast memory, you absorb the vast majority of read load without touching the database at all.

The numbers are staggering. A Redis GET command returns in ~100 microseconds over a local network. A PostgreSQL query that hits disk can take 10–50 milliseconds — up to 500× slower. For an application serving 10,000 requests per second, the difference between a 95% cache hit rate and a 75% hit rate is the difference between a 16-server fleet and a 64-server fleet.

But caching is not just "stick Redis in front of your database." The details — which read/write pattern you use, how you evict stale entries, how you handle thousands of concurrent misses on the same key — determine whether your cache helps or becomes its own source of failure. This guide covers every pattern, every policy, and every failure mode you need to know.

🔍 How Caches Fit Into the Read and Write Path

Before diving into patterns, ground yourself in what a cache is not: it is not a database. Caches are:

Volatile — data survives only as long as the cache process is running (unless persistence is explicitly configured, as in Redis RDB/AOF modes).
Bounded — caches have a fixed memory budget and cannot grow infinitely. When the budget fills, the eviction policy decides what to drop.
Eventually consistent with the DB by default — unless you implement write-through, the cache may serve data that lags behind the source of truth.

Four metrics define cache health at a glance:

Metric	Formula	Healthy Target
Hit Ratio	`hits / (hits + misses)`	≥ 90% for a well-tuned cache
Eviction Rate	evictions per second	Near 0 in steady state; high eviction signals the cache is undersized
Miss Latency	end-to-end time to serve a cache miss	< 20 ms total (DB fetch + cache populate + response)
Memory Utilization	used / max memory	Stay below 80% to avoid eviction pressure

A hit ratio below 80% usually means one of four things: the cache is too small for the working set, TTLs are too short so keys expire before being reused, the access pattern is nearly uniform (no hot keys to exploit), or a new code path is bypassing the cache entirely.

⚙️ Every Cache Pattern Explained: From Cache-Aside to Refresh-Ahead

There are six canonical caching patterns. Each solves a different problem and carries different consistency trade-offs. A typical production system mixes them — Cache-Aside for general reads, Write-Through for high-value writes, and Refresh-Ahead for peak-traffic hot keys.

Pattern 1: Cache-Aside (Lazy Loading)

The most common pattern. The application owns the cache interaction: check cache first; on a miss, load from DB, populate the cache, return data. The cache is populated lazily — only data that is actually read ever enters it.

The diagram below shows both paths through Cache-Aside. The left branch is a cache miss (cold path); the right branch is a cache hit (warm path).

flowchart TD
    Request[Incoming Request] --> CheckCache{Check Cache}
    CheckCache -->|Hit| ReturnCached[Return cached data instantly]
    CheckCache -->|Miss| QueryDB[Query Database]
    QueryDB --> PopulateCache[Write result to cache with TTL]
    PopulateCache --> ReturnFresh[Return fresh data to caller]

On a miss the application makes two sequential round trips — one to the cache (which returns nothing) and one to the database — before it can respond and warm the cache. On a subsequent hit the database is bypassed entirely. The TTL is the critical safety valve: without it, stale data lives in cache indefinitely.

Pros: Simple to implement; only hot data occupies memory; resilient to cache restarts since the DB is always authoritative.
Cons: The first request on a cold cache always hits the DB; stale data can be served for up to TTL seconds after a DB write.

Pattern 2: Read-Through

The cache sits transparently in front of the database. The application only ever talks to the cache. On a miss, the cache itself fetches from the database, populates itself, and returns the data. The application code never issues a direct DB query.

Used by Hibernate's second-level cache, NCache read-through configurations, and effectively every CDN (the edge node fetches from origin on a miss).

Pros: Application code is cleaner — one data source for all reads.
Cons: The first miss still incurs full DB latency; tight coupling between the cache tier and the DB connector layer.

Pattern 3: Write-Through

Every write goes to the cache and the database synchronously — in the same operation. Both stores are updated before the write is acknowledged to the caller. The cache is always consistent with the database for recently written keys.

The diagram below shows write-through on the left and write-back on the right, illustrating the synchronous vs. asynchronous distinction.

flowchart LR
    subgraph WT[Write-Through]
        WT_App[Application] -->|Write| WT_Cache[Cache]
        WT_Cache -->|Sync write| WT_DB[Database]
        WT_DB -->|Acknowledge| WT_App
    end
    subgraph WB[Write-Back]
        WB_App[Application] -->|Write| WB_Cache[Cache]
        WB_Cache -->|Acknowledge immediately| WB_App
        WB_Cache -.->|Async flush later| WB_DB[Database]
    end

In write-through, the write is not complete until both the cache and DB have confirmed it — guaranteeing consistency. In write-back, the cache acknowledges immediately and defers the DB write, trading consistency for low write latency.

Pros: No stale reads after writes; cache is warm for data that is read immediately after being written.
Cons: Higher write latency (every write waits for both a cache write and a DB write); cache gets filled with data that may never be read.

Pattern 4: Write-Back (Write-Behind)

Writes go to the cache immediately. The cache acknowledges the write and asynchronously flushes to the database in a background job — often batched for efficiency. Typical flush intervals range from 100 ms to several seconds.

Pros: Extremely low write latency (write to RAM only, acknowledge immediately); batch DB writes reduce I/O pressure; excellent for write-heavy workloads like counters and leaderboards.
Cons: Data loss risk if the cache crashes before flushing; DB temporarily lags behind cache — dangerous for inventory, financial balances, or any data where the DB must be the truth.

Real-world users: game leaderboard score updates, social media like counts, analytics counters, and session activity tracking — all workloads where a small window of potential loss is acceptable.

Pattern 5: Write-Around

Writes skip the cache entirely and go directly to the database. The cache is only populated when data is subsequently read (via Cache-Aside). The cache is never polluted with data that was just written but may never be read again.

Best for: Write-once, read-rarely data — audit logs, batch-imported records, compliance archives, telemetry. Writing this data to cache would evict frequently accessed hot keys without providing any read benefit.

Pattern 6: Refresh-Ahead

The cache proactively refreshes entries that are approaching TTL expiry — before the TTL actually elapses. When an entry is accessed and its remaining TTL has dropped below a configured threshold (e.g., 20% of original TTL remaining), the cache triggers a background fetch from the database and silently updates the entry.

Pros: Eliminates the cold-miss penalty for frequently accessed hot keys; high-traffic keys are perpetually warm.
Cons: Fetches data that may not be needed again (wasted DB work) if traffic drops; requires accurately estimating refresh thresholds.

Implemented by Caffeine's refreshAfterWrite, Varnish's Grace Mode, and HTTP's stale-while-revalidate Cache-Control directive.

Six-Pattern Comparison Table

Pattern	Read Latency	Write Latency	Consistency Risk	Best Use Case
Cache-Aside	Low (hit) / High (miss)	Normal — DB only	Stale during TTL window	General-purpose reads
Read-Through	Low (hit) / High (miss)	Normal — DB only	Stale during TTL window	Simplified application code
Write-Through	Low	Higher — cache + DB sync	None for recent writes	Data read immediately after write
Write-Back	Low	Very Low — cache only	Window of potential loss	Write-heavy, loss-tolerant workloads
Write-Around	Low (hit) / High (miss)	Normal — DB only	Cache never stale (never populated on write)	Write-once, read-rarely data
Refresh-Ahead	Very Low — always warm	Normal — DB only	Minimal — background refresh before expiry	High-traffic hot keys with TTL pressure

🧠 Deep Dive: Eviction Policies, Invalidation, and the Thundering Herd

The Internals: LRU, LFU, ARC, SLRU, and How They Actually Work

When the cache is full and a new key must be inserted, the eviction policy selects the victim to remove. Choosing the wrong policy wastes memory, degrades hit rates, and in pathological cases (scan pollution), can evict your entire working set.

LRU — Least Recently Used

The most widely deployed policy. Internally implemented as a doubly linked list + hash map: the hash map provides O(1) key lookup; the linked list maintains access recency from most-recent (head) to least-recent (tail). Every access unlinks the node and moves it to the head. Eviction removes the tail node.

flowchart LR
    Head[HEAD - Most Recent] --> N1[Key C] --> N2[Key A] --> N3[Key F] --> Tail[TAIL - Evict First]

This diagram represents the internal LRU linked list. Key C was accessed most recently (it's at the head); Key F was accessed least recently and will be the first victim when the cache needs room for a new entry. Every GET operation promotes the accessed node to the HEAD position in O(1) time.

The clock approximation used in Redis allkeys-lru avoids maintaining a full linked list by storing a 24-bit LRU clock value per key — cheaper at scale but approximate (good enough for most workloads).

LFU — Least Frequently Used

Tracks an access frequency counter per key. Evicts the key with the lowest counter. The naïve implementation uses a min-heap (O(log n) for eviction), but the O(1) LFU algorithm (Ketan Shah, 2010) uses doubly linked lists of frequency buckets for constant-time operations.

The critical weakness: a viral post accessed 50,000 times yesterday but not at all today retains its high frequency counter and will not be evicted. Frequency aging — periodic counter halving — mitigates this. Caffeine's W-TinyLFU uses a Count-Min Sketch for approximate frequency counting with O(1) updates and bounded memory.

FIFO — First In, First Out

The simplest policy: evict the oldest inserted key regardless of access recency or frequency. Implemented as a plain queue. Appropriate for time-series data where age directly correlates with staleness — sensor readings with a 5-minute relevance window, rate-limit counters, rolling log windows.

TTL — Time-To-Live

Not a replacement-on-full policy but a per-key expiry mechanism. Two expiry strategies coexist in Redis:

Lazy expiry: The key is not deleted when it expires. It is deleted on the next access attempt if the TTL has elapsed — zero background cost, but expired keys occupy memory until touched.
Active expiry: A background cycle (runs 10 times per second in Redis, sampling 20 random keys per cycle) finds and deletes expired keys proactively, reclaiming memory without waiting for access.

ARC — Adaptive Replacement Cache

Developed by IBM Research (Nimrod Megiddo and Dharmendra Modha, 2003). ARC maintains four lists:

T1: recently accessed keys (once only)
T2: frequently accessed keys (more than once)
B1: ghost entries — keys recently evicted from T1 (metadata only, no data)
B2: ghost entries — keys recently evicted from T2

The split between T1 and T2 is dynamically tuned based on observed hit patterns in the ghost lists: if evictions from T1 are being requested again soon (hit in B1), ARC expands T1 space. ARC is scan-resistant (a sequential table scan does not evict the entire working set), adapts to workload shifts automatically, and needs no tuning parameter. Used in ZFS, IBM DS8000 storage systems, and several database buffer pools.

SLRU — Segmented LRU

Divides the cache into two segments: a protected segment (for keys accessed at least twice — proven hot) and a probationary segment (for keys accessed only once — unproven). New keys enter probationary. On a second access they are promoted to protected. Eviction always takes from probationary first, shielding keys that have demonstrated their value from accidental eviction.

Caffeine's W-TinyLFU extends SLRU with an admission filter: a Count-Min Sketch estimates the historical frequency of both the new candidate key and the eviction candidate. The new key is only admitted to the protected segment if its estimated frequency exceeds the eviction candidate's — preventing a low-frequency key from displacing a proven hot key.

Random Replacement

Evicts a uniformly random key. Zero metadata overhead — no linked lists, no counters, no timestamps. Surprisingly competitive with LRU on workloads with flat access distributions (all keys equally likely). Memcached uses random replacement within slab classes.

Performance Analysis: Choosing the Right Eviction Policy

Policy	Memory Overhead	Hit/Miss O()	Scan Resistant	Frequency Aware	Real-World Users
LRU	O(n) linked list	O(1)	❌	❌	Redis, Varnish, Nginx
LFU	O(n) counters	O(1) with O(1) LFU	❌	✅	Redis 4.0+, Guava
FIFO	O(1) per key	O(1)	N/A	❌	Simple rate-limit windows
TTL	Per-key timestamp	O(1)	N/A	N/A	Redis, Memcached
ARC	~2× LRU (ghost lists)	O(1)	✅	Partially	ZFS, IBM DS8000, some DB pools
W-TinyLFU	O(n) sketch	O(1)	✅	✅	Caffeine, Guava 26+
Random	O(0)	O(1)	N/A	❌	Memcached slab classes

Rule of thumb: Start with LRU (allkeys-lru in Redis). Upgrade to W-TinyLFU (Caffeine) for JVM in-process caches where mixed recency/frequency workloads are common. Use ARC at the storage or database buffer-pool layer where access patterns are opaque and sequential scans occur regularly.

Cache Invalidation: The Hardest Problem in Computer Science

Phil Karlton's famous quip — "There are only two hard things in computer science: cache invalidation and naming things" — is not hyperbole. Invalidation is hard because you must coordinate two systems (cache + database) that have no shared transaction boundary.

The classic failure mode: the database row is updated, then the application tries to delete the cache key, but the delete fails (network error, process crash, key not found). Users continue reading stale data until TTL expires. And if you reverse the order — delete cache first, then update DB — a concurrent read between the delete and the DB update will re-populate the cache with the old value.

TTL-Based Invalidation

The simplest approach. Every key has an expiry time. Stale data is served for at most TTL seconds. Safe, predictable, and self-healing. Works well when users can tolerate eventual consistency — social feed counters, recommendation lists, product catalog pages.

Event-Driven Invalidation via CDC

A Change Data Capture (CDC) pipeline watches the database's transaction log (e.g., Debezium on MySQL binlog, or Postgres logical replication). When a row is updated, a Kafka event triggers the cache layer to delete the relevant key. Achieves near-real-time consistency without write-through overhead. The risk: if the Kafka consumer lags behind, the cache remains stale until the event is processed.

Cache Busting (URL Fingerprinting)

Used exclusively for CDN and static asset caches. Asset URLs embed a content hash: styles.a1b2c3d4.css. When the file changes, the hash changes, the URL changes, and CDN nodes serve fresh content immediately — no explicit invalidation required. Old nodes serve the old URL forever, but no browser will request it. Zero-cost invalidation at CDN scale.

Key Versioning

Cache keys include a version number: user:42:v3. When the user's record is updated, the application increments the version to v4. The old key is never explicitly deleted — it simply expires via TTL and stops being requested. The new key starts with a miss and populates on first access. Simple and crash-safe, but version state must be stored somewhere (often a counter in Redis or the DB row itself).

Why strong consistency is nearly impossible without 2PC: To guarantee a cache is never stale, every DB write must atomically also invalidate the cache key. Atomicity across two separate systems — a relational database and an in-memory cache — requires a two-phase commit (2PC), which is expensive, complex, and a common source of deadlocks. Most production systems accept a brief stale window rather than pay this cost.

Thundering Herd and Cache Stampede: When TTL Expiry Becomes a Self-Inflicted DDoS

A cache stampede occurs when a high-traffic key expires and thousands of concurrent requests simultaneously find a cache miss, all rush to the database to regenerate the value, and overload it. The database receives a sudden burst of identical, duplicate queries — effectively a self-inflicted DDoS.

The problem is acute for: high-traffic homepages where a single cached rendering absorbs enormous traffic; popular content (celebrity profiles, trending articles); and coordinated deployments where all cache nodes restart simultaneously, invalidating the entire hot working set.

Mitigation 1: Probabilistic Early Expiration (XFetch)

Introduced by Vattani et al. (2015). Each request computes expiry_time − delta × beta × log(rand()) and triggers a background refresh early if the result is in the past. The probability of triggering a refresh increases smoothly as TTL approaches zero, spreading the cold-miss load across multiple requests rather than concentrating it at the exact moment of expiry. No locks, no coordination — purely probabilistic.

Mitigation 2: Mutex Lock (Cache Lock)

The first thread to detect a miss acquires a distributed lock (implemented via Redis SET key NX PX ttl). Only the lock holder queries the database; all other threads either wait or receive a stale value. On completion, the lock is released and the cache is populated. Subsequent waiters hit the now-warm cache. Trade-off: waiters see a latency spike during the lock window; if the lock holder crashes, the lock must time out before another thread can proceed.

Mitigation 3: Stale-While-Revalidate (Background Refresh)

Serve the stale cached value immediately to every waiting request. In parallel, trigger an asynchronous background job to refresh the key. Users see slightly stale data, but the response is instant and the database receives exactly one refresh query. Caffeine's refreshAfterWrite, Varnish's Grace Mode, and the HTTP stale-while-revalidate Cache-Control directive all implement this pattern.

Mitigation 4: TTL Jitter

When populating a large batch of keys (after a deployment, cold start, or large import), add a random offset to each key's TTL. Instead of all keys expiring at t + 300s, they expire uniformly in the range t + 270s to t + 330s. The synchronized expiry cascade is spread over a 60-second window instead of triggering as a single simultaneous surge.

📊 Cache Topologies: Local, Distributed, CDN, and Multi-Level Hierarchies

Caches exist at multiple levels in a real system. Understanding the latency and consistency characteristics of each level is essential for designing the right topology for your workload. The four standard layers, ordered from fastest to slowest, are shown below.

flowchart LR
    Client[Client Request] --> CDN[CDN Edge Cache]
    CDN -->|Cache miss| LB[Load Balancer]
    LB --> App[App Server]
    App --> L1[L1 Local Cache]
    L1 -->|Miss| L2[L2 Distributed Cache]
    L2 -->|Miss| DB[Database]

Each hop in the diagram represents an order-of-magnitude latency jump: L1 local cache (nanoseconds), L2 distributed cache (1–2 ms), database (10–50 ms), CDN edge (20–80 ms to nearest PoP vs. 150–300 ms to origin). The goal is to answer as many requests as possible at the leftmost — fastest — layer.

L1 — In-Process Local Cache (~100 ns)

Lives inside the application process itself. Examples: Caffeine (Java), Guava Cache (Java), Node.js lru-cache, Python functools.lru_cache. Zero network overhead — a local cache lookup is a hash table read in the JVM heap, typically 50–200 nanoseconds.

The key limitation: it is not shared. In a 32-server fleet, each server maintains its own private L1 cache. A write that invalidates a key on server 1 does not affect servers 2–32. Local caches work best for immutable or slowly changing reference data: country lists, feature flags, currency exchange rates, config values.

L2 — Distributed Cache (~1 ms)

A shared, networked cache tier. Examples: Redis Cluster, Memcached, Apache Ignite. All application servers in the fleet read from and write to the same cache cluster. Network round-trip adds ~0.5–2 ms but ensures consistency across the fleet — a write on server 1 is immediately visible to servers 2–32.

Redis Cluster shards keys across nodes using CRC16 hash slots (16,384 total) and replicates each shard to standby nodes for high availability. Memcached uses client-side consistent hashing with no built-in replication.

L3 — CDN Cache (~20–80 ms edge, ~150–300 ms origin fallback)

A globally distributed network of cache nodes (Points of Presence, PoPs) positioned close to end users. Examples: Cloudflare, CloudFront, Fastly, Akamai. CDN caches serve static and semi-static content (HTML, CSS, JS, images, API responses) with edge latency of 20–80 ms — versus 150–300 ms to a single-region origin for geographically distant users.

CDN behavior is controlled by HTTP headers: Cache-Control: public, max-age=86400 instructs every caching layer (browser, CDN, proxy) to store the response for 24 hours. Cache-Control: private prevents CDN caching for user-specific content.

Multi-Level (Hierarchical) Caching

Production systems stack L1 and L2: check local cache first; on miss, check Redis; on miss, query DB and populate both levels. The key consistency challenge: when a key is invalidated in Redis (L2), L1 caches across all servers still hold the stale value.

Two standard solutions:

Short L1 TTLs (1–5 seconds) for frequently changing data — the stale window is brief and self-healing.
Pub/Sub invalidation — on every DB write, publish a message to a Redis Pub/Sub channel; all app servers subscribe and evict the key from their L1 cache immediately.

Cache Level	Typical Latency	Shared Across Fleet	Primary Use
L1 Local (Caffeine)	~100 ns	No — per-process	Reference data, config, feature flags
L2 Distributed (Redis)	~1 ms	Yes — cluster-wide	Session data, user profiles, hot API responses
L3 CDN (Cloudflare)	~50 ms edge	Yes — global PoP	Static assets, public API responses, HTML pages
Database (PostgreSQL)	~10–50 ms	Yes	Source of truth — all cache layers miss

🌍 How Twitter, Netflix, Facebook, and Cloudflare Cache at Scale

Twitter — Redis Sorted Sets for Timeline Assembly

Twitter's home timeline is one of the most complex caching problems at scale. Assembling a feed from tens of thousands of followed accounts in real time would require expensive fan-in joins across hundreds of database shards on every page load. Instead, Twitter pre-computes each user's timeline and caches it in Redis as a sorted set (scores = tweet timestamps). Opening the app reads at most 800 timeline entries from Redis — no database joins, purely cache reads answering in sub-millisecond time. Heavy accounts with millions of followers use a hybrid push/pull model: the most popular accounts' tweets are fetched from Redis at read time rather than pre-written to millions of followers' sorted sets (too expensive to write).

Netflix — EVCache at 99% Cache Hit Rate

Netflix runs EVCache, a distributed Memcached derivative built for multi-region replication. Every piece of video metadata (titles, thumbnails, subtitles), personalization data, and subscriber preferences lives in EVCache. Netflix reports >99% cache hit rates for EVCache, meaning the databases backing these caches handle less than 1% of read traffic. EVCache replicates writes across AWS regions asynchronously, accepting a brief eventual consistency window in exchange for sub-millisecond reads globally. The system handles millions of requests per second across 190+ countries.

Facebook — Lease Mechanism for Thundering Herd at 100,000-Server Scale

Facebook's 2013 Memcached paper (Nishtala et al.) describes a lease mechanism that directly solves thundering herd in their look-aside cache. When a thread misses a key, it receives a lease token from Memcached. Only the lease-holding thread is permitted to query the database and repopulate the cache. Other threads that concurrently miss the same key receive either a "wait and retry" response (short sleep, then retry the cache) or a stale value if one is available. This guarantees at most one database query per cold key regardless of how many concurrent threads are waiting. Facebook's Memcached fleet spans 100,000+ servers and stores the social graph, photo metadata, and user profile data at a scale that no database tier alone could absorb.

Cloudflare — Tiered Cache to Protect Origin

Cloudflare's Tiered Cache adds a middle layer (upper-tier PoPs, also called shield PoPs) between the 300+ global edge PoPs and the customer's origin server. When an edge PoP in London misses a key, it queries an upper-tier PoP in Amsterdam before falling back to the customer's origin. This collapses the thundering herd from 300 simultaneous origin requests (one per PoP) down to a single request — reducing origin bandwidth by up to 70% on popular content. The topology directly implements hierarchical caching at global infrastructure scale.

⚖️ Consistency vs. Performance: The Core Caching Trade-off

Every caching decision sits on the consistency–performance axis. There is no configuration that delivers both perfect consistency and zero additional latency — every gain on one axis costs something on the other.

Dimension	Maximise Consistency	Maximise Performance
Pattern	Write-Through	Write-Back
Invalidation	Event-driven CDC	TTL-based
TTL duration	Short (seconds)	Long (hours)
Cache scope	Distributed only (L2)	Multi-level L1 + L2
Eviction policy	Short TTL + explicit delete	LRU/LFU with long TTL
Stampede mitigation	Mutex lock (one writer, consistent)	Stale-while-revalidate (instant, slightly stale)

The sharpest trade-off — write-through vs. write-back: Write-through guarantees cache = DB at all times but every write waits for both a cache write and a DB write before returning. Write-back cuts write latency to sub-millisecond but risks losing un-flushed writes. For financial transactions, inventory decrements, and payment state, write-through is non-negotiable. For game scores, reaction counts, and session activity, write-back is the right call.

Consistency requirements mapped to data type:

Data Type	Acceptable Staleness	Recommended Pattern
Inventory / payment state	0 seconds — must be exact	Write-Through + DB transaction
User profile fields	1–60 seconds	Cache-Aside + short TTL + event invalidation on write
Social feed counts (likes, views)	1–5 minutes	Write-Back + periodic flush
Static HTML / CSS / JS	Hours to days	CDN + cache busting on deploy
Feature flags / config	1–5 minutes	Local cache + Pub/Sub invalidation
Search index data	Minutes to hours	Write-Around + scheduled cache warm-up

🧭 Decision Guide: Which Cache Pattern and Eviction Policy to Pick

Selecting a Cache Pattern by Scenario

Scenario	Recommended Pattern	Why
Read-heavy, writes are rare	Cache-Aside	Simple; only hot data occupies memory; DB is always authoritative
Data is read immediately after being written	Write-Through	Cache stays warm for fresh data; no stale reads after writes
Write-heavy workload, loss is acceptable	Write-Back	Minimises write latency; batch flushes reduce DB I/O pressure
Write-once, read-rarely (audit logs, archives)	Write-Around	Prevents cache pollution; preserves capacity for hot reads
High-traffic hot keys under TTL pressure	Refresh-Ahead	Eliminates cold-miss latency; keys are always warm
Application simplicity is the priority	Read-Through	Single data source for all reads; no dual-write logic in application

Selecting an Eviction Policy by Workload

Workload Characteristics	Recommended Policy	Redis `maxmemory-policy`
General web traffic — mixed recency	LRU	`allkeys-lru`
Long-lived viral content, stable hot set	LFU	`allkeys-lfu`
Only evict keys that have a TTL set	LRU on volatile keys	`volatile-lru`
Evict keys closest to expiry first	TTL proximity	`volatile-ttl`
JVM in-process cache with mixed workload	W-TinyLFU	Caffeine default
Storage-layer buffer pool (ZFS, databases)	ARC	ZFS / DB default
Extremely memory-constrained, flat access	Random	`allkeys-random`

Selecting a Cache Topology by Need

What You Need	Recommended Topology
Lowest possible read latency (nanoseconds)	L1 local cache — Caffeine or Guava
Shared state across a multi-server fleet	L2 distributed — Redis Cluster
Global static asset or API response delivery	L3 CDN — Cloudflare or CloudFront
Low latency + fleet consistency	L1 + L2 with Pub/Sub invalidation
Thundering herd protection on hot keys	XFetch or mutex lock at L2; stale-while-revalidate at CDN

🧪 Sizing and Tuning Your Cache Before It Breaks in Production

Cache sizing is not optional — a cache that runs out of memory at peak load starts evicting aggressively, hit rate collapses, and the database absorbs a sudden 10× read spike exactly when you can least afford it. Size before you deploy.

Step 1 — Measure the working set size.
The working set is the set of keys that receive at least one access per TTL window. Start by logging distinct cache keys accessed over a 5-minute window at peak traffic. Multiply the key count by average value size. Most web workloads follow a power law: 10% of keys absorb 90% of traffic. Caching the top 10% of keys typically achieves 80–90% hit ratio; the top 25% typically reaches 95%.

Step 2 — Apply the sizing formulas.

Resource	Sizing Formula	Worked Example
Redis memory	`(working_set_size) × 1.2` headroom	500K keys × 2 KB × 1.2 = 1.2 GB
Caffeine local cache size	`peak_rps × avg_ttl_seconds × 0.1`	5,000 rps × 60 s × 0.1 = 30,000 entries
CDN TTL	`acceptable_staleness_seconds / 2`	10 min acceptable → 5 min TTL
Alert threshold (eviction rate)	`> 0 sustained evictions/sec`	Alert immediately; sustained eviction = cache is undersized

Step 3 — Set the eviction policy before you hit the memory limit.
Redis defaults to maxmemory-policy noeviction, which returns OOM errors to the application when memory is full. This is almost never what you want in production. Set allkeys-lru as your default before the first deployment. Revisit after observing your actual access pattern in production.

Step 4 — Monitor these four signals continuously:

Hit ratio — alert if it drops below 80%.
Eviction rate — alert on any sustained nonzero eviction in steady state.
Memory utilization — alert above 80% of maxmemory.
Keyspace size — a sudden spike indicates an application bug writing unexpected keys.

🛠️ Redis and Caffeine: Production Configuration Reference

Redis Eviction and Memory Configuration

Redis ships with conservative defaults that are wrong for most production caching workloads. The five settings below are the minimum you must review before deploying Redis as a cache.

The maxmemory setting is the hard memory cap. Without it, Redis grows unbounded until the operating system kills the process. Set it explicitly to a value below the total system memory — leaving headroom for the OS and other processes. Once this limit is hit, Redis behavior depends on the eviction policy.

The maxmemory-policy setting controls which keys are evicted when memory fills. The default noeviction policy — which returns errors rather than evicting anything — is almost never correct for a cache. Set allkeys-lru as the production default for general web workloads; switch to allkeys-lfu if your access pattern has a stable hot set of repeatedly accessed keys.

The lazyfree-lazy-eviction yes setting is the single most important change for latency-sensitive caches. Without it, evicting large values blocks Redis's single-threaded event loop and produces latency spikes that cascade to the application layer. With lazy eviction enabled, the eviction work is offloaded to a background thread, keeping the event loop responsive.

The lazyfree-lazy-expire yes setting applies the same background threading to key expiry deletions, reducing tail latency from active expiry cycles.

The activerehashing yes setting incrementally rehashes the main hash table in the background, preventing latency spikes that would otherwise occur when the table needs to resize under peak traffic.

Caffeine In-Process Cache Configuration for JVM Services

Caffeine is the standard L1 in-process cache for JVM applications, replacing Guava Cache. The four most important production settings are maximum size, write expiry (hard TTL), refresh-ahead threshold, and statistics recording.

The maximum size caps the number of entries in the cache. When this limit is reached, Caffeine's W-TinyLFU eviction policy selects the least valuable entry to evict — combining frequency history with recency to make a better admission decision than pure LRU alone.

The expireAfterWrite duration sets the hard TTL from the time a key is last written. After this time the entry is guaranteed to be removed. Setting this to five minutes means no cached value can be served stale for more than five minutes regardless of access patterns.

The refreshAfterWrite duration triggers a background reload at a shorter interval — for example, three minutes — before the hard expiry fires. While the background refresh runs, the cache continues serving the slightly stale value with zero latency impact to callers. This implements the Refresh-Ahead pattern and eliminates the cold-miss penalty for frequently accessed keys. The key design principle: set refreshAfterWrite lower than expireAfterWrite so the background reload fires before the hard expiry forces a synchronous miss.

Statistics recording exports a CacheStats object that tracks hit rate, miss count, eviction count, and load success/failure per cache instance. Always enable it in production and wire the output to your observability stack — a hit rate drop from 95% to 80% overnight is an actionable signal that the cache configuration or access patterns have shifted.

For a full deep-dive on Caffeine tuning, W-TinyLFU internals, and multi-level cache topology with Spring, see the companion post: LLD for LRU Cache: Designing a High-Performance Cache.

📊 Visual Reference

flowchart TD
    Driver["Driver Node"]
    Planner["Query Planner"]
    Optimizer["Catalyst Optimizer"]
    Executor["Executor Task"]

    Driver -->|"Process Flow"| Planner
    Planner --> Optimizer
    Optimizer -->|"Optimized Plan"| Executor
    Executor -->|"Result"| Driver

📚 Lessons Learned from Running Caches in Production

Fix the slow query before you cache it. A 500 ms query cached for 5 minutes still hammers the database every 5 minutes — and with thundering herd potential every time the TTL fires. Optimize the query first; cache the fast result. A cache on top of a slow query is technical debt with a TTL.

Never cache error responses. If the database is temporarily unavailable and your application returns an error, caching that error means every user sees the error until TTL expires — even after the database recovers. Always inspect the result before writing to cache. A null or error return should never be stored.

A cache without TTL is a slow memory leak. Data that never expires accumulates until the server hits maxmemory and begins evicting unpredictably. Set a TTL on every key, even if it is 24 or 48 hours, to give the cache a natural floor for reclamation.

Namespace cache keys from day one. Never use bare IDs as cache keys (42). Use qualified keys with entity type, version, and where needed, tenant: user:42:profile:v3, tenant:acme:config:v1. Bare ID keys cause silent data leakage (user A's response served to user B) and make keyspace debugging impossible at scale.

Test cache failure modes in staging. Most cache-related production outages stem from scenarios that were never tested: what happens when Redis hits maxmemory? When a cache node is restarted mid-traffic? When 70% of keys expire simultaneously after a long weekend? Run these scenarios deliberately in staging before they surprise you in production.

Monitor hit ratio as a first-class SLO. A hit ratio that drops from 92% to 75% overnight is a signal — perhaps a new feature introduced uncached access patterns, TTLs were shortened too aggressively, or the data size grew past the cache budget. Set an alert at 80% and treat a sustained drop as an incident.

Use short TTLs for mutable, user-specific data. A 24-hour TTL on a user profile means a user who changes their display name sees their old name for up to 24 hours after the update. For mutable personalized data, keep TTLs under 60 seconds or implement event-driven invalidation on write. The performance gain from a long TTL is rarely worth the user-facing inconsistency.

Encrypt sensitive values before writing to cache. A Redis KEYS * or DEBUG OBJECT command on an unsecured instance exposes every cached value in plaintext. Never store raw PII (emails, phone numbers, SSNs, payment tokens) in cache without field-level encryption. Treat cache as a public bus, not a vault.

📌 TLDR: The Caching Reference Card

Cache-Aside is the safe default: check cache → miss → load DB → populate cache → return.
Write-Through ensures no stale reads after writes; every write pays cache + DB latency.
Write-Back minimises write latency at the cost of a data-loss window; use only for loss-tolerant workloads (counters, scores).
Write-Around prevents cache pollution for write-once, read-rarely data (audit logs, archives).
Refresh-Ahead eliminates cold-miss latency for hot keys; Caffeine's refreshAfterWrite implements it natively.
LRU (allkeys-lru) is the safe default eviction policy for Redis; W-TinyLFU (Caffeine) is superior for mixed workloads in JVM services.
ARC self-tunes between recency and frequency; ideal for opaque storage-layer caches where access patterns are unpredictable.
Cache invalidation is genuinely hard: TTL + event-driven CDC is the standard hybrid; 2PC across cache + DB is not practical.
Thundering herd destroys databases at scale; mitigate with XFetch probabilistic expiry, mutex locks, stale-while-revalidate, or TTL jitter.
Multi-level topology (L1 local → L2 distributed → L3 CDN) stacks latency savings across orders of magnitude.
Hit ratio < 80% = cache is undersized, TTLs are too short, working set is too large, or a new code path bypasses the cache.
lazyfree-lazy-eviction yes in Redis is mandatory at scale; prevents eviction-induced latency spikes on the event loop.
Never cache error responses. Never use infinite TTLs. Always namespace cache keys.

Article tools

Explain simpler Compare approaches What next?

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata