System Design HLD Example: Distributed Cache Platform

Interview HLD for a distributed cache with eviction, invalidation, and resilience trade-offs.

Abstract Algorithms

·Mar 13, 2026·14 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Distributed caches trade strict consistency for sub-millisecond read latency, using consistent hashing to scale horizontally without causing database-shattering "cache stampedes" during cluster rebalancing.

Instagram's primary database once served user profile reads at 28,000 QPS until a single viral post triggered a "cache miss storm." This storm drove the database to 95% CPU utilization in seconds, threatening to take down the entire platform. The fix wasn't a larger database; it was a smarter cache topology. Without a distributed cache absorbing read amplification, read-heavy systems hit their database ceiling within months of meaningful traffic growth, regardless of hardware tier.

Imagine a high-traffic e-commerce site on Black Friday. Every product page view requires a database query for price and stock. If your database handles 5,000 requests per second but you have 50,000 users per second, your site crashes. A distributed cache stores these hot items in RAM, serving them in under 1 millisecond and reducing the database load by 95% or more.

Designing a distributed cache teaches you the core tension in every high-scale read path: how to keep data consistent enough to be correct while keeping it close enough to be fast. By the end of this guide, you will know how to design a cache that handles millions of operations per second with sub-millisecond latency.

📖 Distributed Cache: Use Cases & Requirements

A distributed cache is an in-memory, key-value store shared across multiple servers.

Functional Requirements

Standard Operations: Support GET(key), SET(key, value, TTL), and DELETE(key).
TTL (Time-To-Live): Support per-key expiration to ensure data doesn't stay stale forever.
Eviction Policies: Support strategies like LRU (Least Recently Used) to manage memory when the cache is full.
Scaling: Support horizontal scaling (adding nodes) without invalidating the entire cache.
High Throughput: Handle millions of reads and writes per second.

Non-Functional Requirements

Sub-millisecond Latency: Cache operations must be significantly faster than database queries.
High Availability: If a cache node fails, the system should degrade gracefully rather than crashing the database.
Consistency: Accept eventual consistency between the cache and the source of truth, but provide mechanisms for explicit invalidation.
Scalability: Linearly scale performance as more nodes are added to the cluster.

🔍 Basics: Baseline Architecture

In a distributed environment, the "Cache" is not one server, but a cluster of servers.

Cache Client: A library integrated into the application that handles the logic of where to find a key.
Cache Nodes: Independent servers (like Redis or Memcached instances) that store data in RAM.
Consistency Mechanism: Logic to ensure that when a database record is updated, the corresponding cache entry is invalidated or updated.

The most common pattern is Cache-Aside:

Read: Check cache. If hit, return. If miss, read from DB, write to cache, return.
Write: Update DB, then invalidate (delete) the cache entry.

⚙️ Mechanics: Key Logic

1. Consistent Hashing

The core challenge is: Which node stores key "user:123"?

Naive approach: hash(key) % N. If you add a node (N becomes N+1), almost every key's location changes. This causes a 100% cache miss rate — a "Cache Stampede."
Consistent Hashing: Keys and nodes are mapped onto a logical circle (the "Ring"). A key is stored on the first node it encounters moving clockwise. When a node is added, only a small fraction of keys (1/N) move.

2. Eviction Policies

When RAM is full, the cache must decide what to delete.

LRU (Least Recently Used): Discard the least recently accessed items. Great for most web workloads.
LFU (Least Frequently Used): Discard items accessed the least number of times. Good for stable popularity items.
FIFO (First In First Out): Simple but often inefficient.

3. Expiration (TTL)

Passive Expiration: Delete the key only when someone tries to access it and sees it's expired.
Active Expiration: A background thread periodically scans and deletes expired keys to free up memory.

📐 Estimations & Design Goals

QPS: 10M reads/sec, 1M writes/sec.
Storage: 100 nodes * 128GB RAM = 12.8TB total capacity.
Object Size: Average 2KB per object. Total objects = ~6 Billion.
Hit Rate Goal: 95%+. A 1% drop in hit rate at 10M QPS adds 100,000 queries per second to the database.

📊 High-Level Design

graph LR
    App[Application Service] -->|1. Get Key| Client[Cache Client Library]
    Client -->|2. Hash Key| Ring((Consistent Hash Ring))
    Ring -->|3. Route| NodeA[Node A]
    Ring -->|4. Route| NodeB[Node B]
    Ring -->|5. Route| NodeC[Node C]

    App -->|6. Miss| DB[(Database)]
    DB -->|7. Backfill| Client
    Client -->|8. Set Key| NodeA

The diagram shows the Cache-Aside read path. The application checks the Cache Client Library first; the library hashes the key and consults the Consistent Hash Ring to determine which physical node owns that key. On a miss (step 6), the application queries the database directly, and the result is written back to the owning cache node (step 8). This backfill ensures that subsequent reads for the same key are served from cache without touching the database.

Cache Entry Schema

Every cached record follows this structure in Redis. The schema is enforced by the application layer — the cache itself stores opaque byte values.

Field	Type	Example Value	Notes
key	String	`product:9f3a1`	Namespaced by entity type to avoid key collisions
value	Bytes (JSON/Protobuf)	`{"id":"9f3a1","price":2999}`	Serialization format chosen per latency budget
ttl_seconds	Integer	`3600`	Per-key TTL; 0 means no expiry
version	Integer	`7`	Optional optimistic concurrency token for invalidation
last_accessed	Timestamp	`2026-03-13T10:00:00Z`	Used by LRU eviction bookkeeping

🧠 Deep Dive: Consistent Hashing Ring Internals and LRU Eviction Performance Under Scale

Two mechanisms dominate the engineering complexity of a distributed cache: determining which node owns a key without causing re-hashing storms when topology changes, and deciding which keys to evict when RAM is full while keeping the hit rate above 95%.

Internals: The Consistent Hash Ring and Virtual Nodes

Naive modulo sharding — hash(key) % N — has a fatal flaw: when the node count N changes by even 1, almost every key's owning node changes. At 10M QPS this produces a complete cache miss storm for tens of seconds, hammering the database to destruction. Consistent hashing places both keys and nodes on a conceptual ring using a hash function. A key is owned by the first node found clockwise from its hash position. When one node is added or removed, only the keys between the new node and its predecessor move — approximately 1/N of all keys — leaving the rest undisturbed.

The practical enhancement is virtual nodes: instead of placing each physical node once on the ring, each node is placed at K positions (typically 150–300). This achieves a statistically uniform distribution of keys across nodes even with a small cluster. Without virtual nodes, a 3-node cluster has uneven load distribution; with 200 virtual nodes per physical node, load variance drops below 5%. Redis Cluster implements a deterministic variant called hash slots — the ring is divided into 16,384 fixed slots, and each node owns a range of slots. This makes resharding explicit and predictable rather than probabilistic.

Performance Analysis: LRU Eviction as a Doubly-Linked HashMap

When the cache is full and a new key must be stored, the eviction policy determines what gets deleted. LRU (Least Recently Used) is optimal for web workloads where access patterns exhibit temporal locality — recently accessed items are likely to be accessed again soon. A pure LRU implementation uses a doubly-linked list plus a hash map: the list maintains access order (most recent at head, least recent at tail), and the hash map provides O(1) lookup by key. On every cache hit, the accessed node is unlinked and moved to the list head — an O(1) pointer update. On eviction, the tail node is removed and its key deleted from the hash map — also O(1). Redis implements a variant called approximated LRU: rather than maintaining a full access-order list (which requires pointer manipulation on every read), Redis samples N random keys and evicts the one with the oldest access timestamp. This reduces memory overhead dramatically while achieving near-identical hit rates to true LRU in practice. At 10M reads per second, avoiding per-read pointer manipulation is the difference between 1ms and 10ms average latency.

🌍 Real-World Cache Strategies: How Twitter, Facebook, and Netflix Use Distributed Caches

Twitter's Twemproxy is a lightweight cache proxy that sits between application servers and Redis/Memcached instances. Twemproxy handles consistent hashing, connection multiplexing, and automatic failover transparently from the application. The key insight Twitter solved was connection fan-out: 10,000 application server threads each opening direct connections to 100 cache nodes creates 1,000,000 concurrent connections. Twemproxy reduces this to 100 connections from the proxy pool.

Facebook's Memcached (TAO) architecture served billions of social graph reads per second at its peak. Facebook's critical contribution was the Lease mechanism: when a cache miss triggers a database read, the system issues a lease token to the requesting client. Other clients requesting the same missing key wait briefly rather than all simultaneously hitting the database — a controlled defense against cache stampede. Facebook open-sourced this pattern in their 2013 NSDI paper, and it directly influenced how Redis implements WAIT and SETNX-based stampede prevention.

Netflix EVCache demonstrates geographic replication of a cache cluster. Netflix replicates cache writes across AWS availability zones synchronously and across regions asynchronously. This means a regional cache failure causes slightly stale data (acceptable for recommendations and thumbnails) rather than a database miss storm. Netflix found that 99% of their cache hit rate could be maintained even through a full AZ failure by routing reads to the nearest surviving replica cluster.

⚖️ Trade-offs and Failure Modes in Distributed Cache Design

Dimension	Trade-off	Implication
Cache-Aside vs. Write-Through	Cache-Aside: simple, accepts stale data window; Write-Through: always consistent, double write latency	Use Cache-Aside for read-heavy workloads; Write-Through for financial or inventory data
Replication vs. Sharding	Replication: high read throughput, expensive storage; Sharding: large data sets, single-node capacity limit	Most production systems combine both — shard first, then replicate each shard
TTL length	Short TTL: low staleness risk, high miss rate; Long TTL: high hit rate, risk of serving stale data	Use short TTL for prices, inventory; long TTL for user profile, product metadata
Eviction policy (LRU vs. LFU)	LRU: optimal for recency-biased workloads; LFU: optimal for stable popularity items (e.g., viral content)	LRU is the default; LFU helps when a few items dominate access patterns long-term

Cache Stampede (Thundering Herd): When a popular key expires, dozens or thousands of simultaneous requests all miss the cache and hit the database concurrently. Mitigations include probabilistic early expiration (refresh a key before it expires with some probability), distributed locking (only one client refreshes while others wait), or a short stale-while-revalidate window.

Cold Start Problem: When deploying a new cache cluster or after a full cache flush, the hit rate drops to 0% for minutes or hours. During this window, the database receives full traffic load. Mitigation: warm the cache with the top-N most frequently accessed keys from the previous hour's access logs before cutting traffic to the new cluster.

🧭 Decision Guide: Redis vs. Memcached and Choosing the Right Cache Pattern

System Needs	Recommended Strategy
Simple key-value caching, maximum throughput	Memcached: simpler, multi-threaded, no persistence overhead
Rich data types (sorted sets, hashes, lists)	Redis: native support, enables features like leaderboards and session storage
Persistence required (survive restarts)	Redis with AOF (Append-Only File) or RDB snapshots
Sub-millisecond latency for millions of items	Memcached or Redis with no persistence, colocated in the same AZ
Read-heavy social graph or recommendation queries	Cache-Aside pattern with TTL of 1–24 hours depending on staleness tolerance
Write-heavy financial or inventory records	Write-Through pattern with a short TTL or explicit invalidation on every write
Cluster size < 10 nodes	Redis Sentinel for high availability
Cluster size 10+ nodes, horizontal scaling	Redis Cluster (16,384 hash slots) or sharded Memcached via Twemproxy

The most common interview mistake is proposing Write-Through caching for all workloads. Write-Through doubles the write latency for every database write, which is unacceptable for high-write services like messaging or event logging. Cache-Aside remains the default for most systems.

Situation framing: You are designing the cache for a social media feed that serves 100 million DAU. Each feed view requires reading the latest 20 posts from 500 followed users — up to 10,000 database reads per page load without a cache. The database can handle 50,000 QPS but you need to support 10,000,000 QPS.

Step 1 — Compute the cache hit requirement: To stay within the database's 50,000 QPS limit at 10M QPS, you need a hit rate of at least (10M - 50K) / 10M = 99.5%. This drives every downstream decision toward maximizing hit rate over minimizing staleness.

Step 2 — Choose Cache-Aside with pre-computed feed: Rather than caching individual post records (granular, low hit rate), cache the pre-computed feed list per user as a Redis Sorted Set scored by timestamp. One cache read returns 20 posts with a single ZREVRANGE call, reducing the cache lookup to one operation per page view.

Step 3 — Choose a TTL strategy: Feed data tolerates staleness of 60–300 seconds for most users. Set TTL to 300 seconds for inactive users, and use push-based invalidation for celebrities (users with >10,000 followers) whose feed updates trigger a direct cache write rather than relying on TTL expiry.

Step 4 — Handle stampede on celebrity posts: Use a distributed lock (Redis SETNX) on the cache key for any user with follower count above 10,000. Only one request refreshes the cache while other threads wait up to 500ms, then retry the cache read.

🛠️ Open-Source Caching Solutions Worth Knowing

Redis: Open-source, in-memory data structure store. Supports strings, hashes, sorted sets, lists, and pub/sub. Redis Cluster handles horizontal sharding natively with 16,384 hash slots.
Memcached: High-performance, multi-threaded key-value cache. Simpler than Redis but lacks persistence, replication, and rich data types. Best for pure caching workloads at maximum throughput.
Dragonfly: A Redis-compatible in-memory store written in C++ using shared-nothing architecture. Benchmarks show 25× higher throughput than Redis on multi-core hardware for simple GET/SET operations.
Apache Ignite: Distributed in-memory computing platform supporting SQL queries over cached data. Used when cache queries require joins or aggregations beyond simple key lookups.

📚 Lessons Learned from Cache Failures at Scale

The Stampede That Took Down Instagram's Database. Instagram's user profile cache used a fixed 1-hour TTL on all keys. When a viral post caused millions of users to simultaneously view the poster's profile for the first time (all with cold caches), the database received 28,000 QPS in under 10 seconds — 4× its maximum capacity. The fix was probabilistic early expiration: when a key is within 20% of its TTL, each read has a small probability of proactively refreshing the cache, spreading renewal load across time rather than concentrating it at the expiry boundary.

A Cache Flush During a Deploy Caused a 45-Minute Outage. An engineering team flushed the Redis cache as part of a schema migration without pre-warming it. The resulting cold-start dropped the hit rate from 97% to 0%, sending the full 8M QPS load directly to the database. The database exhausted connections in 90 seconds. Lesson: never flush a production cache without a pre-warming step. Treat the cache like stateful infrastructure that requires a migration plan, not a side-car that can be arbitrarily reset.

Inconsistent TTLs Across Services Created a Data Correctness Bug. Two different microservices cached the same product price key with different TTLs — one used 60 seconds, the other used 3600 seconds. After a price update, one service served the new price while the other served the old price for up to an hour. The fix: centralize TTL policy configuration in a shared constants library, and use explicit invalidation on any write that modifies a shared entity rather than relying on TTL for correctness.

📌 TLDR & Key Takeaways

The core architectural pattern is Cache-Aside: check cache first, miss falls through to the database, and the application backfills the cache. Prefer this for read-heavy workloads; use Write-Through only when strong consistency is required on every write.
Use Consistent Hashing with virtual nodes (or Redis Cluster hash slots) to distribute keys across nodes. Adding or removing a node should redistribute only 1/N of keys, never the entire key space.
LRU eviction is optimal for temporal-locality workloads; Redis uses approximated LRU with random sampling to eliminate per-read pointer overhead at millions of QPS.
The cache stampede (thundering herd) is the #1 failure mode. Use probabilistic early expiration, distributed locking (SETNX), or stale-while-revalidate to prevent it.
Cold start after a cache flush or new deployment is as dangerous as a stampede. Always pre-warm from recent access logs before cutting traffic to a new cache cluster.
Target a hit rate above 95%. A 1% drop at 10M QPS adds 100,000 database queries per second — enough to breach most database capacity limits.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read