System Design HLD Example: Real-Time Leaderboard
A practical interview-ready HLD for a real-time gaming leaderboard with fast rank queries.
Abstract AlgorithmsIntermediate
For developers with some experience. Builds on fundamentals.
Estimated read time: 15 min
AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Real-time leaderboards for 10M+ active users require an in-memory ranking engine. Redis Sorted Sets (ZSET) are the industry standard, providing $O(\log N)$ updates and rank lookups via an internal Skip List data structure. Relational databases fail at this scale because calculating rank is an $O(N)$ operation (counting rows with higher scores) that becomes a bottleneck during peak traffic spikes.
π The Last-Minute Snipe
Imagine itβs the final minute of a week-long gaming tournament. Ten million players are grinding for the top spot. Youβre currently ranked #10, and you just scored 5,000 points. You expect to see your rank jump to #2 immediately.
If the system uses a standard SQL database, your score update triggers a massive re-calculation. The database has to count exactly how many rows are now below you. With millions of players submitting scores every second, the database's write-ahead log (WAL) gets backed up, and the "Rank" query starts taking 5 seconds. By the time the leaderboard refreshes, the tournament is over, and youβve been "sniped" by someone whose score processed faster because their database partition wasn't locked.
The challenge of a leaderboard isn't storing scoresβit's ranking. You need a data structure that can handle 50,000 updates per second while providing any player's exact global rank in logarithmic time, ensuring the competition remains fair and the experience remains instantaneous.
π Real-Time Leaderboard: Use Cases & Requirements
Actors & Journeys
- Player: Plays matches, earns points, and checks their global and friends-list ranking.
- Game Server: The authoritative source that validates gameplay and submits the final score to the leaderboard service.
- Tournament Admin: Configures seasons, resets scores, and monitors for cheating behavior.
Functional Requirements
- Score Submission: Update a player's score after a match.
- Top-N Query: Retrieve the top 10, 100, or 1000 players for the global leaderboard.
- Personal Rank Query: Find a specific player's exact rank (e.g., "You are #4,502 in the world").
- Social Leaderboard: View rankings specifically among a subset of users (friends).
- Time-Bounded Views: Daily, Weekly, and All-time rankings.
Non-Functional Requirements (NFRs)
- Real-Time Accuracy: Score updates must reflect in rankings within $< 500$ ms.
- High Read Availability: 99.99% (Players check their rank far more often than they play matches).
- Scalability: Handle 100 million total registered players and 50,000 score updates/sec during peak events.
- Durability: A cache failure should not permanently lose player scores; a persistent store must act as the backup.
π Basics: How Ranking Differs from Sorting
At first glance, a leaderboard seems like a simple sorting problem. Just sort the table by score and pick the top rows. However, at scale, there is a massive difference between sorting and ranking.
- Sorting: Ordering a set of elements. In SQL, this is
ORDER BY score DESC. - Ranking: Determining the relative position of a single element within that set. In SQL, this requires counting all elements with a higher score.
In a 100M row table, sorting is slow, but ranking is catastrophic. Even with an index on the score, a database cannot easily tell you that you are "exactly the 4,502,123rd player" without traversing millions of index entries. We need a data structure that maintains the order as scores are added, rather than recalculating it on every read.
βοΈ Mechanics: The Logic of Score Updates
The core mechanic of a modern leaderboard is the Accumulation and Fan-out of score events.
- Incremental Updates: Instead of overwriting the score, we often send "score deltas" (e.g., +50 points). This allows for easier concurrent updates.
- Tie-Breaking: To ensure deterministic ranking, we use a composite score. If two players have the same points, the one who achieved it first is ranked higher. We implement this by appending a timestamp to the score:
Total_Score = (Points << 32) | (MAX_INT - Timestamp). - Pre-computed Windows: For weekly leaderboards, we don't recalculate the week from scratch. We maintain a specific Redis ZSET for each week, allowing for $O(\log N)$ updates to the active leaderboard.
π Estimations & Design Goals
The Math of Massive Ranking
- Total Players: 100 Million.
- Daily Active Users (DAU): 10 Million.
- Peak Write Throughput: 50,000 score updates/sec.
- Peak Read Throughput: 100,000 rank queries/sec.
- Memory Footprint: Each entry in Redis (Player ID + Score) $\approx 100$ bytes.
- $100M \times 100 \text{ bytes} = \mathbf{10 \text{ GB}}$. This easily fits in a single high-memory Redis instance, but we'll shard for high availability.
Design Goals
- $O(\log N)$ Rank Retrieval: Avoid $O(N)$ scans at all costs.
- Write-Back Durability: Use a message queue to ensure that even if the ranking engine is busy, no score update is ever lost.
π High-Level Design: The Ranking Pipeline
The architecture separates the fast-path ranking engine from the durable-path storage.
graph TD
Game[Game Server] -->|Score| Ingest[Ingestion Service]
Ingest -->|Kafka| MQ[Message Queue]
MQ --> Processor[Score Processor]
Processor -->|ZADD| Redis[(Redis ZSET Index)]
Processor -->|Write| Postgres[(Postgres: Score History)]
Player -->|Get Rank| API[Leaderboard API]
API -->|ZREVRANK| Redis
API -.->|Fallback| Postgres
The Ranking Pipeline above separates two concerns that must never share a bottleneck: the write path (score ingestion via Kafka β Score Processor β Redis ZSET) and the read path (Leaderboard API β Redis ZREVRANK). The Game Server never writes directly to Redis β it goes through the Ingestion Service and Kafka, ensuring durability before the score is considered committed. If Redis is unavailable, the API falls back to Postgres for correctness at the cost of higher latency.
π§ Deep Dive: Why Redis Sorted Sets Achieve What SQL Cannot
The performance gulf between Redis ZSET and SQL for leaderboard rank queries is not a matter of hardware speed β it is a fundamental algorithmic difference. Understanding the internals of the Redis Skip List reveals why this gap exists and why no amount of database tuning closes it.
Internals: The Skip List and Hash Map That Power ZREVRANK
A Redis Sorted Set is implemented internally as two data structures working in tandem:
A Skip List for ordered traversal by score. A skip list maintains multiple levels of linked lists, where each higher level is a coarser "express lane" that allows the traversal algorithm to jump over large sections of the sorted order. When you call ZREVRANK, Redis traverses the skip list from the highest score downward, counting the number of elements above the target. Because the traversal uses the express lanes, this count takes O(log N) time β not O(N) as a full SQL COUNT(*) would.
A Hash Map for O(1) member lookups by key. When you call ZADD with a new score for an existing player, Redis uses the hash map to locate the player's current skip list node in O(1), removes it from its current position, and re-inserts it at the new score position in O(log N). The combination means both updates and rank queries are logarithmic.
| Operation | Redis ZSET | SQL with Index |
| Update a player's score | O(log N) β skip list re-insertion | O(log N) β index update |
| Get player's exact rank | O(log N) β skip list traversal | O(N) β COUNT(*) of all rows with higher score |
| Get top-K players | O(log N + K) β start from top, take K | O(K log N) β index scan |
| Get players around rank R | O(log N + K) β ZRANGE with offset | O(N) β no efficient SQL equivalent |
| Memory per entry | ~100 bytes | ~200 bytes (row + index overhead) |
The SQL O(N) rank calculation is catastrophic at 100M players. Even with a B-tree index on the score column, counting every entry above a given score requires examining millions of index nodes. At 50,000 score updates per second, the index is constantly being rewritten, making these COUNT(*) queries fight for access to the same index pages β creating severe lock contention.
Performance Analysis: Scaling the Ranking Engine to 100 Million Players
A single Redis instance storing a 100M-entry ZSET requires approximately 10 GB of RAM (100M Γ ~100 bytes per entry). At 50,000 ZADD operations per second, each taking ~0.2 ms of Redis CPU time, a single Redis node operates at ~10,000 ZADD/sec per CPU core β well within the capacity of a modern high-memory Redis instance with multiple cores.
However, a single Redis instance is a single point of failure. The production architecture requires sharding and replication:
| Layer | Configuration | Capacity | Failure Behavior |
| Primary shard | 1 leader per shard | 10M players / shard | Automatic failover to replica in < 30 seconds |
| Read replicas | 2 replicas per shard | Absorbs all ZREVRANK read traffic | Slight stale reads (< 100 ms replication lag) |
| Sharding key | Hash of leaderboard key | 10 shards for 100M players | Re-sharding requires a migration pipeline |
| Write throughput | 5,000 updates/sec per shard | Linear scaling with shard count | β |
For time-bounded leaderboards (daily, weekly, all-time), each time window has its own ZSET key. A daily ZSET is created at midnight UTC and discarded at end of day with DEL or via EXPIRE. This avoids the complexity of retroactively removing old scores from a single global set and allows each time window's memory to be reclaimed automatically.
| ZSET Key Pattern | Scope | Lifecycle |
leaderboard:global:alltime | All registered players | Permanent; grows with player count |
leaderboard:global:weekly:2024-W12 | Active week's accumulation | Expires at end of week + 24h grace period |
leaderboard:global:daily:2024-03-29 | Single day's score deltas | Expires after 48 hours |
leaderboard:friends:{user_id} | Per-user friend subset | Rebuilt on friend list change |
π Real-World Leaderboards: Stack Overflow, League of Legends, and Duolingo
Stack Overflow uses a reputation leaderboard where every upvote increments a user's score and every downvote decrements it. Their system uses a Redis ZSET for the real-time reputation index with Postgres as the source of truth. The interesting challenge: reputation can decrease (downvotes, vote reversals), so the ZADD must support score replacement with potentially lower values. Redis handles this correctly β ZADD always updates the score in-place, regardless of direction.
League of Legends maintains seasonal leaderboards for 150 million players across multiple geographic regions (NA, EUW, KR, etc.). Each regional server has its own ZSET, and a separate cross-region aggregation pipeline runs hourly to build a global "World Ranking." The per-region isolation ensures that a Redis failure in Europe does not bring down NA player rankings. Season resets β where all scores are zeroed and a new ZSET key is activated β are carefully coordinated operational procedures involving weeks of planning.
Duolingo uses a friend-list leaderboard where each user sees only their friends' rankings. Because friend lists are small (average 50 friends), Duolingo computes friend-list leaderboards on-demand: fetch each friend's score via ZSCORE (O(1) each), sort in application memory (O(K log K) where K β€ 50), and cache the result in Redis with a 30-second TTL. This approach avoids maintaining O(users Γ average_friends) separate ZSETs β a critical storage optimization at Duolingo's scale.
βοΈ In-Memory Speed vs. Durability: Trade-offs in Leaderboard Design
| Design Decision | Advantage | Risk |
| Redis ZSET as ranking engine | O(log N) rank update and retrieval | RAM-resident; large datasets require multi-GB high-memory nodes |
| Composite score for tie-breaking | Deterministic ordering; no secondary sort needed | Score field loses direct human-readable meaning; requires documentation |
| Kafka for score ingestion | Durable buffer; handles 50k/sec burst smoothly | Adds 10β50 ms latency vs. direct Redis write |
| Separate ZSET per time window | Cheap daily/weekly reset; automatic memory reclaim | Many ZSET keys to manage; expiry timing must be coordinated |
| Postgres as durable fallback | Survive Redis failures without permanent score loss | Full re-hydration from Postgres takes 3β20 minutes |
Critical Failure Mode β The Redis Restart Without Persistence: If the Redis instance restarts and AOF (Append Only File) persistence was not enabled, the entire ZSET is lost. Re-hydrating 100M entries from Postgres requires streaming all score records, computing each player's cumulative score, and executing 100M ZADD commands. At Redis's sustained ingestion rate of ~500K ZADDs/sec, this process takes approximately 200 seconds (~3 minutes) for 100M players β during which rank queries return empty or stale results. Mitigation: Enable AOF persistence with appendfsync everysec and maintain at least one read replica per shard with continuous replication. AOF recovery is measured in seconds, not minutes.
π§ Choosing the Right Ranking Technology for Your Player Count and Latency Requirements
Choose Redis ZSET when:
- Player count is up to 500M (manageable in a sharded Redis cluster with reasonable RAM budget).
- Rank queries require sub-10 ms latency to support real-time "your rank just changed" notifications.
- Score updates occur at more than 1,000 per second during peak play.
- You need exact real-time rank β not an approximation.
Choose approximate ranking (percentile buckets) when:
- Player count exceeds 500M and the exact rank number is not a core product feature.
- Showing "Top 1%" is as valuable to the user as showing "Rank #4,502,123."
- Storage cost of a 500M-entry ZSET (50 GB) is prohibitive given your infrastructure budget.
Use SQL for leaderboards only when:
- Update rate is under 100 per second and player count is under 1M.
- Rich multi-dimensional filtering is needed β by country, age bracket, device type β and a ZSET-per-segment approach would create an unmanageable number of keys.
- The leaderboard is a secondary feature with relaxed latency requirements (> 500 ms is acceptable).
π§ͺ Delivering This Design in a System Design Interview
Act 1 β The SQL Leaderboard Failure (2 minutes): Open with the last-minute tournament scenario. Draw the SQL table with 100M rows. Write out the rank query: SELECT COUNT(*) FROM players WHERE score > (SELECT score FROM players WHERE user_id = ?). Annotate it as O(N). Show that at 50,000 score updates per second, this query is backed up by index contention β rank results lag the actual scores by 5+ seconds. This immediately motivates the need for a purpose-built ranking data structure.
Act 2 β The Skip List Architecture (5 minutes): Draw the Ranking Pipeline diagram. Explain why the Game Server writes to the Ingestion Service rather than Redis directly β you need Kafka durability before confirming the score to the player. Walk through ZADD (O(log N) update) and ZREVRANK (O(log N) rank retrieval). Contrast with the O(N) SQL COUNT(*) to reinforce the algorithmic advantage.
Act 3 β Edge Cases (3 minutes): Address tie-breaking (composite score encoding), time-bounded leaderboards (separate ZSET per window), and the Redis failure recovery procedure.
| Interviewer Question | Strong Answer |
| How do you handle score decreases (e.g., cheating penalty)? | ZADD with the lower new score; ZSET updates in-place in O(log N) β works for decreases just as well as increases |
| How do you implement weekly leaderboards? | A separate ZSET per week key; Score Processor writes to both the alltime ZSET and the current week's ZSET simultaneously |
| How do you detect cheating (score anomalies)? | Stream score events to an anomaly detection service as a Kafka consumer; flag statistical outliers for manual review |
π οΈ Open Source Tools for Leaderboard Infrastructure
Redis with its native Sorted Set commands (ZADD, ZREVRANK, ZREVRANGE, ZRANGEBYSCORE) is the industry-standard choice for the ranking engine. Redis Cluster provides horizontal sharding and automatic failover with minimal configuration.
Apache Kafka serves as the durable score ingestion buffer between the Game Server and the Score Processor. Kafka's consumer group model allows multiple Score Processor instances to consume in parallel, partitioning the update load across workers. The Postgres write consumer runs as a separate consumer group from the Redis write consumer, ensuring each can scale and fail independently.
ClickHouse is increasingly used as an alternative to Postgres for the score history store, especially when analytics queries (e.g., "top players per game mode over the last 30 days") become expensive on a transactional Postgres schema. ClickHouse's columnar storage and OLAP query engine handle these aggregations 10β100Γ faster than Postgres at leaderboard scale.
π Lessons Learned From Operating Real-Time Leaderboards at Scale
Lesson 1 β The composite score encoding must be documented prominently. A developer joining the team months later will see a score like 892000001707868800 in Redis and have no idea it encodes both points and a timestamp. Document the encoding formula in the service README, a comment in the Score Processor, and the data dictionary.
Lesson 2 β Season resets are operationally complex events that require weeks of planning. At the end of a competitive season, zeroing 100M scores atomically requires coordinating between the Leaderboard Service, Kafka (to drain any in-flight score updates before reset), and Postgres (to archive the season's score history). A poorly executed reset can cause score data loss or contaminate the new season with old scores.
Lesson 3 β The "surrounding players" query is not free. Displaying the 3 players above and below a player's rank requires a ZRANGE with an offset β still O(log N + K), but involves two Redis calls. Cache this "neighborhood view" at the CDN or application cache layer with a 5-second TTL to absorb the read amplification from players who obsessively refresh their rank during tournaments.
Lesson 4 β Monitor ZSET cardinality and memory growth weekly. As players join the game, the ZSET grows linearly. Set up a weekly alert if ZSET memory usage grows faster than the expected new-player rate. An anomalously fast growth rate may indicate score entries from deleted accounts not being cleaned up, or a bug creating phantom player entries.
π TLDR & Key Takeaways for Real-Time Leaderboard Design
- Core problem: SQL COUNT(*) for rank calculation is O(N) β catastrophically slow at 100M players under 50k updates/sec.
- Solution: Redis Sorted Set (ZSET) backed by a Skip List provides O(log N) ZADD (score update) and ZREVRANK (rank lookup) β the fundamental algorithmic win.
- Architecture: Game Server β Ingestion Service β Kafka β Score Processor β Redis ZSET (ranking) + Postgres (durability).
- Tie-breaking: Composite score
(points << 32) | (MAX_INT - timestamp_ms)ensures deterministic ordering when two players share the same point total. - Redis memory: 100M players Γ 100 bytes β 10 GB β fits on a single high-memory Redis node; shard by leaderboard key for HA.
- Critical failure: Redis restart without AOF requires 3+ minutes to rebuild from Postgres β enable
appendfsync everysecand maintain read replicas.
π Related Posts
- System Design HLD: Distributed Cache β Redis architecture fundamentals β sharding, replication, eviction, and persistence β that underpin the leaderboard's ranking engine.
- System Design HLD: News Feed β Fan-out write patterns and Kafka-driven event pipelines that share the same ingestion architecture as the score update path.
- System Design HLD: Proximity Service β Geospatial Redis data structures that complement leaderboard design for location-aware tournament features.
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Stale Reads and Cascading Failures in Distributed Systems
TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable β stale reads...
NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data
TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node β virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...
Clock Skew and Causality Violations: Why Distributed Clocks Lie
TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions β but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...
Split Brain Explained: When Two Nodes Both Think They Are Leader
TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader β each accepting writes the other never sees. Prevent it with quorum consensus (at least βN/2β+1 nodes must agree before leadership is g...
