Series

System Design Interview Prep

A comprehensive series to help you design scalable, reliable, and fault-tolerant systems.

72

Articles

22h 24m

Estimated reading

Intermediate to Advanced

Knowledge level

4,580

Readers

Start Series

About this series

A comprehensive series to help you design scalable, reliable, and fault-tolerant systems.

Learn with real world examples
Connect articles into a structured path
Best practices and trade-offs
Interview focused insights
Continuously updated content

Series Progress

0% Complete

0 of 72 articles viewed

Continue Learning

Real-Time Communication: WebSockets, SSE, and Long Polling Explained

Article 1 of 72

Continue Reading

Who is this for?

Software engineers and developers learning this topic.

Knowledge Level

Intermediate to Advanced

Last Updated

Jun 18, 2026

A

Created by

Abstract Algorithms

All Articles

Lesson 1

Foundation

Real-Time Communication: WebSockets, SSE, and Long Polling Explained

TLDR: πŸ”Œ WebSockets = bidirectional persistent channel β€” use for chat, gaming, collaborative editing. SSE = one-way server push over HTTP with built-in reconnect β€” use for AI streaming, live logs, not

23 min read

Lesson 2

Intermediate

System Design Interview Basics: A Beginner-Friendly Framework for Clear Answers

TLDR: System design interviews are not about inventing a perfect architecture on the spot. They are about showing a calm, repeatable process: clarify requirements, estimate scale, sketch a simple desi

13 min read

Lesson 3

Intermediate

Elasticsearch vs Time-Series DB: Key Differences Explained

TLDR: Elasticsearch is built for search β€” full-text log queries, fuzzy matching, and relevance ranking via an inverted index. InfluxDB and Prometheus are built for metrics β€” numeric time series with a

14 min read

Lesson 4

Intermediate

Write Skew Explained: The Anomaly That Requires Serializable Isolation

TLDR: Write skew is the hardest concurrency anomaly to reason about: two concurrent transactions each read a shared condition, decide they can safely proceed, and then write to different rows. No indi

23 min read

Lesson 5

Intermediate

API Gateway vs. Load Balancer vs. Reverse Proxy: What's the Difference?

TLDR: A Reverse Proxy hides your servers and handles caching/SSL. A Load Balancer spreads traffic across server instances. An API Gateway manages API concerns β€” auth, rate limiting, routing, and proto

14 min read

Lesson 6

Intermediate

CosmosDB Partition Internals: Logical vs Physical Partitions Explained

πŸ”₯ When Your Database Bill Triples Overnight A retail engineering team ships a flash-sale feature. Traffic spikes 10Γ—. Their Azure CosmosDB bill triples within 24 hours. Queries that ran in 5ms now ta

16 min read

Lesson 7

Intermediate

System Design HLD Example: API Gateway for Microservices

TLDR: An API Gateway centralizes "cross-cutting concerns" like authentication, rate limiting, and routing at the edge of your infrastructure. The architectural crux is the separation of the Control Pl

16 min read

Lesson 8

Intermediate

System Design HLD Example: Real-Time Leaderboard

TLDR: Real-time leaderboards for 10M+ active users require an in-memory ranking engine. Redis Sorted Sets (ZSET) are the industry standard, providing \(O(\log N)\) updates and rank lookups via an inte

16 min read

Lesson 9

Intermediate

System Design HLD Example: Ride-Sharing (Uber/Lyft)

TLDR: A ride-sharing platform is a high-velocity geospatial matching engine. Drivers stream GPS coordinates every 5 seconds into a Redis Geospatial Index. When a rider requests a trip, the Matching Se

16 min read

Lesson 10

Intermediate

Non-Repeatable Read Explained: When the Same Query Returns Different Results

TLDR: A non-repeatable read happens when the same SELECT returns different results within a single transaction because a concurrent transaction committed an update between the two reads. Read Committe

26 min read

Lesson 11

Intermediate

System Design HLD Example: Proximity Service (Yelp/Google Places)

TLDR: A proximity service (Yelp/Google Places) solves the 2D search problem by encoding locations into Geohash strings, which are indexed in a standard B-tree. To guarantee results near grid boundarie

17 min read

Lesson 12

Intermediate

System Design HLD Example: Video Streaming (YouTube/Netflix)

TLDR: A video streaming platform is a two-sided architectural beast: a batch-oriented transcoding pipeline that converts raw uploads into multi-resolution segments, and a real-time global delivery net

17 min read

Lesson 13

Intermediate

Isolation Levels in Databases: Read Committed, Repeatable Read, Snapshot, and Serializable Explained

TLDR: Isolation levels control which concurrency anomalies a transaction can see. Read Committed (PostgreSQL and Oracle's default) prevents dirty reads but still silently allows non-repeatable reads,

28 min read

Lesson 14

Intermediate

Dirty Write Explained: When Uncommitted Data Gets Overwritten

TLDR: A dirty write occurs when Transaction B overwrites data that Transaction A has written but not yet committed. The result is not a rollback or an error β€” it is silently inconsistent committed dat

28 min read

Lesson 15

Intermediate

Dirty Read Explained: How Uncommitted Data Corrupts Transactions

TLDR: A dirty read occurs when Transaction B reads data written by Transaction A before A has committed. If A rolls back, B has made decisions on data that β€” from the database's perspective β€” never ex

30 min read

Lesson 16

Intermediate

Phantom Read Explained: When New Rows Appear Mid-Transaction

TLDR: A phantom read occurs when a transaction runs the same range query twice and gets a different set of rows β€” because a concurrent transaction inserted or deleted matching rows and committed in be

32 min read

Lesson 17

Intermediate

Lost Update Explained: When Two Writes Become One

TLDR: A lost update occurs when two concurrent read-modify-write transactions both read the same committed value, both compute a new value from it, and both write back β€” with the second write silently

38 min read

Lesson 18

Intermediate

Read Skew Explained: Inconsistent Snapshots Across Multiple Objects

TLDR: Read skew occurs when a transaction reads two logically related objects at different points in time β€” one before and one after a concurrent transaction commits β€” producing a view that never exis

34 min read

Lesson 19

Intermediate

Split Brain Explained: When Two Nodes Both Think They Are Leader

TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader β€” each accepting writes the other never sees. Prevent it with quorum consensus (at lea

22 min read

Lesson 20

Intermediate

SQL Partitioning: Range, Hash, List, and Composite Strategies Explained

TLDR: SQL partitioning divides one logical table into smaller physical child tables, all accessed through the parent table name. The query optimizer skips irrelevant child tables entirely β€” a process

25 min read

Lesson 21

Advanced

Distributed Transactions: 2PC, Saga, and XA Explained

TLDR: Distributed transactions require you to choose a consistency model before choosing a protocol. 2PC and XA give atomic all-or-nothing commits but block all participants on coordinator failure. Sa

26 min read

Lesson 22

Advanced

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node β€” virtual nod

24 min read

Lesson 23

Advanced

Key Terms in Distributed Systems: The Definitive Glossary

TLDR: Distributed systems vocabulary is precise for a reason. Mixing up read skew and write skew costs you an interview. Confusing Snapshot Isolation with Serializable costs you a production outage. T

51 min read

Lesson 24

Advanced

Choosing the Right Database: CAP Theorem and Practical Use Cases

TLDR: Database selection is a trade-off between consistency, availability, and scalability. By using the CAP Theorem as a compass and matching your data access patterns to the right storage engine (Re

7 min read

Lesson 25

Advanced

Little's Law: The Secret Formula for System Performance

TLDR: Little's Law (\(L = \lambda W\)) connects three metrics every system designer measures: \(L\) = concurrent requests in flight, \(\lambda\) = throughput (RPS), \(W\) = average response time. If l

9 min read

Lesson 26

Advanced

Designing for High Availability: The Road to 99.99% Reliability

TLDR: High Availability (HA) is the art of eliminating Single Points of Failure (SPOFs). By using Active-Active redundancy, automated health checks, and global failover via GSLB, you can achieve "Four

9 min read

Lesson 27

Advanced

High-Level Design: Building a Real-Time Ad Click Aggregator at Scale

TLDR: Scaling an ad click aggregator requires processing massive event streams (billions of clicks per day) with exactly-once delivery guarantees. We achieve this using Kafka-based event ingestion, Ap

9 min read

Lesson 28

Advanced

System Design Requirements and Constraints: Ask Better Questions Before You Draw

TLDR: In system design interviews, weak answers fail early because requirements are fuzzy. Strong answers start by turning vague prompts into explicit functional scope, measurable non-functional targe

11 min read

Lesson 29

Advanced

High-Level Design: Scaling a Concert Ticket Booking System under Flash Load

TLDR: Designing a high-scale ticket booking system requires balancing high read traffic (seat map lookups) with extreme write concurrency (seat lock attempts) during popular concert drops. We achieve

11 min read

Lesson 30

Advanced

System Design API Design for Interviews: Contracts, Idempotency, and Pagination

TLDR: In system design interviews, API design is not a list of HTTP verbs. It is a contract strategy: clear resource boundaries, stable request and response shapes, pagination, idempotency, error sema

12 min read

Lesson 31

Advanced

Partitioning Approaches in SQL and NoSQL: Horizontal, Vertical, Range, Hash, and List Partitioning

TLDR: Partitioning splits one logical table into smaller physical pieces. The database skips irrelevant pieces entirely β€” turning a 30-second full-table scan into a sub-second single-partition read. S

12 min read

Lesson 32

Advanced

The 8 Fallacies of Distributed Systems

TLDR TLDR: In 1994, L. Peter Deutsch at Sun Microsystems listed 8 assumptions that developers make about distributed systems β€” all of which are false. Believing them leads to hard-to-reproduce bugs,

13 min read

Lesson 33

Advanced

System Design Multi-Region Deployment: Latency, Failover, and Consistency Across Regions

TLDR: Multi-region deployment means running the same system across more than one geographic region so users get lower latency and the business can survive a regional outage. The design challenge is no

13 min read

Lesson 34

Advanced

System Design Sharding Strategy: Choosing Keys, Avoiding Hot Spots, and Resharding Safely

TLDR: Sharding means splitting one logical dataset across multiple physical databases so no single node carries all the data and traffic. The hard part is not adding more nodes. The hard part is choos

13 min read

Lesson 35

Advanced

System Design Observability, SLOs, and Incident Response: Operating Systems You Can Trust

TLDR: Observability is how you understand system behavior from telemetry, SLOs are explicit reliability targets, and incident response is the execution model when those targets are at risk. Together,

13 min read

Lesson 36

Advanced

System Design Service Discovery and Health Checks: Routing Traffic to Healthy Instances

TLDR: Service discovery is how clients find the right service instance at runtime, and health checks are how systems decide whether an instance should receive traffic. Together, they turn dynamic infr

13 min read

Lesson 37

Advanced

System Design HLD Example: E-Commerce Platform (Amazon)

TLDR: A large-scale e-commerce platform separates catalog, cart, inventory, orders, and payments into independent microservices. The core architectural challenge is Inventory Correctness during flash

13 min read

Lesson 38

Advanced

Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More

TLDR: Distributed systems produce anomalies not because the code is buggy β€” but because physics makes perfect consistency impossible across network boundaries. Split brain, stale reads, clock skew, ca

13 min read

Lesson 39

Advanced

System Design: Designing a Financial Ledger with Double-Entry Constraints

TLDR: Designing a financial ledger requires strict double-entry compliance, high consistency, and complete auditability. Unlike traditional databases where records are updated in-place, a financial le

13 min read

Lesson 40

Advanced

System Design Core Concepts: Scalability, CAP, and Consistency

TLDR: πŸš€ Scalability, the CAP Theorem, and consistency models are the three concepts that determine whether a distributed system can grow, stay reliable, and deliver correct results. Get these three r

14 min read

Lesson 41

Advanced

The Role of Data in Precise Capacity Estimations for System Design

TLDR: Capacity estimation is the skill of back-of-the-envelope math that tells you whether your system design will survive its traffic before you write a line of code. Four numbers do most of the work

14 min read

Lesson 42

Advanced

System Design Data Modeling and Schema Evolution: Query-Driven Storage That Survives Change

TLDR: In system design interviews, data modeling is where architecture meets reality. A good model starts from query patterns, chooses clear entity boundaries, defines indexes deliberately, and includ

14 min read

Lesson 43

Advanced

System Design Message Queues and Event-Driven Architecture: Building Reliable Asynchronous Systems

TLDR: Message queues and event-driven architecture let services communicate asynchronously, absorb bursty traffic, and isolate failures. The core design challenge is not adding a queue β€” it is definin

14 min read

Lesson 44

Advanced

System Design HLD Example: Collaborative Document Editing (Google Docs)

TLDR: Real-time collaborative editing relies on Operational Transformation (OT) or CRDTs to resolve concurrent edits without data loss. The core trade-off is Latency vs. Consistency: we use optimistic

14 min read

Lesson 45

Advanced

System Design: Designing an Autonomous AI Coding Agent (Devin at Scale)

TLDR: Designing an autonomous AI coding agent at scale is not a prompt engineering task; it is a complex systems problem. The system requires secure multitenancy via Firecracker microVMs, a low-latenc

14 min read

Lesson 46

Advanced

System Design Networking: DNS, CDNs, and Load Balancers

TLDR: When you hit a URL, DNS translates the name to an IP, CDNs serve static assets from the edge nearest to you, and Load Balancers spread traffic across many servers so no single machine becomes a

15 min read

Lesson 47

Advanced

System Design Databases: SQL vs NoSQL and Scaling

TLDR: SQL gives you ACID guarantees and powerful relational queries; NoSQL gives you horizontal scale and flexible schemas. The real decision is not "which is better" β€” it is "which trade-offs align w

15 min read

Lesson 48

Advanced

System Design Replication and Failover: Keep Services Alive When a Primary Dies

TLDR: Replication means keeping multiple copies of your data so the system can survive machine, process, or availability-zone failures. Failover is the coordinated act of promoting a healthy replica,

15 min read

Lesson 49

Advanced

System Design HLD Example: Distributed Cache Platform

TLDR: Distributed caches trade strict consistency for sub-millisecond read latency, using consistent hashing to scale horizontally without causing database-shattering "cache stampedes" during cluster

15 min read

Lesson 50

Advanced

System Design HLD Example: Search Autocomplete (Google/Amazon)

TLDR: Search autocomplete must respond in sub-10ms to feel "instant." The core trade-off is Latency vs. Data Freshness: we use an offline pipeline (Spark) to pre-calculate prefix-to-suggestion mapping

15 min read

Lesson 51

Advanced

System Design HLD Example: Hotel Booking System (Airbnb)

TLDR: A robust hotel booking system must guarantee atomicity in inventory subtraction. The core trade-off is Consistency vs. Availability: we prioritize strong consistency for the booking path (Postgr

15 min read

Lesson 52

Advanced

The Ultimate Guide to Acing the System Design Interview

TLDR: System Design interviews are collaborative whiteboard sessions, not trick-question coding tests. Follow the framework β€” Requirements β†’ Estimations β†’ API β†’ Data Model β†’ High-Level Architecture β†’

16 min read

Lesson 53

Advanced

System Design Advanced: Security, Rate Limiting, and Reliability

TLDR: Three reliability tools every backend system needs: Rate Limiting prevents API spam and DDoS, Circuit Breakers stop cascading failures when downstream services degrade, and Bulkheads isolate fai

16 min read

Lesson 54

Advanced

System Design Protocols: REST, RPC, and TCP/UDP

TLDR: 🎯 Use REST (HTTP + JSON) for public, browser-facing APIs where interoperability matters. Choose gRPC (HTTP/2 + Protobuf) for internal microservice communication when latency counts. Under the h

17 min read

Lesson 55

Advanced

System Design HLD Example: Distributed Job Scheduler

TLDR: A distributed job scheduler ensures tasks fire reliably using a durable Job Store with a next_fire_time index. To handle multiple scheduler instances without double-firing, we use optimistic row

17 min read

Lesson 56

Advanced

System Design HLD Example: File Storage and Sync (Dropbox and Google Drive)

TLDR: Cloud sync systems separate immutable blob storage (S3) from atomic metadata operations (PostgreSQL), using chunk-level deduplication to optimize storage costs and delta-sync events to minimize

18 min read

Lesson 57

Advanced

System Design HLD Example: Payment Processing Platform

TLDR: Payment systems optimize for correctness first, then throughput. This guide covers idempotency, double-entry ledgers, and reconciliation. Stripe processes over 250 million API requests per day,

18 min read

Lesson 58

Advanced

System Design HLD Example: Web Crawler

TLDR: A distributed web crawler must balance global throughput with per-domain politeness. The architectural crux is the URL Frontier, which manages priority and rate-limiting across a distributed fet

18 min read

Lesson 59

Advanced

Write-Time vs Read-Time Fan-Out: How Social Feeds Scale

TLDR: Fan-out is the act of distributing one post to many followers' feeds. Write-time fan-out (push) pre-computes feeds at post time β€” fast reads but catastrophic write amplification for celebrities.

18 min read

Lesson 60

Advanced

System Design for Agentic AI Systems: From Distributed Systems Principles to Production

TLDR: Agentic AI systems are distributed systems with non-deterministic workers. If you design them with queue-first execution, explicit state machines, idempotency keys, bounded retries, and strong o

18 min read

Lesson 61

Advanced

System Design HLD Example: Chat and Messaging Platform

TLDR: A distributed chat system must balance low-latency delivery with strong per-conversation ordering. The architectural crux is the WebSocket Gateway for persistent stateful connections and Cassand

19 min read

Lesson 62

Advanced

System Design HLD Example: Notification Service (Email, SMS, Push)

TLDR: A notification platform routes events to per-channel Kafka queues, deduplicates with Redis, and tracks delivery via webhooks β€” ensuring that critical alerts like password resets never get blocke

19 min read

Lesson 63

Advanced

System Design HLD Example: Distributed Rate Limiter

TLDR: A distributed rate limiter protects APIs from abuse and "noisy neighbors" by enforcing request quotas across a cluster of servers. The core technical challenge is Atomic State Managementβ€”solved

19 min read

Lesson 64

Advanced

System Design HLD Example: URL Shortener (TinyURL and Bitly)

TLDR: A URL shortener is a read-heavy system (100:1 ratio) that maps long URLs to short, unique aliases. The core scaling challenge is generating unique IDs without database contentionβ€”solved using a

19 min read

Lesson 65

Advanced

Clock Skew and Causality Violations: Why Distributed Clocks Lie

TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions β€” but under load, across datacenters, or aft

19 min read

Lesson 66

Advanced

System Design HLD Example: News Feed (Home Timeline)

TLDR: A news feed system builds personalized timelines by combining content publishing, graph relationships, and ranking. The scalability crux is the fan-out amplified write path: a single celebrity p

20 min read

Lesson 67

Advanced

Microservices Architecture: Decomposition, Communication, and Trade-offs

TLDR: Microservices let teams deploy and scale services independently β€” but every service boundary you draw costs you a network hop, a consistency challenge, and an operational burden. The architectur

22 min read

Lesson 68

Advanced

Database Anomalies: How SQL and NoSQL Handle Dirty Reads, Phantom Reads, and Write Skew

TLDR: Database anomalies are the predictable side-effects of concurrent transactions β€” dirty reads, phantom reads, write skew, and lost updates. SQL databases use MVCC and isolation levels to prevent

31 min read

Lesson 69

Advanced

Stale Reads and Cascading Failures in Distributed Systems

TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redi

25 min read

Lesson 70

Advanced

ID Generation Strategies in System Design: Base62, UUID, Snowflake, and Beyond

TLDR: Short shareable IDs need Base62 (URL shorteners). Database primary keys at scale need time-ordered IDs (Snowflake, UUID v7). Security tokens need random IDs (UUID v4, NanoID). Picking the wrong

26 min read

Lesson 71

Advanced

Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared

TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose β€” range, hash, consistent hashing, or di

29 min read

Lesson 72

Advanced

System Design: Complete Guide to Caching β€” Patterns, Eviction, and Distributed Strategies

TLDR: Caching is the single highest-leverage performance tool in distributed systems. This guide covers every read/write pattern (Cache-Aside through Refresh-Ahead), every eviction policy (LRU through

33 min read

System Design Interview Prep: Learning Roadmap

Most engineers don't fail system design interviews because they lack content β€” they fail because they read topics in the wrong order. Sharding before access patterns, consensus papers before requirements, tools before trade-offs. This roadmap fixes that with a dependency-first learning path organized into three tracks based on your timeline.

TLDR: This roadmap organizes 52 system design posts into three learning paths: a 2-week interview sprint, a 4-week backend depth plan, and full mastery β€” covering foundations, APIs, data, async/microservices, and 19 HLD worked examples.

What You'll Learn

Understand System Design Interview Prep through real published examples

Follow a sequence of 72 articles from fundamentals to deeper topics

Connect related concepts: System Design, networking, websockets

Practice explaining trade-offs and implementation decisions

Prerequisites

Basic backend engineering knowledge
Familiarity with APIs, databases, and caching
Comfort reading architecture trade-offs

FAQs

How should I read this series?

Start from the first article if you are new, or use the article list to jump into the most relevant topic.

Is progress automatic?

Progress is based on articles opened from this browser using the local learning history.