Topic
system design
114 articles across 36 sub-topics
Sub-topic
52 articles

Designing for High Availability: The Road to 99.99% Reliability
TLDR: High Availability (HA) is the art of eliminating Single Points of Failure (SPOFs). By using Active-Active redundancy, automated health checks, and global failover via GSLB, you can achieve "Four Nines" (99.99%) reliability—limiting downtime to ...

Big Data 101: The 5 Vs, Ecosystem, and Why Scale Breaks Everything
TLDR: Traditional databases fail at big data scale for three concrete reasons — storage saturation, compute bottleneck, and write-lock contention. The 5 Vs (Volume, Velocity, Variety, Veracity, Value) frame what makes data "big." A layered ecosystem ...
Microservices Architecture: Decomposition, Communication, and Trade-offs
TLDR: Microservices let teams deploy and scale services independently — but every service boundary you draw costs you a network hop, a consistency challenge, and an operational burden. The architecture pays off only when your team and traffic scale h...

System Design HLD Example: Web Crawler
TLDR: A distributed web crawler must balance global throughput with per-domain politeness. The architectural crux is the URL Frontier, which manages priority and rate-limiting across a distributed fetcher pool. By combining Bloom Filters for URL dedu...
System Design HLD Example: Video Streaming (YouTube/Netflix)
TLDR: A video streaming platform is a two-sided architectural beast: a batch-oriented transcoding pipeline that converts raw uploads into multi-resolution segments, and a real-time global delivery network that serves those segments via CDNs. The tech...
System Design HLD Example: Ride-Sharing (Uber/Lyft)
TLDR: A ride-sharing platform is a high-velocity geospatial matching engine. Drivers stream GPS coordinates every 5 seconds into a Redis Geospatial Index. When a rider requests a trip, the Matching Service executes a GEORADIUS query to find the 10 cl...
Sub-topic
13 articles
ID Generation Strategies in System Design: Base62, UUID, Snowflake, and Beyond
TLDR: Short shareable IDs need Base62 (URL shorteners). Database primary keys at scale need time-ordered IDs (Snowflake, UUID v7). Security tokens need random IDs (UUID v4, NanoID). Picking the wrong strategy either causes B-tree fragmentation at 50M...
System Design Service Discovery and Health Checks: Routing Traffic to Healthy Instances
TLDR: Service discovery is how clients find the right service instance at runtime, and health checks are how systems decide whether an instance should receive traffic. Together, they turn dynamic infrastructure from guesswork into deterministic routi...
System Design Observability, SLOs, and Incident Response: Operating Systems You Can Trust
TLDR: Observability is how you understand system behavior from telemetry, SLOs are explicit reliability targets, and incident response is the execution model when those targets are at risk. Together, they convert operational chaos into measurable, re...
System Design Message Queues and Event-Driven Architecture: Building Reliable Asynchronous Systems
TLDR: Message queues and event-driven architecture let services communicate asynchronously, absorb bursty traffic, and isolate failures. The core design challenge is not adding a queue — it is defining delivery semantics, retry behavior, and idempote...
System Design Multi-Region Deployment: Latency, Failover, and Consistency Across Regions
TLDR: Multi-region deployment means running the same system across more than one geographic region so users get lower latency and the business can survive a regional outage. The design challenge is no longer just scaling compute. It is coordinating r...
System Design Interview Basics: A Beginner-Friendly Framework for Clear Answers
TLDR: System design interviews are not about inventing a perfect architecture on the spot. They are about showing a calm, repeatable process: clarify requirements, estimate scale, sketch a simple design, explain trade-offs, and improve it when constr...
Sub-topic
7 articles
SQL Partitioning: Range, Hash, List, and Composite Strategies Explained
TLDR: SQL partitioning divides one logical table into smaller physical child tables, all accessed through the parent table name. The query optimizer skips irrelevant child tables entirely — a process called partition pruning — turning a 30-second ful...

Partitioning Approaches in SQL and NoSQL: Horizontal, Vertical, Range, Hash, and List Partitioning
TLDR: Partitioning splits one logical table into smaller physical pieces called partitions. The database planner skips irrelevant partitions entirely — turning a 30-second full-table scan into a 200ms single-partition read. Range partitioning is best...

Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose — range, hash, consistent hashing, or directory — determines whether range queries stay ch...

Key Terms in Distributed Systems: The Definitive Glossary
TLDR: Distributed systems vocabulary is precise for a reason. Mixing up read skew and write skew costs you an interview. Confusing Snapshot Isolation with Serializable costs you a production outage. This glossary organises every critical term into co...
System Design Sharding Strategy: Choosing Keys, Avoiding Hot Spots, and Resharding Safely
TLDR: Sharding means splitting one logical dataset across multiple physical databases so no single node carries all the data and traffic. The hard part is not adding more nodes. The hard part is choosing a shard key that keeps data balanced and queri...
System Design Replication and Failover: Keep Services Alive When a Primary Dies
TLDR: Replication means keeping multiple copies of your data so the system can survive machine, process, or availability-zone failures. Failover is the coordinated act of promoting a healthy replica, rerouting traffic, and recovering without corrupti...
Sub-topic
4 articles
Probabilistic Data Structures: A Practical Guide to Bloom Filters, HyperLogLog, and Count-Min Sketch
TLDR: Probabilistic data structures trade a small, bounded probability of being wrong for orders-of-magnitude better memory efficiency and O(1) speed. Bloom Filters answer "definitely not in this set" in constant time with zero false negatives. Hyper...
What are Hash Tables? Basics Explained
TLDR: A hash table gives you near-O(1) lookups, inserts, and deletes by using a hash function to map keys to array indices. The tradeoff: collisions (when two keys hash to the same slot) must be handled, and a full hash table must be resized. 📖 Th...
Understanding Inverted Index and Its Benefits in Software Development
TLDR TLDR: An Inverted Index maps every word to the list of documents containing it — the same structure as the back-of-the-book index. It is the core data structure behind every full-text search engine, including Elasticsearch, Lucene, and PostgreS...
How Bloom Filters Work: The Probabilistic Set
TLDR TLDR: A Bloom Filter is a bit array + multiple hash functions that answers "Is X in the set?" in $O(1)$ constant space. It can return false positives (say "yes" when the answer is "no") but never false negatives (never says "no" when the answer...
Sub-topic
3 articles

Redis Sorted Sets Explained: Skip Lists, Scores, and Real-World Use Cases
TLDR: Redis Sorted Sets (ZSETs) store unique members each paired with a floating-point score, kept in sorted order at all times. Internally they use a skip list for O(log N) range queries and a hash table for O(1) score lookup — giving you the best o...

Write-Time vs Read-Time Fan-Out: How Social Feeds Scale
TLDR: Fan-out is the act of distributing one post to many followers' feeds. Write-time fan-out (push) pre-computes feeds at post time — fast reads but catastrophic write amplification for celebrities. Read-time fan-out (pull) computes feeds on demand...
System Design: Complete Guide to Caching — Patterns, Eviction, and Distributed Strategies
TLDR: Caching is the single highest-leverage performance tool in distributed systems. This guide covers every read/write pattern (Cache-Aside through Refresh-Ahead), every eviction policy (LRU through ARC), cache invalidation pitfalls, thundering her...
Sub-topic
2 articles
Split Brain Explained: When Two Nodes Both Think They Are Leader
TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader — each accepting writes the other never sees. Prevent it with quorum consensus (at least ⌊N/2⌋+1 nodes must agree before leadership is g...

The Consistency Continuum: From Read-Your-Own-Writes to Leaderless Replication
TLDR: In distributed systems, consistency is a spectrum of trade-offs between latency, availability, and correctness. By leveraging session-based patterns like Read-Your-Own-Writes and formal Quorum logic ($W+R > N$), architects can provide the illus...
Sub-topic
2 articles

Choosing the Right Database: CAP Theorem and Practical Use Cases
TLDR: Database selection is a trade-off between consistency, availability, and scalability. By using the CAP Theorem as a compass and matching your data access patterns to the right storage engine (Relational, Document, KV, or Wide-Column), you can b...
BASE Theorem Explained: How it Stands Against ACID
TLDR TLDR: ACID (Atomicity, Consistency, Isolation, Durability) is the gold standard for banking. BASE (Basically Available, Soft state, Eventual consistency) is the standard for social media. BASE intentionally sacrifices instant accuracy in exchan...
Sub-topic
2 articles
System Design API Design for Interviews: Contracts, Idempotency, and Pagination
TLDR: In system design interviews, API design is not a list of HTTP verbs. It is a contract strategy: clear resource boundaries, stable request and response shapes, pagination, idempotency, error semantics, and versioning decisions that survive scale...
Backend for Frontend (BFF): Tailoring APIs for UI
TLDR: A "one-size-fits-all" API causes bloated mobile payloads and underpowered desktop dashboards. The Backend for Frontend (BFF) pattern solves this by creating a dedicated API server for each client type — the mobile BFF reshapes data for small sc...
Sub-topic
2 articles
LLD for Movie Booking System: Designing BookMyShow
TLDR TLDR: A Movie Booking System (like BookMyShow) is an inventory management problem with an expiry: seats expire when the show starts. The core engineering challenge is preventing double-booking under concurrent user load with a 3-state seat mode...

Types of Locks Explained: Optimistic vs. Pessimistic Locking
TLDR: Pessimistic locking locks the record before editing — safe but slower under low contention. Optimistic locking checks for changes before saving using a version number — fast but can fail and require retry under high contention. Choosing correct...
Sub-topic
1 article
NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data
TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node — virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...
Sub-topic
1 article
Clock Skew and Causality Violations: Why Distributed Clocks Lie
TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions — but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...
Sub-topic
1 article
Stale Reads and Cascading Failures in Distributed Systems
TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable — stale reads...
Sub-topic
1 article
CosmosDB Partition Internals: Logical vs Physical Partitions Explained
🔥 When Your Database Bill Triples Overnight A retail engineering team ships a flash-sale feature. Traffic spikes 10×. Their Azure CosmosDB bill triples within 24 hours. Queries that ran in 5ms now take 800ms. The on-call engineer bumps provisioned R...
Sub-topic
1 article

Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy — but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sub-topic
1 article
The Dual Write Problem: Why Two Writes Always Fail Eventually — and How to Fix It
TLDR: Any service that writes to a database and publishes a message in the same logical operation has a dual write problem. try/catch retries don't fix it — they turn failures into duplicates. The Transactional Outbox pattern co-writes business data ...
Sub-topic
1 article
How CDC Works Across Databases: PostgreSQL, MySQL, MongoDB, and Beyond
A data engineering team at a fintech company built what they believed was a robust Change Data Capture pipeline: three source databases (PostgreSQL, MongoDB, and Cassandra), Debezium connectors wired to Kafka, and a downstream data warehouse receivin...
Sub-topic
1 article
Real-Time Communication: WebSockets, SSE, and Long Polling Explained
TLDR: 🔌 WebSockets = bidirectional persistent channel — use for chat, gaming, collaborative editing. SSE = one-way server push over HTTP with built-in reconnect — use for AI streaming, live logs, notifications. Long Polling = held HTTP requests — th...
Sub-topic
1 article
MLOps Model Serving and Monitoring Patterns for Production Readiness
TLDR: Production ML reliability depends on joining inference serving, data-quality signals, and rollback automation into one operating loop. TLDR: This dedicated deep dive focuses on the internals, failure behavior, performance trade-offs, and rollou...
Sub-topic
1 article
AI Architecture Patterns: Routers, Planner-Worker Loops, Memory Layers, and Evaluation Guardrails
TLDR: A single agent loop is enough for a demo, but production AI systems need explicit layers for routing, execution, memory, and evaluation. Those layers determine safety, latency, cost, and traceability far more than model choice alone. TLDR: Prod...
Sub-topic
1 article
System Design HLD Example: API Gateway for Microservices
TLDR: An API Gateway centralizes "cross-cutting concerns" like authentication, rate limiting, and routing at the edge of your infrastructure. The architectural crux is the separation of the Control Plane (managing configurations) from the Data Plane ...
Sub-topic
1 article
System Design Data Modeling and Schema Evolution: Query-Driven Storage That Survives Change
TLDR: In system design interviews, data modeling is where architecture meets reality. A good model starts from query patterns, chooses clear entity boundaries, defines indexes deliberately, and includes a schema evolution path so the system can chang...
Sub-topic
1 article
The Role of Data in Precise Capacity Estimations for System Design
TLDR: Capacity estimation is the skill of back-of-the-envelope math that tells you whether your system design will survive its traffic before you write a line of code. Four numbers do most of the work: DAU, QPS, Storage/day, and Bandwidth/day. 📖 T...
Sub-topic
1 article
System Design Advanced: Security, Rate Limiting, and Reliability
TLDR: Three reliability tools every backend system needs: Rate Limiting prevents API spam and DDoS, Circuit Breakers stop cascading failures when downstream services degrade, and Bulkheads isolate failure blast radius. Knowing when and how to combine...
Sub-topic
1 article
LLD for URL Shortener: Designing TinyURL
TLDR TLDR: A URL Shortener maps long URLs to short IDs. The core challenge is generating a globally unique, short, collision-free ID at scale. We use Base62 encoding on auto-incrementing database IDs for deterministic, collision-free short codes. ...
Sub-topic
1 article
X.509 Certificates: A Deep Dive into How They Work
TLDR: An X.509 Certificate is a digital document that binds a Public Key to an Identity (e.g., google.com). It is digitally signed by a trusted Certificate Authority (CA). It prevents attackers from impersonating websites via man-in-the-middle attack...
Sub-topic
1 article
How SSL/TLS Works: The Handshake Explained
TLDR: SSL (now TLS) secures data between your browser and a server. It uses Asymmetric Encryption (Public/Private keys) once — to safely exchange a fast Symmetric Session Key. Everything after the handshake is encrypted with the session key. 📖 The...
Sub-topic
1 article
How OAuth 2.0 Works: The Valet Key Pattern
TLDR: OAuth 2.0 is an authorization protocol. It lets a third-party app (like Spotify) access your resources (like Facebook Friends) without you giving it your Facebook password. It uses short-lived Access Tokens as scoped, revocable keys. 📖 The V...
Sub-topic
1 article
How Kubernetes Works: The Container Orchestrator
TLDR TLDR: Kubernetes (K8s) is an operating system for the cloud. It manages clusters of computers (Nodes) and schedules applications (Pods) onto them via a continuous declarative control loop — you describe what you want, and Kubernetes continuousl...
Sub-topic
1 article
How GPT (LLM) Works: The Next Word Predictor
TLDR: At its core, GPT asks one question, repeated: "Given everything so far, what is the most likely next token?" Tokens are not words — they're subword units. The Transformer architecture uses self-attention to weigh how much each token should infl...
Sub-topic
1 article
How Fluentd Works: The Unified Logging Layer
TLDR: Fluentd is an open-source data collector that decouples log sources from destinations. It ingests logs from 100+ sources (Nginx, Docker, syslog), normalizes them to JSON, applies filters and transformations, and routes them to 100+ outputs (Ela...
Sub-topic
1 article
How Apache Lucene Works: The Engine Behind Elasticsearch
TLDR: Lucene is a search library. Its core innovation is the inverted index — a reverse map from words to documents, like the index at the back of a textbook. Documents are stored in immutable segments that Lucene merges in the background to keep que...
Sub-topic
1 article
A Guide to Raft, Paxos, and Consensus Algorithms
TLDR TLDR: Consensus algorithms allow a cluster of computers to agree on a single value (e.g., "Who is the leader?"). Paxos is the academic standard — correct but notoriously hard to understand. Raft is the practical standard — designed for understa...
Sub-topic
1 article

API Gateway vs. Load Balancer vs. Reverse Proxy: What's the Difference?
TLDR: A Reverse Proxy hides your servers and handles caching/SSL. A Load Balancer spreads traffic across server instances. An API Gateway manages API concerns — auth, rate limiting, routing, and protocol translation. Modern tools (Nginx, AWS ALB, Kon...
Sub-topic
1 article

LLD for Ride Booking App: Designing Uber/Lyft
TLDR: A ride-booking system (Uber/Lyft-style) needs three interleaved sub-systems: real-time driver location tracking (Observer Pattern), nearest-driver matching (geospatial query), and dynamic pricing (Strategy Pattern). Getting state transitions ri...
Sub-topic
1 article

Java Memory Model Demystified: Stack vs. Heap
TLDR: Java memory is split into two main areas: the Stack for method execution frames and primitives, and the Heap for all objects. Understanding their differences is essential for avoiding stack overflow errors, memory leaks, and garbage collection ...

