Topic
distributed systems
71 articles across 24 sub-topics
Sub-topic
22 articles

Big Data 101: The 5 Vs, Ecosystem, and Why Scale Breaks Everything
TLDR: Traditional databases fail at big data scale for three concrete reasons — storage saturation, compute bottleneck, and write-lock contention. The 5 Vs (Volume, Velocity, Variety, Veracity, Value) frame what makes data "big." A layered ecosystem ...
Microservices Architecture: Decomposition, Communication, and Trade-offs
TLDR: Microservices let teams deploy and scale services independently — but every service boundary you draw costs you a network hop, a consistency challenge, and an operational burden. The architecture pays off only when your team and traffic scale h...

System Design HLD Example: Web Crawler
TLDR: A distributed web crawler must balance global throughput with per-domain politeness. The architectural crux is the URL Frontier, which manages priority and rate-limiting across a distributed fetcher pool. By combining Bloom Filters for URL dedu...
System Design HLD Example: Distributed Job Scheduler
TLDR: A distributed job scheduler ensures tasks fire reliably using a durable Job Store with a next_fire_time index. To handle multiple scheduler instances without double-firing, we use optimistic row-level locking (UPDATE WHERE status='SCHEDULED'). ...
Distributed Transactions: 2PC, Saga, and XA Explained
TLDR: Distributed transactions require you to choose a consistency model before choosing a protocol. 2PC and XA give atomic all-or-nothing commits but block all participants on coordinator failure. Saga gives eventual consistency with explicit compen...
Modernization Architecture Patterns: Strangler Fig, Anti-Corruption Layers, and Modular Monoliths
TLDR: Large-scale modernization usually fails when teams try to replace an entire legacy platform in one synchronized rewrite. The safer approach is to create seams, translate old contracts into stable new ones, and move traffic gradually with measur...
Sub-topic
10 articles

Read Skew Explained: Inconsistent Snapshots Across Multiple Objects
TLDR: Read skew occurs when a transaction reads two logically related objects at different points in time — one before and one after a concurrent transaction commits — producing a view that never existed as a committed whole. Read Committed isolation...

Phantom Read Explained: When New Rows Appear Mid-Transaction
TLDR: A phantom read occurs when a transaction runs the same range query twice and gets a different set of rows — because a concurrent transaction inserted or deleted matching rows and committed in between. Row locks cannot stop this because the phan...

Write Skew Explained: The Anomaly That Requires Serializable Isolation
TLDR: Write skew is the hardest concurrency anomaly to reason about: two concurrent transactions each read a shared condition, decide they can safely proceed, and then write to different rows. No individual operation is wrong. No row was overwritten....
Dirty Read Explained: How Uncommitted Data Corrupts Transactions
TLDR: A dirty read occurs when Transaction B reads data written by Transaction A before A has committed. If A rolls back, B has made decisions on data that — from the database's perspective — never existed. Read Committed isolation (the default in Po...
Non-Repeatable Read Explained: When the Same Query Returns Different Results
TLDR: A non-repeatable read happens when the same SELECT returns different results within a single transaction because a concurrent transaction committed an update between the two reads. Read Committed isolation — the default in PostgreSQL, MySQL, an...

Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose — range, hash, consistent hashing, or directory — determines whether range queries stay ch...
Sub-topic
9 articles
System Design Service Discovery and Health Checks: Routing Traffic to Healthy Instances
TLDR: Service discovery is how clients find the right service instance at runtime, and health checks are how systems decide whether an instance should receive traffic. Together, they turn dynamic infrastructure from guesswork into deterministic routi...
System Design Observability, SLOs, and Incident Response: Operating Systems You Can Trust
TLDR: Observability is how you understand system behavior from telemetry, SLOs are explicit reliability targets, and incident response is the execution model when those targets are at risk. Together, they convert operational chaos into measurable, re...
System Design Multi-Region Deployment: Latency, Failover, and Consistency Across Regions
TLDR: Multi-region deployment means running the same system across more than one geographic region so users get lower latency and the business can survive a regional outage. The design challenge is no longer just scaling compute. It is coordinating r...
System Design Interview Basics: A Beginner-Friendly Framework for Clear Answers
TLDR: System design interviews are not about inventing a perfect architecture on the spot. They are about showing a calm, repeatable process: clarify requirements, estimate scale, sketch a simple design, explain trade-offs, and improve it when constr...

System Design Databases: SQL vs NoSQL and Scaling
TLDR: SQL gives you ACID guarantees and powerful relational queries; NoSQL gives you horizontal scale and flexible schemas. The real decision is not "which is better" — it is "which trade-offs align w

System Design Protocols: REST, RPC, and TCP/UDP
TLDR: 🎯 Use REST (HTTP + JSON) for public, browser-facing APIs where interoperability matters. Choose gRPC (HTTP/2 + Protobuf) for internal microservice communication when latency counts. Under the h
Sub-topic
3 articles
HyperLogLog Explained: Counting Billions of Unique Items with 12 KB
TLDR: HyperLogLog estimates the number of distinct elements in a dataset using ~12 KB of memory regardless of cardinality — with ±0.81% error. The insight: if you hash every element to a random bit string, the maximum length of leading zeros you obse...
Count-Min Sketch Explained: Frequency Estimation at Streaming Scale
TLDR: Count-Min Sketch (CMS) is a fixed-size d × w counter matrix that estimates how often any element has appeared in a stream. Insert: hash the element with each of the d hash functions to get one column per row, increment those d counters. Query: ...
Bloom Filters Explained: Membership Testing with Zero False Negatives
TLDR: A Bloom filter is a bit array of m bits + k independent hash functions that sets k bits on insert and checks those same k bits on lookup. If any checked bit is 0, the element is definitely not in the set — false negatives are mathematically imp...
Sub-topic
3 articles
Split Brain Explained: When Two Nodes Both Think They Are Leader
TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader — each accepting writes the other never sees. Prevent it with quorum consensus (at least ⌊N/2⌋+1 nodes must agree before leadership is g...
Database Anomalies: How SQL and NoSQL Handle Dirty Reads, Phantom Reads, and Write Skew
TLDR: Database anomalies are the predictable side-effects of concurrent transactions — dirty reads, phantom reads, write skew, and lost updates. SQL databases use MVCC and isolation levels to prevent them; PostgreSQL's Serializable Snapshot Isolation...

The Consistency Continuum: From Read-Your-Own-Writes to Leaderless Replication
TLDR: In distributed systems, consistency is a spectrum of trade-offs between latency, availability, and correctness. By leveraging session-based patterns like Read-Your-Own-Writes and formal Quorum logic ($W+R > N$), architects can provide the illus...
Sub-topic
3 articles

Redis Sorted Sets Explained: Skip Lists, Scores, and Real-World Use Cases
TLDR: Redis Sorted Sets (ZSETs) store unique members each paired with a floating-point score, kept in sorted order at all times. Internally they use a skip list for O(log N) range queries and a hash table for O(1) score lookup — giving you the best o...

Write-Time vs Read-Time Fan-Out: How Social Feeds Scale
TLDR: Fan-out is the act of distributing one post to many followers' feeds. Write-time fan-out (push) pre-computes feeds at post time — fast reads but catastrophic write amplification for celebrities. Read-time fan-out (pull) computes feeds on demand...
System Design: Complete Guide to Caching — Patterns, Eviction, and Distributed Strategies
TLDR: Caching is the single highest-leverage performance tool in distributed systems. This guide covers every read/write pattern (Cache-Aside through Refresh-Ahead), every eviction policy (LRU through ARC), cache invalidation pitfalls, thundering her...
Sub-topic
2 articles

Dirty Write Explained: When Uncommitted Data Gets Overwritten
TLDR: A dirty write occurs when Transaction B overwrites data that Transaction A has written but not yet committed. The result is not a rollback or an error — it is silently inconsistent committed data: one table reflects Transaction B's intent, anot...

Lost Update Explained: When Two Writes Become One
TLDR: A lost update occurs when two concurrent read-modify-write transactions both read the same committed value, both compute a new value from it, and both write back — with the second write silently discarding the first. No error is raised. Both tr...
Sub-topic
2 articles
Azure Cosmos DB Consistency Levels Explained: Strong, Bounded Staleness, Session, Consistent Prefix, and Eventual
TLDR: Cosmos DB offers five consistency levels — Strong, Bounded Staleness, Session, Consistent Prefix, Eventual — each with precise, non-obvious internal mechanics. Session does not mean HTTP session; it means a client-side token that tracks what yo...
Azure Cosmos DB API Modes Explained: NoSQL, MongoDB, Cassandra, PostgreSQL, Gremlin, and Table
TLDR: Cosmos DB's six API modes are wire-protocol compatibility layers over one shared ARS storage engine — except PostgreSQL (Citus), which is genuinely different. Every API emulates its native database incompletely, and those gaps are structural, n...
Sub-topic
2 articles
The Dual Write Problem in NoSQL: MongoDB, DynamoDB, and Cassandra
TLDR: NoSQL databases trade cross-entity atomicity for scale — and every database draws that atomicity boundary in a different place. MongoDB's boundary is the document (pre-4.0) or the replica set (4.0+ multi-doc transactions). DynamoDB's boundary i...
The Dual Write Problem: Why Two Writes Always Fail Eventually — and How to Fix It
TLDR: Any service that writes to a database and publishes a message in the same logical operation has a dual write problem. try/catch retries don't fix it — they turn failures into duplicates. The Transactional Outbox pattern co-writes business data ...
Sub-topic
1 article
Clock Skew and Causality Violations: Why Distributed Clocks Lie
TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions — but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...
Sub-topic
1 article
Stale Reads and Cascading Failures in Distributed Systems
TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable — stale reads...
Sub-topic
1 article

Compare-and-Swap and Optimistic Locking: How Every Database Implements It
TLDR: Compare-and-Swap (CAS) is the CPU-level atomic instruction that makes lock-free concurrency possible. Optimistic locking builds on it at the database layer: read freely, compute locally, write only if the record has not changed. Every major dat...
Sub-topic
1 article

Change Feed vs Change Stream: CDC Internals, Reliability, and When to Avoid Each
In the summer of 2023, the platform team at a fast-growing e-commerce company was handling 100,000 orders per day across three microservices: Order Service, Inventory Service, and Billing Service. All three needed to react to the same database mutati...
Sub-topic
1 article

Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy — but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sub-topic
1 article

ACID Transactions in Distributed Databases: DynamoDB, Cosmos DB, and Spanner Compared
TLDR: ACID transactions in distributed databases are not equal. DynamoDB provides multi-item atomicity scoped to 25 items using two-phase commit with a coordinator item, but only within a single region. Cosmos DB wraps partition-scoped operations ins...
Sub-topic
1 article
ID Generation Strategies in System Design: Base62, UUID, Snowflake, and Beyond
TLDR: Short shareable IDs need Base62 (URL shorteners). Database primary keys at scale need time-ordered IDs (Snowflake, UUID v7). Security tokens need random IDs (UUID v4, NanoID). Picking the wrong strategy either causes B-tree fragmentation at 50M...
Sub-topic
1 article
System Design Message Queues and Event-Driven Architecture: Building Reliable Asynchronous Systems
TLDR: Message queues and event-driven architecture let services communicate asynchronously, absorb bursty traffic, and isolate failures. The core design challenge is not adding a queue — it is defining delivery semantics, retry behavior, and idempote...
Sub-topic
1 article
System Design Data Modeling and Schema Evolution: Query-Driven Storage That Survives Change
TLDR: In system design interviews, data modeling is where architecture meets reality. A good model starts from query patterns, chooses clear entity boundaries, defines indexes deliberately, and includes a schema evolution path so the system can chang...
Sub-topic
1 article
System Design API Design for Interviews: Contracts, Idempotency, and Pagination
TLDR: In system design interviews, API design is not a list of HTTP verbs. It is a contract strategy: clear resource boundaries, stable request and response shapes, pagination, idempotency, error semantics, and versioning decisions that survive scale...
Sub-topic
1 article
The Role of Data in Precise Capacity Estimations for System Design
TLDR: Capacity estimation is the skill of back-of-the-envelope math that tells you whether your system design will survive its traffic before you write a line of code. Four numbers do most of the work: DAU, QPS, Storage/day, and Bandwidth/day. 📖 T...
Sub-topic
1 article
System Design Advanced: Security, Rate Limiting, and Reliability
TLDR: Three reliability tools every backend system needs: Rate Limiting prevents API spam and DDoS, Circuit Breakers stop cascading failures when downstream services degrade, and Bulkheads isolate failure blast radius. Knowing when and how to combine...
Sub-topic
1 article
How Kafka Works: The Log That Never Forgets
TLDR: Kafka is a distributed event store. Unlike a traditional queue (RabbitMQ) where messages disappear after reading, Kafka stores them in a persistent Log. This allows multiple consumers to read the same data at their own pace, replay history, and...
Sub-topic
1 article
Consistent Hashing: Scaling Without Chaos
TLDR: Standard hashing (key % N) breaks when $N$ changes — adding or removing a server reshuffles almost all keys. Consistent Hashing maps both servers and keys onto a ring (0–360°). When a server is added, only its immediate neighbors' keys move, mi...
Sub-topic
1 article
A Guide to Raft, Paxos, and Consensus Algorithms
TLDR TLDR: Consensus algorithms allow a cluster of computers to agree on a single value (e.g., "Who is the leader?"). Paxos is the academic standard — correct but notoriously hard to understand. Raft is the practical standard — designed for understa...
