Abstract Algorithms

distributed systems

65 articles across 24 sub-topics

Big Data 101: The 5 Vs, Ecosystem, and Why Scale Breaks Everything

TLDR: Traditional databases fail at big data scale for three concrete reasons — storage saturation, compute bottleneck, and write-lock contention. The 5 Vs (Volume, Velocity, Variety, Veracity, Value) frame what makes data "big." A layered ecosystem ...

Mar 28, 2026•20 min read

Microservices Architecture: Decomposition, Communication, and Trade-offs

TLDR: Microservices let teams deploy and scale services independently — but every service boundary you draw costs you a network hop, a consistency challenge, and an operational burden. The architecture pays off only when your team and traffic scale h...

Mar 28, 2026•20 min read

System Design HLD Example: Web Crawler

TLDR: A distributed web crawler must balance global throughput with per-domain politeness. The architectural crux is the URL Frontier, which manages priority and rate-limiting across a distributed fetcher pool. By combining Bloom Filters for URL dedu...

Mar 28, 2026•17 min read

Distributed Transactions: 2PC, Saga, and XA Explained

TLDR: Distributed transactions require you to choose a consistency model before choosing a protocol. 2PC and XA give atomic all-or-nothing commits but block all participants on coordinator failure. Saga gives eventual consistency with explicit compen...

Mar 28, 2026•25 min read

Modernization Architecture Patterns: Strangler Fig, Anti-Corruption Layers, and Modular Monoliths

TLDR: Large-scale modernization usually fails when teams try to replace an entire legacy platform in one synchronized rewrite. The safer approach is to create seams, translate old contracts into stable new ones, and move traffic gradually with measur...

Mar 13, 2026•12 min read

Integration Architecture Patterns: Orchestration, Choreography, Schema Contracts, and Idempotent Receivers

TLDR: Integration failures usually come from weak contracts, unsafe retries, and missing ownership rather than from choosing the wrong transport. Orchestration, choreography, schema contracts, and idempotent receivers are patterns for making cross-bo...

Mar 13, 2026•14 min read

Event Sourcing Pattern: Auditability, Replay, and Evolution of Domain State

TLDR: Event sourcing pays off when regulatory audit history and replay are first-class requirements — but it demands strict schema evolution, a snapshot strategy, and a framework that owns aggregate lifecycle. Spring Boot + Axon Framework is the fast...

Mar 13, 2026•15 min read

CQRS Pattern: Separating Write Models from Query Models at Scale

TLDR: CQRS works when read and write workloads diverge, but only with explicit freshness budgets and projection reliability. The hard part is not separating models — it is operating lag, replay, and rollback safely. An e-commerce platform's order se...

Mar 13, 2026•14 min read

Cloud Architecture Patterns: Cells, Control Planes, Sidecars, and Queue-Based Load Leveling

TLDR: Cloud scale is not created by sprinkling managed services around a diagram. It comes from isolating failure domains, separating coordination from request serving, and smoothing bursty work before it overloads synchronous paths. TLDR: Cloud patt...

Mar 13, 2026•15 min read

Bulkhead Pattern: Isolating Capacity to Protect Critical Workloads

TLDR: Bulkheads isolate capacity so one overloaded dependency or workload class cannot consume every thread, queue slot, or connection in the service. TLDR: Use bulkheads when different workloads do not deserve equal blast radius. The practical goal ...

Mar 13, 2026•15 min read

System Design HLD Example: Payment Processing Platform

TLDR: Payment systems optimize for correctness first, then throughput. This guide covers idempotency, double-entry ledgers, and reconciliation. Stripe processes over 250 million API requests per day, and every single payment must be idempotent: a us...

Mar 13, 2026•15 min read

System Design HLD Example: Notification Service (Email, SMS, Push)

TLDR: A notification platform routes events to per-channel Kafka queues, deduplicates with Redis, and tracks delivery via webhooks — ensuring that critical alerts like password resets never get blocked by marketing batches. Uber sends over 1 million...

Mar 13, 2026•17 min read

System Design HLD Example: File Storage and Sync (Dropbox and Google Drive)

TLDR: Cloud sync systems separate immutable blob storage (S3) from atomic metadata operations (PostgreSQL), using chunk-level deduplication to optimize storage costs and delta-sync events to minimize bandwidth. Dropbox serves 700 million registered ...

Mar 13, 2026•15 min read

System Design HLD Example: Distributed Cache Platform

TLDR: Distributed caches trade strict consistency for sub-millisecond read latency, using consistent hashing to scale horizontally without causing database-shattering "cache stampedes" during cluster rebalancing. Instagram's primary database once se...

Mar 13, 2026•14 min read

System Design Requirements and Constraints: Ask Better Questions Before You Draw

TLDR: In system design interviews, weak answers fail early because requirements are fuzzy. Strong answers start by turning vague prompts into explicit functional scope, measurable non-functional targets, and clear trade-off boundaries before any arch...

Mar 12, 2026•11 min read

Understanding Consistency Patterns: An In-Depth Analysis

TLDR TLDR: Consistency is about whether all nodes in a distributed system show the same data at the same time. Strong consistency gives correctness but costs latency. Eventual consistency gives speed but requires tolerance for briefly stale reads. C...

Mar 9, 2026•13 min read

Little's Law: The Secret Formula for System Performance

TLDR: Little's Law ($L = \lambda W$) connects three metrics every system designer measures: $L$ = concurrent requests in flight, $\lambda$ = throughput (RPS), $W$ = average response time. If latency spikes, your concurrency requirement explodes with ...

Mar 9, 2026•12 min read

The 8 Fallacies of Distributed Systems

TLDR TLDR: In 1994, L. Peter Deutsch at Sun Microsystems listed 8 assumptions that developers make about distributed systems — all of which are false. Believing them leads to hard-to-reproduce bugs, timeout cascades, and security holes. Knowing them...

Mar 9, 2026•13 min read

Data Warehouse vs Data Lake vs Data Lakehouse: Which One to Choose?

TLDR: Warehouse = structured, clean data for BI and SQL dashboards (Snowflake, BigQuery). Lake = raw, messy data for ML and data science (S3, HDFS). Lakehouse = open table formats (Delta Lake, Iceberg) that bring SQL performance to raw storage — the ...

Mar 9, 2026•14 min read

Databases(10)

Read Skew Explained: Inconsistent Snapshots Across Multiple Objects

TLDR: Read skew occurs when a transaction reads two logically related objects at different points in time — one before and one after a concurrent transaction commits — producing a view that never existed as a committed whole. Read Committed isolation...

Apr 11, 2026•31 min read

Phantom Read Explained: When New Rows Appear Mid-Transaction

TLDR: A phantom read occurs when a transaction runs the same range query twice and gets a different set of rows — because a concurrent transaction inserted or deleted matching rows and committed in between. Row locks cannot stop this because the phan...

Apr 11, 2026•29 min read

Write Skew Explained: The Anomaly That Requires Serializable Isolation

TLDR: Write skew is the hardest concurrency anomaly to reason about: two concurrent transactions each read a shared condition, decide they can safely proceed, and then write to different rows. No individual operation is wrong. No row was overwritten....

Apr 11, 2026•22 min read

Dirty Read Explained: How Uncommitted Data Corrupts Transactions

TLDR: A dirty read occurs when Transaction B reads data written by Transaction A before A has committed. If A rolls back, B has made decisions on data that — from the database's perspective — never existed. Read Committed isolation (the default in Po...

Apr 11, 2026•28 min read

Non-Repeatable Read Explained: When the Same Query Returns Different Results

TLDR: A non-repeatable read happens when the same SELECT returns different results within a single transaction because a concurrent transaction committed an update between the two reads. Read Committed isolation — the default in PostgreSQL, MySQL, an...

Apr 11, 2026•24 min read

Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared

TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose — range, hash, consistent hashing, or directory — determines whether range queries stay ch...

Apr 5, 2026•27 min read

Key Terms in Distributed Systems: The Definitive Glossary

TLDR: Distributed systems vocabulary is precise for a reason. Mixing up read skew and write skew costs you an interview. Confusing Snapshot Isolation with Serializable costs you a production outage. This glossary organises every critical term into co...

Apr 5, 2026•45 min read

System Design Sharding Strategy: Choosing Keys, Avoiding Hot Spots, and Resharding Safely

TLDR: Sharding means splitting one logical dataset across multiple physical databases so no single node carries all the data and traffic. The hard part is not adding more nodes. The hard part is choosing a shard key that keeps data balanced and queri...

Mar 12, 2026•13 min read

System Design Replication and Failover: Keep Services Alive When a Primary Dies

TLDR: Replication means keeping multiple copies of your data so the system can survive machine, process, or availability-zone failures. Failover is the coordinated act of promoting a healthy replica, rerouting traffic, and recovering without corrupti...

Mar 12, 2026•13 min read

Elasticsearch vs Time-Series DB: Key Differences Explained

TLDR: Elasticsearch is built for search — full-text log queries, fuzzy matching, and relevance ranking via an inverted index. InfluxDB and Prometheus are built for metrics — numeric time series with aggressive compression. Picking the wrong one waste...

Mar 9, 2026•13 min read

Interview-prep(9)

System Design Service Discovery and Health Checks: Routing Traffic to Healthy Instances

TLDR: Service discovery is how clients find the right service instance at runtime, and health checks are how systems decide whether an instance should receive traffic. Together, they turn dynamic infrastructure from guesswork into deterministic routi...

Mar 12, 2026•12 min read

System Design Observability, SLOs, and Incident Response: Operating Systems You Can Trust

TLDR: Observability is how you understand system behavior from telemetry, SLOs are explicit reliability targets, and incident response is the execution model when those targets are at risk. Together, they convert operational chaos into measurable, re...

Mar 12, 2026•11 min read

System Design Multi-Region Deployment: Latency, Failover, and Consistency Across Regions

TLDR: Multi-region deployment means running the same system across more than one geographic region so users get lower latency and the business can survive a regional outage. The design challenge is no longer just scaling compute. It is coordinating r...

Mar 12, 2026•12 min read

System Design Interview Basics: A Beginner-Friendly Framework for Clear Answers

TLDR: System design interviews are not about inventing a perfect architecture on the spot. They are about showing a calm, repeatable process: clarify requirements, estimate scale, sketch a simple design, explain trade-offs, and improve it when constr...

Mar 12, 2026•12 min read

System Design Databases: SQL vs NoSQL and Scaling

TLDR: SQL gives you ACID guarantees and powerful relational queries; NoSQL gives you horizontal scale and flexible schemas. The real decision is not "which is better" — it is "which trade-offs align with your workload." Understanding replication, sha...

Feb 8, 2026•13 min read

System Design Protocols: REST, RPC, and TCP/UDP

TLDR: 🎯 Use REST (HTTP + JSON) for public, browser-facing APIs where interoperability matters. Choose gRPC (HTTP/2 + Protobuf) for internal microservice communication when latency counts. Under the hood, TCP guarantees reliable ordered delivery; UDP...

Feb 8, 2026•16 min read

System Design Networking: DNS, CDNs, and Load Balancers

TLDR: When you hit a URL, DNS translates the name to an IP, CDNs serve static assets from the edge nearest to you, and Load Balancers spread traffic across many servers so no single machine becomes a bottleneck. These three layers are the traffic con...

Feb 8, 2026•15 min read

System Design Core Concepts: Scalability, CAP, and Consistency

TLDR: 🚀 Scalability, the CAP Theorem, and consistency models are the three concepts that determine whether a distributed system can grow, stay reliable, and deliver correct results. Get these three right and you can reason about any system design qu...

Feb 8, 2026•12 min read

The Ultimate Guide to Acing the System Design Interview

TLDR: System Design interviews are collaborative whiteboard sessions, not trick-question coding tests. Follow the framework — Requirements → Estimations → API → Data Model → High-Level Architecture → Deep-Dive — and you turn vague product ideas into ...

Feb 8, 2026•13 min read

Caching(3)

Redis Sorted Sets Explained: Skip Lists, Scores, and Real-World Use Cases

TLDR: Redis Sorted Sets (ZSETs) store unique members each paired with a floating-point score, kept in sorted order at all times. Internally they use a skip list for O(log N) range queries and a hash table for O(1) score lookup — giving you the best o...

Mar 29, 2026•19 min read

Write-Time vs Read-Time Fan-Out: How Social Feeds Scale

TLDR: Fan-out is the act of distributing one post to many followers' feeds. Write-time fan-out (push) pre-computes feeds at post time — fast reads but catastrophic write amplification for celebrities. Read-time fan-out (pull) computes feeds on demand...

Mar 29, 2026•18 min read

System Design: Complete Guide to Caching — Patterns, Eviction, and Distributed Strategies

TLDR: Caching is the single highest-leverage performance tool in distributed systems. This guide covers every read/write pattern (Cache-Aside through Refresh-Ahead), every eviction policy (LRU through ARC), cache invalidation pitfalls, thundering her...

Mar 9, 2026•29 min read

Concurrency(2)

Dirty Write Explained: When Uncommitted Data Gets Overwritten

TLDR: A dirty write occurs when Transaction B overwrites data that Transaction A has written but not yet committed. The result is not a rollback or an error — it is silently inconsistent committed data: one table reflects Transaction B's intent, anot...

Apr 11, 2026•26 min read

Lost Update Explained: When Two Writes Become One

TLDR: A lost update occurs when two concurrent read-modify-write transactions both read the same committed value, both compute a new value from it, and both write back — with the second write silently discarding the first. No error is raised. Both tr...

Apr 11, 2026•34 min read

Consistency(2)

Database Anomalies: How SQL and NoSQL Handle Dirty Reads, Phantom Reads, and Write Skew

TLDR: Database anomalies are the predictable side-effects of concurrent transactions — dirty reads, phantom reads, write skew, and lost updates. SQL databases use MVCC and isolation levels to prevent them; PostgreSQL's Serializable Snapshot Isolation...

Apr 5, 2026•29 min read

The Consistency Continuum: From Read-Your-Own-Writes to Leaderless Replication

TLDR: In distributed systems, consistency is a spectrum of trade-offs between latency, availability, and correctness. By leveraging session-based patterns like Read-Your-Own-Writes and formal Quorum logic ($W+R > N$), architects can provide the illus...

Apr 5, 2026•8 min read

Azure(2)

Azure Cosmos DB Consistency Levels Explained: Strong, Bounded Staleness, Session, Consistent Prefix, and Eventual

TLDR: Cosmos DB offers five consistency levels — Strong, Bounded Staleness, Session, Consistent Prefix, Eventual — each with precise, non-obvious internal mechanics. Session does not mean HTTP session; it means a client-side token that tracks what yo...

Apr 5, 2026•23 min read

Azure Cosmos DB API Modes Explained: NoSQL, MongoDB, Cassandra, PostgreSQL, Gremlin, and Table

TLDR: Cosmos DB's six API modes are wire-protocol compatibility layers over one shared ARS storage engine — except PostgreSQL (Citus), which is genuinely different. Every API emulates its native database incompletely, and those gaps are structural, n...

Apr 5, 2026•23 min read

Architecture Patterns(2)

The Dual Write Problem in NoSQL: MongoDB, DynamoDB, and Cassandra

TLDR: NoSQL databases trade cross-entity atomicity for scale — and every database draws that atomicity boundary in a different place. MongoDB's boundary is the document (pre-4.0) or the replica set (4.0+ multi-doc transactions). DynamoDB's boundary i...

Apr 5, 2026•35 min read

The Dual Write Problem: Why Two Writes Always Fail Eventually — and How to Fix It

TLDR: Any service that writes to a database and publishes a message in the same logical operation has a dual write problem. try/catch retries don't fix it — they turn failures into duplicates. The Transactional Outbox pattern co-writes business data ...

Apr 5, 2026•22 min read

Compare And Swap(1)

Compare-and-Swap and Optimistic Locking: How Every Database Implements It

TLDR: Compare-and-Swap (CAS) is the CPU-level atomic instruction that makes lock-free concurrency possible. Optimistic locking builds on it at the database layer: read freely, compute locally, write only if the record has not changed. Every major dat...

Apr 17, 2026•31 min read

Cdc(1)

Change Feed vs Change Stream: CDC Internals, Reliability, and When to Avoid Each

In the summer of 2023, the platform team at a fast-growing e-commerce company was handling 100,000 orders per day across three microservices: Order Service, Inventory Service, and Billing Service. All three needed to react to the same database mutati...

Apr 17, 2026•46 min read

Anomalies(1)

Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More

TLDR: Distributed systems produce anomalies not because the code is buggy — but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...

Apr 5, 2026•34 min read

Acid(1)

ACID Transactions in Distributed Databases: DynamoDB, Cosmos DB, and Spanner Compared

TLDR: ACID transactions in distributed databases are not equal. DynamoDB provides multi-item atomicity scoped to 25 items using two-phase commit with a coordinator item, but only within a single region. Cosmos DB wraps partition-scoped operations ins...

Apr 5, 2026•35 min read

Id Generation(1)

ID Generation Strategies in System Design: Base62, UUID, Snowflake, and Beyond

TLDR: Short shareable IDs need Base62 (URL shorteners). Database primary keys at scale need time-ordered IDs (Snowflake, UUID v7). Security tokens need random IDs (UUID v4, NanoID). Picking the wrong strategy either causes B-tree fragmentation at 50M...

Apr 5, 2026•23 min read

Job Scheduler(1)

System Design HLD Example: Distributed Job Scheduler

TLDR: A distributed job scheduler ensures tasks fire reliably using a durable Job Store with a next_fire_time index. To handle multiple scheduler instances without double-firing, we use optimistic row-level locking (UPDATE WHERE status='SCHEDULED'). ...

Mar 28, 2026•14 min read

Rate Limiter(1)

System Design HLD Example: Distributed Rate Limiter

TLDR: A distributed rate limiter protects APIs from abuse and "noisy neighbors" by enforcing request quotas across a cluster of servers. The core technical challenge is Atomic State Management—solved by using Redis Lua scripts to perform a "check-and...

Mar 13, 2026•18 min read

Chat System(1)

System Design HLD Example: Chat and Messaging Platform

TLDR: A distributed chat system must balance low-latency delivery with strong per-conversation ordering. The architectural crux is the WebSocket Gateway for persistent stateful connections and Cassandra for append-heavy message storage partitioned by...

Mar 13, 2026•17 min read

Event-driven-architecture(1)

System Design Message Queues and Event-Driven Architecture: Building Reliable Asynchronous Systems

TLDR: Message queues and event-driven architecture let services communicate asynchronously, absorb bursty traffic, and isolate failures. The core design challenge is not adding a queue — it is defining delivery semantics, retry behavior, and idempote...

Mar 12, 2026•13 min read

Data-modeling(1)

System Design Data Modeling and Schema Evolution: Query-Driven Storage That Survives Change

TLDR: In system design interviews, data modeling is where architecture meets reality. A good model starts from query patterns, chooses clear entity boundaries, defines indexes deliberately, and includes a schema evolution path so the system can chang...

Mar 12, 2026•13 min read

Api Design(1)

System Design API Design for Interviews: Contracts, Idempotency, and Pagination

TLDR: In system design interviews, API design is not a list of HTTP verbs. It is a contract strategy: clear resource boundaries, stable request and response shapes, pagination, idempotency, error semantics, and versioning decisions that survive scale...

Mar 12, 2026•11 min read

Capacity Planning(1)

The Role of Data in Precise Capacity Estimations for System Design

TLDR: Capacity estimation is the skill of back-of-the-envelope math that tells you whether your system design will survive its traffic before you write a line of code. Four numbers do most of the work: DAU, QPS, Storage/day, and Bandwidth/day. 📖 T...

Mar 9, 2026•13 min read

Circuit Breaker(1)

System Design Advanced: Security, Rate Limiting, and Reliability

TLDR: Three reliability tools every backend system needs: Rate Limiting prevents API spam and DDoS, Circuit Breakers stop cascading failures when downstream services degrade, and Bulkheads isolate failure blast radius. Knowing when and how to combine...

Mar 9, 2026•14 min read

Kafka(1)

How Kafka Works: The Log That Never Forgets

TLDR: Kafka is a distributed event store. Unlike a traditional queue (RabbitMQ) where messages disappear after reading, Kafka stores them in a persistent Log. This allows multiple consumers to read the same data at their own pace, replay history, and...

Mar 9, 2026•13 min read

Hashing(1)

Consistent Hashing: Scaling Without Chaos

TLDR: Standard hashing (key % N) breaks when $N$ changes — adding or removing a server reshuffles almost all keys. Consistent Hashing maps both servers and keys onto a ring (0–360°). When a server is added, only its immediate neighbors' keys move, mi...

Mar 9, 2026•13 min read

Consensus(1)

A Guide to Raft, Paxos, and Consensus Algorithms

TLDR TLDR: Consensus algorithms allow a cluster of computers to agree on a single value (e.g., "Who is the leader?"). Paxos is the academic standard — correct but notoriously hard to understand. Raft is the practical standard — designed for understa...

Mar 9, 2026•13 min read