Advanced51 min readDistributed SystemsSystem DesignGlossary

Key Terms in Distributed Systems: The Definitive Glossary

Every term you'll encounter in distributed systems interviews and production engineering — defined clearly with examples

System Design Interview Prep

Abstract Algorithms

·Apr 5, 2026·51 min read

More actions⌄

Practice Interview Mock Discussion

Reading progress

51 min left

Metadata and pacing⌄

Total read

51 min

Sections

◴ On this page⌄

📖 How Vocabulary Breaks System Design Interviews (and Production Systems)🔍 Consistency Anomalies: The Ways Transactions Can Lie to You Dirty Read Non-Repeatable Read Phantom Read Read Skew Write Skew Lost Update Dirty Write ⚙️ Isolation Levels: PostgreSQL's Six-Way Dial Between Safety and Speed 📊 Replication Models and the Consistency Spectrum Across Databases Single-Leader Replication Multi-Leader Replication Leaderless Replication (Dynamo-Style)Synchronous vs. Asynchronous Replication 📦 Partitioning and Sharding: Distributing Data Across Nodes 🕐 Ordering and Time: Why "Now" Has No Meaning in a Distributed System 🗳️ Consensus and Leader Election: Deciding Who Gets the Final Say 🔁 Fault Tolerance and Recovery: Building Systems That Survive Failures 🧠 Storage Engine Internals: Why LSM Trees and B-Trees Make Different Bets Internals of the LSM Tree Write Path LSM Tree (Log-Structured Merge-Tree)B-Tree Performance Analysis: LSM Tree vs B-Tree Trade-offs MVCC (Multi-Version Concurrency Control)📡 Observability: Reading the Vital Signs of a Distributed Production System 🌐 CAP Theorem and PACELC: The Consistency–Availability Trade-off Decoded 🌍 Where These Concepts Show Up in Production Systems You Know ⚖️ Trade-offs in Distributed Systems: The Decisions That Define Your Architecture 🧭 Decision Guide: Choosing the Right Consistency and Storage Model 🧪 Scenario Walkthroughs: Applying the Vocabulary to Real Design Problems Scenario 1: The On-Call Scheduling Bug (Write Skew in Practice)Scenario 2: The Stripe Retry That Charged Twice (Idempotency Failure)Scenario 3: The Quorum That Was Not Strong Enough 📚 Lessons Learned: What Production Systems Teach You About This Vocabulary 📌 TLDR: The 60-Second Distributed Systems Vocabulary Reference 🔗 Related Posts: Build Your Distributed Systems Knowledge Graph

✣ Need another angle?⌄

Switch the article companion into a lower-complexity framing, then quiz yourself when you are ready.

Advanced51 min readDistributed SystemsSystem DesignGlossary

Key Terms in Distributed Systems: The Definitive Glossary

Every term you'll encounter in distributed systems interviews and production engineering — defined clearly with examples

Abstract Algorithms

Apr 5, 2026 · 51 min read

Interview

Helpful?

📖 How Vocabulary Breaks System Design Interviews (and Production Systems)

TLDR: Distributed systems vocabulary is precise for a reason.

1. Overview

Every term you'll encounter in distributed systems interviews and production engineering — defined clearly with examples

⌁

Why it matters

TLDR: Distributed systems vocabulary is precise for a reason.

Show high-level concept flow⌄

📖 How Vocabulary Breaks System Design Interviews (and Production Systems)

Starting point

→

🔍 Consistency Anomalies: The Ways Transactions Can Lie to You

Next concept

→

⚙️ Isolation Levels: PostgreSQL's Six-Way Dial Between Safety and Speed

Next concept

→

📊 Replication Models and the Consistency Spectrum Across Databases

Next concept

→

📦 Partitioning and Sharding: Distributing Data Across Nodes

Outcome

Committed

At a glance

DifficultyAdvanced ▥

Concepts37

Estimated time51 min

PrerequisitesDistributed Systems, System Design

System lens

See Key Terms in Distributed Systems: The Definitive Glossary as a living topology.

Every term you'll encounter in distributed systems interviews and production engineering — defined clearly with examples

📖 How Vocabulary Breaks System Design Interviews (and Production Systems)

Ingress and assumptions

🔍 Consistency Anomalies: The Ways Transactions Can Lie to You

State transition

⚙️ Isolation Levels: PostgreSQL's Six-Way Dial Between Safety and Speed

State transition

📊 Replication Models and the Consistency Spectrum Across Databases

State transition

📦 Partitioning and Sharding: Distributing Data Across Nodes

Outcome and guarantees

The article becomes easier when every section maps to a state change, a guarantee, or a failure boundary.

Narrative transition

Move from explanation to operating judgment.

Use these checkpoints as the conceptual pacing layer before continuing into the full article.

!Why this matters

TLDR: Distributed systems vocabulary is precise for a reason.

#Key section to watch

Pay attention to "🔍 Consistency Anomalies: The Ways Transactions Can Lie to You"; it usually contains the main mechanism or tradeoff.

?Interview angle

Be ready to explain 📖 How Vocabulary Breaks System Design Interviews (and Production Systems) and 🔍 Consistency Anomalies: The Ways Transactions Can Lie to You with one concrete example and one tradeoff.

Tradeoff path 1

📖 How Vocabulary Breaks System Design Interviews (and Production Systems): speed-first

TLDR: Distributed systems vocabulary is precise for a reason.

Tradeoff path 2

🔍 Consistency Anomalies: The Ways Transactions Can Lie to You: reliability-first

Mixing up read skew and write skew costs you an interview.

Failure rehearsal

Pressure-test the mental model.

📖 How Vocabulary Breaks System Design Interviews (and Production Systems) misunderstood

High model quality can still produce incorrect outputs without grounding and verification.

Mitigation: Revisit 📖 How Vocabulary Breaks System Design Interviews (and Production Systems) and validate the first principles.

Risk 68%

🔍 Consistency Anomalies: The Ways Transactions Can Lie to You tradeoff missed

Low latency does not automatically mean high throughput under contention.

Mitigation: Compare against 🔍 Consistency Anomalies: The Ways Transactions Can Lie to You and document the tradeoff.

Risk 58%

Back to the article

Continue into the authored sections with the topology in mind: each heading should now answer what changes, what can fail, and what guarantee the system is trying to preserve.

TLDR: Distributed systems vocabulary is precise for a reason. Mixing up read skew and write skew costs you an interview. Confusing Snapshot Isolation with Serializable costs you a production outage. This glossary organises every critical term into concept families — with concrete scenarios, prevention strategies, and visual aids — so the vocabulary becomes second nature.

📖 How Vocabulary Breaks System Design Interviews (and Production Systems)

You are 35 minutes into a system design interview. You have just proposed a multi-leader replication setup for a globally distributed social feed. The interviewer pauses and says: "Your design doesn't account for write skew. Walk me through how you'd prevent it."

You nod. You say "I'd use Serializable isolation." But inside, a small alarm is going off: is write skew the same as a lost update? Does Serializable actually prevent it? What even is the difference?

This moment — the instant where blurry vocabulary collapses your design argument — is the most common way strong engineers fail distributed systems interviews. It is not a knowledge gap; it is a vocabulary gap. You understand the concepts intuitively but do not have the precise words or the precise relationships between them.

The same problem shows up in production. When a Cassandra runbook says "ensure R + W > N for strong consistency" and nobody on the team is certain whether that formula applies in a multi-datacenter setup, configuration mistakes follow. When a junior engineer confuses "replica lag" with "split brain," they diagnose the wrong failure mode.

This glossary exists to close that gap. Rather than presenting terms alphabetically — which treats the vocabulary like a dictionary and hides the relationships between concepts — every term here is grouped by concept family. Isolation anomalies are grouped together so you can see how they escalate. Replication models are grouped so you can compare their trade-offs side-by-side. By the time you finish, you will not just recognise the words; you will know how they connect.

🔍 Consistency Anomalies: The Ways Transactions Can Lie to You

Before you can understand isolation levels, you need to understand what they protect against. Each anomaly below describes a class of correctness violation that can occur when multiple transactions execute concurrently.

Dirty Read

A dirty read occurs when transaction B reads data that transaction A has written but not yet committed. If A rolls back, B has read data that, as far as the database is concerned, never existed.

Scenario: Transaction A is transferring $500 from account 1 to account 2. It debits account 1 and is mid-flight before crediting account 2. Transaction B reads account 1's new (lower) balance before A commits. A then fails and rolls back. B has acted on a phantom number.

Dirty read anomaly: Transaction B reads uncommitted data from Transaction A before rollback

Prevention: Any isolation level above Read Uncommitted prevents dirty reads.

Non-Repeatable Read

A non-repeatable read occurs when a transaction reads the same row twice within one transaction and gets different values each time, because another transaction committed a change between the two reads.

Scenario: An audit job reads a payment record at 10:00:01 and again at 10:00:02 to verify it has not changed. Between those two reads, a concurrent transaction updates the payment amount. The audit report is now internally inconsistent.

Non-repeatable read: same row returns different values within one transaction

Prevention: Repeatable Read isolation or higher.

Phantom Read

A phantom read is the row-range equivalent of a non-repeatable read. A transaction executes the same range query twice. The second execution returns rows that did not exist the first time (or is missing rows that did exist), because another transaction inserted or deleted rows in between.

Scenario: A booking system reads all available seats in a row WHERE section = 'A' AND available = true. Another transaction books one of those seats and commits. The first transaction re-queries and sees a smaller result set. It is now operating on stale availability data.

Phantom read: range query returns different row counts between two reads in the same transaction

Prevention: Serializable isolation (or predicate locks in Repeatable Read in some engines).

Read Skew

Read skew occurs when a transaction performs multiple reads across two different objects, and those reads are taken from different points in time — so they represent an internally inconsistent snapshot.

The canonical example is a bank invariant: account A + account B must total $1,000. Transaction T1 deducts $200 from account A (now $300) but has not yet added it to account B (still $500). A concurrent read transaction reads account A's new value ($300) and account B's old value ($500). It sees $800 total, violating the invariant. This state never existed as a committed whole — it is a mix of before-and-after images of T1. The read transaction crosses snapshot boundaries and cannot be trusted.

Read skew: transaction reads inconsistent snapshots across two objects violating an invariant

Prevention: Snapshot Isolation (MVCC) — reads are served from a single consistent snapshot taken at transaction start, so a transaction never sees partial writes from another.

Write Skew

Write skew is the most subtle of the isolation anomalies and the one most frequently confused with a lost update in interviews.

Write skew occurs when two concurrent transactions each read a shared condition, each check that it is safe to proceed, and each then write a change that, individually, would be valid — but together violate a global invariant.

The classic scenario: A hospital requires at least one doctor to be on-call at all times. Two doctors, Alice and Bob, both check the on-call roster concurrently. Both see the other doctor is on-call and decide it is safe for them to go off-call. Both transactions commit. Now zero doctors are on-call — a violated invariant that neither transaction could detect individually.

Write skew differs from a lost update because neither transaction overwrites the other's data. They write to different rows. The conflict is logical, not physical.

Write skew: two concurrent transactions each check a condition and write to different rows causing a logical conflict

Prevention: Serializable isolation (Serializable Snapshot Isolation / 2PL). Snapshot Isolation does not prevent write skew, which surprises many engineers who assume it does.

The following diagram illustrates the write skew scenario step-by-step:

sequenceDiagram
    participant Alice as Alice Txn
    participant DB as Database
    participant Bob as Bob Txn

    Alice->>DB: BEGIN
    Bob->>DB: BEGIN
    Alice->>DB: SELECT count(*) WHERE on_call = true
    DB-->>Alice: 2 doctors on call
    Bob->>DB: SELECT count(*) WHERE on_call = true
    DB-->>Bob: 2 doctors on call
    Alice->>DB: UPDATE doctors SET on_call=false WHERE name=Alice
    Bob->>DB: UPDATE doctors SET on_call=false WHERE name=Bob
    Alice->>DB: COMMIT
    Bob->>DB: COMMIT
    Note over DB: Invariant violated - 0 doctors on call

Both Alice and Bob read a valid state (two doctors on-call) and make a write that looks locally safe. The invariant breaks only when both commits land. This is why Serializable isolation is the only safe choice for write skew scenarios — it detects that the two reads and two writes form a conflicting cycle.

Lost Update

A lost update occurs when two transactions concurrently read the same value, compute a new value, and write back — with the second write silently overwriting the first.

Scenario: A counter starts at 5. Transaction A reads 5 and plans to write 6. Transaction B also reads 5 and plans to write 6. A writes 6. B writes 6. The counter should be 7; it is 6. A's increment is lost.

Lost update: two concurrent read-modify-write operations overwrite each other resulting in one increment being silently discarded

Prevention: Use atomic operations (UPDATE counter SET value = value + 1 WHERE id = X), optimistic locking (compare-and-swap), or pessimistic locking. Repeatable Read prevents it in some databases; Serializable prevents it universally.

A classic compare-and-swap guard against a lost update looks like this:

-- Read current value
SELECT value, version FROM counters WHERE id = 1;
-- value = 5, version = 42

-- Write only if version has not changed (no lost update possible)
UPDATE counters SET value = 6, version = 43
WHERE id = 1 AND version = 42;
-- If 0 rows affected, another transaction won the race — retry

Dirty Write

A dirty write occurs when a transaction overwrites data that another transaction has written but not yet committed. This can cause partial rollbacks to leave the database in an inconsistent state.

Scenario: Two buyers concurrently purchase the last item in inventory. Transaction A updates the item's owner to "Alice" but has not yet committed. While A is still in progress, Transaction B updates the same row, setting the owner to "Bob", overwriting A’s uncommitted value. Transaction A then rolls back. The value "Alice" never logically existed, yet B’s write was applied over it. The database has allowed one transaction to overwrite another transaction’s uncommitted data, violating isolation.

Dirty write anomaly: two concurrent transactions overwrite each other s uncommitted data causing one write to be silently lost

Prevention: All isolation levels above Read Uncommitted prevent dirty writes by default, since row-level locks prevent overwriting an uncommitted row.

⚙️ Isolation Levels: PostgreSQL's Six-Way Dial Between Safety and Speed

Isolation levels are the database's knob for trading safety against concurrency and performance. The SQL standard defines four levels; real databases (PostgreSQL, MySQL, Oracle) add Snapshot Isolation as a practical middle ground.

Isolation Level	Dirty Read	Non-Repeatable Read	Phantom Read	Read Skew	Write Skew	Lost Update
Read Uncommitted	Possible	Possible	Possible	Possible	Possible	Possible
Read Committed	Prevented	Possible	Possible	Possible	Possible	Possible
Repeatable Read	Prevented	Prevented	Possible*	Prevented†	Possible	Prevented†
Snapshot Isolation (MVCC)	Prevented	Prevented	Prevented	Prevented	Possible	Prevented
Serializable	Prevented	Prevented	Prevented	Prevented	Prevented	Prevented

*Phantom reads may be prevented by gap locks in some engines (MySQL InnoDB). †PostgreSQL's Repeatable Read prevents lost updates and read skew via MVCC.

Read Uncommitted is the lowest level. Transactions can read rows written by concurrent uncommitted transactions. Almost no production system uses this level; it exists for bulk reporting where approximate numbers are acceptable and speed matters above all else.

Read Committed is the PostgreSQL and Oracle default. Each statement within a transaction gets a fresh snapshot of committed data. This prevents dirty reads but allows non-repeatable reads within a single transaction, since the snapshot advances with each new statement.

Repeatable Read guarantees that if you read a row in a transaction, re-reading the same row later in the same transaction will return the same data. PostgreSQL implements this via MVCC snapshots. It does not prevent phantom reads in the SQL standard, though PostgreSQL's MVCC implementation effectively prevents most phantom reads in practice.

Snapshot Isolation (MVCC) gives every transaction a consistent view of the database as it existed at transaction start time. All reads come from this frozen snapshot. It prevents dirty reads, non-repeatable reads, phantom reads, and read skew — but not write skew, which is the key gotcha.

Serializable is the strongest level. All concurrent transactions produce results equivalent to some serial (one-at-a-time) execution order. PostgreSQL implements this via Serializable Snapshot Isolation (SSI), which uses read/write dependency tracking to detect and abort transactions that would violate serializability. It is safe for write skew.

📊 Replication Models and the Consistency Spectrum Across Databases

Replication is the act of maintaining copies of the same data on multiple nodes. Every replicated system must answer: who accepts writes, and how quickly must replicas reflect them?

The following diagram maps the consistency spectrum from weakest to strongest, positioned against the replication models that produce each consistency level:

flowchart TD
    A[Eventual Consistency] --> B[Causal Consistency]
    B --> C[Monotonic Reads / Read-Your-Writes]
    C --> D[Session Consistency]
    D --> E[Bounded Staleness]
    E --> F[Sequential Consistency]
    F --> G[Linearizability - Strongest]

    A -.->|Leaderless async replication| A
    D -.->|Multi-leader with per-session routing| D
    G -.->|Single-leader sync replication| G

The spectrum above runs from eventual consistency (weakest — replicas converge eventually, with no timing guarantee) through to linearizability (strongest — the system appears as a single machine; reads always reflect the most recent write). Each step upward adds correctness guarantees but increases latency or availability cost.

Single-Leader Replication

One designated primary node accepts all writes. Replicas receive a replication stream and apply changes in the same order. This is the model used by PostgreSQL streaming replication and MySQL binlog replication.

Trade-off: Simple write ordering and conflict-free. But the single leader is a bottleneck — you cannot scale write throughput beyond what one node can handle, and leader failure requires a failover.

Multi-Leader Replication

Multiple nodes accept writes simultaneously. This enables writes to continue during network partitions between datacenters, and it eliminates the single-leader bottleneck for geo-distributed systems. MySQL multi-source replication and CockroachDB multi-region writes use variants of this model.

Trade-off: Write conflicts are inevitable when two leaders accept conflicting writes to the same row. The system must define a conflict resolution policy: last-write-wins, custom merge, CRDTs, or expose conflicts to application code.

Leaderless Replication (Dynamo-Style)

Any node can accept any read or write. Used by Apache Cassandra, Amazon DynamoDB, and Riak. Consistency is achieved by requiring reads and writes to contact a quorum of nodes.

Quorum condition — W + R > N: Where N is the replication factor (number of copies), W is the number of nodes that must acknowledge a write before it is confirmed, and R is the number of nodes contacted for a read. If W + R > N, at least one node in every read quorum must have the latest write. With N=3, W=2, R=2: any read overlaps with any write by at least one node.

Hinted handoff: If a target node is down during a write, a neighbour node stores the write with a hint (a note saying "deliver to node X when it recovers"). When the downed node comes back, the hint is forwarded. This allows writes to succeed during temporary node failures without sacrificing durability.

Read repair: When a read query contacts multiple replicas and detects a version mismatch (some replicas have an older value), the coordinator updates the stale replicas with the newer value before returning to the client. This lazily brings replicas back into sync.

Anti-entropy: A background process that periodically compares Merkle tree hashes of replica data and syncs any divergent ranges. Read repair handles inconsistencies detected during reads; anti-entropy handles key ranges that have not been recently read.

Synchronous vs. Asynchronous Replication

Synchronous replication means the primary waits for at least one replica to confirm receipt (and optionally apply) the write before acknowledging the client. PostgreSQL's synchronous_standby_names enables this. Zero data loss on primary failure, but added write latency for every commit.

Asynchronous replication means the primary acknowledges the client immediately and ships changes to replicas independently. Faster writes, but a primary crash before replication occurs means recent writes can be lost.

Replication lag is how far behind a replica is from the primary — measured in seconds or bytes of WAL. High lag means reads routed to replicas may return stale data. In PostgreSQL: pg_stat_replication.write_lag. In MySQL: Seconds_Behind_Master from SHOW REPLICA STATUS.

Replication factor (RF): The number of copies of each piece of data. RF=3 means three nodes hold each row. Higher RF increases fault tolerance (the system can lose RF-1 nodes without data loss) at the cost of write amplification.

📦 Partitioning and Sharding: Distributing Data Across Nodes

As data volumes grow beyond what a single node can store or serve, you partition data across multiple nodes. Understanding the vocabulary here prevents the most common interview stumble — conflating partitioning with sharding.

Partitioning vs. Sharding: Partitioning is the general concept of splitting a dataset into subsets. Sharding is the specific technique of horizontal partitioning across multiple database nodes (physical machines or instances). All sharding is partitioning; not all partitioning is sharding (e.g., a single-node table partitioned by date is partitioned but not sharded).

Horizontal partitioning: Rows are split across nodes by a partition key. Node 1 holds users 1–1,000,000; node 2 holds users 1,000,001–2,000,000. This is what people typically mean when they say "sharding."

Vertical partitioning: Columns are split. A wide user record might have its frequently-accessed fields on one node (profile, username) and infrequently-accessed fields on another (audit logs, raw clickstream). Used to optimise I/O patterns and cache efficiency.

Range partitioning: Rows are assigned to partitions based on key ranges. Easy to support range queries (give me all orders from March). Risk: hot partitions (all recent activity hits the latest range shard).

Hash partitioning: Rows are assigned by hash(key) mod N. Distributes load evenly. Downside: range queries are expensive because adjacent keys end up on different nodes.

Consistent hashing solves the redistribution problem with hash partitioning. Nodes and keys are mapped to positions on a conceptual ring. A key is assigned to the first node clockwise on the ring from the key's hash position. When a node is added or removed, only keys near that node's ring position need to be redistributed — not the entire dataset. Used by Cassandra and DynamoDB.

Virtual nodes (vnodes): Instead of each physical node occupying one contiguous arc on the ring, it occupies many small arcs (typically 256 per node). This ensures more even key distribution and makes redistribution smoother when nodes are added or removed.

Hot partition / hot spot: One partition receives a disproportionate share of reads or writes. Common cause: a shard key that is temporally biased (e.g., created_at) or a celebrity problem (one user ID generates millions of events). Detection: monitor per-shard throughput and latency in your database's observability dashboards.

Shard key: The attribute used to route a row to its partition. Choosing the right shard key is the single most important sharding decision. A bad shard key creates hot spots; a good shard key distributes load evenly and co-locates frequently-joined data.

Resharding: Redistributing data when partition boundaries change — either because you are adding nodes (expanding) or because hot spots require redistribution. Resharding in a live system is operationally complex; consistent hashing minimises but does not eliminate this.

🕐 Ordering and Time: Why "Now" Has No Meaning in a Distributed System

In a single-machine system, time is simple: events have a total order enforced by a single clock and a single CPU. In a distributed system, machines have independent clocks that drift, and messages arrive out of order. The vocabulary in this section defines how distributed systems reason about causality and ordering without trusting wall clocks.

Wall clock time: The system clock on a machine (e.g., System.currentTimeMillis()). Wall clocks are unreliable for ordering distributed events because clock skew — different machines having different time values — can exceed the gap between events. Google's Spanner measured clock skew up to 7 ms across datacenters; events separated by less than 7 ms cannot be confidently ordered by clock alone.

Logical clock (Lamport clock): A monotonically increasing integer counter maintained by each process. Rules: (1) increment the counter before each event; (2) when sending a message, include the current counter; (3) on receiving a message, set your counter to max(local, received) + 1. Lamport clocks establish a partial order: if A happened before B, A's timestamp is lower than B's. But the converse is not guaranteed — a lower timestamp does not necessarily mean it happened first in real time.

Vector clock: An extension of Lamport clocks where each node maintains a vector of counters — one per node in the system. [nodeA: 3, nodeB: 1, nodeC: 2]. Vector clocks can determine causal relationships: event X causally precedes event Y if every entry in X's vector is ≤ the corresponding entry in Y's vector, with at least one strictly less. Vector clocks enable conflict detection in leaderless systems (used by Amazon Dynamo).

TrueTime (Spanner): Google Spanner's approach to distributed time uses GPS receivers and atomic clocks to provide a bounded uncertainty window [earliest, latest] around the current time. Spanner's external consistency guarantee relies on waiting out this uncertainty interval before committing, ensuring any reader that starts after a commit will observe it. This is the only production system that provides true linearizability at global scale without sacrificing availability.

Happens-before (→): Formalized by Leslie Lamport. Event A → B means A causally influenced B: either A and B are on the same process and A occurred before B, A sent a message that B received, or A → C and C → B. Two events with no happens-before relationship are concurrent — neither could have influenced the other.

Causality: The property that captures which events could have influenced which. Causal consistency guarantees that if you see the effect of an event, you also see the cause. This is stronger than eventual consistency but weaker than linearizability.

Epoch: In consensus protocols, an epoch (called a term in Raft, a ballot in Paxos) is a monotonically increasing round number. A new epoch is started whenever a new leader is elected. Messages from a leader with a stale epoch number are rejected, preventing two leaders from both believing they are authoritative.

Monotonic reads: A session guarantee: once a client reads a value at version V, it will never read a value older than V in subsequent reads. Prevents the disorienting scenario where refreshing a page shows older data than the previous page load.

Monotonic writes: A session guarantee: a client's writes are applied in the order they were issued. Without this, write 2 can arrive at a replica before write 1, corrupting the sequence.

Read-your-writes: A session guarantee: after a client writes a value, any subsequent read by the same client returns that value (or a newer one). Essential for user-facing applications — submitting a form and not seeing the update on the next page load is a jarring experience. Typically implemented by routing a client's reads to the same replica that received its writes, or by using a version token.

📊 Visual Reference

flowchart TD
    Driver["Driver Node"]
    Planner["Query Planner"]
    Optimizer["Catalyst Optimizer"]
    Executor["Executor Task"]

    Driver -->|"Process Flow"| Planner
    Planner --> Optimizer
    Optimizer -->|"Optimized Plan"| Executor
    Executor -->|"Result"| Driver

🗳️ Consensus and Leader Election: Deciding Who Gets the Final Say

Consensus is the problem of getting multiple distributed nodes to agree on a single value, even when some nodes may crash or messages may be delayed. It underpins leader election, distributed transactions, and replicated state machines.

Consensus: Formally, a consensus algorithm must satisfy: (1) Validity — any decided value was proposed; (2) Agreement — no two nodes decide differently; (3) Termination — every non-faulty node eventually decides. The FLP impossibility result proves that deterministic consensus is impossible in a fully asynchronous system with even one crash failure — which is why practical implementations use timeouts.

Paxos: The first widely-studied consensus algorithm, published by Leslie Lamport. Paxos has three roles: Proposer (initiates a value proposal), Acceptor (votes on proposals), and Learner (observes the decided value). The algorithm proceeds in two phases: Prepare/Promise (establishing authority for an epoch) and Accept/Accepted (getting a quorum to accept a value). Paxos is notoriously difficult to implement correctly in production, leading to multi-Paxos variants (used by Google Chubby).

Raft: Designed explicitly for understandability as an alternative to Paxos. Raft separates consensus into three sub-problems: leader election (elect one node per term), log replication (leader accepts entries and replicates to followers), and safety (a leader must have all committed entries from previous terms). Used by etcd, CockroachDB, and TiDB. The Raft paper's figure 2 is the specification; every production implementation derives from it.

Leader election: The process of choosing one node as the single authoritative coordinator for writes or decisions. A well-designed election must be safe (at most one leader per epoch) and live (a new leader is eventually elected after a failure). Raft achieves safety by requiring a candidate to receive votes from a majority before becoming leader for a given term.

Split brain: The catastrophic scenario where two nodes simultaneously believe they are the leader. In a replicated database, this means two nodes each accepting writes with no reconciliation — both believe their writes are authoritative. The result is divergent data that cannot be automatically merged. Split brain is prevented by requiring a majority quorum for leadership (a cluster of 5 cannot produce two majorities of 3 simultaneously).

Fencing token: Even with leader election and epoch numbers, a slow leader that was paused (e.g., garbage collection, VM live migration) can resume believing it is still the leader after a newer leader has been elected. A fencing token is a monotonically increasing integer issued by the lock service (e.g., ZooKeeper's zxid) when a leader acquires a lock. The storage system rejects any write that includes a token lower than the highest token it has seen. This prevents a stale leader from overwriting data committed by a newer leader.

The diagram below shows the fencing token mechanism in action:

sequenceDiagram
    participant L1 as Leader 1 (stale)
    participant ZK as ZooKeeper
    participant L2 as Leader 2 (current)
    participant ST as Storage

    L1->>ZK: Acquire lock, receive token=33
    Note over L1: Leader 1 pauses (GC / network)
    ZK-->>L2: Leader 1 timeout - new election
    L2->>ZK: Acquire lock, receive token=34
    L2->>ST: Write data with token=34
    ST-->>L2: OK - token 34 accepted
    Note over L1: Leader 1 resumes, still thinks it has lock
    L1->>ST: Write data with token=33
    ST-->>L1: REJECTED - token 33 less than seen token 34

ZooKeeper issues monotonically increasing tokens. Storage accepts writes only from the most recent token holder. A stale leader trying to write with an old token is rejected at the storage layer, regardless of what ZooKeeper's session state says.

🔁 Fault Tolerance and Recovery: Building Systems That Survive Failures

Distributed systems must be designed with the assumption that individual components will fail. The vocabulary in this section covers how systems detect failures, recover from them, and guarantee message delivery semantics.

Checkpointing: Periodically saving the complete state of a running system to stable storage, so recovery can restart from the checkpoint rather than replaying from the beginning of time. Used in Apache Flink (barrier-based checkpoints every few seconds), Apache Spark Streaming (DStream checkpoints), and traditional databases (fuzzy checkpoints in PostgreSQL). The checkpoint interval trades recovery time against checkpoint overhead.

Write-Ahead Log (WAL): Changes are written to a sequential, durable log before being applied to the main data structure. On crash recovery, the system replays the WAL from the last checkpoint forward. Used by PostgreSQL, MySQL InnoDB, RocksDB, and almost every durable storage system. The WAL is also the replication stream in PostgreSQL streaming replication — the same log that enables crash recovery enables replicas.

Idempotency: An operation is idempotent if applying it multiple times produces the same result as applying it once. SET balance = 500 is idempotent; ADD 50 TO balance is not. Idempotency is the fundamental safety property required for at-least-once delivery systems.

A canonical idempotency key pattern for a payment API:

// Client includes a unique idempotency key with each request
POST /payments
{
  "idempotency_key": "client-generated-uuid-abc123",
  "amount": 500,
  "recipient": "acct_xyz"
}

// Server checks: has this key been processed?
IF EXISTS(processed_keys WHERE key = "client-generated-uuid-abc123"):
    RETURN cached_response   // return same result, do not recharge
ELSE:
    process_payment()
    store_key("client-generated-uuid-abc123", response)
    RETURN response

The idempotency key is the client-generated UUID that uniquely identifies the logical operation. On retry (network timeout, client crash), the client resubmits with the same key. The server deduplicates by key and returns the original response without re-processing.

Exactly-once delivery: The guarantee that a message is processed exactly one time. Achievable in practice via idempotent consumers + deduplication at the application layer, or Kafka's exactly-once semantics (transactional producers + idempotent brokers). True network-level exactly-once delivery is theoretically impossible — what is achieved is effectively exactly-once by making re-delivery safe.

At-least-once delivery: The system guarantees delivery of every message, but may deliver the same message more than once (due to retries, network failures, or consumer restarts). The receiver must be idempotent.

At-most-once delivery: The system sends a message once and does not retry. A message may be lost but is never processed twice. Used where loss is acceptable (UDP metrics, log sampling).

Heartbeat: A periodic signal (typically a small TCP packet or HTTP ping) sent by a node to indicate it is alive. A node that has not sent a heartbeat within a configurable timeout is assumed to have failed.

Timeout: The duration after which a node is assumed dead if no heartbeat or response is received. Setting the right timeout is a classic operational challenge: too short causes false positives (healthy but slow nodes are marked dead); too long delays failure detection and recovery.

Fail-stop failure: A node that fails by stopping entirely and sending no further messages. This is the simplest failure mode — silent, unambiguous, and detectable via heartbeat timeout.

Byzantine failure: A node that fails by behaving arbitrarily — sending incorrect, conflicting, or maliciously crafted messages. Byzantine fault tolerance requires a supermajority (3f+1 nodes to tolerate f Byzantine faults) and is the hardest failure model to design for. Used in blockchain consensus protocols (PBFT, Tendermint) but rare in traditional data infrastructure.

📊 Visual Reference

flowchart TD
    Driver["Driver Node"]
    Planner["Query Planner"]
    Optimizer["Catalyst Optimizer"]
    Executor["Executor Task"]

    Driver -->|"Process Flow"| Planner
    Planner --> Optimizer
    Optimizer -->|"Optimized Plan"| Executor
    Executor -->|"Result"| Driver

🧠 Storage Engine Internals: Why LSM Trees and B-Trees Make Different Bets

The choice of storage engine determines write throughput, read latency, compaction overhead, and operational complexity. Understanding LSM trees vs. B-trees is a standard intermediate systems question.

Internals of the LSM Tree Write Path

flowchart TD
    W[Client Write] --> M[MemTable - in-memory sorted buffer]
    M -->|MemTable full - flush| S0[L0 SSTable - immutable on disk]
    S0 -->|L0 compaction trigger| S1[L1 SSTables - sorted - compacted]
    S1 -->|tiered compaction| S2[L2 SSTables - larger - merged]
    S2 --> S3[L3 SSTables - largest - most compacted]
    S0 -.->|WAL for durability| WAL[Write-Ahead Log]

The LSM tree write path starts in memory (MemTable), flushes to immutable disk files (SSTables) when the MemTable fills, and merges those files through compaction levels to reclaim space and reduce read amplification. The WAL runs in parallel to the MemTable, ensuring durability even if the process crashes before the MemTable is flushed.

LSM Tree (Log-Structured Merge-Tree)

What it is: A write-optimised storage engine architecture used by Apache Cassandra, RocksDB, LevelDB, ScyllaDB, and Bigtable. The design principle is to make writes sequential (the fastest disk I/O operation) by never updating data in-place.

Write path: All writes go to an in-memory sorted buffer called the MemTable. The MemTable is also protected by a sequential WAL on disk for crash durability. When the MemTable reaches a size threshold, it is flushed to disk as an immutable sorted file called an SSTable (Sorted String Table). Over time, many SSTables accumulate. A background compaction process merges SSTables, discards deleted keys (represented by tombstones — special markers that signal deletion), and rewrites the merged result into a new, larger SSTable.

Read path: More expensive than writes. A read must check the MemTable first (most recent data), then SSTables from newest to oldest (L0, L1, L2...) until the key is found. A Bloom filter — a probabilistic data structure that answers "is this key definitely not in this SSTable?" — eliminates most unnecessary disk reads. If the Bloom filter says no, the SSTable can be skipped entirely with a single bit check.

Trade-off: Write amplification is low; read amplification is higher. Compaction consumes I/O and CPU in the background, which can affect latency under heavy write load.

B-Tree

What it is: A read-optimised, balanced tree structure used by PostgreSQL (heap + B-tree indexes), MySQL InnoDB, Oracle, and SQLite. B-trees update data in-place on disk.

Write path: Finds the appropriate leaf page in the tree and writes the new or updated value directly to that page. Pages that are too full are split. The WAL records the change for crash recovery, but the actual data page is modified on disk.

Read path: Efficient O(log n) traversal from root to leaf. All data is in sorted order on disk, making range queries fast with sequential I/O.

Trade-off: Read performance is excellent; write performance is good for random updates but requires random I/O (writing to a specific page wherever it is on disk). Writes in B-trees go to exactly the right place; LSM writes always go to the end of the current WAL/MemTable.

Property	LSM Tree	B-Tree
Write performance	Excellent (sequential append)	Good (random in-place write)
Read performance	Good with Bloom filters	Excellent (direct B-tree traversal)
Space amplification	Higher during compaction	Lower (in-place updates)
Write amplification	Higher (compaction rewrites data)	Lower
Best for	High write throughput (Cassandra, RocksDB, ClickHouse)	Mixed read/write OLTP (PostgreSQL, MySQL)

Performance Analysis: LSM Tree vs B-Tree Trade-offs

The LSM tree's sequential-write design produces high write throughput but introduces read amplification and write amplification as structural costs. Read amplification occurs because a read must consult the MemTable plus potentially all SSTable levels before finding the target key (mitigated by Bloom filters). Write amplification occurs because compaction re-writes the same data multiple times as it moves through levels (L0 → L1 → L2 → …). In RocksDB with level compaction, a single user write can result in 10–30 bytes written to disk per byte of actual data.

B-trees have lower read amplification (O(log n) direct traversal to the correct page) but suffer from random write I/O — the engine must locate and update the specific disk page where the key lives, which translates to random seeks on spinning disks or random-access write patterns on SSDs. B-trees also tend to have higher space amplification than a fully compacted LSM tree because partially-filled pages are never reclaimed without explicit VACUUM/page rebalancing.

Operational implication: In Cassandra or RocksDB under high write load, monitor compaction_pending_tasks and compaction_executor_active_tasks. When pending tasks accumulate faster than compaction can drain them, read latency degrades because there are more SSTables to scan per key lookup. Plan compaction throughput (in MB/s) as part of capacity planning for any LSM-based system.

MVCC (Multi-Version Concurrency Control)

What it is: A concurrency control technique where the database maintains multiple versions of each row rather than updating in-place. When a transaction modifies a row, a new version is created; old versions are kept until no active transaction needs them.

Why it matters: MVCC allows readers and writers to operate concurrently without blocking each other. A reader gets a consistent snapshot of the database at its start time (seeing only versions committed before it started). Writers create new versions. Garbage collection (VACUUM in PostgreSQL, purge in MySQL InnoDB) removes versions that are no longer needed by any active transaction.

📡 Observability: Reading the Vital Signs of a Distributed Production System

Building a distributed system is only half the work. Being able to observe, diagnose, and understand it in production is the other half.

Replication lag (revisited with measurement tools): In PostgreSQL, pg_stat_replication exposes write_lag, flush_lag, and replay_lag per replica — the wall-clock time between when the primary wrote a WAL record and when the replica wrote/flushed/applied it. In MySQL: SHOW REPLICA STATUS returns Seconds_Behind_Master. In Kafka: the MirrorMaker consumer lag metric. High replication lag is often the first warning sign of a replica falling behind before it becomes a consistency problem.

Consumer lag (Kafka): The number of messages between a consumer group's current committed offset and the latest offset on the partition. A consumer lag of 500,000 messages means the consumer is 500,000 events behind the latest. Monitored via Kafka's __consumer_offsets topic, JMX metrics, or Confluent Control Center. Spikes in consumer lag indicate the consumer is too slow — possible causes: processing bottleneck, broker GC pause, consumer rebalance.

Backpressure: The mechanism by which a downstream consumer signals to an upstream producer to slow down. Without backpressure, a fast producer overwhelms a slow consumer's buffers and causes crashes or data loss. Reactive Streams (Java), Akka Streams, and Apache Flink all have first-class backpressure support.

Fanout: One write that triggers many downstream reads or writes. Classic example: a user with 10 million followers posts a tweet. The write fanout — writing that tweet to 10 million followers' home timeline feeds — represents 10 million downstream writes. Systems that handle fanout-heavy workloads must be designed to queue and throttle fanout workers (Twitter's Finagle/Flock architecture). Contrast push fanout (write to all follower caches on post) vs. pull fanout (users read from a global feed and merge at read time).

Thundering herd: When many clients simultaneously try to access a resource — often triggered by a cache expiry, server restart, or a popular item suddenly being requested. If 10,000 clients all miss the cache simultaneously and all hit the database, the database receives 10,000 concurrent queries it was not designed to handle. Mitigation: probabilistic early expiry (refresh cache slightly before it expires), distributed locks on cache population, request coalescing.

Hot key: A single cache or database key receiving disproportionate traffic. The celebrity problem — every user querying a single celebrity's profile simultaneously. Mitigation: fan out the hot key to multiple shards (read from a random shard among hot-key replicas), use a local in-process cache for the hot key, or add client-side caching.

Cache stampede: The specific thundering herd scenario where many threads simultaneously discover a cache miss for the same key and all start computing/fetching the value to populate the cache. Mitigation: lock the cache key during population (first thread to detect the miss holds the lock; others wait or return stale data).

🌐 CAP Theorem and PACELC: The Consistency–Availability Trade-off Decoded

The CAP theorem and its successor PACELC are the theoretical frameworks behind every "which database should I use" decision in a distributed system design discussion.

CAP theorem (Brewer, 2000): In the presence of a network partition (P), a distributed system must choose between Consistency (C — every read returns the most recent write) and Availability (A — every request receives a response, though it may not be the latest data). A system cannot guarantee both simultaneously during a partition.

Important nuance: CAP's "Consistency" refers specifically to linearizability — not eventual consistency, not read-your-writes, not causal consistency. Many discussions muddy this. When an interviewer asks about CAP, they mean linearizability.

CP system: Prioritises consistency over availability during partitions. During a partition, a CP system will refuse to respond (or return an error) rather than return potentially stale data. Examples: HBase, Apache ZooKeeper, etcd, Google Bigtable.

AP system: Prioritises availability over consistency during partitions. During a partition, the system continues serving requests using whatever local data it has, accepting that responses may be stale or conflict when the partition heals. Examples: Apache Cassandra, Amazon DynamoDB (default), CouchDB, Riak.

PACELC (Daniel Abadi, 2012): Extends CAP to also characterise the latency-consistency trade-off when there is no partition. The full acronym: if there is a Partition (P), choose between Availability (A) and Consistency (C). Else (E), even without a partition, choose between Latency (L) and Consistency (C).

Database	Partition behaviour	No-partition behaviour	PACELC class
Cassandra	Available	Low latency over consistency	PA/EL
DynamoDB (eventual)	Available	Low latency	PA/EL
HBase	Consistent	High consistency, higher latency	PC/EC
ZooKeeper / etcd	Consistent	High consistency	PC/EC
PostgreSQL (sync replication)	Consistent	Higher latency	PC/EC
CockroachDB	Consistent	Balanced	PC/EC
MongoDB (default)	Available	Low latency	PA/EL
Spanner	Consistent	Consistent (TrueTime)	PC/EC

The following diagram shows the consistency model spectrum from weakest to strongest, with real-world database examples at each level:

flowchart LR
    EC[Eventual Consistency - Cassandra default] --> SEC[Strong Eventual - CRDTs]
    SEC --> BS[Bounded Staleness - Azure Cosmos]
    BS --> SS[Session Consistency - DynamoDB sessions]
    SS --> CC[Causal Consistency - MongoDB causal]
    CC --> SC[Sequential Consistency - Zookeeper]
    SC --> LIN[Linearizability - Spanner - etcd]

Each step from left to right adds stronger guarantees but increases coordination cost (and therefore latency). Eventual consistency at the far left requires no cross-node coordination on reads. Linearizability at the far right requires that a read confirm the current value with a quorum of nodes before returning.

Linearizability: The system behaves as if there is a single copy of the data and all operations take effect instantaneously at some point between their invocation and their response. The strongest consistency model. Any read returns the most recent committed write, globally. Achieved by single-leader replication with synchronous reads to the leader, or by consensus-based protocols (Raft, Paxos, Spanner TrueTime).

Sequential consistency: All processes see operations in the same global order, but that order does not need to match real-time order. Weaker than linearizability. Two clients may observe an order that does not match wall-clock time, but they agree on a single consistent sequence.

Causal consistency: Operations that have a causal relationship (A → B) are seen in that order by all nodes. Concurrent operations (neither causally precedes the other) may be seen in different orders by different nodes. Stronger than eventual, weaker than sequential.

Eventual consistency: Given no new writes, all replicas will eventually converge to the same value. No timing guarantee. The weakest useful consistency model. Used where availability and low latency matter more than freshness.

Strong eventual consistency (SEC): Eventual consistency with the additional property that all nodes that have received the same set of updates will have the same state — even if updates arrived in different orders. Achieved using CRDTs (Conflict-free Replicated Data Types), which define merge operations that are commutative, associative, and idempotent.

Bounded staleness: Reads may lag behind the latest write by at most X seconds or X operations. Azure Cosmos DB exposes this as a named consistency level. Provides a predictable freshness window for analytics or reporting use cases that can tolerate known-bounded lag.

Session consistency: Within a single client session, all four session guarantees hold: read-your-writes, monotonic reads, monotonic writes, and writes-follow-reads (causal ordering within the session). Outside the session, weaker consistency applies. This is the most common consistency level for user-facing applications.

🌍 Where These Concepts Show Up in Production Systems You Know

These are not academic terms. Each definition in this glossary maps directly to a production decision made by engineers at companies you use every day.

Stripe and Idempotency Keys: Stripe's payment API mandates idempotency keys precisely because at-least-once delivery is the only safe guarantee over HTTPS. A payment request that times out cannot be retried safely without an idempotency key — re-sending the request without one would charge the customer twice. Every Stripe API client must generate a UUID per logical payment and include it as the Idempotency-Key header. Stripe's backend deduplicates by key, returning the original response for any retry.

Discord and Read Skew in Message History: Discord migrated from Cassandra to ScyllaDB (then to a custom Rust storage layer) partly to address read-path anomalies under high fan-out. In leaderless systems, a client reading message history can receive results from replicas at different replication positions — the distributed equivalent of read skew. Messages can appear out-of-order or temporarily invisible between retries. Discord's architecture evolved toward per-channel causal consistency, tracking the last_seen_message_id in the session layer to implement monotonic reads.

Google Spanner and Linearizability at Scale: Spanner is the only widely-deployed system that provides external consistency (linearizability) across globally distributed datacenters. It achieves this by using TrueTime to assign globally unique commit timestamps with bounded uncertainty. Every transaction waits out the TrueTime uncertainty interval before committing, ensuring that any external observer who queries after a commit will see it. The cost: commit latency includes the uncertainty window (~7 ms across datacenters), but write throughput remains practical for financial and inventory systems.

Netflix and Eventual Consistency in the Catalog: Netflix's content catalog is stored in Cassandra with eventual consistency. Because catalog metadata (title descriptions, artwork, ratings) changes infrequently and being briefly stale is acceptable, PA/EL (Available + Low Latency) is the right trade-off. A user seeing an outdated thumbnail for 30 seconds is far less harmful than a catalog service being unavailable while waiting for cross-datacenter confirmation.

ZooKeeper and Leader Election in Kafka: Apache Kafka uses Apache ZooKeeper (or, since Kafka 3.3, KRaft — its own Raft implementation) for controller leader election. The Kafka controller is the node responsible for managing partition leadership and broker registration. ZooKeeper provides the fencing token mechanism via ephemeral sequential znodes: the controller that holds the highest-numbered znode is the elected leader. When the controller crashes, its ephemeral node is deleted automatically (session expiry), triggering re-election.

⚖️ Trade-offs in Distributed Systems: The Decisions That Define Your Architecture

Every term in this glossary exists because there is a trade-off underneath it. Understanding the trade-off is more useful in an interview than memorising the definition.

Consistency vs. Availability (CAP): Choosing linearizability (CP) means that during a network partition, the system refuses requests rather than serve stale data. Choosing availability (AP) means stale data is served but the system stays responsive. The trade-off is not just theoretical: during an AWS AZ partition, a CP database (HBase, etcd) will go read-only or refuse writes; an AP database (Cassandra, DynamoDB) continues to serve and accept writes, reconciling conflicts after the partition heals. Choose based on whether stale data is tolerable or whether a partial view of reality is acceptable.

Isolation Level vs. Throughput: Upgrading from Snapshot Isolation to Serializable in PostgreSQL adds SSI overhead (tracking read/write dependency sets). In write-heavy workloads, SSI can increase abort rates (transactions are rolled back more frequently to prevent serializability violations), reducing effective throughput. The rule: use Serializable only when your application has cross-row invariants that must hold. Use Snapshot Isolation (Read Committed is the PostgreSQL default — explicitly upgrade to Repeatable Read or Serializable) for everything else.

Replication Factor vs. Write Latency: In Cassandra with RF=3 and QUORUM consistency, every write must be confirmed by 2 out of 3 replicas. Higher RF protects against more concurrent node failures but costs write latency (more round-trips) and disk space (more copies). The rule: RF=3 is the standard production minimum; RF=5 is used for critical data in multi-datacenter deployments. Never use RF=1 for data that cannot be reconstructed.

LSM Write Amplification vs. B-Tree Read Performance: LSM trees sacrifice write amplification (compaction re-writes data multiple times) for sequential write throughput. B-trees sacrifice random write performance for O(log n) reads. The rule: if your write:read ratio is above 10:1, an LSM engine (Cassandra, RocksDB) is likely faster. If your read:write ratio is above 10:1 or you need strong range-query performance, a B-tree engine (PostgreSQL, MySQL InnoDB) is likely better.

Quorum Size vs. Fault Tolerance: Increasing W or R in a quorum system improves consistency at the cost of latency and availability. W=N means no write can succeed unless all replicas are reachable — maximum consistency, minimum availability. W=1 means any single node can acknowledge a write — maximum availability, minimum consistency. The sweet spot for most systems is W=⌈N/2⌉+1 (majority write quorum), which tolerates up to ⌊N/2⌋ node failures.

Failure Mode: Split Brain: The most catastrophic failure in a replicated system. Prevention requires strict quorum enforcement (a node that cannot reach a quorum must refuse to accept writes, even if it believes it is the leader) and fencing tokens at the storage layer. Two prevention layers are better than one: epoch-based leader election in the consensus layer, plus monotonic fencing tokens enforced by the storage backend.

🧭 Decision Guide: Choosing the Right Consistency and Storage Model

Use this guide when you face a design question about consistency guarantees, isolation levels, or storage engines.

Situation	Recommendation	Key reason
Financial transactions with multi-row invariants (e.g., double-entry ledger)	Serializable isolation	Write skew can violate ledger invariants; Snapshot Isolation is insufficient
High-throughput analytics reads on eventually consistent data	Eventual consistency (AP database)	Freshness is not critical; availability and throughput matter more
Single-datacenter OLTP with mixed reads and writes	PostgreSQL with Read Committed (default) or Repeatable Read	Good balance of concurrency and safety; upgrade to Serializable only if write skew is possible
Globally distributed system where reads must be fast in every region	Leaderless replication (Cassandra/DynamoDB) with LOCAL_QUORUM	Cross-datacenter synchronous writes are too slow; local quorum provides regional consistency
Write-heavy time-series ingestion (IoT sensors, log aggregation)	LSM-based engine (Cassandra, RocksDB, ClickHouse)	Sequential writes minimise random I/O; compaction reclaims space
Mixed OLTP with frequent range queries and point lookups	B-tree engine (PostgreSQL, MySQL InnoDB)	Sorted pages optimise range scans; B-tree traversal optimises point lookups
Distributed lock / leader election	ZooKeeper or etcd (CP systems) with fencing tokens	You need linearizability for lock safety; AP systems cannot provide it
User session data (read-your-writes, monotonic reads)	Session consistency (DynamoDB sessions, sticky routing)	Users expect to see their own writes; strong global consistency is unnecessary overhead
Avoid	Reason
:---	:---
Read Uncommitted in any production system	Dirty reads allow acting on uncommitted, potentially rolled-back data
Snapshot Isolation for scheduling or on-call systems with shared invariants	Write skew can violate invariants even when each transaction looks locally safe
Leaderless replication with W=1 for financial data	A single node failure before replication can cause data loss
RF=1 for any data that is not easily reconstructable	A single disk failure permanently loses the data

🧪 Scenario Walkthroughs: Applying the Vocabulary to Real Design Problems

These scenarios test whether you can use the vocabulary correctly in context — the way an interviewer would present a problem.

Scenario 1: The On-Call Scheduling Bug (Write Skew in Practice)

Your team operates a hospital scheduling application. The invariant: at least one doctor must be on-call at all times. The application checks the on-call count before allowing a doctor to go off-call. Two doctors, Alice and Bob, both open the "Go Off-Call" dialog at the same moment. Both see 2 doctors on-call. Both transactions proceed.

What happened? Write skew. Each transaction read the shared invariant (on-call count ≥ 2), determined it was safe to write (decrement), and committed. Neither write individually violated the local pre-condition. The global invariant broke because the two transactions' reads and writes formed a conflicting dependency cycle that Snapshot Isolation cannot detect.

Fix: Use SELECT ... FOR UPDATE (pessimistic locking) on the on-call roster row before checking the count, or upgrade the isolation level to Serializable. Serializable Snapshot Isolation (PostgreSQL's SSI) will detect the read-write dependency cycle and abort one of the two transactions with a serialization failure, prompting a retry.

Scenario 2: The Stripe Retry That Charged Twice (Idempotency Failure)

A mobile payment client sends a POST /charge request to a payment API. The network times out after 3 seconds. The client retries. The server processed the original request successfully but the response was lost in transit. The server receives the retry, treats it as a new request, and charges the customer again.

What happened? Missing idempotency key. The payment API was not idempotent. At-least-once delivery (the default for any retryable HTTP call) combined with a non-idempotent server handler produces duplicate charges.

Fix: The client generates a UUID for the logical payment intent before the first attempt. Every retry includes the same UUID as the Idempotency-Key header. The server stores (idempotency_key, response) in a durable key-value store with a TTL. On receiving a request, the server checks for the key first. If found, return the cached response. If not found, process and store. The idempotency key makes the endpoint safe for at-least-once delivery.

Scenario 3: The Quorum That Was Not Strong Enough

A Cassandra cluster with N=3, W=1, R=1 is storing user profile data. One node receives a write (user changes their email address) and acknowledges it. The node crashes before replication. A subsequent read contacts the two remaining nodes (neither has the new email). The user sees their old email address even though the write was confirmed.

What happened? W + R = 2, which is not greater than N = 3. The quorum condition W + R > N is not satisfied. A write acknowledged by 1 node and a read contacting 1 different node have no guaranteed overlap.

Fix: Set W=2, R=2 (W + R = 4 > N = 3). Now any read quorum of 2 must include at least one of the 2 nodes that confirmed the write. Alternatively, W=1, R=3 (read all nodes) also satisfies the condition, but at higher read latency cost. For user profile data where read-your-writes is important, W=2, R=2 with LOCAL_QUORUM in a single datacenter is the standard configuration.

Understanding the vocabulary in isolation is not enough. Seeing how mature open-source systems implement each concept cements the definitions.

PostgreSQL and Isolation Levels: PostgreSQL implements Read Committed, Repeatable Read, and Serializable using MVCC. Each transaction is assigned a transaction ID (xid). Row versions carry xmin (the xid that created the row) and xmax (the xid that deleted it). A transaction's snapshot includes all xids committed before it started. Serializable is implemented via Serializable Snapshot Isolation (SSI), which tracks read/write dependencies and aborts transactions that form a dangerous cycle. Configure with:

-- Set isolation for a single transaction
BEGIN ISOLATION LEVEL SERIALIZABLE;
-- or per-session
SET default_transaction_isolation = 'repeatable read';

Cassandra and Quorum: Cassandra implements leaderless replication with tunable consistency. Each read or write operation specifies a consistency level that determines the quorum size. QUORUM in a single datacenter means floor(RF/2) + 1 replicas must respond. LOCAL_QUORUM means a quorum within the local datacenter only (used for multi-datacenter deployments where cross-datacenter latency would be unacceptable). Hinted handoff and read repair handle temporary node failures. For a full Cassandra deep-dive, see System Design Replication and Failover.

# cassandra.yaml - hinted handoff configuration
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000  # 3 hours - max time to store hints
hinted_handoff_throttle_in_kb: 1024

ZooKeeper and Fencing Tokens: ZooKeeper provides distributed coordination via sequential ephemeral znodes. Every create operation on a sequential node generates a monotonically increasing zxid. A service acquiring a distributed lock creates a sequential ephemeral node and receives its sequence number — this is the fencing token. The storage backend that this lock protects rejects operations carrying a fencing token lower than the highest it has seen. Ephemeral nodes are automatically deleted when the client session expires, so a crashed leader automatically releases its lock.

Kafka and Consumer Lag / Exactly-Once: Kafka tracks consumer progress via committed offsets stored in the __consumer_offsets internal topic. Consumer lag for a partition = latest_offset - consumer_committed_offset. Kafka's exactly-once semantics (EOS) combines idempotent producers (each message carries a producer ID + sequence number; the broker deduplicates retries) with transactional APIs (atomically write to multiple partitions and commit the consumer offset in the same transaction). For deeper Kafka internals, see How Kafka Works.

📊 Visual Reference

flowchart TD
    Driver["Driver Node"]
    Planner["Query Planner"]
    Optimizer["Catalyst Optimizer"]
    Executor["Executor Task"]

    Driver -->|"Process Flow"| Planner
    Planner --> Optimizer
    Optimizer -->|"Optimized Plan"| Executor
    Executor -->|"Result"| Driver

📚 Lessons Learned: What Production Systems Teach You About This Vocabulary

These are the hard-won lessons that distinguish engineers who understand the definitions from engineers who have operated these systems at scale.

Snapshot Isolation is not Serializable — and production bugs live in that gap. Many engineers assume that MVCC with Snapshot Isolation (the default in many cloud databases) prevents all isolation anomalies. It does not prevent write skew. The write skew scenario — two concurrent transactions each reading a shared condition and writing in a way that together violates an invariant — is exactly the class of bug that a banking or scheduling system can silently produce under Snapshot Isolation. Always check whether your application has invariants that span multiple rows or multiple objects. If it does, escalate to Serializable.

CAP is about partitions, not about consistency in steady state. A common interview mistake is to say "Cassandra is AP so it doesn't support consistency." Cassandra supports strong consistency within a datacenter when you set QUORUM and RF=3. CAP's trade-off only activates during a network partition. In steady state (no partition), AP systems can be made consistent by setting a high quorum. PACELC is the better model for steady-state reasoning.

Quorum math is about overlaps, not majorities. The W + R > N condition is not about majority vote — it is about ensuring that at least one node in a read quorum must have participated in a write quorum. With N=5, W=3, R=3: any three-node read set must overlap with any three-node write set by at least one node (since 3+3=6 > 5). This is why W=1, R=5 is valid quorum: every read contacts every node, so it must hit the one node that received the write.

Fencing tokens are required even after leader election. A freshly-elected leader may still have a stale predecessor attempting writes. Leader election alone does not prevent split brain at the storage layer. The fencing token (a monotonically increasing epoch number validated by the storage system) is the actual prevention mechanism. Always ensure the storage backend participates in the fencing check — a lock service that does not enforce fencing provides incomplete protection.

Replication lag is not just a performance metric — it is a correctness metric. High replication lag in a system where reads are routed to replicas means users may be reading data that is seconds or minutes out of date. If your SLA requires read-your-writes consistency and your application routes reads to replicas, you need a session-routing mechanism (always read from the primary after a write, or use a replication lag threshold before allowing replica reads).

LSM compaction pauses are a production reliability concern. A Cassandra or RocksDB node under heavy write load accumulates SSTables faster than compaction can merge them. When compaction falls behind, read amplification grows (more SSTables to check per read), disk space grows, and eventually read latency degrades significantly. Monitor compaction pending task counts and tune compaction throughput for your write/read ratio.

📌 TLDR: The 60-Second Distributed Systems Vocabulary Reference

Read Skew — two reads in the same transaction see different snapshots; prevented by Snapshot Isolation.
Write Skew — two concurrent transactions each check a shared condition and write in a way that violates it together; prevented only by Serializable isolation.
Lost Update — two concurrent read-modify-write cycles; second write overwrites first; prevented by atomic operations, CAS, or Serializable isolation.
Isolation levels escalate from Read Uncommitted → Read Committed → Repeatable Read → Snapshot Isolation → Serializable. Each step costs latency and throughput in exchange for preventing a new class of anomaly.
Snapshot Isolation ≠ Serializable. Write skew is the gap between them.
Quorum: W + R > N — the arithmetic condition that guarantees at least one node in a read quorum has the latest write.
Split brain requires fencing tokens to prevent, not just epoch numbers in the lock service.
LSM trees favour writes (sequential, in-memory first); B-trees favour reads (in-place, sorted on disk).
CAP is a partition-time trade-off. PACELC covers both partition-time and steady-state latency/consistency trade-offs.
Linearizability is the strongest consistency model (real-time, globally ordered). Eventual consistency is the weakest (converges eventually, no timing guarantee).

Expandable deep dives

📖 How Vocabulary Breaks System Design Interviews (and Production Systems)⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

🔍 Consistency Anomalies: The Ways Transactions Can Lie to You⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

Dirty Read⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

Non-Repeatable Read⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

Key takeaways

✓TLDR: Distributed systems vocabulary is precise for a reason.
✓Mixing up read skew and write skew costs you an interview.
✓Confusing Snapshot Isolation with Serializable costs you a production outage.
✓This glossary organises every critical term into concept families — with concrete scenarios, prevention strategies, and visual aids — so the vocabulary becomes second nature.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Reader feedback

Was this article useful?

Rate it before you leave, then follow or subscribe for the next deep dive.

Continue learning

Stale Reads and Cascading Failures in Distributed Systems

25 min · System Design · best next step

View roadmap