Key Terms in Distributed Systems: The Definitive Glossary
Every term you'll encounter in distributed systems interviews and production engineering — defined clearly with examples
Abstract AlgorithmsTLDR: Distributed systems vocabulary is precise for a reason. Mixing up read skew and write skew costs you an interview. Confusing Snapshot Isolation with Serializable costs you a production outage. This glossary organises every critical term into concept families — with concrete scenarios, prevention strategies, and visual aids — so the vocabulary becomes second nature.
📖 How Vocabulary Breaks System Design Interviews (and Production Systems)
You are 35 minutes into a system design interview. You have just proposed a multi-leader replication setup for a globally distributed social feed. The interviewer pauses and says: "Your design doesn't account for write skew. Walk me through how you'd prevent it."
You nod. You say "I'd use Serializable isolation." But inside, a small alarm is going off: is write skew the same as a lost update? Does Serializable actually prevent it? What even is the difference?
This moment — the instant where blurry vocabulary collapses your design argument — is the most common way strong engineers fail distributed systems interviews. It is not a knowledge gap; it is a vocabulary gap. You understand the concepts intuitively but do not have the precise words or the precise relationships between them.
The same problem shows up in production. When a Cassandra runbook says "ensure R + W > N for strong consistency" and nobody on the team is certain whether that formula applies in a multi-datacenter setup, configuration mistakes follow. When a junior engineer confuses "replica lag" with "split brain," they diagnose the wrong failure mode.
This glossary exists to close that gap. Rather than presenting terms alphabetically — which treats the vocabulary like a dictionary and hides the relationships between concepts — every term here is grouped by concept family. Isolation anomalies are grouped together so you can see how they escalate. Replication models are grouped so you can compare their trade-offs side-by-side. By the time you finish, you will not just recognise the words; you will know how they connect.
🔍 Consistency Anomalies: The Ways Transactions Can Lie to You
Before you can understand isolation levels, you need to understand what they protect against. Each anomaly below describes a class of correctness violation that can occur when multiple transactions execute concurrently.
Dirty Read
A dirty read occurs when transaction B reads data that transaction A has written but not yet committed. If A rolls back, B has read data that, as far as the database is concerned, never existed.
Scenario: Transaction A is transferring $500 from account 1 to account 2. It debits account 1 and is mid-flight before crediting account 2. Transaction B reads account 1's new (lower) balance before A commits. A then fails and rolls back. B has acted on a phantom number.
Prevention: Any isolation level above Read Uncommitted prevents dirty reads.
Non-Repeatable Read
A non-repeatable read occurs when a transaction reads the same row twice within one transaction and gets different values each time, because another transaction committed a change between the two reads.
Scenario: An audit job reads a payment record at 10:00:01 and again at 10:00:02 to verify it has not changed. Between those two reads, a concurrent transaction updates the payment amount. The audit report is now internally inconsistent.
Prevention: Repeatable Read isolation or higher.
Phantom Read
A phantom read is the row-range equivalent of a non-repeatable read. A transaction executes the same range query twice. The second execution returns rows that did not exist the first time (or is missing rows that did exist), because another transaction inserted or deleted rows in between.
Scenario: A booking system reads all available seats in a row WHERE section = 'A' AND available = true. Another transaction books one of those seats and commits. The first transaction re-queries and sees a smaller result set. It is now operating on stale availability data.
Prevention: Serializable isolation (or predicate locks in Repeatable Read in some engines).
Read Skew
Read skew occurs when a transaction performs multiple reads across two different objects, and those reads are taken from different points in time — so they represent an internally inconsistent snapshot.
The canonical example is a bank invariant: account A + account B must total $1,000. Transaction A deducts $200 from account A (now $300) but has not yet added it to account B (still $700). A concurrent read transaction reads account A's new value ($300) and account B's old value ($700). It sees $1,000 total, but that's by coincidence — had it read account B first, it would have seen $900. The read transaction crosses snapshot boundaries and can't be trusted.
Prevention: Snapshot Isolation (MVCC) — reads are served from a single consistent snapshot taken at transaction start, so a transaction never sees partial writes from another.
Write Skew
Write skew is the most subtle of the isolation anomalies and the one most frequently confused with a lost update in interviews.
Write skew occurs when two concurrent transactions each read a shared condition, each check that it is safe to proceed, and each then write a change that, individually, would be valid — but together violate a global invariant.
The classic scenario: A hospital requires at least one doctor to be on-call at all times. Two doctors, Alice and Bob, both check the on-call roster concurrently. Both see the other doctor is on-call and decide it is safe for them to go off-call. Both transactions commit. Now zero doctors are on-call — a violated invariant that neither transaction could detect individually.
Write skew differs from a lost update because neither transaction overwrites the other's data. They write to different rows. The conflict is logical, not physical.
Prevention: Serializable isolation (Serializable Snapshot Isolation / 2PL). Snapshot Isolation does not prevent write skew, which surprises many engineers who assume it does.
The following diagram illustrates the write skew scenario step-by-step:
sequenceDiagram
participant Alice as Alice Txn
participant DB as Database
participant Bob as Bob Txn
Alice->>DB: BEGIN
Bob->>DB: BEGIN
Alice->>DB: SELECT count(*) WHERE on_call = true
DB-->>Alice: 2 doctors on call
Bob->>DB: SELECT count(*) WHERE on_call = true
DB-->>Bob: 2 doctors on call
Alice->>DB: UPDATE doctors SET on_call=false WHERE name=Alice
Bob->>DB: UPDATE doctors SET on_call=false WHERE name=Bob
Alice->>DB: COMMIT
Bob->>DB: COMMIT
Note over DB: Invariant violated - 0 doctors on call
Both Alice and Bob read a valid state (two doctors on-call) and make a write that looks locally safe. The invariant breaks only when both commits land. This is why Serializable isolation is the only safe choice for write skew scenarios — it detects that the two reads and two writes form a conflicting cycle.
Lost Update
A lost update occurs when two transactions concurrently read the same value, compute a new value, and write back — with the second write silently overwriting the first.
Scenario: A counter starts at 5. Transaction A reads 5 and plans to write 6. Transaction B also reads 5 and plans to write 6. A writes 6. B writes 6. The counter should be 7; it is 6. A's increment is lost.
Prevention: Use atomic operations (UPDATE counter SET value = value + 1 WHERE id = X), optimistic locking (compare-and-swap), or pessimistic locking. Repeatable Read prevents it in some databases; Serializable prevents it universally.
A classic compare-and-swap guard against a lost update looks like this:
-- Read current value
SELECT value, version FROM counters WHERE id = 1;
-- value = 5, version = 42
-- Write only if version has not changed (no lost update possible)
UPDATE counters SET value = 6, version = 43
WHERE id = 1 AND version = 42;
-- If 0 rows affected, another transaction won the race — retry
Dirty Write
A dirty write occurs when a transaction overwrites data that another transaction has written but not yet committed. This can cause partial rollbacks to leave the database in an inconsistent state.
Scenario: Two buyers concurrently purchase the last item in inventory. Transaction A updates the item's owner to "Alice" (uncommitted). Transaction B, seeing the item row as writable, updates the owner to "Bob". A rolls back. The item now belongs to Bob, but the invoice may have been generated for Alice. The physical data and financial records diverge.
Prevention: All isolation levels above Read Uncommitted prevent dirty writes by default, since row-level locks prevent overwriting an uncommitted row.
⚙️ Isolation Levels: PostgreSQL's Six-Way Dial Between Safety and Speed
Isolation levels are the database's knob for trading safety against concurrency and performance. The SQL standard defines four levels; real databases (PostgreSQL, MySQL, Oracle) add Snapshot Isolation as a practical middle ground.
| Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read | Read Skew | Write Skew | Lost Update |
| Read Uncommitted | Possible | Possible | Possible | Possible | Possible | Possible |
| Read Committed | Prevented | Possible | Possible | Possible | Possible | Possible |
| Repeatable Read | Prevented | Prevented | Possible* | Prevented† | Possible | Prevented† |
| Snapshot Isolation (MVCC) | Prevented | Prevented | Prevented | Prevented | Possible | Prevented |
| Serializable | Prevented | Prevented | Prevented | Prevented | Prevented | Prevented |
*Phantom reads may be prevented by gap locks in some engines (MySQL InnoDB). †PostgreSQL's Repeatable Read prevents lost updates and read skew via MVCC.
Read Uncommitted is the lowest level. Transactions can read rows written by concurrent uncommitted transactions. Almost no production system uses this level; it exists for bulk reporting where approximate numbers are acceptable and speed matters above all else.
Read Committed is the PostgreSQL and Oracle default. Each statement within a transaction gets a fresh snapshot of committed data. This prevents dirty reads but allows non-repeatable reads within a single transaction, since the snapshot advances with each new statement.
Repeatable Read guarantees that if you read a row in a transaction, re-reading the same row later in the same transaction will return the same data. PostgreSQL implements this via MVCC snapshots. It does not prevent phantom reads in the SQL standard, though PostgreSQL's MVCC implementation effectively prevents most phantom reads in practice.
Snapshot Isolation (MVCC) gives every transaction a consistent view of the database as it existed at transaction start time. All reads come from this frozen snapshot. It prevents dirty reads, non-repeatable reads, phantom reads, and read skew — but not write skew, which is the key gotcha.
Serializable is the strongest level. All concurrent transactions produce results equivalent to some serial (one-at-a-time) execution order. PostgreSQL implements this via Serializable Snapshot Isolation (SSI), which uses read/write dependency tracking to detect and abort transactions that would violate serializability. It is safe for write skew.
📊 Replication Models and the Consistency Spectrum Across Databases
Replication is the act of maintaining copies of the same data on multiple nodes. Every replicated system must answer: who accepts writes, and how quickly must replicas reflect them?
The following diagram maps the consistency spectrum from weakest to strongest, positioned against the replication models that produce each consistency level:
flowchart TD
A[Eventual Consistency] --> B[Causal Consistency]
B --> C[Monotonic Reads / Read-Your-Writes]
C --> D[Session Consistency]
D --> E[Bounded Staleness]
E --> F[Sequential Consistency]
F --> G[Linearizability - Strongest]
A -.->|Leaderless async replication| A
D -.->|Multi-leader with per-session routing| D
G -.->|Single-leader sync replication| G
The spectrum above runs from eventual consistency (weakest — replicas converge eventually, with no timing guarantee) through to linearizability (strongest — the system appears as a single machine; reads always reflect the most recent write). Each step upward adds correctness guarantees but increases latency or availability cost.
Single-Leader Replication
One designated primary node accepts all writes. Replicas receive a replication stream and apply changes in the same order. This is the model used by PostgreSQL streaming replication and MySQL binlog replication.
Trade-off: Simple write ordering and conflict-free. But the single leader is a bottleneck — you cannot scale write throughput beyond what one node can handle, and leader failure requires a failover.
Multi-Leader Replication
Multiple nodes accept writes simultaneously. This enables writes to continue during network partitions between datacenters, and it eliminates the single-leader bottleneck for geo-distributed systems. MySQL multi-source replication and CockroachDB multi-region writes use variants of this model.
Trade-off: Write conflicts are inevitable when two leaders accept conflicting writes to the same row. The system must define a conflict resolution policy: last-write-wins, custom merge, CRDTs, or expose conflicts to application code.
Leaderless Replication (Dynamo-Style)
Any node can accept any read or write. Used by Apache Cassandra, Amazon DynamoDB, and Riak. Consistency is achieved by requiring reads and writes to contact a quorum of nodes.
Quorum condition — W + R > N: Where N is the replication factor (number of copies), W is the number of nodes that must acknowledge a write before it is confirmed, and R is the number of nodes contacted for a read. If W + R > N, at least one node in every read quorum must have the latest write. With N=3, W=2, R=2: any read overlaps with any write by at least one node.
Hinted handoff: If a target node is down during a write, a neighbour node stores the write with a hint (a note saying "deliver to node X when it recovers"). When the downed node comes back, the hint is forwarded. This allows writes to succeed during temporary node failures without sacrificing durability.
Read repair: When a read query contacts multiple replicas and detects a version mismatch (some replicas have an older value), the coordinator updates the stale replicas with the newer value before returning to the client. This lazily brings replicas back into sync.
Anti-entropy: A background process that periodically compares Merkle tree hashes of replica data and syncs any divergent ranges. Read repair handles inconsistencies detected during reads; anti-entropy handles key ranges that have not been recently read.
Synchronous vs. Asynchronous Replication
Synchronous replication means the primary waits for at least one replica to confirm receipt (and optionally apply) the write before acknowledging the client. PostgreSQL's synchronous_standby_names enables this. Zero data loss on primary failure, but added write latency for every commit.
Asynchronous replication means the primary acknowledges the client immediately and ships changes to replicas independently. Faster writes, but a primary crash before replication occurs means recent writes can be lost.
Replication lag is how far behind a replica is from the primary — measured in seconds or bytes of WAL. High lag means reads routed to replicas may return stale data. In PostgreSQL: pg_stat_replication.write_lag. In MySQL: Seconds_Behind_Master from SHOW REPLICA STATUS.
Replication factor (RF): The number of copies of each piece of data. RF=3 means three nodes hold each row. Higher RF increases fault tolerance (the system can lose RF-1 nodes without data loss) at the cost of write amplification.
📦 Partitioning and Sharding: Distributing Data Across Nodes
As data volumes grow beyond what a single node can store or serve, you partition data across multiple nodes. Understanding the vocabulary here prevents the most common interview stumble — conflating partitioning with sharding.
Partitioning vs. Sharding: Partitioning is the general concept of splitting a dataset into subsets. Sharding is the specific technique of horizontal partitioning across multiple database nodes (physical machines or instances). All sharding is partitioning; not all partitioning is sharding (e.g., a single-node table partitioned by date is partitioned but not sharded).
Horizontal partitioning: Rows are split across nodes by a partition key. Node 1 holds users 1–1,000,000; node 2 holds users 1,000,001–2,000,000. This is what people typically mean when they say "sharding."
Vertical partitioning: Columns are split. A wide user record might have its frequently-accessed fields on one node (profile, username) and infrequently-accessed fields on another (audit logs, raw clickstream). Used to optimise I/O patterns and cache efficiency.
Range partitioning: Rows are assigned to partitions based on key ranges. Easy to support range queries (give me all orders from March). Risk: hot partitions (all recent activity hits the latest range shard).
Hash partitioning: Rows are assigned by hash(key) mod N. Distributes load evenly. Downside: range queries are expensive because adjacent keys end up on different nodes.
Consistent hashing solves the redistribution problem with hash partitioning. Nodes and keys are mapped to positions on a conceptual ring. A key is assigned to the first node clockwise on the ring from the key's hash position. When a node is added or removed, only keys near that node's ring position need to be redistributed — not the entire dataset. Used by Cassandra and DynamoDB.
Virtual nodes (vnodes): Instead of each physical node occupying one contiguous arc on the ring, it occupies many small arcs (typically 256 per node). This ensures more even key distribution and makes redistribution smoother when nodes are added or removed.
Hot partition / hot spot: One partition receives a disproportionate share of reads or writes. Common cause: a shard key that is temporally biased (e.g., created_at) or a celebrity problem (one user ID generates millions of events). Detection: monitor per-shard throughput and latency in your database's observability dashboards.
Shard key: The attribute used to route a row to its partition. Choosing the right shard key is the single most important sharding decision. A bad shard key creates hot spots; a good shard key distributes load evenly and co-locates frequently-joined data.
Resharding: Redistributing data when partition boundaries change — either because you are adding nodes (expanding) or because hot spots require redistribution. Resharding in a live system is operationally complex; consistent hashing minimises but does not eliminate this.
🕐 Ordering and Time: Why "Now" Has No Meaning in a Distributed System
In a single-machine system, time is simple: events have a total order enforced by a single clock and a single CPU. In a distributed system, machines have independent clocks that drift, and messages arrive out of order. The vocabulary in this section defines how distributed systems reason about causality and ordering without trusting wall clocks.
Wall clock time: The system clock on a machine (e.g., System.currentTimeMillis()). Wall clocks are unreliable for ordering distributed events because clock skew — different machines having different time values — can exceed the gap between events. Google's Spanner measured clock skew up to 7 ms across datacenters; events separated by less than 7 ms cannot be confidently ordered by clock alone.
Logical clock (Lamport clock): A monotonically increasing integer counter maintained by each process. Rules: (1) increment the counter before each event; (2) when sending a message, include the current counter; (3) on receiving a message, set your counter to max(local, received) + 1. Lamport clocks establish a partial order: if A happened before B, A's timestamp is lower than B's. But the converse is not guaranteed — a lower timestamp does not necessarily mean it happened first in real time.
Vector clock: An extension of Lamport clocks where each node maintains a vector of counters — one per node in the system. [nodeA: 3, nodeB: 1, nodeC: 2]. Vector clocks can determine causal relationships: event X causally precedes event Y if every entry in X's vector is ≤ the corresponding entry in Y's vector, with at least one strictly less. Vector clocks enable conflict detection in leaderless systems (used by Amazon Dynamo).
TrueTime (Spanner): Google Spanner's approach to distributed time uses GPS receivers and atomic clocks to provide a bounded uncertainty window [earliest, latest] around the current time. Spanner's external consistency guarantee relies on waiting out this uncertainty interval before committing, ensuring any reader that starts after a commit will observe it. This is the only production system that provides true linearizability at global scale without sacrificing availability.
Happens-before (→): Formalized by Leslie Lamport. Event A → B means A causally influenced B: either A and B are on the same process and A occurred before B, A sent a message that B received, or A → C and C → B. Two events with no happens-before relationship are concurrent — neither could have influenced the other.
Causality: The property that captures which events could have influenced which. Causal consistency guarantees that if you see the effect of an event, you also see the cause. This is stronger than eventual consistency but weaker than linearizability.
Epoch: In consensus protocols, an epoch (called a term in Raft, a ballot in Paxos) is a monotonically increasing round number. A new epoch is started whenever a new leader is elected. Messages from a leader with a stale epoch number are rejected, preventing two leaders from both believing they are authoritative.
Monotonic reads: A session guarantee: once a client reads a value at version V, it will never read a value older than V in subsequent reads. Prevents the disorienting scenario where refreshing a page shows older data than the previous page load.
Monotonic writes: A session guarantee: a client's writes are applied in the order they were issued. Without this, write 2 can arrive at a replica before write 1, corrupting the sequence.
Read-your-writes: A session guarantee: after a client writes a value, any subsequent read by the same client returns that value (or a newer one). Essential for user-facing applications — submitting a form and not seeing the update on the next page load is a jarring experience. Typically implemented by routing a client's reads to the same replica that received its writes, or by using a version token.
🗳️ Consensus and Leader Election: Deciding Who Gets the Final Say
Consensus is the problem of getting multiple distributed nodes to agree on a single value, even when some nodes may crash or messages may be delayed. It underpins leader election, distributed transactions, and replicated state machines.
Consensus: Formally, a consensus algorithm must satisfy: (1) Validity — any decided value was proposed; (2) Agreement — no two nodes decide differently; (3) Termination — every non-faulty node eventually decides. The FLP impossibility result proves that deterministic consensus is impossible in a fully asynchronous system with even one crash failure — which is why practical implementations use timeouts.
Paxos: The first widely-studied consensus algorithm, published by Leslie Lamport. Paxos has three roles: Proposer (initiates a value proposal), Acceptor (votes on proposals), and Learner (observes the decided value). The algorithm proceeds in two phases: Prepare/Promise (establishing authority for an epoch) and Accept/Accepted (getting a quorum to accept a value). Paxos is notoriously difficult to implement correctly in production, leading to multi-Paxos variants (used by Google Chubby).
Raft: Designed explicitly for understandability as an alternative to Paxos. Raft separates consensus into three sub-problems: leader election (elect one node per term), log replication (leader accepts entries and replicates to followers), and safety (a leader must have all committed entries from previous terms). Used by etcd, CockroachDB, and TiDB. The Raft paper's figure 2 is the specification; every production implementation derives from it.
Leader election: The process of choosing one node as the single authoritative coordinator for writes or decisions. A well-designed election must be safe (at most one leader per epoch) and live (a new leader is eventually elected after a failure). Raft achieves safety by requiring a candidate to receive votes from a majority before becoming leader for a given term.
Split brain: The catastrophic scenario where two nodes simultaneously believe they are the leader. In a replicated database, this means two nodes each accepting writes with no reconciliation — both believe their writes are authoritative. The result is divergent data that cannot be automatically merged. Split brain is prevented by requiring a majority quorum for leadership (a cluster of 5 cannot produce two majorities of 3 simultaneously).
Fencing token: Even with leader election and epoch numbers, a slow leader that was paused (e.g., garbage collection, VM live migration) can resume believing it is still the leader after a newer leader has been elected. A fencing token is a monotonically increasing integer issued by the lock service (e.g., ZooKeeper's zxid) when a leader acquires a lock. The storage system rejects any write that includes a token lower than the highest token it has seen. This prevents a stale leader from overwriting data committed by a newer leader.
The diagram below shows the fencing token mechanism in action:
sequenceDiagram
participant L1 as Leader 1 (stale)
participant ZK as ZooKeeper
participant L2 as Leader 2 (current)
participant ST as Storage
L1->>ZK: Acquire lock, receive token=33
Note over L1: Leader 1 pauses (GC / network)
ZK-->>L2: Leader 1 timeout - new election
L2->>ZK: Acquire lock, receive token=34
L2->>ST: Write data with token=34
ST-->>L2: OK - token 34 accepted
Note over L1: Leader 1 resumes, still thinks it has lock
L1->>ST: Write data with token=33
ST-->>L1: REJECTED - token 33 less than seen token 34
ZooKeeper issues monotonically increasing tokens. Storage accepts writes only from the most recent token holder. A stale leader trying to write with an old token is rejected at the storage layer, regardless of what ZooKeeper's session state says.
🔁 Fault Tolerance and Recovery: Building Systems That Survive Failures
Distributed systems must be designed with the assumption that individual components will fail. The vocabulary in this section covers how systems detect failures, recover from them, and guarantee message delivery semantics.
Checkpointing: Periodically saving the complete state of a running system to stable storage, so recovery can restart from the checkpoint rather than replaying from the beginning of time. Used in Apache Flink (barrier-based checkpoints every few seconds), Apache Spark Streaming (DStream checkpoints), and traditional databases (fuzzy checkpoints in PostgreSQL). The checkpoint interval trades recovery time against checkpoint overhead.
Write-Ahead Log (WAL): Changes are written to a sequential, durable log before being applied to the main data structure. On crash recovery, the system replays the WAL from the last checkpoint forward. Used by PostgreSQL, MySQL InnoDB, RocksDB, and almost every durable storage system. The WAL is also the replication stream in PostgreSQL streaming replication — the same log that enables crash recovery enables replicas.
Idempotency: An operation is idempotent if applying it multiple times produces the same result as applying it once. SET balance = 500 is idempotent; ADD 50 TO balance is not. Idempotency is the fundamental safety property required for at-least-once delivery systems.
A canonical idempotency key pattern for a payment API:
// Client includes a unique idempotency key with each request
POST /payments
{
"idempotency_key": "client-generated-uuid-abc123",
"amount": 500,
"recipient": "acct_xyz"
}
// Server checks: has this key been processed?
IF EXISTS(processed_keys WHERE key = "client-generated-uuid-abc123"):
RETURN cached_response // return same result, do not recharge
ELSE:
process_payment()
store_key("client-generated-uuid-abc123", response)
RETURN response
The idempotency key is the client-generated UUID that uniquely identifies the logical operation. On retry (network timeout, client crash), the client resubmits with the same key. The server deduplicates by key and returns the original response without re-processing.
Exactly-once delivery: The guarantee that a message is processed exactly one time. Achievable in practice via idempotent consumers + deduplication at the application layer, or Kafka's exactly-once semantics (transactional producers + idempotent brokers). True network-level exactly-once delivery is theoretically impossible — what is achieved is effectively exactly-once by making re-delivery safe.
At-least-once delivery: The system guarantees delivery of every message, but may deliver the same message more than once (due to retries, network failures, or consumer restarts). The receiver must be idempotent.
At-most-once delivery: The system sends a message once and does not retry. A message may be lost but is never processed twice. Used where loss is acceptable (UDP metrics, log sampling).
Heartbeat: A periodic signal (typically a small TCP packet or HTTP ping) sent by a node to indicate it is alive. A node that has not sent a heartbeat within a configurable timeout is assumed to have failed.
Timeout: The duration after which a node is assumed dead if no heartbeat or response is received. Setting the right timeout is a classic operational challenge: too short causes false positives (healthy but slow nodes are marked dead); too long delays failure detection and recovery.
Fail-stop failure: A node that fails by stopping entirely and sending no further messages. This is the simplest failure mode — silent, unambiguous, and detectable via heartbeat timeout.
Byzantine failure: A node that fails by behaving arbitrarily — sending incorrect, conflicting, or maliciously crafted messages. Byzantine fault tolerance requires a supermajority (3f+1 nodes to tolerate f Byzantine faults) and is the hardest failure model to design for. Used in blockchain consensus protocols (PBFT, Tendermint) but rare in traditional data infrastructure.
🧠 Storage Engine Internals: Why LSM Trees and B-Trees Make Different Bets
The choice of storage engine determines write throughput, read latency, compaction overhead, and operational complexity. Understanding LSM trees vs. B-trees is a standard intermediate systems question.
Internals of the LSM Tree Write Path
flowchart TD
W[Client Write] --> M[MemTable - in-memory sorted buffer]
M -->|MemTable full - flush| S0[L0 SSTable - immutable on disk]
S0 -->|L0 compaction trigger| S1[L1 SSTables - sorted - compacted]
S1 -->|tiered compaction| S2[L2 SSTables - larger - merged]
S2 --> S3[L3 SSTables - largest - most compacted]
S0 -.->|WAL for durability| WAL[Write-Ahead Log]
The LSM tree write path starts in memory (MemTable), flushes to immutable disk files (SSTables) when the MemTable fills, and merges those files through compaction levels to reclaim space and reduce read amplification. The WAL runs in parallel to the MemTable, ensuring durability even if the process crashes before the MemTable is flushed.
LSM Tree (Log-Structured Merge-Tree)
What it is: A write-optimised storage engine architecture used by Apache Cassandra, RocksDB, LevelDB, ScyllaDB, and Bigtable. The design principle is to make writes sequential (the fastest disk I/O operation) by never updating data in-place.
Write path: All writes go to an in-memory sorted buffer called the MemTable. The MemTable is also protected by a sequential WAL on disk for crash durability. When the MemTable reaches a size threshold, it is flushed to disk as an immutable sorted file called an SSTable (Sorted String Table). Over time, many SSTables accumulate. A background compaction process merges SSTables, discards deleted keys (represented by tombstones — special markers that signal deletion), and rewrites the merged result into a new, larger SSTable.
Read path: More expensive than writes. A read must check the MemTable first (most recent data), then SSTables from newest to oldest (L0, L1, L2...) until the key is found. A Bloom filter — a probabilistic data structure that answers "is this key definitely not in this SSTable?" — eliminates most unnecessary disk reads. If the Bloom filter says no, the SSTable can be skipped entirely with a single bit check.
Trade-off: Write amplification is low; read amplification is higher. Compaction consumes I/O and CPU in the background, which can affect latency under heavy write load.
B-Tree
What it is: A read-optimised, balanced tree structure used by PostgreSQL (heap + B-tree indexes), MySQL InnoDB, Oracle, and SQLite. B-trees update data in-place on disk.
Write path: Finds the appropriate leaf page in the tree and writes the new or updated value directly to that page. Pages that are too full are split. The WAL records the change for crash recovery, but the actual data page is modified on disk.
Read path: Efficient O(log n) traversal from root to leaf. All data is in sorted order on disk, making range queries fast with sequential I/O.
Trade-off: Read performance is excellent; write performance is good for random updates but requires random I/O (writing to a specific page wherever it is on disk). Writes in B-trees go to exactly the right place; LSM writes always go to the end of the current WAL/MemTable.
| Property | LSM Tree | B-Tree |
| Write performance | Excellent (sequential append) | Good (random in-place write) |
| Read performance | Good with Bloom filters | Excellent (direct B-tree traversal) |
| Space amplification | Higher during compaction | Lower (in-place updates) |
| Write amplification | Higher (compaction rewrites data) | Lower |
| Best for | High write throughput (Cassandra, RocksDB, ClickHouse) | Mixed read/write OLTP (PostgreSQL, MySQL) |
Performance Analysis: LSM Tree vs B-Tree Trade-offs
The LSM tree's sequential-write design produces high write throughput but introduces read amplification and write amplification as structural costs. Read amplification occurs because a read must consult the MemTable plus potentially all SSTable levels before finding the target key (mitigated by Bloom filters). Write amplification occurs because compaction re-writes the same data multiple times as it moves through levels (L0 → L1 → L2 → …). In RocksDB with level compaction, a single user write can result in 10–30 bytes written to disk per byte of actual data.
B-trees have lower read amplification (O(log n) direct traversal to the correct page) but suffer from random write I/O — the engine must locate and update the specific disk page where the key lives, which translates to random seeks on spinning disks or random-access write patterns on SSDs. B-trees also tend to have higher space amplification than a fully compacted LSM tree because partially-filled pages are never reclaimed without explicit VACUUM/page rebalancing.
Operational implication: In Cassandra or RocksDB under high write load, monitor compaction_pending_tasks and compaction_executor_active_tasks. When pending tasks accumulate faster than compaction can drain them, read latency degrades because there are more SSTables to scan per key lookup. Plan compaction throughput (in MB/s) as part of capacity planning for any LSM-based system.
MVCC (Multi-Version Concurrency Control)
What it is: A concurrency control technique where the database maintains multiple versions of each row rather than updating in-place. When a transaction modifies a row, a new version is created; old versions are kept until no active transaction needs them.
Why it matters: MVCC allows readers and writers to operate concurrently without blocking each other. A reader gets a consistent snapshot of the database at its start time (seeing only versions committed before it started). Writers create new versions. Garbage collection (VACUUM in PostgreSQL, purge in MySQL InnoDB) removes versions that are no longer needed by any active transaction.
📡 Observability: Reading the Vital Signs of a Distributed Production System
Building a distributed system is only half the work. Being able to observe, diagnose, and understand it in production is the other half.
Replication lag (revisited with measurement tools): In PostgreSQL, pg_stat_replication exposes write_lag, flush_lag, and replay_lag per replica — the wall-clock time between when the primary wrote a WAL record and when the replica wrote/flushed/applied it. In MySQL: SHOW REPLICA STATUS returns Seconds_Behind_Master. In Kafka: the MirrorMaker consumer lag metric. High replication lag is often the first warning sign of a replica falling behind before it becomes a consistency problem.
Consumer lag (Kafka): The number of messages between a consumer group's current committed offset and the latest offset on the partition. A consumer lag of 500,000 messages means the consumer is 500,000 events behind the latest. Monitored via Kafka's __consumer_offsets topic, JMX metrics, or Confluent Control Center. Spikes in consumer lag indicate the consumer is too slow — possible causes: processing bottleneck, broker GC pause, consumer rebalance.
Backpressure: The mechanism by which a downstream consumer signals to an upstream producer to slow down. Without backpressure, a fast producer overwhelms a slow consumer's buffers and causes crashes or data loss. Reactive Streams (Java), Akka Streams, and Apache Flink all have first-class backpressure support.
Fanout: One write that triggers many downstream reads or writes. Classic example: a user with 10 million followers posts a tweet. The write fanout — writing that tweet to 10 million followers' home timeline feeds — represents 10 million downstream writes. Systems that handle fanout-heavy workloads must be designed to queue and throttle fanout workers (Twitter's Finagle/Flock architecture). Contrast push fanout (write to all follower caches on post) vs. pull fanout (users read from a global feed and merge at read time).
Thundering herd: When many clients simultaneously try to access a resource — often triggered by a cache expiry, server restart, or a popular item suddenly being requested. If 10,000 clients all miss the cache simultaneously and all hit the database, the database receives 10,000 concurrent queries it was not designed to handle. Mitigation: probabilistic early expiry (refresh cache slightly before it expires), distributed locks on cache population, request coalescing.
Hot key: A single cache or database key receiving disproportionate traffic. The celebrity problem — every user querying a single celebrity's profile simultaneously. Mitigation: fan out the hot key to multiple shards (read from a random shard among hot-key replicas), use a local in-process cache for the hot key, or add client-side caching.
Cache stampede: The specific thundering herd scenario where many threads simultaneously discover a cache miss for the same key and all start computing/fetching the value to populate the cache. Mitigation: lock the cache key during population (first thread to detect the miss holds the lock; others wait or return stale data).
🌐 CAP Theorem and PACELC: The Consistency–Availability Trade-off Decoded
The CAP theorem and its successor PACELC are the theoretical frameworks behind every "which database should I use" decision in a distributed system design discussion.
CAP theorem (Brewer, 2000): In the presence of a network partition (P), a distributed system must choose between Consistency (C — every read returns the most recent write) and Availability (A — every request receives a response, though it may not be the latest data). A system cannot guarantee both simultaneously during a partition.
Important nuance: CAP's "Consistency" refers specifically to linearizability — not eventual consistency, not read-your-writes, not causal consistency. Many discussions muddy this. When an interviewer asks about CAP, they mean linearizability.
CP system: Prioritises consistency over availability during partitions. During a partition, a CP system will refuse to respond (or return an error) rather than return potentially stale data. Examples: HBase, Apache ZooKeeper, etcd, Google Bigtable.
AP system: Prioritises availability over consistency during partitions. During a partition, the system continues serving requests using whatever local data it has, accepting that responses may be stale or conflict when the partition heals. Examples: Apache Cassandra, Amazon DynamoDB (default), CouchDB, Riak.
PACELC (Daniel Abadi, 2012): Extends CAP to also characterise the latency-consistency trade-off when there is no partition. The full acronym: if there is a Partition (P), choose between Availability (A) and Consistency (C). Else (E), even without a partition, choose between Latency (L) and Consistency (C).
| Database | Partition behaviour | No-partition behaviour | PACELC class |
| Cassandra | Available | Low latency over consistency | PA/EL |
| DynamoDB (eventual) | Available | Low latency | PA/EL |
| HBase | Consistent | High consistency, higher latency | PC/EC |
| ZooKeeper / etcd | Consistent | High consistency | PC/EC |
| PostgreSQL (sync replication) | Consistent | Higher latency | PC/EC |
| CockroachDB | Consistent | Balanced | PC/EC |
| MongoDB (default) | Available | Low latency | PA/EL |
| Spanner | Consistent | Consistent (TrueTime) | PC/EC |
The following diagram shows the consistency model spectrum from weakest to strongest, with real-world database examples at each level:
flowchart LR
EC[Eventual Consistency - Cassandra default] --> SEC[Strong Eventual - CRDTs]
SEC --> BS[Bounded Staleness - Azure Cosmos]
BS --> SS[Session Consistency - DynamoDB sessions]
SS --> CC[Causal Consistency - MongoDB causal]
CC --> SC[Sequential Consistency - Zookeeper]
SC --> LIN[Linearizability - Spanner - etcd]
Each step from left to right adds stronger guarantees but increases coordination cost (and therefore latency). Eventual consistency at the far left requires no cross-node coordination on reads. Linearizability at the far right requires that a read confirm the current value with a quorum of nodes before returning.
Linearizability: The system behaves as if there is a single copy of the data and all operations take effect instantaneously at some point between their invocation and their response. The strongest consistency model. Any read returns the most recent committed write, globally. Achieved by single-leader replication with synchronous reads to the leader, or by consensus-based protocols (Raft, Paxos, Spanner TrueTime).
Sequential consistency: All processes see operations in the same global order, but that order does not need to match real-time order. Weaker than linearizability. Two clients may observe an order that does not match wall-clock time, but they agree on a single consistent sequence.
Causal consistency: Operations that have a causal relationship (A → B) are seen in that order by all nodes. Concurrent operations (neither causally precedes the other) may be seen in different orders by different nodes. Stronger than eventual, weaker than sequential.
Eventual consistency: Given no new writes, all replicas will eventually converge to the same value. No timing guarantee. The weakest useful consistency model. Used where availability and low latency matter more than freshness.
Strong eventual consistency (SEC): Eventual consistency with the additional property that all nodes that have received the same set of updates will have the same state — even if updates arrived in different orders. Achieved using CRDTs (Conflict-free Replicated Data Types), which define merge operations that are commutative, associative, and idempotent.
Bounded staleness: Reads may lag behind the latest write by at most X seconds or X operations. Azure Cosmos DB exposes this as a named consistency level. Provides a predictable freshness window for analytics or reporting use cases that can tolerate known-bounded lag.
Session consistency: Within a single client session, all four session guarantees hold: read-your-writes, monotonic reads, monotonic writes, and writes-follow-reads (causal ordering within the session). Outside the session, weaker consistency applies. This is the most common consistency level for user-facing applications.
🌍 Where These Concepts Show Up in Production Systems You Know
These are not academic terms. Each definition in this glossary maps directly to a production decision made by engineers at companies you use every day.
Stripe and Idempotency Keys: Stripe's payment API mandates idempotency keys precisely because at-least-once delivery is the only safe guarantee over HTTPS. A payment request that times out cannot be retried safely without an idempotency key — re-sending the request without one would charge the customer twice. Every Stripe API client must generate a UUID per logical payment and include it as the Idempotency-Key header. Stripe's backend deduplicates by key, returning the original response for any retry.
Discord and Read Skew in Message History: Discord migrated from Cassandra to ScyllaDB (then to a custom Rust storage layer) partly to address read-path anomalies under high fan-out. In leaderless systems, a client reading message history can receive results from replicas at different replication positions — the distributed equivalent of read skew. Messages can appear out-of-order or temporarily invisible between retries. Discord's architecture evolved toward per-channel causal consistency, tracking the last_seen_message_id in the session layer to implement monotonic reads.
Google Spanner and Linearizability at Scale: Spanner is the only widely-deployed system that provides external consistency (linearizability) across globally distributed datacenters. It achieves this by using TrueTime to assign globally unique commit timestamps with bounded uncertainty. Every transaction waits out the TrueTime uncertainty interval before committing, ensuring that any external observer who queries after a commit will see it. The cost: commit latency includes the uncertainty window (~7 ms across datacenters), but write throughput remains practical for financial and inventory systems.
Netflix and Eventual Consistency in the Catalog: Netflix's content catalog is stored in Cassandra with eventual consistency. Because catalog metadata (title descriptions, artwork, ratings) changes infrequently and being briefly stale is acceptable, PA/EL (Available + Low Latency) is the right trade-off. A user seeing an outdated thumbnail for 30 seconds is far less harmful than a catalog service being unavailable while waiting for cross-datacenter confirmation.
ZooKeeper and Leader Election in Kafka: Apache Kafka uses Apache ZooKeeper (or, since Kafka 3.3, KRaft — its own Raft implementation) for controller leader election. The Kafka controller is the node responsible for managing partition leadership and broker registration. ZooKeeper provides the fencing token mechanism via ephemeral sequential znodes: the controller that holds the highest-numbered znode is the elected leader. When the controller crashes, its ephemeral node is deleted automatically (session expiry), triggering re-election.
⚖️ Trade-offs in Distributed Systems: The Decisions That Define Your Architecture
Every term in this glossary exists because there is a trade-off underneath it. Understanding the trade-off is more useful in an interview than memorising the definition.
Consistency vs. Availability (CAP): Choosing linearizability (CP) means that during a network partition, the system refuses requests rather than serve stale data. Choosing availability (AP) means stale data is served but the system stays responsive. The trade-off is not just theoretical: during an AWS AZ partition, a CP database (HBase, etcd) will go read-only or refuse writes; an AP database (Cassandra, DynamoDB) continues to serve and accept writes, reconciling conflicts after the partition heals. Choose based on whether stale data is tolerable or whether a partial view of reality is acceptable.
Isolation Level vs. Throughput: Upgrading from Snapshot Isolation to Serializable in PostgreSQL adds SSI overhead (tracking read/write dependency sets). In write-heavy workloads, SSI can increase abort rates (transactions are rolled back more frequently to prevent serializability violations), reducing effective throughput. The rule: use Serializable only when your application has cross-row invariants that must hold. Use Snapshot Isolation (Read Committed is the PostgreSQL default — explicitly upgrade to Repeatable Read or Serializable) for everything else.
Replication Factor vs. Write Latency: In Cassandra with RF=3 and QUORUM consistency, every write must be confirmed by 2 out of 3 replicas. Higher RF protects against more concurrent node failures but costs write latency (more round-trips) and disk space (more copies). The rule: RF=3 is the standard production minimum; RF=5 is used for critical data in multi-datacenter deployments. Never use RF=1 for data that cannot be reconstructed.
LSM Write Amplification vs. B-Tree Read Performance: LSM trees sacrifice write amplification (compaction re-writes data multiple times) for sequential write throughput. B-trees sacrifice random write performance for O(log n) reads. The rule: if your write:read ratio is above 10:1, an LSM engine (Cassandra, RocksDB) is likely faster. If your read:write ratio is above 10:1 or you need strong range-query performance, a B-tree engine (PostgreSQL, MySQL InnoDB) is likely better.
Quorum Size vs. Fault Tolerance: Increasing W or R in a quorum system improves consistency at the cost of latency and availability. W=N means no write can succeed unless all replicas are reachable — maximum consistency, minimum availability. W=1 means any single node can acknowledge a write — maximum availability, minimum consistency. The sweet spot for most systems is W=⌈N/2⌉+1 (majority write quorum), which tolerates up to ⌊N/2⌋ node failures.
Failure Mode: Split Brain: The most catastrophic failure in a replicated system. Prevention requires strict quorum enforcement (a node that cannot reach a quorum must refuse to accept writes, even if it believes it is the leader) and fencing tokens at the storage layer. Two prevention layers are better than one: epoch-based leader election in the consensus layer, plus monotonic fencing tokens enforced by the storage backend.
🧭 Decision Guide: Choosing the Right Consistency and Storage Model
Use this guide when you face a design question about consistency guarantees, isolation levels, or storage engines.
| Situation | Recommendation | Key reason |
| Financial transactions with multi-row invariants (e.g., double-entry ledger) | Serializable isolation | Write skew can violate ledger invariants; Snapshot Isolation is insufficient |
| High-throughput analytics reads on eventually consistent data | Eventual consistency (AP database) | Freshness is not critical; availability and throughput matter more |
| Single-datacenter OLTP with mixed reads and writes | PostgreSQL with Read Committed (default) or Repeatable Read | Good balance of concurrency and safety; upgrade to Serializable only if write skew is possible |
| Globally distributed system where reads must be fast in every region | Leaderless replication (Cassandra/DynamoDB) with LOCAL_QUORUM | Cross-datacenter synchronous writes are too slow; local quorum provides regional consistency |
| Write-heavy time-series ingestion (IoT sensors, log aggregation) | LSM-based engine (Cassandra, RocksDB, ClickHouse) | Sequential writes minimise random I/O; compaction reclaims space |
| Mixed OLTP with frequent range queries and point lookups | B-tree engine (PostgreSQL, MySQL InnoDB) | Sorted pages optimise range scans; B-tree traversal optimises point lookups |
| Distributed lock / leader election | ZooKeeper or etcd (CP systems) with fencing tokens | You need linearizability for lock safety; AP systems cannot provide it |
| User session data (read-your-writes, monotonic reads) | Session consistency (DynamoDB sessions, sticky routing) | Users expect to see their own writes; strong global consistency is unnecessary overhead |
| Avoid | Reason | |
| :--- | :--- | |
| Read Uncommitted in any production system | Dirty reads allow acting on uncommitted, potentially rolled-back data | |
| Snapshot Isolation for scheduling or on-call systems with shared invariants | Write skew can violate invariants even when each transaction looks locally safe | |
| Leaderless replication with W=1 for financial data | A single node failure before replication can cause data loss | |
| RF=1 for any data that is not easily reconstructable | A single disk failure permanently loses the data |
🧪 Scenario Walkthroughs: Applying the Vocabulary to Real Design Problems
These scenarios test whether you can use the vocabulary correctly in context — the way an interviewer would present a problem.
Scenario 1: The On-Call Scheduling Bug (Write Skew in Practice)
Your team operates a hospital scheduling application. The invariant: at least one doctor must be on-call at all times. The application checks the on-call count before allowing a doctor to go off-call. Two doctors, Alice and Bob, both open the "Go Off-Call" dialog at the same moment. Both see 2 doctors on-call. Both transactions proceed.
What happened? Write skew. Each transaction read the shared invariant (on-call count ≥ 2), determined it was safe to write (decrement), and committed. Neither write individually violated the local pre-condition. The global invariant broke because the two transactions' reads and writes formed a conflicting dependency cycle that Snapshot Isolation cannot detect.
Fix: Use SELECT ... FOR UPDATE (pessimistic locking) on the on-call roster row before checking the count, or upgrade the isolation level to Serializable. Serializable Snapshot Isolation (PostgreSQL's SSI) will detect the read-write dependency cycle and abort one of the two transactions with a serialization failure, prompting a retry.
Scenario 2: The Stripe Retry That Charged Twice (Idempotency Failure)
A mobile payment client sends a POST /charge request to a payment API. The network times out after 3 seconds. The client retries. The server processed the original request successfully but the response was lost in transit. The server receives the retry, treats it as a new request, and charges the customer again.
What happened? Missing idempotency key. The payment API was not idempotent. At-least-once delivery (the default for any retryable HTTP call) combined with a non-idempotent server handler produces duplicate charges.
Fix: The client generates a UUID for the logical payment intent before the first attempt. Every retry includes the same UUID as the Idempotency-Key header. The server stores (idempotency_key, response) in a durable key-value store with a TTL. On receiving a request, the server checks for the key first. If found, return the cached response. If not found, process and store. The idempotency key makes the endpoint safe for at-least-once delivery.
Scenario 3: The Quorum That Was Not Strong Enough
A Cassandra cluster with N=3, W=1, R=1 is storing user profile data. One node receives a write (user changes their email address) and acknowledges it. The node crashes before replication. A subsequent read contacts the two remaining nodes (neither has the new email). The user sees their old email address even though the write was confirmed.
What happened? W + R = 2, which is not greater than N = 3. The quorum condition W + R > N is not satisfied. A write acknowledged by 1 node and a read contacting 1 different node have no guaranteed overlap.
Fix: Set W=2, R=2 (W + R = 4 > N = 3). Now any read quorum of 2 must include at least one of the 2 nodes that confirmed the write. Alternatively, W=1, R=3 (read all nodes) also satisfies the condition, but at higher read latency cost. For user profile data where read-your-writes is important, W=2, R=2 with LOCAL_QUORUM in a single datacenter is the standard configuration.
Understanding the vocabulary in isolation is not enough. Seeing how mature open-source systems implement each concept cements the definitions.
PostgreSQL and Isolation Levels: PostgreSQL implements Read Committed, Repeatable Read, and Serializable using MVCC. Each transaction is assigned a transaction ID (xid). Row versions carry xmin (the xid that created the row) and xmax (the xid that deleted it). A transaction's snapshot includes all xids committed before it started. Serializable is implemented via Serializable Snapshot Isolation (SSI), which tracks read/write dependencies and aborts transactions that form a dangerous cycle. Configure with:
-- Set isolation for a single transaction
BEGIN ISOLATION LEVEL SERIALIZABLE;
-- or per-session
SET default_transaction_isolation = 'repeatable read';
Cassandra and Quorum: Cassandra implements leaderless replication with tunable consistency. Each read or write operation specifies a consistency level that determines the quorum size. QUORUM in a single datacenter means floor(RF/2) + 1 replicas must respond. LOCAL_QUORUM means a quorum within the local datacenter only (used for multi-datacenter deployments where cross-datacenter latency would be unacceptable). Hinted handoff and read repair handle temporary node failures. For a full Cassandra deep-dive, see System Design Replication and Failover.
# cassandra.yaml - hinted handoff configuration
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000 # 3 hours - max time to store hints
hinted_handoff_throttle_in_kb: 1024
ZooKeeper and Fencing Tokens: ZooKeeper provides distributed coordination via sequential ephemeral znodes. Every create operation on a sequential node generates a monotonically increasing zxid. A service acquiring a distributed lock creates a sequential ephemeral node and receives its sequence number — this is the fencing token. The storage backend that this lock protects rejects operations carrying a fencing token lower than the highest it has seen. Ephemeral nodes are automatically deleted when the client session expires, so a crashed leader automatically releases its lock.
Kafka and Consumer Lag / Exactly-Once: Kafka tracks consumer progress via committed offsets stored in the __consumer_offsets internal topic. Consumer lag for a partition = latest_offset - consumer_committed_offset. Kafka's exactly-once semantics (EOS) combines idempotent producers (each message carries a producer ID + sequence number; the broker deduplicates retries) with transactional APIs (atomically write to multiple partitions and commit the consumer offset in the same transaction). For deeper Kafka internals, see How Kafka Works.
📚 Lessons Learned: What Production Systems Teach You About This Vocabulary
These are the hard-won lessons that distinguish engineers who understand the definitions from engineers who have operated these systems at scale.
Snapshot Isolation is not Serializable — and production bugs live in that gap. Many engineers assume that MVCC with Snapshot Isolation (the default in many cloud databases) prevents all isolation anomalies. It does not prevent write skew. The write skew scenario — two concurrent transactions each reading a shared condition and writing in a way that together violates an invariant — is exactly the class of bug that a banking or scheduling system can silently produce under Snapshot Isolation. Always check whether your application has invariants that span multiple rows or multiple objects. If it does, escalate to Serializable.
CAP is about partitions, not about consistency in steady state. A common interview mistake is to say "Cassandra is AP so it doesn't support consistency." Cassandra supports strong consistency within a datacenter when you set QUORUM and RF=3. CAP's trade-off only activates during a network partition. In steady state (no partition), AP systems can be made consistent by setting a high quorum. PACELC is the better model for steady-state reasoning.
Quorum math is about overlaps, not majorities. The W + R > N condition is not about majority vote — it is about ensuring that at least one node in a read quorum must have participated in a write quorum. With N=5, W=3, R=3: any three-node read set must overlap with any three-node write set by at least one node (since 3+3=6 > 5). This is why W=1, R=5 is valid quorum: every read contacts every node, so it must hit the one node that received the write.
Fencing tokens are required even after leader election. A freshly-elected leader may still have a stale predecessor attempting writes. Leader election alone does not prevent split brain at the storage layer. The fencing token (a monotonically increasing epoch number validated by the storage system) is the actual prevention mechanism. Always ensure the storage backend participates in the fencing check — a lock service that does not enforce fencing provides incomplete protection.
Replication lag is not just a performance metric — it is a correctness metric. High replication lag in a system where reads are routed to replicas means users may be reading data that is seconds or minutes out of date. If your SLA requires read-your-writes consistency and your application routes reads to replicas, you need a session-routing mechanism (always read from the primary after a write, or use a replication lag threshold before allowing replica reads).
LSM compaction pauses are a production reliability concern. A Cassandra or RocksDB node under heavy write load accumulates SSTables faster than compaction can merge them. When compaction falls behind, read amplification grows (more SSTables to check per read), disk space grows, and eventually read latency degrades significantly. Monitor compaction pending task counts and tune compaction throughput for your write/read ratio.
📌 TLDR: The 60-Second Distributed Systems Vocabulary Reference
- Read Skew — two reads in the same transaction see different snapshots; prevented by Snapshot Isolation.
- Write Skew — two concurrent transactions each check a shared condition and write in a way that violates it together; prevented only by Serializable isolation.
- Lost Update — two concurrent read-modify-write cycles; second write overwrites first; prevented by atomic operations, CAS, or Serializable isolation.
- Isolation levels escalate from Read Uncommitted → Read Committed → Repeatable Read → Snapshot Isolation → Serializable. Each step costs latency and throughput in exchange for preventing a new class of anomaly.
- Snapshot Isolation ≠ Serializable. Write skew is the gap between them.
- Quorum: W + R > N — the arithmetic condition that guarantees at least one node in a read quorum has the latest write.
- Split brain requires fencing tokens to prevent, not just epoch numbers in the lock service.
- LSM trees favour writes (sequential, in-memory first); B-trees favour reads (in-place, sorted on disk).
- CAP is a partition-time trade-off. PACELC covers both partition-time and steady-state latency/consistency trade-offs.
- Linearizability is the strongest consistency model (real-time, globally ordered). Eventual consistency is the weakest (converges eventually, no timing guarantee).
📝 Practice Quiz: The Five Most Commonly Confused Distributed Systems Concepts
Test your grasp of the concepts most frequently muddled in interviews and production discussions.
Transaction A reads account balances for accounts 1 and 2. In between those two reads, Transaction B transfers money from account 1 to account 2 and commits. Transaction A's two reads now see an inconsistent snapshot. Which anomaly is this?
- A) Write skew
- B) Read skew
- C) Lost update
- D) Phantom read Correct Answer: B) Read skew — Transaction A's two reads span the boundary of Transaction B's commit, producing an internally inconsistent snapshot.
Two hospital scheduling transactions both read the on-call count, both see 2 doctors on-call, and both proceed to mark themselves off-call. After both commit, zero doctors are on-call. Which isolation level is the minimum required to prevent this?
- A) Read Committed
- B) Repeatable Read
- C) Snapshot Isolation (MVCC)
- D) Serializable Correct Answer: D) Serializable — Write skew involves concurrent transactions reading a shared condition and writing in a way that violates a combined invariant. Only Serializable (via 2PL or SSI) detects this conflict cycle.
A Cassandra cluster has replication factor N=5. You want reads to always see the latest confirmed write. What is the minimum valid combination of W and R?
- A) W=1, R=1
- B) W=2, R=2
- C) W=3, R=3
- D) W=5, R=1 Correct Answer: C) W=3, R=3 — with N=5, W+R must be greater than 5. W=3+R=3=6 > 5. Also valid: W=2,R=4 or W=4,R=2 or W=1,R=5.
A leader is elected in a Raft cluster. Moments later, the leader is paused by a JVM garbage collection pause lasting 12 seconds. During that time, a new leader is elected for the next term. Why is the old leader still dangerous even though its Raft term has expired, and what prevents it from corrupting data?
- A) The old leader's heartbeats will notify the new leader to yield
- B) Raft term numbers prevent the old leader from writing to any follower
- C) A fencing token from the lock service allows the storage backend to reject the stale leader's writes
- D) Network timeouts will automatically terminate the old leader's connections Correct Answer: C) The fencing token is the storage-layer enforcement mechanism. Raft term numbers alone do not prevent a paused leader from resuming and attempting writes after a new leader has committed data. The storage backend validates the monotonically increasing fencing token and rejects writes from the stale leader.
Open-ended: You are choosing a storage engine for a new service. The service ingests 500,000 sensor readings per second and needs to query the last 24 hours of data for a given sensor ID (low read volume, extremely high write volume). How would you reason about LSM tree vs. B-tree for this workload, and what compaction considerations would you factor into your capacity planning?
🔗 Related Posts: Build Your Distributed Systems Knowledge Graph
- System Design Replication and Failover
- Consistent Hashing Explained
- System Design Core Concepts: Scalability, CAP, and Consistency
- Types of Locks Explained: Tips for Maintaining Consistent Systems and Avoiding Write Conflicts
- Capacity Estimation Guide

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions — with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy — but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose — range, hash, consistent hashing, or directory — determines whether range queries stay ch...
