All Posts

Read Skew Explained: Inconsistent Snapshots Across Multiple Objects

How a transaction can read two objects at different points in time, violating invariants that no single committed state ever broke.

Abstract AlgorithmsAbstract Algorithms
··39 min read
Cover Image for Read Skew Explained: Inconsistent Snapshots Across Multiple Objects
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: Read skew occurs when a transaction reads two logically related objects at different points in time — one before and one after a concurrent transaction commits — producing a view that never existed as a committed whole. Read Committed isolation cannot prevent it because each SQL statement gets a fresh snapshot. Snapshot isolation (PostgreSQL's REPEATABLE READ, CockroachDB's default SERIALIZABLE) fixes it by giving every transaction a single consistent point-in-time view for its entire duration.


It was 11:47 PM on a Tuesday when Meridian Bank's automated risk engine began generating fraud alerts — not a trickle, but a cascade. Within four minutes, 340 customer accounts had been frozen. The trigger: a batch compliance job had computed total balances across linked checking-and-savings account pairs and found a $200 discrepancy in each one. The invariant Meridian's core banking system enforced was non-negotiable: for every linked pair, balance(checking) + balance(savings) = $1,000. The reconciliation job was reporting $800 in hundreds of pairs simultaneously.

No money had actually disappeared. The bank's ledger was correct, every row on disk was valid, and no transaction had been aborted. What the risk engine saw was a ghost state — a snapshot assembled from two different points in time, straddling a transfer in progress. This is read skew. Restoring service required 90 minutes of manual account review. The root cause was three lines of application code missing a transaction wrapper and a default isolation level that permitted exactly this kind of inconsistency.


📖 The Meridian Bank Failure: What Read Skew Looks Like in Production

Meridian's overnight transfer service moved $200 from checking to savings accounts at high volume. The service had been written without a single wrapping database transaction: it performed two sequential, auto-committing SQL updates.

Step 1 — Commit the debit: UPDATE accounts SET balance = 300 WHERE type = 'checking' AND customer_id = ? The checking account balance drops from $500 to $300. This commit is immediately visible to every new reader.

Step 2 — Credit pending: UPDATE accounts SET balance = 700 WHERE type = 'savings' AND customer_id = ? This update is executing but has not yet committed. The savings account still shows $500 in the database's committed state.

Between those two commits — a window that lasted anywhere from a few milliseconds to tens of milliseconds under load — Meridian's compliance job executed its own database transaction. Under READ COMMITTED isolation, each SELECT statement gets a fresh snapshot of whatever data is committed at the moment that statement runs. The compliance transaction issued two queries:

SELECT balance FROM accounts WHERE type = 'checking' AND customer_id = ?;
SELECT balance FROM accounts WHERE type = 'savings'  AND customer_id = ?;

The first query returned $300 — the committed debit was already visible. The second query returned $500 — the credit had not yet committed, so the database returned the pre-transfer value. The compliance job computed $300 + $500 = $800, flagged a $200 discrepancy, and triggered the account freeze.

Both values were committed, valid, and fully consistent with the database's rules. Yet together they formed a picture that was never true at any single moment in time. The checking account had been debited but the savings account had not yet been credited — the compliance transaction caught the transfer midway through.

This is the defining characteristic of read skew: two objects with a logical relationship are read at different committed states, producing a composite view that violates the invariant both objects were designed to maintain together.


🔍 Why Read Skew Is Not the Same as a Non-Repeatable Read

Both anomalies are about reads returning unexpected values, which makes them easy to conflate. The distinction is precise and matters enormously for choosing a fix.

A non-repeatable read involves re-reading the same single row within a transaction and getting a different value. Transaction T2 reads row R, concurrent transaction T1 updates and commits R, T2 reads R again — and gets a new value. The anomaly is that one row changed underneath a reader. Preventing this requires locking or versioning that single row.

A read skew involves reading two different objects that share a logical invariant — and reading them at different committed states. T2 reads object A, then reads object B. Between those two reads, concurrent transaction T1 commits updates to both A and B. T2 holds A's value from before T1 and B's value from after T1 (or vice versa). The resulting pair violates the invariant even though every individual value T2 holds was, at some moment, a committed truth.

AnomalyObjects involvedWhat changesInvariant at risk
Non-repeatable read1 rowThe same row returns a different value on re-readRow-level value consistency over time
Read skew2+ rowsDifferent rows are read from different committed statesCross-row invariant (e.g., A + B = constant)

The critical implication: preventing non-repeatable reads is not sufficient to prevent read skew. The ANSI SQL definition of REPEATABLE READ only guarantees that re-reading the same row returns the same value. It does not guarantee that two different rows are read from the same committed state. This is why read skew is also called the snapshot isolation anomaly — it is the specific anomaly that motivated the design of snapshot isolation.

Read skew is invisible to most monitoring. The database returns no error. Each value returned is a committed, valid row. The anomaly only surfaces when application logic combines the two values and checks an invariant — exactly what Meridian's compliance job did at the worst possible moment.


⚙️ How a Transfer Produces Two Incompatible Snapshots

The sequence diagram below traces the exact interaction that caused Meridian's outage. T1 is the transfer service performing a non-atomic debit-then-credit. T2 is the compliance reader. T2 uses READ COMMITTED isolation, which means each of its SELECT statements takes an independent snapshot of the currently-committed database state.

sequenceDiagram
    participant T1 as Transfer Service
    participant DB as Accounts Database
    participant T2 as Risk Engine Reader

    T1->>DB: Debit checking. Set balance 500 to 300. COMMIT.
    Note over DB: Checking=300 committed. Savings=500 unchanged. Invariant window open.
    T2->>DB: BEGIN
    T2->>DB: SELECT checking account balance
    DB-->>T2: 300
    T1->>DB: Credit savings. Set balance 500 to 700. Not yet committed.
    T2->>DB: SELECT savings account balance
    DB-->>T2: 500. Savings credit not yet committed.
    T1->>DB: COMMIT savings credit
    Note over T2: Computed total = 300 + 500 = 800. Invariant expects 1000.

The diagram shows the invariant window: the gap between when T1's debit becomes visible and when T1's credit completes. T2 starts its transaction inside that window and issues its two SELECT statements before the window closes. Under READ COMMITTED, T2's first SELECT takes a snapshot at the moment it runs — catching the committed debit — and T2's second SELECT takes a new snapshot at the moment it runs — missing the still-pending credit. T2 now holds two individually correct values that are mutually inconsistent.

Three conditions must hold simultaneously for read skew to occur:

  1. A write is non-atomic — two related objects are updated in separate commits, or a single atomic commit becomes visible between a reader's two queries.
  2. The reader issues at least two separate queries targeting both updated objects.
  3. The reader's isolation level takes a fresh snapshot per query (Read Committed) rather than pinning to a single transaction-start snapshot (Snapshot Isolation).

Remove any one condition and read skew disappears. Snapshot isolation removes the third condition entirely: T2's snapshot is pinned at the moment BEGIN is called, making it immune to all subsequent commits regardless of how many queries T2 issues or how long the transaction runs.


❓ Why Read Committed Is Designed to Allow This Behaviour

Read Committed is the default isolation level in PostgreSQL, Oracle, SQL Server, and most cloud-hosted databases. Its defining characteristic is statement-level snapshots: every individual SQL statement sees the most recently committed data at the precise moment that statement begins executing. When the statement completes, the snapshot is released. The next statement takes a new one.

This design exists for a deliberate reason: it maximises concurrency. In a lock-based implementation, Read Committed holds row-level read locks only for the duration of the statement, releasing them immediately rather than holding them until the transaction commits. In an MVCC implementation, it simply takes a fresh read snapshot at each statement boundary. Either way, the result is fewer contention bottlenecks and higher throughput.

The guarantee Read Committed provides is narrow but useful: you will never read uncommitted data. Every value a SELECT statement returns was committed at some point before the statement started. This prevents the most severe anomaly, dirty reads, where a transaction reads another transaction's in-flight, uncommitted changes that might be rolled back.

The problem is the phrase "at some point." Two consecutive SELECT statements in the same transaction can legally refer to two different "some points," separated by however long elapsed between them. This is not a defect — it is a documented trade-off. The ANSI SQL specification acknowledges that Read Committed permits non-repeatable reads and, by extension, read skew.

The appropriate mental model is this: under Read Committed, your BEGIN...COMMIT block provides atomicity for writes but does not pin reads to a shared snapshot. Each SELECT is effectively issued by an independent micro-transaction that sees the world at that precise instant. This is why no amount of careful query ordering within a Read Committed transaction can prevent read skew — the isolation level is structurally incapable of providing the guarantee you need.


🧠 MVCC and the Consistent Snapshot: How Snapshot Isolation Ends the Straddle

Multi-Version Concurrency Control (MVCC) is the mechanism that makes snapshot isolation practical at scale without needing read locks. Every major database — PostgreSQL, Oracle, MySQL InnoDB, CockroachDB, and virtually all cloud-native databases — maintains multiple committed versions of each row. Each version is tagged with the transaction ID (or timestamp) of the transaction that wrote it.

The Internals of MVCC Row Versioning

When a row is updated, the database does not overwrite the existing value. Instead, it writes a new row version (called a "tuple" in PostgreSQL or an "undo record" in MySQL) tagged with the writing transaction's commit timestamp. The old version is retained in place. For a SELECT query, the database walks backward through the version chain and returns the most recent version whose commit timestamp satisfies the reader's visibility rules.

Under Snapshot Isolation, the database records a T_start value for each transaction — typically the system transaction ID at the moment BEGIN is called. For every row access, the database applies one rule: return the most recent version whose commit timestamp is ≤ T_start. New versions written by transactions that committed after T_start are skipped entirely.

The implementation differs by database:

  • PostgreSQL: Old row versions ("dead tuples") live in the same heap pages as live tuples. Each tuple has xmin (the transaction that inserted it) and xmax (the transaction that deleted/updated it) system columns. The visibility check compares xmin and xmax against the transaction's snapshot object — a list of all transaction IDs that were in-progress at T_start.
  • MySQL InnoDB: The undo log stores rollback segments containing old column values. When a snapshot read needs an old version, InnoDB reconstructs it on demand by applying undo records backward from the current row version.
  • CockroachDB: MVCC is implemented at the storage layer (RocksDB), keyed by (key, timestamp). Every write appends a new versioned key. Reads at a specific timestamp retrieve the latest version at or before that timestamp across all nodes in the distributed cluster.

When a transaction starts under Snapshot Isolation, the database records the transaction's start timestamp — call it T_start. For every subsequent query in that transaction, regardless of when it executes or which rows it touches, the database resolves each row by returning the most recent version whose commit timestamp is ≤ T_start. Commits by other transactions that happen after T_start produce versions with commit timestamps > T_start, and those versions are invisible to this transaction for its entire duration.

The table below compares what T2 sees under each isolation level during the Meridian transfer scenario. T2 starts at t=0. T1 commits the debit at t=1 and the credit at t=3.

TimeEventT2 under Read CommittedT2 under Snapshot Isolation
t=0T2 starts. Checking=500, Savings=500Snapshot pinned: C=500, S=500
t=1T1 commits debit. Checking=300Next SELECT sees C=300Commit after T_start — invisible
t=2T2 SELECT checking300 (reads T1's debit)500 (snapshot pinned at t=0)
t=3T1 commits credit. Savings=700Next SELECT sees S=700Commit after T_start — invisible
t=4T2 SELECT savings500 (T1 credit pending at this moment)500 (snapshot pinned at t=0)
t=5T2 computes total300 + 500 = 800 (anomaly)500 + 500 = 1000 (correct)

Under Snapshot Isolation, T2 sees the world as it was at T_start for every object, for every query, for the entire transaction. The straddle is structurally impossible: T2 cannot see T1's debit without also seeing T1's credit, because T1's entire commit either falls before T_start (both writes visible) or after T_start (both writes invisible). There is no in-between.

Performance Analysis: The Overhead of Keeping Multiple Row Versions

Snapshot Isolation has a well-deserved reputation for low overhead under MVCC, but the cost is not zero. Understanding where it is spent matters for production capacity planning.

Read overhead: Negligible for most workloads. A snapshot read requires one additional comparison per row access (check xmin/xmax against the snapshot object). No locks are acquired. Readers never block writers, and writers never block readers — the signature performance advantage of MVCC over lock-based Repeatable Read.

Write overhead: Each UPDATE or DELETE creates a new row version, slightly increasing I/O compared to an in-place update. The extra version takes storage (heap page space in PostgreSQL, undo log space in MySQL). For write-heavy tables, the version chain can grow long, slowing version chain traversals for old snapshots.

Garbage collection bottleneck: Old row versions must be cleaned up. PostgreSQL's AUTOVACUUM daemon scans tables and removes dead tuples that are no longer visible to any active transaction. The critical issue: long-running Snapshot Isolation transactions delay garbage collection. A transaction pinned at T_start = 5000 prevents all versions newer than T_start = 5000 from being recycled, even if millions of newer transactions have committed. In high-throughput PostgreSQL databases, a forgotten or stuck long-running transaction can cause table bloat — dead tuples accumulate faster than AUTOVACUUM can remove them — degrading scan performance progressively. Always set statement_timeout and lock_timeout to prevent runaway transactions from holding old snapshots indefinitely.

Transaction ID wraparound (PostgreSQL-specific): PostgreSQL uses 32-bit transaction IDs, which wrap around after ~2 billion transactions. Long-running Snapshot Isolation transactions that pin very old snapshots can approach this wraparound horizon, requiring emergency VACUUM FREEZE operations. Monitor age(datfrozenxid) on all databases and keep it well below the autovacuum_freeze_max_age threshold.


📊 Which Isolation Level Stops Read Skew: The Complete Anomaly Prevention Map

The ANSI SQL standard defines four isolation levels. Most production databases implement an additional level — Snapshot Isolation — which sits between Repeatable Read and Serializable in terms of strength. The table below maps each level to the anomalies it prevents or allows.

Isolation LevelDirty ReadNon-Repeatable ReadRead SkewPhantom ReadWrite Skew
Read UncommittedPossiblePossiblePossiblePossiblePossible
Read CommittedPreventedPossiblePossiblePossiblePossible
ANSI Repeatable ReadPreventedPreventedPossiblePossiblePossible
Snapshot IsolationPreventedPreventedPreventedPreventedPossible
SerializablePreventedPreventedPreventedPreventedPrevented

Three observations critical for system design decisions:

Snapshot Isolation is the minimum viable fix for read skew. It costs almost nothing in a well-tuned MVCC database but closes the straddle window entirely. For read-heavy compliance or reporting transactions, upgrading from Read Committed to Snapshot Isolation is the correct first response.

Snapshot Isolation still allows write skew. If two transactions each read a consistent snapshot, make independent decisions based on what they read, and then write conflicting updates — both writes can commit successfully even though their combined effect violates a constraint. If write skew is also a risk in your system (common in booking, scheduling, and allocation patterns), you need Serializable.

PostgreSQL's Repeatable Read is stronger than the ANSI definition. PostgreSQL's REPEATABLE READ delivers true Snapshot Isolation — the entire transaction sees a single pinned snapshot. This is stronger than what the ANSI standard requires and specifically prevents read skew. Oracle does not have a standard Repeatable Read level; it moves directly from Read Committed to Serializable. Never assume the isolation level name means the same thing across databases.


🏗️ The Multi-Service Read Path Where Inconsistency Enters the System

In a monolithic application backed by a single database, read skew is a database-level concern: choose the right isolation level and it disappears. In a microservices architecture, the attack surface is larger. Services may query different databases, read from different replicas, or call independent services that each apply their own isolation rules. No single database transaction can span a service boundary.

The diagram below shows the service topology at the root of Meridian's outage:

graph TD
    TransferSvc[Transfer Service] -->|debit committed| AccDB[(Accounts Database)]
    TransferSvc -->|credit pending| AccDB
    RiskEngine[Risk Engine] -->|reads checking 300| AccDB
    RiskEngine -->|reads savings 500 stale| AccDB
    RiskEngine --> Calc[Computed total 800]
    Calc -->|expected 1000 discrepancy| AlertSvc[Alert Service]
    AlertSvc --> FreezeOp[Account Freeze]

The diagram highlights two compounding failures: the Transfer Service makes two sequential non-atomic commits (creating an invariant window), and the Risk Engine reads both accounts in a single READ COMMITTED transaction whose two SELECT statements land on opposite sides of the debit commit. These two failures are independent — either one alone can be fixed without touching the other — but together they guaranteed the anomaly under high concurrency.

Even with a corrected Transfer Service using a single atomic transaction, the Risk Engine remains vulnerable: under Read Committed, T2's two SELECT statements can still straddle the atomic commit window, reading one account's old value and one account's new value. The complete fix requires both: atomic writes at the source and Snapshot Isolation for the reader.

In microservices systems where the Risk Engine calls two separate services — AccountService and BalanceService — each backed by their own database, no database-level isolation level helps at all. The invariant crossing happens at the application layer, across service boundaries. The remedies available are:

  • Route invariant reads through a single service that enforces a consistent snapshot internally using Snapshot Isolation.
  • Idempotent reconciliation jobs that detect impossible states and trigger compensation, using Snapshot Isolation for the reconciliation reads themselves.
  • Saga coordination with explicit post-saga consistency checks, verifying all service writes completed before releasing the composite read to downstream consumers.

🌍 How PostgreSQL, Oracle, MySQL InnoDB, and CockroachDB Handle Read Skew

Understanding how each database names, implements, and defaults its isolation levels is essential for knowing what "fixing" read skew actually requires in your stack.

PostgreSQL Default: Read Committed. Read skew is possible with the default. Setting REPEATABLE READ activates true Snapshot Isolation — the transaction receives a consistent point-in-time view from BEGIN to COMMIT. PostgreSQL's SERIALIZABLE adds predicate lock tracking (Serializable Snapshot Isolation / SSI), which detects and aborts transactions that would produce non-serializable outcomes, returning error code 40001 (serialization failure) for retries.

Oracle Database Default: Read Committed. Oracle has no standard REPEATABLE READ level. The available levels are Read Committed and Serializable. Oracle's Serializable uses MVCC snapshot semantics at the transaction level, which is equivalent to PostgreSQL's Repeatable Read for read skew prevention. To prevent read skew in Oracle, you must use SERIALIZABLE — there is no intermediate Snapshot Isolation level to select.

MySQL InnoDB Default: Repeatable Read. MySQL InnoDB's implementation is stronger than the ANSI definition: it pins a "consistent read view" at the time of the first consistent read in the transaction (the first non-locking SELECT). Subsequent reads in the same transaction reuse that snapshot. In practice, MySQL InnoDB's default level does prevent read skew for standard SELECT queries, making it the safest default among mainstream databases for this specific anomaly.

CockroachDB Default: Serializable. CockroachDB prevents all known anomalies including read skew and write skew at the default level. It uses Multi-Version Timestamp Ordering (MVTO) combined with optimistic concurrency control — conflicting transactions are detected and retried automatically. The cost is occasional transaction retry overhead, but correctness is never compromised.

DatabaseDefault IsolationRead Skew Prevented at Default?Fix Required
PostgreSQLRead CommittedNoUse REPEATABLE READ
OracleRead CommittedNoUse SERIALIZABLE
MySQL InnoDBRepeatable ReadYesDefault is sufficient
CockroachDBSerializableYesDefault is sufficient
SQL ServerRead CommittedNoUse SNAPSHOT isolation level
Amazon Aurora (PostgreSQL mode)Read CommittedNoUse REPEATABLE READ (same as PostgreSQL)
Amazon Aurora (MySQL mode)Repeatable ReadYesDefault is sufficient

⚠️ Multi-Datacenter Amplification: How Replica Lag Turns a Race Into a Certainty

In a single-datacenter deployment, read skew requires unfortunate timing: the reader's two queries must straddle a writer's commit window. With sub-millisecond inter-node latency, this window may be microseconds wide. The anomaly is possible but not guaranteed during any given request.

In a geographically distributed deployment with multi-region asynchronous replication, read skew stops being a timing race and becomes a structural property of the system.

Consider Meridian's extended infrastructure: the accounts database was replicated to three regions — US-East (primary), EU-West (replica), and AP-Southeast (replica) — using asynchronous streaming replication. Average replication lag from US-East to EU-West was 80–150 milliseconds at peak. AP-Southeast lagged 200–300 milliseconds.

Meridian's Risk Engine routed compliance reads geographically to reduce latency: checking accounts were stored in US-East and savings accounts in EU-West. When the compliance job read Account A from the US-East primary and Account B from the EU-West replica, it was reading two objects from different points in time — the EU-West replica's state was 80–150 milliseconds behind the primary — without any concurrent write even being necessary. The replication lag guaranteed the two reads came from different database states. Read skew was not a race condition; it was a permanent feature of the architecture.

The key distinction between single-region and multi-region read skew:

ScenarioRead Skew TriggerFrequencyFix
Single-region, atomic writesConcurrent write commits between reader's two queriesRare; requires timing coincidenceSnapshot Isolation for reader
Single-region, non-atomic writesAny write commits the debit before the creditCommon; proportional to write volumeAtomic writes + Snapshot Isolation
Multi-region, async replicationReplication lag between primary and replicaAlways present; structuralRead from primary for invariant reads
Multi-region, bounded stalenessReplica lag exceeds invariant windowProportional to lagBounded-staleness reads with tight lag bound

Mitigations specific to multi-datacenter read skew:

  • Primary-only reads for invariant-critical queries. Accept the added latency; eliminate replica lag as a variable entirely.
  • Bounded-staleness reads. Databases such as Google Spanner and CockroachDB support reading from replicas at a guaranteed staleness bound. If both accounts in a pair are read at "no older than T milliseconds," and T is smaller than the minimum transfer commit duration, the lag is contained.
  • Synchronous replication for critical account data. Accept write latency in exchange for replica freshness. Appropriate for the small subset of tables that participate in invariants.
  • Affinity sharding. Ensure all accounts in a logical invariant pair live on the same node and region, eliminating cross-region reads for that pair at the cost of less even load distribution.

🔭 Catching Read Skew in Production Before It Freezes Accounts

Read skew is treacherous precisely because the database reports no errors, logs no warnings, and returns values that are individually correct. Every monitoring tool that watches for failed queries, slow queries, or high error rates will report a clean system even while read skew is occurring continuously. Detection must happen at the application and invariant layers, not at the database layer.

Invariant monitoring jobs. For every multi-object invariant your system enforces (A + B = K, allocated ≤ total_capacity, parent_total = SUM(children)), schedule a periodic consistency check that reads all participating objects in a single Snapshot Isolation transaction. If the invariant is violated, log the offending account IDs, the computed vs. expected values, and the timestamp. Do not trigger an alarm on the first failure — run the check twice with a five-second gap. A violation that resolves between the first and second check is almost certainly read skew. A violation that persists is a genuine data error.

Cross-table consistency cross-checks. Where two tables must always be consistent (e.g., orders and order_line_items, account_balances and ledger_entries), add a reconciliation step to your data pipeline that counts, sums, and hash-checks both tables under a single snapshot. Schedule this check as part of your nightly ETL or as a dedicated reconciliation service that runs every few minutes. Discrepancies that appear and disappear across consecutive runs are read skew. Persistent discrepancies are data bugs.

Replication lag alerting. Instrument your replication lag metric for every replica. Set an alert threshold equal to your expected invariant window — the typical duration between a writer's first and second commit. If replication lag exceeds that threshold, route invariant reads to the primary until lag recovers. Most databases expose replication lag via pg_stat_replication (PostgreSQL), SHOW SLAVE STATUS (MySQL), or built-in metrics dashboards (CockroachDB, Spanner).

Distributed trace context for multi-service reads. In microservices systems, add trace metadata to every read that participates in a multi-object invariant check — including which replica, which service instance, and what read timestamp each value came from. When a compliance violation fires, the trace will immediately reveal whether the two reads crossed replicas, straddled a commit, or came from a genuine data error. Without this, diagnosing the root cause requires hours of log correlation.


🧪 Tracing the Anomaly Step by Step: Two Scenarios Side by Side

This section walks through the same four-step read sequence under two different isolation levels, showing exactly what each reader observes. No special setup is needed — this is the behaviour of any MVCC database with default Read Committed configuration.

Scenario A — Vulnerable: Read Committed reader, non-atomic writer

The transfer service issues two sequential auto-commit updates. The compliance reader runs its two SELECT statements in a single transaction while the credit update is in flight.

StepWho ActsWhat HappensCommitted State of DB
1Transfer ServiceDebit checking: 500 → 300. COMMIT.Checking=300, Savings=500
2Compliance ReaderBEGIN. SELECT checking.Returns 300 (debit committed, visible)
3Transfer ServiceCredit savings: 500 → 700. BEGIN, not yet committed.Checking=300, Savings=500 (credit pending)
4Compliance ReaderSELECT savings. Returns 500 (old).
5Transfer ServiceCOMMIT credit.Checking=300, Savings=700
6Compliance ReaderCompute total: 300 + 500 = 800. Flags discrepancy.DB is consistent. Reader is not.

The reader's two queries straddled the credit commit. From T2's perspective, $200 has vanished. From the database's perspective, the data is fully correct — the reader simply assembled values from two different committed states.

Scenario B — Fixed: Snapshot Isolation reader, same non-atomic writer

The transfer service is unchanged. Only the compliance reader's isolation level is raised to REPEATABLE READ (Snapshot Isolation).

StepWho ActsWhat HappensWhat Compliance Reader Sees
1Transfer ServiceDebit checking: 500 → 300. COMMIT.— (snapshot not yet taken)
2Compliance ReaderBEGIN. Snapshot pinned: Checking=500, Savings=500.Snapshot: C=500, S=500
3Transfer ServiceCredit savings: 500 → 700. COMMIT.Commit is after T_start — invisible to reader
4Compliance ReaderSELECT checking. Returns 500 (snapshot value).500
5Compliance ReaderSELECT savings. Returns 500 (snapshot value).500
6Compliance ReaderCompute total: 500 + 500 = 1000. No discrepancy.Invariant satisfied

The compliance reader never saw any version of either row written after T_start. Its two reads came from the same consistent committed state — the world as it was the moment the reader started its transaction. The non-atomic writer still creates a transient inconsistent state, but that state is invisible to any Snapshot Isolation reader.

The practical implication: upgrading a compliance or reporting reader from Read Committed to Snapshot Isolation is the minimum fix that eliminates the read skew anomaly for that reader, even without changing the writer. It is the safest starting point because it requires no write-path changes, no schema modifications, and no cross-team coordination.


⚖️ Prevention Strategies and Their Trade-offs

Prevention is not one-size-fits-all. The right strategy depends on your workload's consistency requirements, your tolerance for retry overhead, and whether you own the write side, the read side, or both.

Option 1: Atomic writes at the source (highest impact, zero overhead) Wrap every multi-object update in a single BEGIN...COMMIT block. A single atomic transaction closes the invariant window to zero: no reader under any isolation level can see the half-committed state, because the half-committed state is never visible. This is the most impactful fix for application-layer read skew and adds no database overhead. For microservices, this means collapsing the two-service update into a single service that performs both writes atomically, or using a distributed transaction protocol (2PC, Saga with locking) to coordinate across services.

Option 2: Snapshot Isolation for invariant readers (recommended starting point) Upgrade readers that check invariants from Read Committed to REPEATABLE READ (PostgreSQL) or SNAPSHOT (SQL Server). This is the minimum viable database-layer fix, adds negligible overhead on MVCC databases, and requires no schema or write-path changes. The limitation: Snapshot Isolation does not prevent write skew. If write skew is also a risk in your system, step up to Serializable.

Option 3: Serializable Isolation for complete anomaly prevention PostgreSQL's SERIALIZABLE level uses Serializable Snapshot Isolation (SSI), which detects read-write conflicts at predicate level and aborts transactions that would produce non-serializable outcomes. This is the strongest guarantee available and prevents all known anomalies, including write skew. The cost: occasional transaction retries on error code 40001. Design retry logic into any transaction that uses SERIALIZABLE. For most OLTP workloads, the retry rate is under 1%.

Option 4: Primary-only routing for invariant reads For geographically distributed deployments, route all invariant-critical reads to the primary region. Accept the added latency (typically 50–200 ms for cross-region round trips) in exchange for eliminating replication lag as a source of cross-time reads.


🧭 When to Use Snapshot Isolation, Serializable, or Application-Level Fixes

The right prevention strategy depends on where in the stack you own the problem — the writer, the reader, or both — and how much consistency overhead your workload can absorb. The table below maps common situations to the recommended approach.

SituationRecommended Prevention
Single-database OLTP; invariant reads span multiple rowsUse REPEATABLE READ for reader transactions (PostgreSQL, MySQL default)
Write skew is also a risk alongside read skewUse SERIALIZABLE; add retry logic for SQLSTATE 40001
Non-atomic writes from application codeWrap debit + credit in a single BEGIN...COMMIT — highest-impact, zero-overhead fix
Multi-service reads that cross invariantsRoute invariant reads through a single service using Snapshot Isolation internally
Geographically distributed replicas with replication lagPin invariant reads to the primary or use bounded-staleness reads (Spanner, CockroachDB)
High-throughput reads; eventual consistency is acceptableAdd periodic invariant monitoring jobs under Snapshot Isolation; alert on persistent violations
Legacy application that cannot be changedAdd a read-through caching layer or materialised view refreshed under Snapshot Isolation

The most important principle: fix the write side first, then the read side. A single atomic transaction for the writer eliminates the invariant window regardless of the reader's isolation level. Snapshot Isolation for the reader is a defence-in-depth layer, not a substitute for correct write semantics.


🛠️ PostgreSQL and CockroachDB: Configuring Snapshot Isolation in Practice

PostgreSQL — Snapshot Isolation per transaction

Set the isolation level inside the transaction before the first statement executes. This targets only the transactions that need it, leaving Read Committed as the default for everything else.

-- Per-transaction: upgrade to Snapshot Isolation for this reader
BEGIN;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

SELECT balance FROM accounts WHERE type = 'checking' AND customer_id = 42;
SELECT balance FROM accounts WHERE type = 'savings'  AND customer_id = 42;

COMMIT;

PostgreSQL — Session-level default for compliance reader roles

For a dedicated compliance or reporting database role that always requires snapshot-consistent reads:

-- Apply to a specific role (requires superuser)
ALTER ROLE compliance_reader SET default_transaction_isolation = 'repeatable read';

-- Verify the active level in any connection as this role
SHOW transaction_isolation;
-- Returns: repeatable read

PostgreSQL — Serializable for write-skew-free transactions

When write skew is also a concern, upgrade further. Handle the retry error in your application.

-- Set Serializable at session level for a high-stakes reconciliation role
ALTER ROLE reconciliation_job SET default_transaction_isolation = 'serializable';

-- PostgreSQL returns SQLSTATE 40001 on serialization failure — retry the transaction
-- Check pg_stat_activity to monitor serialization conflict rate

CockroachDB — Verify the default Serializable level

CockroachDB defaults to Serializable and requires no configuration to prevent read skew. Verify the active level and confirm no session has been downgraded:

-- Confirm the active isolation level for the current session
SHOW TRANSACTION ISOLATION LEVEL;
-- Returns: serializable

-- Confirm cluster-wide default (already serializable — shown for documentation)
SHOW CLUSTER SETTING sql.defaults.transaction_isolation;
-- Returns: serializable

CockroachDB — Monitoring serialization retries

Under Serializable, CockroachDB returns error 40001 (serialization failure) for conflicting transactions. Monitor retry frequency at the cluster level:

-- Count serialization retry events per node (CockroachDB metrics)
SELECT node_id, value
FROM   crdb_internal.node_metrics
WHERE  name = 'txn.restarts.serializable';

For a full deep-dive on PostgreSQL's MVCC implementation, vacuum behaviour, and SSI internals, see Isolation Levels in Databases.


📚 Lessons Learned: What Production Outages Teach About Snapshot Consistency

1. "Read Committed" is a misleading name for invariant-crossing reads. The label sounds safe — you are only reading committed data. But committed data can come from two different committed states. For any workload that computes aggregates, checks constraints, or reads multiple rows with a logical relationship, Read Committed is the wrong default to leave in place without additional controls.

2. Non-atomic writes amplify read skew risk by orders of magnitude. Meridian's transfer service committed its debit and credit in two separate auto-commit statements. The invariant window — the gap where checking + savings ≠ $1,000 — was open for the full duration between the two commits. A single atomic BEGIN...COMMIT wrapping both updates reduces that window to zero, regardless of what the reader's isolation level is. If every team knew only one rule about transactions, "wrap multi-object updates in a single transaction" would prevent more production incidents than any isolation level tuning.

3. Replication lag in distributed systems is perpetual structural read skew. A single-datacenter read skew is a timing race. A multi-region read skew is a permanent condition: every cross-replica read is a cross-time read because the replica is always some milliseconds behind the primary. This distinction is critical for architecture decisions — it means "choose the right isolation level" is insufficient for distributed reads. You must also address routing.

4. Snapshot Isolation is not the same as Serializable. Snapshot Isolation prevents read skew but explicitly allows write skew. Two transactions can each read a consistent snapshot, decide to write, and both commit — even if the combination of their writes is logically incoherent. Booking systems, inventory allocation, and concurrent approval workflows are all vulnerable to write skew under Snapshot Isolation. Confirm which anomalies your workload requires protection from before choosing an isolation level.

5. Detection must be built into the data pipeline, not bolted on after an incident. Read skew produces no database-level errors, no unusual latency spikes, and no anomalous query patterns. By the time application logic fires an alert based on an incorrect invariant value, the damage is done. Periodic invariant checks under Snapshot Isolation — running every few minutes against the live database — catch the anomaly before it reaches customers. Treat invariant monitoring as a first-class operational responsibility, not an optional audit job.

6. PostgreSQL's REPEATABLE READ is an outlier — do not generalise across databases. PostgreSQL's Repeatable Read delivers full Snapshot Isolation. Oracle's Serializable delivers snapshot-level consistency but Oracle has no Repeatable Read. MySQL InnoDB's Repeatable Read pins the snapshot at first read. SQL Server's Repeatable Read uses lock-based semantics and does not prevent phantom reads. The isolation level names in the ANSI standard do not map uniformly to the same guarantees across vendors. Always read your specific database's documentation, not the ANSI definition.


📌 Summary & Key Takeaways: Read Skew, Snapshot Isolation, and the Invariants That Cannot Be Broken

  • Read skew is a cross-object inconsistency: a transaction assembles two logically related values from different committed states, producing a view that never existed as a whole at any single moment in time.
  • The root cause is statement-level snapshots: Read Committed gives each SQL statement its own fresh snapshot, allowing a multi-statement transaction to straddle a concurrent writer's commit window.
  • The key distinction from non-repeatable reads: non-repeatable reads affect the same row read twice; read skew affects two different rows read once each, with their combination violating a cross-row invariant.
  • Snapshot Isolation is the fix: pinning the transaction's snapshot at BEGIN makes every read within the transaction see the same consistent state, structurally preventing the straddle.
  • PostgreSQL's Repeatable Read = Snapshot Isolation: it is stronger than the ANSI definition and does prevent read skew. Oracle, SQL Server, and older MySQL versions require specific configuration to match this protection.
  • Multi-datacenter replication lag is structural read skew: cross-replica reads are always cross-time reads when replication is asynchronous; invariant reads must be routed to the primary or use bounded-staleness reads.
  • Silent failure mode: read skew produces no database errors, no warnings, no anomalous metrics — only incorrect application-level logic built on a ghost snapshot. Detection requires explicit invariant monitoring under Snapshot Isolation.

One-liner to remember: Read skew is what happens when your transaction sees the world before a transfer on one row and after a transfer on another — and Snapshot Isolation is the only isolation-level guarantee that makes that impossible.


📝 Practice Quiz: Test Your Understanding of Read Skew

  1. Which isolation level is the minimum required to prevent read skew?

    • A) Read Uncommitted
    • B) Read Committed
    • C) Snapshot Isolation (Repeatable Read in PostgreSQL)
    • D) Write Committed Correct Answer: C
  2. A compliance job reads Account A ($300) and Account B ($500) and reports a total of $800, even though the invariant is A + B = $1,000. The job uses Read Committed and no concurrent writes are in progress when it reads B. What is the most likely root cause?

    • A) The database returned a corrupt value for B
    • B) Account B was at $500 before a transfer started, and the compliance job read B before the credit committed
    • C) Read Committed prevents dirty reads, so this anomaly is impossible
    • D) The compliance job read both values from the same pinned snapshot Correct Answer: B
  3. A developer switches a PostgreSQL reporting transaction from READ COMMITTED to REPEATABLE READ. Which anomaly does this specifically prevent that Read Committed could not?

    • A) Dirty reads from uncommitted concurrent writes
    • B) Lost updates caused by concurrent writers overwriting each other
    • C) Read skew — seeing two related rows from different committed states
    • D) Phantom rows appearing when a concurrent INSERT adds a new matching row Correct Answer: C
  4. CockroachDB defaults to Serializable isolation. A team proposes downgrading to Read Committed to reduce retry overhead. Which specific risk are they accepting?

    • A) Uncommitted data will become visible to readers
    • B) Cross-row invariants may be violated during high-concurrency invariant reads
    • C) Write throughput will also decrease significantly as a side effect
    • D) CockroachDB does not support Read Committed as a valid isolation level Correct Answer: B
  5. Under MVCC Snapshot Isolation, when is a transaction's read snapshot established?

    • A) When the first SELECT statement executes
    • B) When the transaction issues its first write operation
    • C) At the moment BEGIN is called — transaction start
    • D) Dynamically, based on which table is accessed in each query Correct Answer: C
  6. A banking system wraps its debit and credit in a single atomic BEGIN...COMMIT transaction. A concurrent compliance reader uses Read Committed. Can read skew still occur?

    • A) No — a single atomic commit cannot be partially visible; once committed, both rows are visible together
    • B) Yes — under Read Committed, the reader's two SELECT statements can straddle the atomic commit window, seeing one account at the old value and one at the new value
    • C) No — Read Committed prevents any read from spanning a concurrent writer's commit window
    • D) Yes — but only if the reader explicitly holds a lock on the debit account row Correct Answer: B
  7. Snapshot Isolation prevents read skew but not write skew. Which isolation level is required to prevent both anomalies?

    • A) Read Committed
    • B) ANSI Repeatable Read
    • C) Snapshot Isolation
    • D) Serializable Correct Answer: D
  8. In a geographically distributed database with 200ms average replication lag between primary and replica, a compliance job reads Account A from the primary and Account B from the replica. No concurrent writes are in progress. Can read skew occur?

    • A) No — with no concurrent writes, the replica will have identical data to the primary at all times
    • B) No — replication lag affects write acknowledgement latency, not read consistency
    • C) Yes — the replica may be 200ms behind, meaning Account B is read from an older committed state than Account A
    • D) Yes — but only if the compliance job explicitly requests stale reads via a flag Correct Answer: C
  9. Which of the following multi-service patterns most directly creates a read skew risk at the application layer, independent of database isolation level?

    • A) A service that reads the same row twice within a single database transaction
    • B) A risk engine that calls two separate microservices — each backed by its own database — to read related account balances, then computes an invariant across both values
    • C) A service that writes to two tables in a single atomic transaction before returning
    • D) A service that uses a read replica for SELECT queries and the primary for writes Correct Answer: B
  10. (Open-ended challenge — no single correct answer) Your system maintains a three-table invariant: total_reserved + total_available = total_capacity. Users are reporting impossible states — the three values do not sum correctly on some reads, but the issue resolves on page refresh. Your database uses Read Committed. Describe a comprehensive strategy to both detect and prevent this class of inconsistency — covering isolation level changes, write-side fixes, monitoring, and what to do at service boundaries if the three tables are owned by different microservices.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms