All Posts

System Design HLD Example: Collaborative Document Editing (Google Docs)

A practical interview-ready HLD for real-time collaborative editing with conflict resolution and sync.

Abstract AlgorithmsAbstract Algorithms
Β·Β·33 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Real-time collaborative document editing uses Operational Transformation (OT) or CRDTs to ensure all clients converge to the same document state regardless of concurrent edit ordering. The key insight: every operation must be transformed against concurrent unacknowledged operations before application β€” not simply sequenced by timestamp. This HLD covers the full system: OT algorithm and convergence proofs, WebSocket fan-out, cursor presence via Redis pub/sub, version history via event sourcing with snapshots, and the bottlenecks that emerge at Google Docs scale.

πŸ“– The Concurrent Edit Problem: Why Collaborative Documents Are a Hard Distributed Systems Challenge

Two users are editing the same document at 2am. User A inserts "Hello" starting at position 5. Simultaneously, User B deletes the character at position 5. Both operations are issued concurrently β€” neither has seen the other's edit yet. If the server naively applies them in arrival order, A's insert lands first, shifting every subsequent character right by 5 positions. User B's delete now targets position 5 β€” which is the first character of A's newly inserted "Hello" β€” silently corrupting the document. Last-writer-wins fails entirely here. There is no "last" writer; there are two concurrent writers whose intentions must be merged, not arbitrated.

This is the core engineering challenge of collaborative editing: how do you reconcile concurrent operations that were each valid against a shared-but-diverged document state, producing a result that is correct, convergent, and non-destructive to either user's intent?

Use Cases and System Scope

ActorCore Journeys
Document authorCreate document, set sharing permissions, transfer ownership
Co-editorType, delete, format text in real-time alongside other users
ViewerRead-only access; see live edits as they happen
Presence serviceTrack who is in the document and where each cursor is
Version historyReconstruct any past state of the document from the edit log

In scope for this HLD:

  • Create, read, update, and delete documents (full CRUD)
  • Real-time multi-user co-editing with cursor presence (who is editing where)
  • Sharing and permission management: owner, editor, viewer roles
  • Version history and point-in-time document restore
  • Reconnect sync: replay missed operations when a client returns from a network drop

Out of scope (v1 boundary):

  • Spell check, grammar suggestions, and AI writing assistance
  • Full offline-first editing with prolonged-disconnect merge (CRDT extension)
  • End-to-end encryption of document content
  • Voice and video calling (separate signalling layer)

Design Goals

GoalSpecification
ConvergenceAll replicas must converge to identical document state within 200 ms of a network partition healing
Edit latencyOperation round-trip (client β†’ server β†’ broadcast to peers) under 100 ms at p99
DurabilityNo committed operation may be lost; the ops log is the source of truth
AvailabilityActive editing sessions survive WebSocket server restarts via stateless reconnect
ThroughputSupport 1 M concurrent active editing sessions; up to 50 concurrent editors per document

Capacity Estimation

Google Docs serves approximately 1 billion registered users with an estimated 50 million daily active editors. At peak:

  • Concurrent editing sessions: ~5 million (10% of DAU actively typing)
  • Operations per second (write): An active typist generates ~2 ops/sec. At 5 M sessions: ~10 million ops/sec fleet-wide
  • Storage for the ops log: Each operation averages ~200 bytes (type, position, content, metadata). At 10 M ops/sec: ~2 GB/sec of raw op log writes β€” snapshots are critical to prevent unbounded growth and unbounded reconstruction time
  • WebSocket connections: 5 M persistent connections; at ~50 KB RAM per connection: ~250 GB RAM across WebSocket servers β†’ horizontal sharding by document ID is required
  • Document reads (open + reconnect): ~100 M document opens per day β†’ ~1,200 document reconstructions/sec from snapshot + op replay

βš™οΈ Operational Transformation: The Algorithm That Prevents Silent Data Loss

Operational Transformation (OT) is the mechanism that reconciles concurrent edits. The key insight: when two operations are issued concurrently against the same document version, one must be transformed to account for the positional effects of the other before it can be safely applied.

The Core Conflict: A Concrete Example

Consider the document "ABCDE" (character positions 0–4).

  • User A issues: INSERT("X", position=3) β†’ intent: "ABCXDE"
  • User B simultaneously issues: DELETE(position=3, length=1) β†’ intent: "ABCE"

Both operations are stamped against document version v1. The server receives A's op first, applies it, and the document becomes "ABCXDE". Now B's op arrives: DELETE(position=3, length=1). Applied naively, it deletes "X" β€” not "D" as B intended. B wanted to delete "D", which is now at position 4 after A's insert shifted everything right.

OT transforms B's delete against A's insert before applying it:

def transform_delete_against_insert(delete_op, insert_op):
    """
    insert_op inserted len(insert_op.text) characters at insert_op.position.
    If the insert happened at or before the delete target, the delete
    target has shifted right by the length of the inserted text.
    """
    if insert_op.position <= delete_op.position:
        # Insert is at or before our delete target β€” shift right
        delete_op.position += len(insert_op.text)
    # else: insert is after our delete target β€” delete position unchanged
    return delete_op

Applying this: insert_op.position (3) <= delete_op.position (3) β†’ delete_op.position += 1 = 4. The transformed delete removes the character at position 4 ("D"), producing "ABCXE". Both users' intentions are preserved and the document converges.

The Full Transform Matrix

Op₁ (applied first)Opβ‚‚ (to transform against Op₁)Transform rule
INSERT(p₁, len)INSERT(pβ‚‚)If p₁ ≀ pβ‚‚: pβ‚‚ += len; else unchanged
INSERT(p₁, len)DELETE(pβ‚‚, lβ‚‚)If p₁ ≀ pβ‚‚: pβ‚‚ += len; if p₁ is inside the deleted range: split the delete around the insert
DELETE(p₁, l₁)INSERT(pβ‚‚)If p₁ < pβ‚‚: pβ‚‚ -= l₁; if pβ‚‚ is inside the deleted range: pβ‚‚ = p₁
DELETE(p₁, l₁)DELETE(pβ‚‚, lβ‚‚)Overlap must be resolved: chars deleted by both ops are counted exactly once; non-overlapping ranges shift independently

The four-case matrix is deceptively compact β€” production OT implementations for rich text (attributes, embedded objects, list formatting) handle many more edge cases through composite operation types.

The OT Server as the Central Arbiter

The server is the single serialization point for every document. Every client operation includes a revision number β€” the document version the operation was based on. If a client is behind (its operation is at revision r but the server is now at r+3), the server transforms the incoming op against ops r+1, r+2, and r+3 in sequence before applying it.

def server_receive_op(doc_id: str, client_op, client_revision: int):
    server_revision = get_current_revision(doc_id)

    # Fetch all ops the client hasn't yet seen
    concurrent_ops = get_ops(doc_id, from_seq=client_revision + 1,
                             to_seq=server_revision)

    # Transform client's op against each concurrent server op in order
    transformed_op = client_op
    for server_op in concurrent_ops:
        transformed_op = transform(transformed_op, server_op)

    # Assign next revision, persist, broadcast to all subscribers
    new_revision = server_revision + 1
    persist_op(doc_id, transformed_op, seq=new_revision)
    update_revision(doc_id, new_revision)
    broadcast_to_subscribers(doc_id, transformed_op, seq=new_revision)
    return new_revision

This model is known as the Jupiter Protocol β€” a client–server OT architecture where all transforms happen on the server, no peer-to-peer transform is ever required, and the server never needs to understand document semantics beyond the transform function.

🧠 Deep Dive: OT Internals, Convergence Proofs, and Computation Cost at Scale

πŸ”¬ Internals: How the Client and Server OT Engines Stay Synchronized

Client-side OT engine β€” three key data structures:

pending_ops:      Queue of ops sent to server but not yet acknowledged
client_revision:  Last revision the client confirmed from the server
local_doc_state:  Optimistically applied document (pending ops included)

The client applies every typed operation optimistically to local_doc_state for zero-latency local rendering. It simultaneously adds the op to pending_ops and sends it to the server. When the server broadcasts an op from another user, the client must:

  1. Transform every op in pending_ops against the incoming op (so the pending queue stays valid against the growing server sequence)
  2. Apply the (possibly transformed) incoming op to local_doc_state

Server-side OT engine β€” state per document:

ops_log:          Ordered log of all committed ops; each has a seq number
current_revision: Monotonically incrementing sequence counter
active_sessions:  Map of connected client_id β†’ last_ack_revision

The server holds no rendered document content in memory β€” it stores only the op log and the current revision number. Document content is reconstructed on demand from the nearest snapshot + subsequent ops.

Reconnect sync protocol:

When a client reconnects after a network interruption:

  1. Client sends {doc_id, last_seen_revision: r}
  2. Server fetches the nearest snapshot at or before revision r, plus all ops from snapshot to present
  3. Server sends the catchup bundle: {snapshot_content, ops_since_snapshot, current_revision}
  4. Client replays the ops against the snapshot content to reconstruct current state
  5. Client resubmits its locally pending (unacknowledged) ops with updated client_revision
  6. Server re-transforms those pending ops against everything it received while the client was disconnected

Vector clocks for cursor causality (presence events, not OT sequencing):

While OT uses monotonic revision numbers for operation ordering, vector clocks are used for presence events (cursor moves, user join/leave) where strict sequencing through the OT log would add unnecessary latency.

def happens_before(vc1: dict, vc2: dict) -> bool:
    """Returns True if vc1 causally precedes vc2."""
    return (all(vc1.get(k, 0) <= vc2.get(k, 0) for k in vc2)
            and any(vc1.get(k, 0) < vc2.get(k, 0) for k in vc2))

def are_concurrent(vc1: dict, vc2: dict) -> bool:
    return not happens_before(vc1, vc2) and not happens_before(vc2, vc1)

# Example: User A and B both move cursors after different ops
# A's clock: {A: 3, B: 1}  β€” A applied 3 of A's ops, 1 of B's
# B's clock: {A: 1, B: 3}  β€” B applied 1 of A's ops, 3 of B's
# are_concurrent({A:3,B:1}, {A:1,B:3}) β†’ True: show both cursors independently

πŸ“ Mathematical Model: Convergence Guarantees and Why TP2 Is the Hard Problem

OT requires two formal properties to guarantee convergence across all topologies:

Property TP1 (Transformation Property 1) β€” required for all OT systems:

For any two concurrent operations O₁ and Oβ‚‚ issued at the same document state S:

apply(apply(S, O₁), transform(Oβ‚‚, O₁)) = apply(apply(S, Oβ‚‚), transform(O₁, Oβ‚‚))

In plain terms: it does not matter which concurrent op you apply first. As long as you transform the second one against the first, you always arrive at the same final state. TP1 is the invariant every production OT system must satisfy, and it is the oracle for fuzz-testing transform correctness.

Property TP2 (Transformation Property 2) β€” required only for multi-master OT:

If three concurrent ops O₁, Oβ‚‚, O₃ exist, transforming O₃ against both O₁ and Oβ‚‚ must produce the same result regardless of the order in which you perform those transforms.

TP2 is notoriously difficult to satisfy correctly. The earliest published OT systems (including Jupiter's academic predecessors) had subtle TP2 violations that caused occasional document corruption. Google's production approach β€” the Jupiter Protocol β€” sidesteps TP2 entirely by using a single central server. All clients communicate only with the server; no client ever transforms an op against another client's op directly. This reduces the convergence guarantee from TP1+TP2 to TP1 only, which is tractable to implement correctly and to verify via property-based tests.

Convergence trace for the "ABCDE" example:

Initial state:  "ABCDE"  (revision rβ‚€)

Client A β†’  INSERT("X", pos=3)  at rβ‚€  β†’  sent to server
Client B β†’  DELETE(pos=3, len=1) at rβ‚€  β†’  sent to server

Server receives INSERT first:
  Apply INSERT("X", 3) β†’ "ABCXDE"  (revision r₁)
  Broadcast INSERT("X", 3, seq=r₁) to all clients

Server receives DELETE(3, 1) based on rβ‚€:
  concurrent_ops since rβ‚€ = [INSERT("X", 3)]
  transform DELETE(3,1) against INSERT("X",3):
    insert.pos(3) ≀ delete.pos(3)  β†’  delete.pos += 1  β†’  DELETE(4, 1)
  Apply DELETE(4, 1) β†’ "ABCXE"  (revision rβ‚‚)
  Broadcast DELETE(4, 1, seq=rβ‚‚) to all clients

Client A receives broadcast DELETE(4, 1):
  pending_ops is empty (INSERT already acked)
  Apply DELETE(4, 1) β†’ local doc = "ABCXE"  βœ“

Client B receives broadcast INSERT("X", 3):
  pending_ops = [DELETE(3, 1)]
  Transform pending DELETE against incoming INSERT:
    same transform as server β†’ DELETE(4, 1)
  Apply INSERT("X", 3) β†’ "ABCXDE"
  Apply transformed DELETE(4, 1) β†’ "ABCXE"  βœ“

All clients + server converge to "ABCXE". TP1 satisfied. βœ“

πŸ“ˆ Performance Analysis: Where OT Gets Expensive at Scale

Transform cost per op: O(1) per operation pair. However, if a client is behind by k server revisions when it submits an op, the transform loop runs k iterations β€” O(k). A client that was disconnected for 10 minutes on a busy document (200 ops/min) returns with k = 2,000 ops to transform against. This is the client reconnect cost, not the steady-state cost.

Fan-out cost: A document with 50 concurrent editors generates ~100 ops/sec. Each op must be broadcast to 49 other clients. At 10,000 hot documents fleet-wide: 49 million WebSocket pushes per second. This is the primary throughput bottleneck β€” not OT computation itself.

Snapshot cost vs. reconstruction cost:

StrategyReconstruction timeStorage per document
No snapshotsO(total ops, unbounded)Minimal (ops only, ~200 B each)
Snapshot every 100 opsO(100 op replays)~1 snapshot per 100 ops
Snapshot every 1,000 opsO(1,000 op replays)~1 snapshot per 1,000 ops

At 200 ops/min, snapshotting every 100 ops means one snapshot every 30 seconds β€” a reasonable trade-off between storage and reconstruction latency.

πŸ—οΈ Scaling Collaborative Editing: Hot Documents, Session Affinity, and Version History

Hot Documents and the Fan-Out Bottleneck

A document with 50+ simultaneous editors cannot be served by a single WebSocket process without head-of-line blocking. The solution is document-partitioned routing: all WebSocket connections for a given document are pinned to the same WebSocket server shard using consistent hashing on doc_id.

# Consistent hashing: route all clients for document D to one WS shard
shard_id = consistent_hash(doc_id) % NUM_WS_SHARDS
ws_server = websocket_shard_ring[shard_id]

# The OT service broadcasts to the owning shard only
# The shard fans out to all local connections for doc D

For exceptionally hot documents (>100 concurrent editors), a single shard becomes a bottleneck. The mitigation: a fan-out relay tier β€” lightweight relay processes that subscribe to the OT service's pub/sub channel for a document and forward to their slice of connected clients. The relay tier scales horizontally per document without any OT logic.

WebSocket Session Management: Stateless by Design

The WebSocket server itself holds zero document state. All persistent state lives in:

  • Redis: Current op buffer (last N ops per document), cursor positions, session metadata
  • PostgreSQL: The authoritative ops_log and snapshots tables

Because WebSocket servers are stateless, a server restart triggers reconnections that are routed to any available shard. No state migration is needed. The client's reconnect payload ({doc_id, last_seen_revision}) contains everything needed to resume the session from any server.

Version History: Event Sourcing Where the Op Log Is the Record

Every edit is an immutable, append-only row in ops_log. Version history requires no separate versioning system β€” it is inherent in the log structure. Restoring to any past state is a deterministic replay:

def restore_to_revision(doc_id: str, target_revision: int) -> str:
    # 1. Find the latest snapshot at or before target_revision
    snapshot = get_latest_snapshot_before(doc_id, target_revision)

    # 2. Replay ops from snapshot.seq + 1 to target_revision
    ops = get_ops(doc_id, from_seq=snapshot.seq + 1, to_seq=target_revision)

    # 3. Apply each op to the snapshot content
    content = snapshot.content
    for op in ops:
        content = apply_op(content, op)
    return content

Snapshot trigger β€” bounding reconstruction latency:

SNAPSHOT_INTERVAL = 100  # take a snapshot every 100 committed ops

def on_op_committed(doc_id: str, new_seq: int, current_content: str):
    if new_seq % SNAPSHOT_INTERVAL == 0:
        store_snapshot(doc_id, seq=new_seq, content=current_content)
        # Optionally prune ops older than 2 snapshot generations
        prune_ops_before(doc_id, seq=new_seq - (2 * SNAPSHOT_INTERVAL))

This bounds reconstruction to at most 100 op replays from the nearest snapshot β€” a constant-time guarantee regardless of how long the document has existed.

Cursor Presence: Ephemeral State via Redis Pub/Sub

Cursor positions are ephemeral β€” high-frequency, non-critical, and naturally lossy. Routing them through the OT pipeline would add unnecessary latency and log bloat. Redis pub/sub with a short TTL is the correct tool:

import redis, json, time

r = redis.Redis()

def broadcast_cursor(doc_id: str, user_id: str, position: int, color: str):
    payload = json.dumps({
        "user_id": user_id,
        "position": position,
        "color": color,
        "ts": time.time()
    })
    # Publish to the document's cursor channel for real-time broadcast
    r.publish(f"doc:{doc_id}:cursors", payload)
    # Cache latest cursor per user; new joiners can read this immediately
    r.setex(f"cursor:{doc_id}:{user_id}", 5, payload)  # 5-second TTL

def subscribe_cursors(doc_id: str, on_cursor_event):
    """Each WebSocket shard subscribes to cursor events for its active documents."""
    pubsub = r.pubsub()
    pubsub.subscribe(f"doc:{doc_id}:cursors")
    for message in pubsub.listen():
        if message["type"] == "message":
            on_cursor_event(doc_id, json.loads(message["data"]))

The 5-second TTL on cached cursor positions means a newly connected client can immediately see all active cursors without waiting for the next cursor move event from each user.

πŸ“Š End-to-End Architecture: From Keystroke to Broadcast

The following diagram traces the full collaborative editing data flow, from a user's keystroke to its propagation to all connected peers.

graph TD
    C1[Client A - Browser] -->|WebSocket: op + revision| WSS[WebSocket Server Shard]
    C2[Client B - Browser] -->|WebSocket: op + revision| WSS
    C3[Client C - Browser] -->|WebSocket: cursor move| WSS

    WSS -->|Forward op via gRPC| OTS[OT Service]
    WSS -->|Publish cursor event| RPS[Redis Pub/Sub]

    OTS -->|Transform op against concurrent ops| OTS
    OTS -->|Write to ops_log| OL[(ops_log - PostgreSQL)]
    OTS -->|Update op buffer| RC[Redis - Active Op Buffer]
    OTS -->|Publish transformed op| RPS

    RPS -->|Fan-out to subscribers| WSS
    WSS -->|Push op to all doc clients| C1
    WSS -->|Push op to all doc clients| C2
    WSS -->|Push cursor event| C3

    OTS -->|Every N ops: trigger snapshot| SS[Snapshot Service]
    SS -->|Write full doc content| SNAP[(snapshots - PostgreSQL)]

    C4[Client D - Reconnect] -->|GET /docs/reconstruct| RS[Read Service]
    RS -->|Load nearest snapshot| SNAP
    RS -->|Load ops since snapshot| OL
    RS -->|Return catchup bundle| C4

The diagram separates the two primary data flows. The write path (left side, C1/C2/C3 β†’ WSS β†’ OTS β†’ RPS β†’ back to clients) is the hot path: every keystroke traverses this path in under 100 ms. The read/reconnect path (bottom right, Client D) is the cold path: it reconstructs document state from persistent storage and is bounded to O(100) op replays by the snapshot strategy.

Component Roles at a Glance

ComponentResponsibilityState
WebSocket Server (Shard)Connection management, routing, relayStateless β€” no document content
OT ServiceTransform, serialize, and broadcast ops per documentRedis op buffer (last N ops)
ops_log (PostgreSQL)Durable, ordered op history per documentPersistent β€” primary source of truth
Snapshot ServicePeriodically captures full document statePostgreSQL snapshots table
Redis Pub/Sub + CacheCursor positions, session registry, fast op bufferEphemeral β€” TTL-managed
Read ServiceDocument reconstruction for open and reconnectStateless β€” queries snapshots + ops_log

🌍 Real-World Applications: How Google Docs and Notion Scale Collaborative Editing

Google Docs: The Jupiter Protocol at Scale

Google Docs uses an OT variant based on the Jupiter Protocol (Xerox PARC, 1995; evolved at Google into the Wave OT format). The architecture is deliberately centralized:

  • A single OT server process per document serializes all operations β€” no peer-to-peer transforms
  • Clients communicate only with the server, reducing the convergence requirement to TP1 only
  • Operations are represented as sequences of retain(n), insert(text), and delete(n) instructions over a string with rich text attributes, allowing a single OT type to handle plain text and formatting simultaneously

Scale observations: Google Docs processes billions of edits per day. Document servers are sharded by document ID, with hot documents promoted to higher-capacity shard hosts. The bottleneck at this scale is not OT computation (which is O(1) per transform) but fan-out: broadcasting a single op to 50 concurrent editors on 10,000 simultaneous hot documents drives tens of millions of WebSocket pushes per second.

Notion: Migrating from OT Toward CRDT Structures

Notion initially built on an OT backbone but has progressively migrated page blocks and nested structures toward CRDT-based data types. The motivation: CRDTs require no central transform server. Convergence is guaranteed by the data structure's mathematical properties β€” any two replicas, having received the same set of operations in any order, will converge to the same state.

A CRDT for text (e.g., Yjs or Automerge) assigns a globally unique identifier to every character insertion. When two users insert characters at the same position, the CRDT breaks the tie deterministically using character IDs β€” no transform logic is needed, and no central server is required.

OT vs. CRDT at a glance:

DimensionOT (Jupiter / ShareDB)CRDT (Yjs / Automerge)
Central server requiredYes β€” single serialization pointNo β€” peer-to-peer is possible
Correctness modelTransform function must satisfy TP1Structural; convergence is algebraic
Memory overheadCompact (ops are small deltas)High β€” tombstones for deleted chars persist indefinitely
Offline-firstDifficult β€” server needed for transformsNatural fit
Rich text maturityMature (Wave OT, json0)Emerging (Yjs has good rich-text support)
Garbage collectionNot requiredRequired; coordination needed across peers

βš–οΈ Trade-offs and Failure Modes: OT vs. CRDT vs. Last-Writer-Wins

Why Last-Writer-Wins Is Wrong for Document Editing

LWW resolves conflicts using wall-clock timestamps. In distributed systems, clocks diverge (NTP drift of 100–500 ms is common). An edit from a node with a slightly faster clock always wins β€” silently discarding concurrent edits from nodes with slightly slower clocks. More fundamentally, LWW discards one of two concurrent edits; it has no mechanism to merge them. Collaborative editing requires merging; LWW is structurally incapable of providing it.

Failure Modes in Production OT Systems

Failure ModeRoot CauseMitigation
Transform function bugEdge case in overlapping DELETE vs. DELETE range handlingProperty-based fuzz testing using TP1 as the invariant: for any randomly generated concurrent ops, both transform orderings must produce identical final states
Client-server revision driftClient submits an op far behind current server revision after a long reconnectEnforce a max client_revision_lag; force a full page reload if the gap exceeds the safe transform window
OT server bottleneckSingle-threaded OT loop can't keep up with 100+ concurrent editorsPartition hot documents; introduce a fan-out relay tier that scales horizontally without OT logic
Redis pub/sub message lossRedis restart drops in-flight cursor and op broadcastsOp durability is guaranteed by PostgreSQL write-first; clients detect missing ops via revision sequence gaps and trigger a catchup request
Split-brain during OT server failoverTwo OT replicas briefly both accept ops for the same documentLeader election per document via etcd or ZooKeeper; at-most-one active OT writer enforced at the coordination layer

CAP Trade-Off: Consistency Over Availability at the OT Layer

The collaborative editing system deliberately prioritizes Consistency over Availability at the OT Service:

  • During an OT service failure, the active editing session is paused (clients buffer local ops and display a "reconnecting…" spinner) rather than allowing two OT processes to diverge
  • This is a deliberate engineering choice: a split-brain OT scenario would require a CRDT-style merge of two diverged op histories post-failover β€” an extremely complex recovery path
  • Users tolerate a few seconds of pause; silent document corruption is unacceptable

🧭 Decision Guide: Choosing Between OT, CRDT, and Optimistic Locking

SituationRecommendation
Use OT whenYou have a reliable central server, need mature tooling (ShareDB, json0, Wave OT), and your document is primarily linear text or structured JSON with a known schema
Use CRDT whenYou need offline-first support, peer-to-peer sync without a server, or your data maps cleanly to a CRDT type (sequences, sets, counters, maps)
Avoid LWW whenMultiple concurrent users can edit overlapping regions β€” LWW silently discards one user's valid edits with no merge semantics
Avoid multi-master OT whenYou cannot rigorously verify TP2 correctness across your entire operation set β€” multi-master OT bugs are notoriously subtle and hard to reproduce
Use optimistic locking instead whenA document is rarely edited by more than one person at a time β€” version-based conflict detection ("document changed, please reload") is far simpler than OT and sufficient for low-contention cases

πŸ§ͺ Data Model, Cache Strategy, and the Write Path in Detail

Data Model

-- Core document record
CREATE TABLE documents (
    doc_id      UUID PRIMARY KEY,
    title       TEXT NOT NULL,
    owner_id    UUID NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW(),
    latest_seq  INT  DEFAULT 0    -- latest committed operation sequence number
);

-- Every edit ever made β€” the immutable event log
CREATE TABLE ops_log (
    doc_id       UUID NOT NULL,
    seq          INT  NOT NULL,           -- server-assigned; monotonically increasing per doc
    op_type      TEXT NOT NULL,           -- 'insert' | 'delete' | 'retain'
    position     INT  NOT NULL,
    content      TEXT,                    -- for inserts; NULL for deletes
    length       INT,                     -- for deletes; NULL for inserts
    user_id      UUID NOT NULL,
    client_id    TEXT NOT NULL,           -- unique per browser tab/client instance
    applied_at   TIMESTAMPTZ DEFAULT NOW(),
    vector_clock JSONB,                   -- logical clock snapshot at time of op
    PRIMARY KEY  (doc_id, seq)
);

-- Periodic full-content snapshots to bound reconstruction time
CREATE TABLE snapshots (
    doc_id      UUID NOT NULL,
    seq         INT  NOT NULL,            -- op sequence at which this snapshot was captured
    content     TEXT NOT NULL,            -- full document content at this seq
    created_at  TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (doc_id, seq)
);

-- Document access control
CREATE TABLE doc_permissions (
    doc_id      UUID NOT NULL,
    user_id     UUID NOT NULL,
    role        TEXT NOT NULL,            -- 'owner' | 'editor' | 'viewer'
    granted_at  TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (doc_id, user_id)
);

Cache Design

Cache Key PatternCached ValueTTLPurpose
doc:{doc_id}:ops_bufferLast 200 ops as sorted set (score = seq)10 min slidingFast catchup for recently reconnected clients without hitting PostgreSQL
doc:{doc_id}:latest_seqCurrent server revision integerInvalidated on each commitOT service reads this atomically before accepting new ops
cursor:{doc_id}:{user_id}{position, color, ts} JSON5 sStale cursor auto-eviction; initial presence snapshot for new joiners
session:{client_id}{user_id, doc_id, last_ack_seq}30 minWebSocket session metadata for reconnect routing

The Write Path: Full Flow from Keystroke to Convergence

A user types "e" at position 47 in a document currently at revision 314:

  1. Client creates op {type: "insert", position: 47, content: "e", client_revision: 314}
  2. Client applies it optimistically to local doc state (zero-latency local render)
  3. Client sends op over WebSocket to the assigned shard
  4. WebSocket Server forwards op to the OT Service via internal gRPC
  5. OT Service atomically reads doc:{doc_id}:latest_seq from Redis β†’ seq = 314; client is current, no transforms needed
  6. OT Service assigns seq = 315, persists op to ops_log (PostgreSQL write-ahead log ensures durability)
  7. OT Service updates doc:{doc_id}:latest_seq in Redis to 315
  8. OT Service appends op to doc:{doc_id}:ops_buffer sorted set in Redis
  9. OT Service publishes {seq: 315, op: ...} to Redis pub/sub channel doc:{doc_id}:ops
  10. WebSocket Server receives pub/sub message, pushes {seq: 315, op: ...} to all 49 other connected clients for this document
  11. All peer clients apply the incoming op, retransforming any pending ops in their local queues

Target round-trip: < 100 ms at p99 from step 1 (keypress) to step 11 (all peers updated).

The Read Path: Document Reconstruction on Open

When a user opens a document cold:

  1. Client sends GET /docs/{doc_id} with auth token
  2. Read Service verifies permissions in doc_permissions
  3. Read Service fetches the latest snapshot from the snapshots table
  4. Read Service fetches all ops from ops_log where seq > snapshot.seq
  5. Read Service replays ops against the snapshot content to reconstruct current state
  6. Read Service returns {content, current_seq} to the client
  7. Client opens WebSocket connection and subscribes to ops starting at current_seq + 1

πŸ› οΈ ShareDB: How It Implements Operational Transformation

ShareDB is the leading open-source OT framework for Node.js. It implements OT over JSON documents using the json0 operation type β€” a composable, composable set of operations over JSON structures: insert/delete on strings and arrays, set/unset on object keys, and numeric increment. ShareDB handles the full OT cycle: server-side transform and sequencing, client-side pending op queue, retransform on reconnect, and pluggable persistence backends.

Minimal ShareDB Server Integration

// server.js β€” ShareDB OT server backed by PostgreSQL
const ShareDB = require('sharedb');
const PostgresDB = require('sharedb-postgres');
const WebSocket = require('ws');
const WebSocketJSONStream = require('@teamwork/websocket-json-stream');

const db = new PostgresDB({ connectionString: process.env.DATABASE_URL });
const backend = new ShareDB({ db });

// ShareDB handles all OT negotiation over the WebSocket stream
const wss = new WebSocket.Server({ port: 8080 });
wss.on('connection', (ws) => {
  const stream = new WebSocketJSONStream(ws);
  backend.listen(stream);
});

Minimal ShareDB Client Integration

// client.js β€” real-time collaborative text editing in the browser
const sharedb = require('sharedb/lib/client');
const ReconnectingWebSocket = require('reconnecting-websocket');

const socket = new ReconnectingWebSocket('ws://localhost:8080');
const connection = new sharedb.Connection(socket);

// Subscribe to the document
const doc = connection.get('documents', 'doc-abc-123');
doc.subscribe((err) => {
  if (err) throw err;

  // Submit an insert at string index 5 (json0 format: si = string insert)
  doc.submitOp([{ p: ['body', 5], si: 'Hello' }], (err) => {
    if (err) console.error('Submit failed:', err);
    // op is immediately applied locally; server transforms it against
    // any concurrent ops and broadcasts the result to all subscribers
  });
});

// All concurrent remote ops arrive here β€” already transformed by ShareDB
doc.on('op', (op, source) => {
  if (!source) {
    // Remote op β€” apply to editor UI
    console.log('Remote op applied:', op);
  }
});

submitOp encapsulates the entire OT write cycle: optimistic local application, server submission with the current revision, server-side transform against concurrent ops, and broadcast. The application developer never writes transform logic directly β€” that is ShareDB's contract.

For a full deep-dive on ShareDB's json0 OT types, projections, and access control middleware, see the ShareDB README and the json0 specification. A companion post exploring CRDTs via Yjs is a planned follow-up in this series.

πŸ“š Lessons Learned: What Engineers Get Wrong About Collaborative Editing

  1. Don't conflate operation ordering with operation merging. The most common interview pitfall is proposing wall-clock timestamps plus last-writer-wins. Timestamps tell you when operations were issued; they have no semantics for how to combine them. Two concurrent inserts at the same position both represent valid user intentions β€” neither is wrong, and discarding one is a data loss bug, not a conflict resolution.

  2. The transform function is load-bearing code. Every bug in the transform matrix is a potential silent document corruption vector. Production OT implementations use property-based fuzz testing: generate thousands of random concurrent op pairs, and for each pair verify that both transform orderings (transform(Oβ‚‚, O₁) and transform(O₁, Oβ‚‚)) produce identical final states. TP1 is the fuzz oracle.

  3. Snapshots are not optional. A document with 50,000 ops that takes 30 seconds to reconstruct will cause visible loading delays on every open. Snapshot every 50–200 ops depending on average op size. Treating snapshots as an "optimization" rather than a correctness requirement for latency SLAs leads to silent degradation as documents age.

  4. WebSocket servers must be stateless. If the OT transform state lives inside a WebSocket server process, you've coupled document consistency to that process's availability. When the process restarts, the document's editing session is corrupted. All OT state β€” op buffer, current revision, transform context β€” belongs in Redis and PostgreSQL. WebSocket servers are routing infrastructure, not business logic.

  5. Cursor positions are not edits. A common over-engineering mistake is routing cursor events through the OT pipeline for causal ordering. Cursor events are ephemeral (a stale cursor position is invisible to users after 5 seconds), high-frequency (every mouse move), and naturally lossy. Put them in Redis pub/sub with a 5-second TTL. Ordering cursor events through the durable op log wastes compute, adds latency, and pollutes version history.

  6. CRDT memory overhead compounds over time. CRDTs for text maintain tombstones for every deleted character β€” these can never be garbage-collected without coordination across all live replicas. A 1 KB document that has undergone thousands of edits can balloon to 100 KB of CRDT state. Plan for compaction. Yjs has a garbage collection mechanism, but it requires all peers to have acknowledged receipt of the relevant ops before pruning β€” a coordination requirement that reintroduces complexity in peer-to-peer deployments.

πŸ“Œ TLDR Summary & Key Takeaways: The Eight Design Decisions That Define This System

  • OT over last-writer-wins: Collaborative editing requires merging concurrent operations, not discarding one. OT's transform function reconciles edits while preserving both users' intentions β€” LWW has no such merge semantics.
  • Central arbiter (Jupiter Protocol): A single OT server per document serializes operations, reducing the convergence requirement to TP1 only. This is the only OT architecture that has been proven correct and deployed at billion-user scale.
  • WebSocket for real-time delivery: HTTP polling introduces 200–1,000 ms of additional round-trip latency per operation. A collaborative editor operating on polling is perceptibly broken. Persistent WebSocket channels are a hard architectural requirement, not a convenience.
  • Stateless WebSocket servers: All document state lives in Redis and PostgreSQL. WebSocket servers are replaceable routing infrastructure. No document consistency is coupled to the lifecycle of any single process.
  • Event sourcing for the op log: Every edit is an immutable, append-only row in ops_log. Version history, point-in-time restore, and audit trails come for free from this single design choice.
  • Snapshots every N ops: Without snapshots, document reconstruction is O(total ops ever). With snapshots every 100 ops, reconstruction is bounded to O(100) β€” a latency guarantee that holds regardless of document age or edit volume.
  • Redis for ephemeral state: Cursor positions, session metadata, and the recent op buffer all live in Redis with TTLs. Redis pub/sub fans out broadcasts to WebSocket shards; Redis delivers the sub-millisecond read latency that PostgreSQL cannot meet for hot-path operations.
  • CRDT as the alternative path: CRDTs (Yjs, Automerge) eliminate the central arbiter requirement and enable offline-first and peer-to-peer topologies, at the cost of memory overhead from tombstones and added coordination for garbage collection. Choose CRDTs when you need offline-first; choose OT (ShareDB/Jupiter) when you have a reliable central server and need battle-tested tooling.

πŸ“ Practice Quiz: Test Your Collaborative Editing Design Knowledge

  1. User A inserts "X" at position 3. Concurrently, User B also inserts "Y" at position 3. Both were issued against revision r. The server applies A's op first. After OT transformation, what position does B's insert target?

    • A) Position 3 β€” unchanged, because both ops are independent inserts
    • B) Position 4 β€” shifted right by 1 because A's insert at position 3 displaced B's target
    • C) Position 2 β€” shifted left to avoid overwriting A's character
    • D) The operation is rejected with a conflict error and B must re-submit Correct Answer: B
  2. Why does this collaborative editing system use WebSocket persistent connections rather than HTTP short-polling or HTTP long-polling for operation delivery?

    • A) WebSockets are the only protocol that guarantees exactly-once delivery semantics
    • B) HTTP polling adds a minimum of one full HTTP round-trip per operation (200–1,000 ms), making the editor perceptibly laggy; WebSockets maintain a persistent bidirectional channel that delivers ops in < 10 ms
    • C) WebSockets automatically perform OT transform logic; HTTP does not
    • D) HTTP connections cannot carry user identity across requests without session cookies Correct Answer: B
  3. A heavily edited document has 80,000 operations in ops_log and no snapshots have ever been taken. What is the worst-case time complexity to reconstruct its current content for a user opening it now?

    • A) O(1) β€” the document state is always cached in Redis
    • B) O(log n) β€” the ops log is indexed with a B-tree and can be binary searched
    • C) O(n) β€” every one of the 80,000 operations must be replayed sequentially from the beginning
    • D) O(nΒ²) β€” each operation must be transformed against all previous operations before application Correct Answer: C
  4. During an OT Service outage, 30 users are actively editing document D. The load balancer detects the failure and promotes a warm standby OT Service replica that has been independently receiving op writes. What critical problem does this introduce?

    • A) The standby replica will reject all ops because it lacks the primary's session tokens
    • B) Two independent OT processes will have each committed ops during the overlap window, creating a split-brain: both replicas hold a different canonical revision sequence for the same document, requiring a CRDT-style merge to reconcile β€” a recovery path of extreme complexity
    • C) WebSocket clients will automatically negotiate a tie-breaking protocol using their local vector clocks
    • D) No problem: the standby's ops log is a perfect superset of the primary's, so state is already consistent Correct Answer: B
  5. Open-ended: Your team proposes replacing the OT Service with a Yjs CRDT engine to eliminate the single-point-of-failure concern. What specific operational trade-offs would you evaluate before approving this change? Address at minimum: memory growth from tombstones on heavily edited documents, the coordination requirements introduced by Yjs garbage collection, the maturity of CRDT support for rich text attributes (bold, inline comments, nested lists), and any new monitoring or debugging complexity introduced by a decentralized topology. Correct Answer: (Open-ended β€” no single correct answer. Strong answers address: (1) tombstone accumulation β€” a 1 KB doc with thousands of edits can balloon to 100 KB of CRDT state; plan for GC coordination; (2) Yjs GC requires all live peers to acknowledge receipt of relevant ops before pruning β€” this reintroduces a form of coordination overhead; (3) Yjs has solid rich-text support via y-prosemirror / y-quill but is less battle-tested than json0/Wave OT for large-scale production deployments; (4) decentralized topology changes debugging: with OT, the server op log is the single forensic record; with peer-to-peer CRDTs, causality reconstruction requires collecting vector clock state across all peers.)

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms