System Design HLD Example: Collaborative Document Editing (Google Docs)
A practical interview-ready HLD for real-time collaborative editing with conflict resolution and sync.
Abstract AlgorithmsTLDR: Real-time collaborative document editing uses Operational Transformation (OT) or CRDTs to ensure all clients converge to the same document state regardless of concurrent edit ordering. The key insight: every operation must be transformed against concurrent unacknowledged operations before application β not simply sequenced by timestamp. This HLD covers the full system: OT algorithm and convergence proofs, WebSocket fan-out, cursor presence via Redis pub/sub, version history via event sourcing with snapshots, and the bottlenecks that emerge at Google Docs scale.
π The Concurrent Edit Problem: Why Collaborative Documents Are a Hard Distributed Systems Challenge
Two users are editing the same document at 2am. User A inserts "Hello" starting at position 5. Simultaneously, User B deletes the character at position 5. Both operations are issued concurrently β neither has seen the other's edit yet. If the server naively applies them in arrival order, A's insert lands first, shifting every subsequent character right by 5 positions. User B's delete now targets position 5 β which is the first character of A's newly inserted "Hello" β silently corrupting the document. Last-writer-wins fails entirely here. There is no "last" writer; there are two concurrent writers whose intentions must be merged, not arbitrated.
This is the core engineering challenge of collaborative editing: how do you reconcile concurrent operations that were each valid against a shared-but-diverged document state, producing a result that is correct, convergent, and non-destructive to either user's intent?
Use Cases and System Scope
| Actor | Core Journeys |
| Document author | Create document, set sharing permissions, transfer ownership |
| Co-editor | Type, delete, format text in real-time alongside other users |
| Viewer | Read-only access; see live edits as they happen |
| Presence service | Track who is in the document and where each cursor is |
| Version history | Reconstruct any past state of the document from the edit log |
In scope for this HLD:
- Create, read, update, and delete documents (full CRUD)
- Real-time multi-user co-editing with cursor presence (who is editing where)
- Sharing and permission management: owner, editor, viewer roles
- Version history and point-in-time document restore
- Reconnect sync: replay missed operations when a client returns from a network drop
Out of scope (v1 boundary):
- Spell check, grammar suggestions, and AI writing assistance
- Full offline-first editing with prolonged-disconnect merge (CRDT extension)
- End-to-end encryption of document content
- Voice and video calling (separate signalling layer)
Design Goals
| Goal | Specification |
| Convergence | All replicas must converge to identical document state within 200 ms of a network partition healing |
| Edit latency | Operation round-trip (client β server β broadcast to peers) under 100 ms at p99 |
| Durability | No committed operation may be lost; the ops log is the source of truth |
| Availability | Active editing sessions survive WebSocket server restarts via stateless reconnect |
| Throughput | Support 1 M concurrent active editing sessions; up to 50 concurrent editors per document |
Capacity Estimation
Google Docs serves approximately 1 billion registered users with an estimated 50 million daily active editors. At peak:
- Concurrent editing sessions: ~5 million (10% of DAU actively typing)
- Operations per second (write): An active typist generates ~2 ops/sec. At 5 M sessions: ~10 million ops/sec fleet-wide
- Storage for the ops log: Each operation averages ~200 bytes (type, position, content, metadata). At 10 M ops/sec: ~2 GB/sec of raw op log writes β snapshots are critical to prevent unbounded growth and unbounded reconstruction time
- WebSocket connections: 5 M persistent connections; at ~50 KB RAM per connection: ~250 GB RAM across WebSocket servers β horizontal sharding by document ID is required
- Document reads (open + reconnect): ~100 M document opens per day β ~1,200 document reconstructions/sec from snapshot + op replay
βοΈ Operational Transformation: The Algorithm That Prevents Silent Data Loss
Operational Transformation (OT) is the mechanism that reconciles concurrent edits. The key insight: when two operations are issued concurrently against the same document version, one must be transformed to account for the positional effects of the other before it can be safely applied.
The Core Conflict: A Concrete Example
Consider the document "ABCDE" (character positions 0β4).
- User A issues:
INSERT("X", position=3)β intent:"ABCXDE" - User B simultaneously issues:
DELETE(position=3, length=1)β intent:"ABCE"
Both operations are stamped against document version v1. The server receives A's op first, applies it, and the document becomes "ABCXDE". Now B's op arrives: DELETE(position=3, length=1). Applied naively, it deletes "X" β not "D" as B intended. B wanted to delete "D", which is now at position 4 after A's insert shifted everything right.
OT transforms B's delete against A's insert before applying it:
def transform_delete_against_insert(delete_op, insert_op):
"""
insert_op inserted len(insert_op.text) characters at insert_op.position.
If the insert happened at or before the delete target, the delete
target has shifted right by the length of the inserted text.
"""
if insert_op.position <= delete_op.position:
# Insert is at or before our delete target β shift right
delete_op.position += len(insert_op.text)
# else: insert is after our delete target β delete position unchanged
return delete_op
Applying this: insert_op.position (3) <= delete_op.position (3) β delete_op.position += 1 = 4. The transformed delete removes the character at position 4 ("D"), producing "ABCXE". Both users' intentions are preserved and the document converges.
The Full Transform Matrix
| Opβ (applied first) | Opβ (to transform against Opβ) | Transform rule |
INSERT(pβ, len) | INSERT(pβ) | If pβ β€ pβ: pβ += len; else unchanged |
INSERT(pβ, len) | DELETE(pβ, lβ) | If pβ β€ pβ: pβ += len; if pβ is inside the deleted range: split the delete around the insert |
DELETE(pβ, lβ) | INSERT(pβ) | If pβ < pβ: pβ -= lβ; if pβ is inside the deleted range: pβ = pβ |
DELETE(pβ, lβ) | DELETE(pβ, lβ) | Overlap must be resolved: chars deleted by both ops are counted exactly once; non-overlapping ranges shift independently |
The four-case matrix is deceptively compact β production OT implementations for rich text (attributes, embedded objects, list formatting) handle many more edge cases through composite operation types.
The OT Server as the Central Arbiter
The server is the single serialization point for every document. Every client operation includes a revision number β the document version the operation was based on. If a client is behind (its operation is at revision r but the server is now at r+3), the server transforms the incoming op against ops r+1, r+2, and r+3 in sequence before applying it.
def server_receive_op(doc_id: str, client_op, client_revision: int):
server_revision = get_current_revision(doc_id)
# Fetch all ops the client hasn't yet seen
concurrent_ops = get_ops(doc_id, from_seq=client_revision + 1,
to_seq=server_revision)
# Transform client's op against each concurrent server op in order
transformed_op = client_op
for server_op in concurrent_ops:
transformed_op = transform(transformed_op, server_op)
# Assign next revision, persist, broadcast to all subscribers
new_revision = server_revision + 1
persist_op(doc_id, transformed_op, seq=new_revision)
update_revision(doc_id, new_revision)
broadcast_to_subscribers(doc_id, transformed_op, seq=new_revision)
return new_revision
This model is known as the Jupiter Protocol β a clientβserver OT architecture where all transforms happen on the server, no peer-to-peer transform is ever required, and the server never needs to understand document semantics beyond the transform function.
π§ Deep Dive: OT Internals, Convergence Proofs, and Computation Cost at Scale
π¬ Internals: How the Client and Server OT Engines Stay Synchronized
Client-side OT engine β three key data structures:
pending_ops: Queue of ops sent to server but not yet acknowledged
client_revision: Last revision the client confirmed from the server
local_doc_state: Optimistically applied document (pending ops included)
The client applies every typed operation optimistically to local_doc_state for zero-latency local rendering. It simultaneously adds the op to pending_ops and sends it to the server. When the server broadcasts an op from another user, the client must:
- Transform every op in
pending_opsagainst the incoming op (so the pending queue stays valid against the growing server sequence) - Apply the (possibly transformed) incoming op to
local_doc_state
Server-side OT engine β state per document:
ops_log: Ordered log of all committed ops; each has a seq number
current_revision: Monotonically incrementing sequence counter
active_sessions: Map of connected client_id β last_ack_revision
The server holds no rendered document content in memory β it stores only the op log and the current revision number. Document content is reconstructed on demand from the nearest snapshot + subsequent ops.
Reconnect sync protocol:
When a client reconnects after a network interruption:
- Client sends
{doc_id, last_seen_revision: r} - Server fetches the nearest snapshot at or before revision
r, plus all ops from snapshot to present - Server sends the catchup bundle:
{snapshot_content, ops_since_snapshot, current_revision} - Client replays the ops against the snapshot content to reconstruct current state
- Client resubmits its locally pending (unacknowledged) ops with updated
client_revision - Server re-transforms those pending ops against everything it received while the client was disconnected
Vector clocks for cursor causality (presence events, not OT sequencing):
While OT uses monotonic revision numbers for operation ordering, vector clocks are used for presence events (cursor moves, user join/leave) where strict sequencing through the OT log would add unnecessary latency.
def happens_before(vc1: dict, vc2: dict) -> bool:
"""Returns True if vc1 causally precedes vc2."""
return (all(vc1.get(k, 0) <= vc2.get(k, 0) for k in vc2)
and any(vc1.get(k, 0) < vc2.get(k, 0) for k in vc2))
def are_concurrent(vc1: dict, vc2: dict) -> bool:
return not happens_before(vc1, vc2) and not happens_before(vc2, vc1)
# Example: User A and B both move cursors after different ops
# A's clock: {A: 3, B: 1} β A applied 3 of A's ops, 1 of B's
# B's clock: {A: 1, B: 3} β B applied 1 of A's ops, 3 of B's
# are_concurrent({A:3,B:1}, {A:1,B:3}) β True: show both cursors independently
π Mathematical Model: Convergence Guarantees and Why TP2 Is the Hard Problem
OT requires two formal properties to guarantee convergence across all topologies:
Property TP1 (Transformation Property 1) β required for all OT systems:
For any two concurrent operations
OβandOβissued at the same document stateS:
apply(apply(S, Oβ), transform(Oβ, Oβ)) = apply(apply(S, Oβ), transform(Oβ, Oβ))
In plain terms: it does not matter which concurrent op you apply first. As long as you transform the second one against the first, you always arrive at the same final state. TP1 is the invariant every production OT system must satisfy, and it is the oracle for fuzz-testing transform correctness.
Property TP2 (Transformation Property 2) β required only for multi-master OT:
If three concurrent ops
Oβ,Oβ,Oβexist, transformingOβagainst bothOβandOβmust produce the same result regardless of the order in which you perform those transforms.
TP2 is notoriously difficult to satisfy correctly. The earliest published OT systems (including Jupiter's academic predecessors) had subtle TP2 violations that caused occasional document corruption. Google's production approach β the Jupiter Protocol β sidesteps TP2 entirely by using a single central server. All clients communicate only with the server; no client ever transforms an op against another client's op directly. This reduces the convergence guarantee from TP1+TP2 to TP1 only, which is tractable to implement correctly and to verify via property-based tests.
Convergence trace for the "ABCDE" example:
Initial state: "ABCDE" (revision rβ)
Client A β INSERT("X", pos=3) at rβ β sent to server
Client B β DELETE(pos=3, len=1) at rβ β sent to server
Server receives INSERT first:
Apply INSERT("X", 3) β "ABCXDE" (revision rβ)
Broadcast INSERT("X", 3, seq=rβ) to all clients
Server receives DELETE(3, 1) based on rβ:
concurrent_ops since rβ = [INSERT("X", 3)]
transform DELETE(3,1) against INSERT("X",3):
insert.pos(3) β€ delete.pos(3) β delete.pos += 1 β DELETE(4, 1)
Apply DELETE(4, 1) β "ABCXE" (revision rβ)
Broadcast DELETE(4, 1, seq=rβ) to all clients
Client A receives broadcast DELETE(4, 1):
pending_ops is empty (INSERT already acked)
Apply DELETE(4, 1) β local doc = "ABCXE" β
Client B receives broadcast INSERT("X", 3):
pending_ops = [DELETE(3, 1)]
Transform pending DELETE against incoming INSERT:
same transform as server β DELETE(4, 1)
Apply INSERT("X", 3) β "ABCXDE"
Apply transformed DELETE(4, 1) β "ABCXE" β
All clients + server converge to "ABCXE". TP1 satisfied. β
π Performance Analysis: Where OT Gets Expensive at Scale
Transform cost per op: O(1) per operation pair. However, if a client is behind by k server revisions when it submits an op, the transform loop runs k iterations β O(k). A client that was disconnected for 10 minutes on a busy document (200 ops/min) returns with k = 2,000 ops to transform against. This is the client reconnect cost, not the steady-state cost.
Fan-out cost: A document with 50 concurrent editors generates ~100 ops/sec. Each op must be broadcast to 49 other clients. At 10,000 hot documents fleet-wide: 49 million WebSocket pushes per second. This is the primary throughput bottleneck β not OT computation itself.
Snapshot cost vs. reconstruction cost:
| Strategy | Reconstruction time | Storage per document |
| No snapshots | O(total ops, unbounded) | Minimal (ops only, ~200 B each) |
| Snapshot every 100 ops | O(100 op replays) | ~1 snapshot per 100 ops |
| Snapshot every 1,000 ops | O(1,000 op replays) | ~1 snapshot per 1,000 ops |
At 200 ops/min, snapshotting every 100 ops means one snapshot every 30 seconds β a reasonable trade-off between storage and reconstruction latency.
ποΈ Scaling Collaborative Editing: Hot Documents, Session Affinity, and Version History
Hot Documents and the Fan-Out Bottleneck
A document with 50+ simultaneous editors cannot be served by a single WebSocket process without head-of-line blocking. The solution is document-partitioned routing: all WebSocket connections for a given document are pinned to the same WebSocket server shard using consistent hashing on doc_id.
# Consistent hashing: route all clients for document D to one WS shard
shard_id = consistent_hash(doc_id) % NUM_WS_SHARDS
ws_server = websocket_shard_ring[shard_id]
# The OT service broadcasts to the owning shard only
# The shard fans out to all local connections for doc D
For exceptionally hot documents (>100 concurrent editors), a single shard becomes a bottleneck. The mitigation: a fan-out relay tier β lightweight relay processes that subscribe to the OT service's pub/sub channel for a document and forward to their slice of connected clients. The relay tier scales horizontally per document without any OT logic.
WebSocket Session Management: Stateless by Design
The WebSocket server itself holds zero document state. All persistent state lives in:
- Redis: Current op buffer (last N ops per document), cursor positions, session metadata
- PostgreSQL: The authoritative
ops_logandsnapshotstables
Because WebSocket servers are stateless, a server restart triggers reconnections that are routed to any available shard. No state migration is needed. The client's reconnect payload ({doc_id, last_seen_revision}) contains everything needed to resume the session from any server.
Version History: Event Sourcing Where the Op Log Is the Record
Every edit is an immutable, append-only row in ops_log. Version history requires no separate versioning system β it is inherent in the log structure. Restoring to any past state is a deterministic replay:
def restore_to_revision(doc_id: str, target_revision: int) -> str:
# 1. Find the latest snapshot at or before target_revision
snapshot = get_latest_snapshot_before(doc_id, target_revision)
# 2. Replay ops from snapshot.seq + 1 to target_revision
ops = get_ops(doc_id, from_seq=snapshot.seq + 1, to_seq=target_revision)
# 3. Apply each op to the snapshot content
content = snapshot.content
for op in ops:
content = apply_op(content, op)
return content
Snapshot trigger β bounding reconstruction latency:
SNAPSHOT_INTERVAL = 100 # take a snapshot every 100 committed ops
def on_op_committed(doc_id: str, new_seq: int, current_content: str):
if new_seq % SNAPSHOT_INTERVAL == 0:
store_snapshot(doc_id, seq=new_seq, content=current_content)
# Optionally prune ops older than 2 snapshot generations
prune_ops_before(doc_id, seq=new_seq - (2 * SNAPSHOT_INTERVAL))
This bounds reconstruction to at most 100 op replays from the nearest snapshot β a constant-time guarantee regardless of how long the document has existed.
Cursor Presence: Ephemeral State via Redis Pub/Sub
Cursor positions are ephemeral β high-frequency, non-critical, and naturally lossy. Routing them through the OT pipeline would add unnecessary latency and log bloat. Redis pub/sub with a short TTL is the correct tool:
import redis, json, time
r = redis.Redis()
def broadcast_cursor(doc_id: str, user_id: str, position: int, color: str):
payload = json.dumps({
"user_id": user_id,
"position": position,
"color": color,
"ts": time.time()
})
# Publish to the document's cursor channel for real-time broadcast
r.publish(f"doc:{doc_id}:cursors", payload)
# Cache latest cursor per user; new joiners can read this immediately
r.setex(f"cursor:{doc_id}:{user_id}", 5, payload) # 5-second TTL
def subscribe_cursors(doc_id: str, on_cursor_event):
"""Each WebSocket shard subscribes to cursor events for its active documents."""
pubsub = r.pubsub()
pubsub.subscribe(f"doc:{doc_id}:cursors")
for message in pubsub.listen():
if message["type"] == "message":
on_cursor_event(doc_id, json.loads(message["data"]))
The 5-second TTL on cached cursor positions means a newly connected client can immediately see all active cursors without waiting for the next cursor move event from each user.
π End-to-End Architecture: From Keystroke to Broadcast
The following diagram traces the full collaborative editing data flow, from a user's keystroke to its propagation to all connected peers.
graph TD
C1[Client A - Browser] -->|WebSocket: op + revision| WSS[WebSocket Server Shard]
C2[Client B - Browser] -->|WebSocket: op + revision| WSS
C3[Client C - Browser] -->|WebSocket: cursor move| WSS
WSS -->|Forward op via gRPC| OTS[OT Service]
WSS -->|Publish cursor event| RPS[Redis Pub/Sub]
OTS -->|Transform op against concurrent ops| OTS
OTS -->|Write to ops_log| OL[(ops_log - PostgreSQL)]
OTS -->|Update op buffer| RC[Redis - Active Op Buffer]
OTS -->|Publish transformed op| RPS
RPS -->|Fan-out to subscribers| WSS
WSS -->|Push op to all doc clients| C1
WSS -->|Push op to all doc clients| C2
WSS -->|Push cursor event| C3
OTS -->|Every N ops: trigger snapshot| SS[Snapshot Service]
SS -->|Write full doc content| SNAP[(snapshots - PostgreSQL)]
C4[Client D - Reconnect] -->|GET /docs/reconstruct| RS[Read Service]
RS -->|Load nearest snapshot| SNAP
RS -->|Load ops since snapshot| OL
RS -->|Return catchup bundle| C4
The diagram separates the two primary data flows. The write path (left side, C1/C2/C3 β WSS β OTS β RPS β back to clients) is the hot path: every keystroke traverses this path in under 100 ms. The read/reconnect path (bottom right, Client D) is the cold path: it reconstructs document state from persistent storage and is bounded to O(100) op replays by the snapshot strategy.
Component Roles at a Glance
| Component | Responsibility | State |
| WebSocket Server (Shard) | Connection management, routing, relay | Stateless β no document content |
| OT Service | Transform, serialize, and broadcast ops per document | Redis op buffer (last N ops) |
| ops_log (PostgreSQL) | Durable, ordered op history per document | Persistent β primary source of truth |
| Snapshot Service | Periodically captures full document state | PostgreSQL snapshots table |
| Redis Pub/Sub + Cache | Cursor positions, session registry, fast op buffer | Ephemeral β TTL-managed |
| Read Service | Document reconstruction for open and reconnect | Stateless β queries snapshots + ops_log |
π Real-World Applications: How Google Docs and Notion Scale Collaborative Editing
Google Docs: The Jupiter Protocol at Scale
Google Docs uses an OT variant based on the Jupiter Protocol (Xerox PARC, 1995; evolved at Google into the Wave OT format). The architecture is deliberately centralized:
- A single OT server process per document serializes all operations β no peer-to-peer transforms
- Clients communicate only with the server, reducing the convergence requirement to TP1 only
- Operations are represented as sequences of
retain(n),insert(text), anddelete(n)instructions over a string with rich text attributes, allowing a single OT type to handle plain text and formatting simultaneously
Scale observations: Google Docs processes billions of edits per day. Document servers are sharded by document ID, with hot documents promoted to higher-capacity shard hosts. The bottleneck at this scale is not OT computation (which is O(1) per transform) but fan-out: broadcasting a single op to 50 concurrent editors on 10,000 simultaneous hot documents drives tens of millions of WebSocket pushes per second.
Notion: Migrating from OT Toward CRDT Structures
Notion initially built on an OT backbone but has progressively migrated page blocks and nested structures toward CRDT-based data types. The motivation: CRDTs require no central transform server. Convergence is guaranteed by the data structure's mathematical properties β any two replicas, having received the same set of operations in any order, will converge to the same state.
A CRDT for text (e.g., Yjs or Automerge) assigns a globally unique identifier to every character insertion. When two users insert characters at the same position, the CRDT breaks the tie deterministically using character IDs β no transform logic is needed, and no central server is required.
OT vs. CRDT at a glance:
| Dimension | OT (Jupiter / ShareDB) | CRDT (Yjs / Automerge) |
| Central server required | Yes β single serialization point | No β peer-to-peer is possible |
| Correctness model | Transform function must satisfy TP1 | Structural; convergence is algebraic |
| Memory overhead | Compact (ops are small deltas) | High β tombstones for deleted chars persist indefinitely |
| Offline-first | Difficult β server needed for transforms | Natural fit |
| Rich text maturity | Mature (Wave OT, json0) | Emerging (Yjs has good rich-text support) |
| Garbage collection | Not required | Required; coordination needed across peers |
βοΈ Trade-offs and Failure Modes: OT vs. CRDT vs. Last-Writer-Wins
Why Last-Writer-Wins Is Wrong for Document Editing
LWW resolves conflicts using wall-clock timestamps. In distributed systems, clocks diverge (NTP drift of 100β500 ms is common). An edit from a node with a slightly faster clock always wins β silently discarding concurrent edits from nodes with slightly slower clocks. More fundamentally, LWW discards one of two concurrent edits; it has no mechanism to merge them. Collaborative editing requires merging; LWW is structurally incapable of providing it.
Failure Modes in Production OT Systems
| Failure Mode | Root Cause | Mitigation |
| Transform function bug | Edge case in overlapping DELETE vs. DELETE range handling | Property-based fuzz testing using TP1 as the invariant: for any randomly generated concurrent ops, both transform orderings must produce identical final states |
| Client-server revision drift | Client submits an op far behind current server revision after a long reconnect | Enforce a max client_revision_lag; force a full page reload if the gap exceeds the safe transform window |
| OT server bottleneck | Single-threaded OT loop can't keep up with 100+ concurrent editors | Partition hot documents; introduce a fan-out relay tier that scales horizontally without OT logic |
| Redis pub/sub message loss | Redis restart drops in-flight cursor and op broadcasts | Op durability is guaranteed by PostgreSQL write-first; clients detect missing ops via revision sequence gaps and trigger a catchup request |
| Split-brain during OT server failover | Two OT replicas briefly both accept ops for the same document | Leader election per document via etcd or ZooKeeper; at-most-one active OT writer enforced at the coordination layer |
CAP Trade-Off: Consistency Over Availability at the OT Layer
The collaborative editing system deliberately prioritizes Consistency over Availability at the OT Service:
- During an OT service failure, the active editing session is paused (clients buffer local ops and display a "reconnectingβ¦" spinner) rather than allowing two OT processes to diverge
- This is a deliberate engineering choice: a split-brain OT scenario would require a CRDT-style merge of two diverged op histories post-failover β an extremely complex recovery path
- Users tolerate a few seconds of pause; silent document corruption is unacceptable
π§ Decision Guide: Choosing Between OT, CRDT, and Optimistic Locking
| Situation | Recommendation |
| Use OT when | You have a reliable central server, need mature tooling (ShareDB, json0, Wave OT), and your document is primarily linear text or structured JSON with a known schema |
| Use CRDT when | You need offline-first support, peer-to-peer sync without a server, or your data maps cleanly to a CRDT type (sequences, sets, counters, maps) |
| Avoid LWW when | Multiple concurrent users can edit overlapping regions β LWW silently discards one user's valid edits with no merge semantics |
| Avoid multi-master OT when | You cannot rigorously verify TP2 correctness across your entire operation set β multi-master OT bugs are notoriously subtle and hard to reproduce |
| Use optimistic locking instead when | A document is rarely edited by more than one person at a time β version-based conflict detection ("document changed, please reload") is far simpler than OT and sufficient for low-contention cases |
π§ͺ Data Model, Cache Strategy, and the Write Path in Detail
Data Model
-- Core document record
CREATE TABLE documents (
doc_id UUID PRIMARY KEY,
title TEXT NOT NULL,
owner_id UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
latest_seq INT DEFAULT 0 -- latest committed operation sequence number
);
-- Every edit ever made β the immutable event log
CREATE TABLE ops_log (
doc_id UUID NOT NULL,
seq INT NOT NULL, -- server-assigned; monotonically increasing per doc
op_type TEXT NOT NULL, -- 'insert' | 'delete' | 'retain'
position INT NOT NULL,
content TEXT, -- for inserts; NULL for deletes
length INT, -- for deletes; NULL for inserts
user_id UUID NOT NULL,
client_id TEXT NOT NULL, -- unique per browser tab/client instance
applied_at TIMESTAMPTZ DEFAULT NOW(),
vector_clock JSONB, -- logical clock snapshot at time of op
PRIMARY KEY (doc_id, seq)
);
-- Periodic full-content snapshots to bound reconstruction time
CREATE TABLE snapshots (
doc_id UUID NOT NULL,
seq INT NOT NULL, -- op sequence at which this snapshot was captured
content TEXT NOT NULL, -- full document content at this seq
created_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (doc_id, seq)
);
-- Document access control
CREATE TABLE doc_permissions (
doc_id UUID NOT NULL,
user_id UUID NOT NULL,
role TEXT NOT NULL, -- 'owner' | 'editor' | 'viewer'
granted_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (doc_id, user_id)
);
Cache Design
| Cache Key Pattern | Cached Value | TTL | Purpose |
doc:{doc_id}:ops_buffer | Last 200 ops as sorted set (score = seq) | 10 min sliding | Fast catchup for recently reconnected clients without hitting PostgreSQL |
doc:{doc_id}:latest_seq | Current server revision integer | Invalidated on each commit | OT service reads this atomically before accepting new ops |
cursor:{doc_id}:{user_id} | {position, color, ts} JSON | 5 s | Stale cursor auto-eviction; initial presence snapshot for new joiners |
session:{client_id} | {user_id, doc_id, last_ack_seq} | 30 min | WebSocket session metadata for reconnect routing |
The Write Path: Full Flow from Keystroke to Convergence
A user types "e" at position 47 in a document currently at revision 314:
- Client creates op
{type: "insert", position: 47, content: "e", client_revision: 314} - Client applies it optimistically to local doc state (zero-latency local render)
- Client sends op over WebSocket to the assigned shard
- WebSocket Server forwards op to the OT Service via internal gRPC
- OT Service atomically reads
doc:{doc_id}:latest_seqfrom Redis βseq = 314; client is current, no transforms needed - OT Service assigns
seq = 315, persists op toops_log(PostgreSQL write-ahead log ensures durability) - OT Service updates
doc:{doc_id}:latest_seqin Redis to315 - OT Service appends op to
doc:{doc_id}:ops_buffersorted set in Redis - OT Service publishes
{seq: 315, op: ...}to Redis pub/sub channeldoc:{doc_id}:ops - WebSocket Server receives pub/sub message, pushes
{seq: 315, op: ...}to all 49 other connected clients for this document - All peer clients apply the incoming op, retransforming any pending ops in their local queues
Target round-trip: < 100 ms at p99 from step 1 (keypress) to step 11 (all peers updated).
The Read Path: Document Reconstruction on Open
When a user opens a document cold:
- Client sends
GET /docs/{doc_id}with auth token - Read Service verifies permissions in
doc_permissions - Read Service fetches the latest snapshot from the
snapshotstable - Read Service fetches all ops from
ops_logwhereseq > snapshot.seq - Read Service replays ops against the snapshot content to reconstruct current state
- Read Service returns
{content, current_seq}to the client - Client opens WebSocket connection and subscribes to ops starting at
current_seq + 1
π οΈ ShareDB: How It Implements Operational Transformation
ShareDB is the leading open-source OT framework for Node.js. It implements OT over JSON documents using the json0 operation type β a composable, composable set of operations over JSON structures: insert/delete on strings and arrays, set/unset on object keys, and numeric increment. ShareDB handles the full OT cycle: server-side transform and sequencing, client-side pending op queue, retransform on reconnect, and pluggable persistence backends.
Minimal ShareDB Server Integration
// server.js β ShareDB OT server backed by PostgreSQL
const ShareDB = require('sharedb');
const PostgresDB = require('sharedb-postgres');
const WebSocket = require('ws');
const WebSocketJSONStream = require('@teamwork/websocket-json-stream');
const db = new PostgresDB({ connectionString: process.env.DATABASE_URL });
const backend = new ShareDB({ db });
// ShareDB handles all OT negotiation over the WebSocket stream
const wss = new WebSocket.Server({ port: 8080 });
wss.on('connection', (ws) => {
const stream = new WebSocketJSONStream(ws);
backend.listen(stream);
});
Minimal ShareDB Client Integration
// client.js β real-time collaborative text editing in the browser
const sharedb = require('sharedb/lib/client');
const ReconnectingWebSocket = require('reconnecting-websocket');
const socket = new ReconnectingWebSocket('ws://localhost:8080');
const connection = new sharedb.Connection(socket);
// Subscribe to the document
const doc = connection.get('documents', 'doc-abc-123');
doc.subscribe((err) => {
if (err) throw err;
// Submit an insert at string index 5 (json0 format: si = string insert)
doc.submitOp([{ p: ['body', 5], si: 'Hello' }], (err) => {
if (err) console.error('Submit failed:', err);
// op is immediately applied locally; server transforms it against
// any concurrent ops and broadcasts the result to all subscribers
});
});
// All concurrent remote ops arrive here β already transformed by ShareDB
doc.on('op', (op, source) => {
if (!source) {
// Remote op β apply to editor UI
console.log('Remote op applied:', op);
}
});
submitOp encapsulates the entire OT write cycle: optimistic local application, server submission with the current revision, server-side transform against concurrent ops, and broadcast. The application developer never writes transform logic directly β that is ShareDB's contract.
For a full deep-dive on ShareDB's json0 OT types, projections, and access control middleware, see the ShareDB README and the json0 specification. A companion post exploring CRDTs via Yjs is a planned follow-up in this series.
π Lessons Learned: What Engineers Get Wrong About Collaborative Editing
Don't conflate operation ordering with operation merging. The most common interview pitfall is proposing wall-clock timestamps plus last-writer-wins. Timestamps tell you when operations were issued; they have no semantics for how to combine them. Two concurrent inserts at the same position both represent valid user intentions β neither is wrong, and discarding one is a data loss bug, not a conflict resolution.
The transform function is load-bearing code. Every bug in the transform matrix is a potential silent document corruption vector. Production OT implementations use property-based fuzz testing: generate thousands of random concurrent op pairs, and for each pair verify that both transform orderings (
transform(Oβ, Oβ)andtransform(Oβ, Oβ)) produce identical final states. TP1 is the fuzz oracle.Snapshots are not optional. A document with 50,000 ops that takes 30 seconds to reconstruct will cause visible loading delays on every open. Snapshot every 50β200 ops depending on average op size. Treating snapshots as an "optimization" rather than a correctness requirement for latency SLAs leads to silent degradation as documents age.
WebSocket servers must be stateless. If the OT transform state lives inside a WebSocket server process, you've coupled document consistency to that process's availability. When the process restarts, the document's editing session is corrupted. All OT state β op buffer, current revision, transform context β belongs in Redis and PostgreSQL. WebSocket servers are routing infrastructure, not business logic.
Cursor positions are not edits. A common over-engineering mistake is routing cursor events through the OT pipeline for causal ordering. Cursor events are ephemeral (a stale cursor position is invisible to users after 5 seconds), high-frequency (every mouse move), and naturally lossy. Put them in Redis pub/sub with a 5-second TTL. Ordering cursor events through the durable op log wastes compute, adds latency, and pollutes version history.
CRDT memory overhead compounds over time. CRDTs for text maintain tombstones for every deleted character β these can never be garbage-collected without coordination across all live replicas. A 1 KB document that has undergone thousands of edits can balloon to 100 KB of CRDT state. Plan for compaction. Yjs has a garbage collection mechanism, but it requires all peers to have acknowledged receipt of the relevant ops before pruning β a coordination requirement that reintroduces complexity in peer-to-peer deployments.
π TLDR Summary & Key Takeaways: The Eight Design Decisions That Define This System
- OT over last-writer-wins: Collaborative editing requires merging concurrent operations, not discarding one. OT's transform function reconciles edits while preserving both users' intentions β LWW has no such merge semantics.
- Central arbiter (Jupiter Protocol): A single OT server per document serializes operations, reducing the convergence requirement to TP1 only. This is the only OT architecture that has been proven correct and deployed at billion-user scale.
- WebSocket for real-time delivery: HTTP polling introduces 200β1,000 ms of additional round-trip latency per operation. A collaborative editor operating on polling is perceptibly broken. Persistent WebSocket channels are a hard architectural requirement, not a convenience.
- Stateless WebSocket servers: All document state lives in Redis and PostgreSQL. WebSocket servers are replaceable routing infrastructure. No document consistency is coupled to the lifecycle of any single process.
- Event sourcing for the op log: Every edit is an immutable, append-only row in
ops_log. Version history, point-in-time restore, and audit trails come for free from this single design choice. - Snapshots every N ops: Without snapshots, document reconstruction is O(total ops ever). With snapshots every 100 ops, reconstruction is bounded to O(100) β a latency guarantee that holds regardless of document age or edit volume.
- Redis for ephemeral state: Cursor positions, session metadata, and the recent op buffer all live in Redis with TTLs. Redis pub/sub fans out broadcasts to WebSocket shards; Redis delivers the sub-millisecond read latency that PostgreSQL cannot meet for hot-path operations.
- CRDT as the alternative path: CRDTs (Yjs, Automerge) eliminate the central arbiter requirement and enable offline-first and peer-to-peer topologies, at the cost of memory overhead from tombstones and added coordination for garbage collection. Choose CRDTs when you need offline-first; choose OT (ShareDB/Jupiter) when you have a reliable central server and need battle-tested tooling.
π Practice Quiz: Test Your Collaborative Editing Design Knowledge
User A inserts
"X"at position 3. Concurrently, User B also inserts"Y"at position 3. Both were issued against revisionr. The server applies A's op first. After OT transformation, what position does B's insert target?- A) Position 3 β unchanged, because both ops are independent inserts
- B) Position 4 β shifted right by 1 because A's insert at position 3 displaced B's target
- C) Position 2 β shifted left to avoid overwriting A's character
- D) The operation is rejected with a conflict error and B must re-submit Correct Answer: B
Why does this collaborative editing system use WebSocket persistent connections rather than HTTP short-polling or HTTP long-polling for operation delivery?
- A) WebSockets are the only protocol that guarantees exactly-once delivery semantics
- B) HTTP polling adds a minimum of one full HTTP round-trip per operation (200β1,000 ms), making the editor perceptibly laggy; WebSockets maintain a persistent bidirectional channel that delivers ops in < 10 ms
- C) WebSockets automatically perform OT transform logic; HTTP does not
- D) HTTP connections cannot carry user identity across requests without session cookies Correct Answer: B
A heavily edited document has 80,000 operations in
ops_logand no snapshots have ever been taken. What is the worst-case time complexity to reconstruct its current content for a user opening it now?- A) O(1) β the document state is always cached in Redis
- B) O(log n) β the ops log is indexed with a B-tree and can be binary searched
- C) O(n) β every one of the 80,000 operations must be replayed sequentially from the beginning
- D) O(nΒ²) β each operation must be transformed against all previous operations before application Correct Answer: C
During an OT Service outage, 30 users are actively editing document D. The load balancer detects the failure and promotes a warm standby OT Service replica that has been independently receiving op writes. What critical problem does this introduce?
- A) The standby replica will reject all ops because it lacks the primary's session tokens
- B) Two independent OT processes will have each committed ops during the overlap window, creating a split-brain: both replicas hold a different canonical revision sequence for the same document, requiring a CRDT-style merge to reconcile β a recovery path of extreme complexity
- C) WebSocket clients will automatically negotiate a tie-breaking protocol using their local vector clocks
- D) No problem: the standby's ops log is a perfect superset of the primary's, so state is already consistent Correct Answer: B
Open-ended: Your team proposes replacing the OT Service with a Yjs CRDT engine to eliminate the single-point-of-failure concern. What specific operational trade-offs would you evaluate before approving this change? Address at minimum: memory growth from tombstones on heavily edited documents, the coordination requirements introduced by Yjs garbage collection, the maturity of CRDT support for rich text attributes (bold, inline comments, nested lists), and any new monitoring or debugging complexity introduced by a decentralized topology. Correct Answer: (Open-ended β no single correct answer. Strong answers address: (1) tombstone accumulation β a 1 KB doc with thousands of edits can balloon to 100 KB of CRDT state; plan for GC coordination; (2) Yjs GC requires all live peers to acknowledge receipt of relevant ops before pruning β this reintroduces a form of coordination overhead; (3) Yjs has solid rich-text support via y-prosemirror / y-quill but is less battle-tested than json0/Wave OT for large-scale production deployments; (4) decentralized topology changes debugging: with OT, the server op log is the single forensic record; with peer-to-peer CRDTs, causality reconstruction requires collecting vector clock state across all peers.)
π Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi
TLDR: Delta Lake, Apache Iceberg, and Apache Hudi are open table formats that wrap Parquet files with a transaction log (or snapshot tree) to deliver ACID guarantees, time travel, schema evolution, an
Medallion Architecture: Bronze, Silver, and Gold Layers in Practice
TLDR: Medallion Architecture solves the "data swamp" problem by organizing a data lake into three progressively refined zones β Bronze (raw, immutable), Silver (cleaned, conformed), Gold (aggregated, business-ready) β so teams always build on a trust...
Kappa Architecture: Streaming-First Data Pipelines
TLDR: Kappa architecture replaces Lambda's batch + speed dual codebases with a single streaming pipeline backed by a replayable Kafka log. Reprocessing becomes replaying from offset 0. One codebase, no drift. TLDR: Kappa is the right call when your t...
Big Data 101: The 5 Vs, Ecosystem, and Why Scale Breaks Everything
TLDR: Traditional databases fail at big data scale for three concrete reasons β storage saturation, compute bottleneck, and write-lock contention. The 5 Vs (Volume, Velocity, Variety, Veracity, Value) frame what makes data "big." A layered ecosystem ...
