System Design HLD Example: Collaborative Document Editing (Google Docs)

A senior-level HLD for real-time collaborative editing with Operational Transformation (OT) and conflict resolution.

System Design Interview Prep

Abstract Algorithms

·Mar 28, 2026·13 min read

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 13 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Real-time collaborative editing relies on Operational Transformation (OT) or CRDTs to resolve concurrent edits without data loss. The core trade-off is Latency vs. Consistency: we use optimistic local updates for zero-latency typing and a centralized OT server to ensure all clients eventually converge to the exact same document state.

✍️ The "Hello World" Race Condition

Imagine two users, Alice and Bob, are editing the same sentence: "The cat is happy." Alice wants to change it to "The fat cat is happy," so she inserts "fat " at position 4. Simultaneously, Bob wants to change it to "The cat is very happy," so he inserts "very " at position 11.

In a naive system:

Alice's Edit: Server applies INSERT("fat ", pos:4). Doc becomes: "The fat cat is happy."
Bob's Edit: Server applies Bob's original INSERT("very ", pos:11).
The Result: "The fat cat is very hvery appy."

Wait, what happened? Because Alice's insert shifted the entire string by 4 characters, Bob's original position 11 is no longer the space before "happy"—it's now in the middle of the word "happy." This is the Position Shift Problem. In a collaborative environment, every edit changes the coordinate system for every other concurrent edit. Without a specialized resolution algorithm, the document quickly becomes a corrupted mess of overlapping characters.

📖 Real-Time Collaboration: Use Cases & Requirements

Actors

Author / Editor: Creates and modifies document content in real-time.
Collaborator: Views live changes and contributes simultaneously.
Viewer: Consumes the document without editing rights but still expects live updates.

Functional Requirements

Real-Time Editing: Multiple users can edit the same document simultaneously.
Conflict Resolution: All users must eventually see the exact same document state (Convergence).
Presence: Show which users are currently active and their cursor positions.
Version History: Ability to see past revisions and restore previous states.
Offline Support: (Partial) Allow users to continue typing during brief network drops.

Non-Functional Requirements

Ultra-Low Latency: Keystrokes should appear on other screens in < 100ms.
High Availability: The service must be available 99.99% of the time.
Scalability: Support 10,000+ concurrent editors on a single "hot" document (e.g., a viral public doc).
Durability: No committed edit should ever be lost.

🔍 Basics: Operational Transformation (OT)

Operational Transformation is the "classic" approach used by Google Docs. The core idea is that the server doesn't just apply an operation; it transforms it against all other operations that happened since the client's last known version.

Every operation is represented as a triple: (Type, Position, Value). When the server receives an operation based on version V, but the current server version is V+k, it runs the new operation through a transform function T(op_new, op_concurrent) for all $k$ operations. This ensures that the user's intent is preserved even if the index has changed.

⚙️ Mechanics: The Transform Matrix

To implement OT, you must define how every operation type interacts with every other operation type.

Insert vs. Insert: If User A inserts at position 5 and User B inserts at position 10, no change is needed for A, but B's position must be incremented by the length of A's insert.
Insert vs. Delete: If a character is deleted before an insert, the insert position must be decremented.
Delete vs. Delete: If both users delete the same character, the second operation becomes a "no-op" to avoid double-deletion.

This "Matrix" of rules ensures that no matter what order operations arrive in, the final result is deterministic.

📐 Estimations & Design Goals

The Math of Real-Time Sync

Average Typist: 60 words per minute $\approx 5$ characters per second.
Active Users per Doc: 10 (average) to 1,000 (hot docs).
Traffic per Doc: 10 users 5 chars/sec = *50 operations/sec.
Fan-out: Every 1 operation must be broadcast to the other 9 users.
Total Broadcasts: $50 \times 9 = 450$ messages per second per document.

Design Goal: Use WebSockets for bidirectional, low-latency communication. Standard HTTP polling would be too slow and would overwhelm the server with headers.

📊 High-Level Design: The Centralized OT Architecture

The following architecture utilizes a centralized "Orderer" to ensure all operations are sequenced correctly.

graph TD
    UserA((User A)) <--> WSS1[WS Server Shard 1]
    UserB((User B)) <--> WSS1
    UserC((User C)) <--> WSS2[WS Server Shard 2]

    WSS1 <--> LB[Internal Load Balancer]
    WSS2 <--> LB

    LB <--> OTS[OT Service: The Arbiter]
    OTS --> PDB[(Document Store: Postgres)]
    OTS --> RC[(Active Doc Cache: Redis)]

    OTS --> PubSub[Redis Pub/Sub]
    PubSub --> WSS1
    PubSub --> WSS2

The diagram above captures the full topology. The critical insight is the separation of concerns: WebSocket Servers handle raw TCP connections and serialization, the OT Service is the single stateful arbiter that runs all transformations, and Redis Pub/Sub is the fan-out mechanism that broadcasts transformed operations back to all connected shards without routing every message through the OT Service a second time.

🧠 Deep Dive: How the OT Server Guarantees Document Convergence

The "magic" of collaborative editing lies entirely inside the OT Service labeled "The Arbiter" in the architecture diagram. Understanding its internals reveals why a simple message broker is not enough—you need a stateful transformation engine with an ordered operation log.

Internals: The Operation Log and Version Clock

The OT Server maintains two critical data structures for each active document: an ordered operation log and a document snapshot. The operation log is an append-only ledger of every transformation ever applied to the document. Each entry is stamped with a monotonically increasing server_version. When a client sends an operation, it includes its client_version—the last server version it acknowledged. The OT server retrieves every operation applied since that version and transforms the incoming operation against each one in sequence before committing it.

Field	Type	Description
op_id	UUID	Unique identifier for this operation
doc_id	UUID	Identifies the document being modified
client_version	Integer	Server version the client based this op on
server_version	Integer	Version assigned by the OT Server after transform
author_id	UUID	Identity of the editing user
op_type	ENUM	INSERT, DELETE, RETAIN, or FORMAT
position	Integer	Character index after transformation
content	Text	The character(s) affected (null for DELETE)

Periodically—every 100 operations is a common threshold—the server saves a checkpoint snapshot of the document's full state. This prevents the operation log from growing unboundedly and allows new clients joining a session to fast-load the latest snapshot rather than replaying millions of operations from the beginning. Without checkpoints, a heavily-edited document would require minutes of replay time for every new session.

Performance Analysis: The Latency Budget for a Keystroke

Every keystroke travels a full round-trip before it appears on another user's screen. The system must complete this round-trip in under 100 milliseconds to feel instantaneous.

Stage	Latency Target	Bottleneck Risk
Client to WebSocket Server (network)	10–30 ms	Geographic distance to nearest WS shard
WebSocket Server to OT Service (internal)	1–5 ms	Internal network hop
OT Transformation (CPU)	< 1 ms per concurrent op	Transform function complexity per op-pair
DB Write — operation log (Postgres)	5–20 ms	Write amplification under high concurrency
Fan-out via Redis Pub/Sub	1–3 ms	Redis Pub/Sub throughput at scale
Broadcast from WS Server to other clients	10–30 ms	Geographic distance

Total P95 target: < 100 ms end-to-end. The most critical optimization is keeping the active document's recent operation log in Redis. This allows the OT transformation step to read concurrent operations from memory rather than hitting Postgres on every keystroke, reducing the critical path from ~25 ms to ~3 ms for the transform step alone.

🌍 Real-World Collaborative Editors: Google Docs, Notion, and Figma

Google Docs pioneered OT at consumer scale. Google's production system uses a variant called "Operational Transformation on Trees" (OT-T) that extends character-level OT to handle rich text structures like nested lists, tables, and embedded images. Google reportedly routes all OT processing for a given document through a single geographic region to avoid the combinatorial complexity of multi-master OT—a design decision that trades some latency for dramatic correctness simplification.

Notion migrated to a CRDT-based architecture (specifically, the YATA sequence CRDT) around 2020. CRDTs allow offline editing and peer-to-peer sync because they do not require a central arbiter. The trade-off: every deleted character must retain a tombstone marker forever, which causes memory growth in long-lived documents. Notion's migration also required a full re-ingestion of all existing documents into the new CRDT format—a months-long engineering effort.

Figma handles collaborative vector design by assigning each design element a unique stable ID and tracking property changes as key-value patches. Because design objects are independent entities (unlike contiguous text characters), Figma avoided full OT or CRDT and instead used last-write-wins with a logical clock for concurrent edits to the same property. This pragmatic approach works because "move this rectangle" and "change its color" rarely conflict in a meaningful way.

⚖️ Consistency vs. Latency: Trade-offs That Define Collaborative Systems

Trade-off Axis	OT (Centralized Arbiter)	CRDT (Distributed)
Offline Support	Limited — server required for transformation	Excellent — clients sync independently
Conflict Resolution	Deterministic — server orders all operations	Eventual — nodes merge without coordination
Memory Footprint	Low — only active ops in RAM	High — tombstones accumulate permanently
Implementation Complexity	High — transform matrix for every op-pair	Very High — formal correctness proofs required
Latency (Online)	Low — single round-trip via WebSocket	Low — local apply, background async sync
Hot Document Scale	Limited by single OT node throughput	Scales horizontally with document sharding

Critical Failure Mode — The Thundering Herd on Reconnect: If a hot document has 1,000 simultaneous users and the OT service node crashes, all 1,000 WebSocket connections reconnect at the same instant. This generates a massive burst of reconnect and re-subscription requests. Mitigation requires exponential backoff with jitter on the client reconnection logic and a session resume token so clients attach to the nearest server version checkpoint rather than replaying from version 0.

🧭 OT vs. CRDT: A Decision Framework for Your Architecture

Use this guide to choose the right synchronization strategy before committing to either path.

Choose OT (Operational Transformation) when:

Your document structure is primarily linear text — rich text editors, code editors, Markdown editors.
You have a reliable internet connection as a baseline user assumption.
You need character-level precision with deterministic conflict resolution and a single source of truth.
You want simpler client code — clients only apply transforms received from the server, never coordinate peer-to-peer.

Choose CRDTs when:

Your application must support offline-first editing with seamless merge on reconnect.
You are building a peer-to-peer collaborative tool where a central server is undesirable or cost-prohibitive.
Your data model is set-like or map-like — collaborative whiteboards, JSON document editors, property panels.
You are willing to accept higher memory overhead and the engineering cost of CRDT correctness for the distributed model.

Hybrid Approach: Figma uses a CRDT-inspired model for independent objects (no central transform needed) but resolves same-property conflicts with a logical clock (last-write-wins). This is the pragmatic middle ground for complex structured documents that mix independent and dependent elements.

🧪 Delivering This Design in a System Design Interview

When presenting the Collaborative Document system, structure your delivery in three acts to demonstrate both breadth and depth.

Act 1 — Frame the Core Problem (2 minutes): Start with the Position Shift Problem. Draw Alice and Bob editing "The cat is happy" concurrently. Show how a naive system produces "The fat cat is very hvery appy." This demonstrates systems thinking before you draw a single architecture box.

Act 2 — Propose the Architecture (5 minutes): Draw the diagram above on the whiteboard. Highlight the OT Service as the central arbiter. Explain why you need multiple WebSocket shards (connection limits per node), why Redis Pub/Sub is used for fan-out (avoiding routing every broadcast through the OT service again), and why Postgres is the durable store (operation log = version history).

Act 3 — Defend Your Trade-offs (3 minutes): The interviewer will ask: "Why not CRDTs? Why not just use a database lock?" Walk through the OT vs. CRDT decision matrix above. Then address the hot document problem: what happens when 10,000 users edit the same viral document? Answer: shard the OT Service by document ID — all editors of one document route to the same OT partition.

Interviewer Question	Strong Answer
How do you handle a user going offline mid-edit?	Buffer operations locally with client version; replay and transform on reconnect against the server log
How do you scale the OT service beyond one node?	Shard by document ID — all editors of the same document route to the same OT node
How do you implement version history?	The operation log IS the version history; replay from any checkpoint to reconstruct any past state

🛠️ Open Source Foundations: ShareDB, Yjs, and Automerge

The collaborative editing ecosystem has several battle-tested OSS options for different architectural philosophies.

ShareDB (by the ShareJS team) implements a server-centric OT architecture in Node.js, backed by MongoDB or PostgreSQL for operation log persistence. It handles the full transform matrix for plain text and JSON document types and powers several production collaborative tools out of the box.

Yjs is a high-performance CRDT framework that supports pluggable network providers — WebSockets for server-mediated sync, WebRTC for peer-to-peer. Yjs is used by Tiptap, ProseMirror integrations, and a growing ecosystem of no-code and collaborative SaaS platforms.

Automerge is an academic-grade CRDT library providing a JSON-like document with full change history. It offers the most formal correctness guarantees but carries the highest memory overhead — best suited for smaller documents or archival-first use cases where history preservation is more important than runtime memory efficiency.

📚 Lessons Learned From Building Collaborative Systems at Scale

Lesson 1 — Start with a single OT node and shard by document ID from day one. Adding document-level routing as an afterthought means rewriting the client session resume protocol. Design the routing layer before you have users.

Lesson 2 — The operation log is your debugging superpower. Because every edit is logged with author ID and timestamp, replaying the log lets you diagnose exactly which concurrent operation caused a corruption. Never truncate the log in production without archiving it first.

Lesson 3 — Presence is harder than edits. Broadcasting cursor positions and user avatars requires a separate extremely low-latency channel. At 1,000 users, cursor position broadcasts can generate more traffic than the actual document operations. Implement presence throttling — update cursor positions at most 10 times per second per user.

Lesson 4 — Snapshots are not optional at scale. An operation log with 10 million entries takes minutes to replay. Checkpoint snapshots every 100–500 operations are essential for fast session initialization and disaster recovery.

📌 TLDR & Key Takeaways for Collaborative Document Design

The core problem is the Position Shift Problem: concurrent edits invalidate each other's character indices, producing corrupted documents in naive systems.
OT solves this by transforming each incoming operation against all concurrent operations before applying it, guaranteeing all clients converge to the same state.
The architecture uses WebSocket shards for connection handling, a centralized OT Service as the arbiter, Redis for active doc caching and Pub/Sub fan-out, and Postgres for durable operation log storage.
Scale the OT service by sharding on document ID — all editors of one document always route to the same partition.
The key trade-off is OT (simpler for online, linear text) vs. CRDT (better for offline and peer-to-peer use cases).
Never neglect presence: cursor broadcasting is a separate high-frequency channel that needs independent throttling and its own latency budget.

System Design HLD: Chat & Messaging — Real-time WebSocket fan-out patterns that directly mirror the broadcast mechanics of collaborative editing at scale.
System Design HLD: Notification Service — How to architect the presence and alert delivery layer that complements a collaborative document editor.
System Design HLD: File Storage & Sync — The companion problem of syncing binary document blobs, version snapshots, and media attachments at cloud scale.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Stale Reads and Cascading Failures in Distributed Systems

TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable — stale reads...

May 3, 2026•23 min read

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node — virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...

May 3, 2026•22 min read

Clock Skew and Causality Violations: Why Distributed Clocks Lie

TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions — but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...

May 3, 2026•18 min read

Split Brain Explained: When Two Nodes Both Think They Are Leader

TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader — each accepting writes the other never sees. Prevent it with quorum consensus (at least ⌊N/2⌋+1 nodes must agree before leadership is g...

May 3, 2026•20 min read