System Design HLD Example: Collaborative Document Editing (Google Docs)
A senior-level HLD for real-time collaborative editing with Operational Transformation (OT) and conflict resolution.
Abstract AlgorithmsIntermediate
For developers with some experience. Builds on fundamentals.
Estimated read time: 13 min
AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Real-time collaborative editing relies on Operational Transformation (OT) or CRDTs to resolve concurrent edits without data loss. The core trade-off is Latency vs. Consistency: we use optimistic local updates for zero-latency typing and a centralized OT server to ensure all clients eventually converge to the exact same document state.
βοΈ The "Hello World" Race Condition
Imagine two users, Alice and Bob, are editing the same sentence: "The cat is happy." Alice wants to change it to "The fat cat is happy," so she inserts "fat " at position 4. Simultaneously, Bob wants to change it to "The cat is very happy," so he inserts "very " at position 11.
In a naive system:
- Alice's Edit: Server applies
INSERT("fat ", pos:4). Doc becomes: "The fat cat is happy." - Bob's Edit: Server applies Bob's original
INSERT("very ", pos:11). - The Result: "The fat cat is very hvery appy."
Wait, what happened? Because Alice's insert shifted the entire string by 4 characters, Bob's original position 11 is no longer the space before "happy"βit's now in the middle of the word "happy." This is the Position Shift Problem. In a collaborative environment, every edit changes the coordinate system for every other concurrent edit. Without a specialized resolution algorithm, the document quickly becomes a corrupted mess of overlapping characters.
π Real-Time Collaboration: Use Cases & Requirements
Actors
- Author / Editor: Creates and modifies document content in real-time.
- Collaborator: Views live changes and contributes simultaneously.
- Viewer: Consumes the document without editing rights but still expects live updates.
Functional Requirements
- Real-Time Editing: Multiple users can edit the same document simultaneously.
- Conflict Resolution: All users must eventually see the exact same document state (Convergence).
- Presence: Show which users are currently active and their cursor positions.
- Version History: Ability to see past revisions and restore previous states.
- Offline Support: (Partial) Allow users to continue typing during brief network drops.
Non-Functional Requirements
- Ultra-Low Latency: Keystrokes should appear on other screens in < 100ms.
- High Availability: The service must be available 99.99% of the time.
- Scalability: Support 10,000+ concurrent editors on a single "hot" document (e.g., a viral public doc).
- Durability: No committed edit should ever be lost.
π Basics: Operational Transformation (OT)
Operational Transformation is the "classic" approach used by Google Docs. The core idea is that the server doesn't just apply an operation; it transforms it against all other operations that happened since the client's last known version.
Every operation is represented as a triple: (Type, Position, Value).
When the server receives an operation based on version V, but the current server version is V+k, it runs the new operation through a transform function T(op_new, op_concurrent) for all $k$ operations. This ensures that the user's intent is preserved even if the index has changed.
βοΈ Mechanics: The Transform Matrix
To implement OT, you must define how every operation type interacts with every other operation type.
- Insert vs. Insert: If User A inserts at position 5 and User B inserts at position 10, no change is needed for A, but B's position must be incremented by the length of A's insert.
- Insert vs. Delete: If a character is deleted before an insert, the insert position must be decremented.
- Delete vs. Delete: If both users delete the same character, the second operation becomes a "no-op" to avoid double-deletion.
This "Matrix" of rules ensures that no matter what order operations arrive in, the final result is deterministic.
π Estimations & Design Goals
The Math of Real-Time Sync
- Average Typist: 60 words per minute $\approx 5$ characters per second.
- Active Users per Doc: 10 (average) to 1,000 (hot docs).
- Traffic per Doc: 10 users 5 chars/sec = *50 operations/sec.
- Fan-out: Every 1 operation must be broadcast to the other 9 users.
- Total Broadcasts: $50 \times 9 = 450$ messages per second per document.
Design Goal: Use WebSockets for bidirectional, low-latency communication. Standard HTTP polling would be too slow and would overwhelm the server with headers.
π High-Level Design: The Centralized OT Architecture
The following architecture utilizes a centralized "Orderer" to ensure all operations are sequenced correctly.
graph TD
UserA((User A)) <--> WSS1[WS Server Shard 1]
UserB((User B)) <--> WSS1
UserC((User C)) <--> WSS2[WS Server Shard 2]
WSS1 <--> LB[Internal Load Balancer]
WSS2 <--> LB
LB <--> OTS[OT Service: The Arbiter]
OTS --> PDB[(Document Store: Postgres)]
OTS --> RC[(Active Doc Cache: Redis)]
OTS --> PubSub[Redis Pub/Sub]
PubSub --> WSS1
PubSub --> WSS2
The diagram above captures the full topology. The critical insight is the separation of concerns: WebSocket Servers handle raw TCP connections and serialization, the OT Service is the single stateful arbiter that runs all transformations, and Redis Pub/Sub is the fan-out mechanism that broadcasts transformed operations back to all connected shards without routing every message through the OT Service a second time.
π§ Deep Dive: How the OT Server Guarantees Document Convergence
The "magic" of collaborative editing lies entirely inside the OT Service labeled "The Arbiter" in the architecture diagram. Understanding its internals reveals why a simple message broker is not enoughβyou need a stateful transformation engine with an ordered operation log.
Internals: The Operation Log and Version Clock
The OT Server maintains two critical data structures for each active document: an ordered operation log and a document snapshot. The operation log is an append-only ledger of every transformation ever applied to the document. Each entry is stamped with a monotonically increasing server_version. When a client sends an operation, it includes its client_versionβthe last server version it acknowledged. The OT server retrieves every operation applied since that version and transforms the incoming operation against each one in sequence before committing it.
| Field | Type | Description |
| op_id | UUID | Unique identifier for this operation |
| doc_id | UUID | Identifies the document being modified |
| client_version | Integer | Server version the client based this op on |
| server_version | Integer | Version assigned by the OT Server after transform |
| author_id | UUID | Identity of the editing user |
| op_type | ENUM | INSERT, DELETE, RETAIN, or FORMAT |
| position | Integer | Character index after transformation |
| content | Text | The character(s) affected (null for DELETE) |
Periodicallyβevery 100 operations is a common thresholdβthe server saves a checkpoint snapshot of the document's full state. This prevents the operation log from growing unboundedly and allows new clients joining a session to fast-load the latest snapshot rather than replaying millions of operations from the beginning. Without checkpoints, a heavily-edited document would require minutes of replay time for every new session.
Performance Analysis: The Latency Budget for a Keystroke
Every keystroke travels a full round-trip before it appears on another user's screen. The system must complete this round-trip in under 100 milliseconds to feel instantaneous.
| Stage | Latency Target | Bottleneck Risk |
| Client to WebSocket Server (network) | 10β30 ms | Geographic distance to nearest WS shard |
| WebSocket Server to OT Service (internal) | 1β5 ms | Internal network hop |
| OT Transformation (CPU) | < 1 ms per concurrent op | Transform function complexity per op-pair |
| DB Write β operation log (Postgres) | 5β20 ms | Write amplification under high concurrency |
| Fan-out via Redis Pub/Sub | 1β3 ms | Redis Pub/Sub throughput at scale |
| Broadcast from WS Server to other clients | 10β30 ms | Geographic distance |
Total P95 target: < 100 ms end-to-end. The most critical optimization is keeping the active document's recent operation log in Redis. This allows the OT transformation step to read concurrent operations from memory rather than hitting Postgres on every keystroke, reducing the critical path from ~25 ms to ~3 ms for the transform step alone.
π Real-World Collaborative Editors: Google Docs, Notion, and Figma
Google Docs pioneered OT at consumer scale. Google's production system uses a variant called "Operational Transformation on Trees" (OT-T) that extends character-level OT to handle rich text structures like nested lists, tables, and embedded images. Google reportedly routes all OT processing for a given document through a single geographic region to avoid the combinatorial complexity of multi-master OTβa design decision that trades some latency for dramatic correctness simplification.
Notion migrated to a CRDT-based architecture (specifically, the YATA sequence CRDT) around 2020. CRDTs allow offline editing and peer-to-peer sync because they do not require a central arbiter. The trade-off: every deleted character must retain a tombstone marker forever, which causes memory growth in long-lived documents. Notion's migration also required a full re-ingestion of all existing documents into the new CRDT formatβa months-long engineering effort.
Figma handles collaborative vector design by assigning each design element a unique stable ID and tracking property changes as key-value patches. Because design objects are independent entities (unlike contiguous text characters), Figma avoided full OT or CRDT and instead used last-write-wins with a logical clock for concurrent edits to the same property. This pragmatic approach works because "move this rectangle" and "change its color" rarely conflict in a meaningful way.
βοΈ Consistency vs. Latency: Trade-offs That Define Collaborative Systems
| Trade-off Axis | OT (Centralized Arbiter) | CRDT (Distributed) |
| Offline Support | Limited β server required for transformation | Excellent β clients sync independently |
| Conflict Resolution | Deterministic β server orders all operations | Eventual β nodes merge without coordination |
| Memory Footprint | Low β only active ops in RAM | High β tombstones accumulate permanently |
| Implementation Complexity | High β transform matrix for every op-pair | Very High β formal correctness proofs required |
| Latency (Online) | Low β single round-trip via WebSocket | Low β local apply, background async sync |
| Hot Document Scale | Limited by single OT node throughput | Scales horizontally with document sharding |
Critical Failure Mode β The Thundering Herd on Reconnect: If a hot document has 1,000 simultaneous users and the OT service node crashes, all 1,000 WebSocket connections reconnect at the same instant. This generates a massive burst of reconnect and re-subscription requests. Mitigation requires exponential backoff with jitter on the client reconnection logic and a session resume token so clients attach to the nearest server version checkpoint rather than replaying from version 0.
π§ OT vs. CRDT: A Decision Framework for Your Architecture
Use this guide to choose the right synchronization strategy before committing to either path.
Choose OT (Operational Transformation) when:
- Your document structure is primarily linear text β rich text editors, code editors, Markdown editors.
- You have a reliable internet connection as a baseline user assumption.
- You need character-level precision with deterministic conflict resolution and a single source of truth.
- You want simpler client code β clients only apply transforms received from the server, never coordinate peer-to-peer.
Choose CRDTs when:
- Your application must support offline-first editing with seamless merge on reconnect.
- You are building a peer-to-peer collaborative tool where a central server is undesirable or cost-prohibitive.
- Your data model is set-like or map-like β collaborative whiteboards, JSON document editors, property panels.
- You are willing to accept higher memory overhead and the engineering cost of CRDT correctness for the distributed model.
Hybrid Approach: Figma uses a CRDT-inspired model for independent objects (no central transform needed) but resolves same-property conflicts with a logical clock (last-write-wins). This is the pragmatic middle ground for complex structured documents that mix independent and dependent elements.
π§ͺ Delivering This Design in a System Design Interview
When presenting the Collaborative Document system, structure your delivery in three acts to demonstrate both breadth and depth.
Act 1 β Frame the Core Problem (2 minutes): Start with the Position Shift Problem. Draw Alice and Bob editing "The cat is happy" concurrently. Show how a naive system produces "The fat cat is very hvery appy." This demonstrates systems thinking before you draw a single architecture box.
Act 2 β Propose the Architecture (5 minutes): Draw the diagram above on the whiteboard. Highlight the OT Service as the central arbiter. Explain why you need multiple WebSocket shards (connection limits per node), why Redis Pub/Sub is used for fan-out (avoiding routing every broadcast through the OT service again), and why Postgres is the durable store (operation log = version history).
Act 3 β Defend Your Trade-offs (3 minutes): The interviewer will ask: "Why not CRDTs? Why not just use a database lock?" Walk through the OT vs. CRDT decision matrix above. Then address the hot document problem: what happens when 10,000 users edit the same viral document? Answer: shard the OT Service by document ID β all editors of one document route to the same OT partition.
| Interviewer Question | Strong Answer |
| How do you handle a user going offline mid-edit? | Buffer operations locally with client version; replay and transform on reconnect against the server log |
| How do you scale the OT service beyond one node? | Shard by document ID β all editors of the same document route to the same OT node |
| How do you implement version history? | The operation log IS the version history; replay from any checkpoint to reconstruct any past state |
π οΈ Open Source Foundations: ShareDB, Yjs, and Automerge
The collaborative editing ecosystem has several battle-tested OSS options for different architectural philosophies.
ShareDB (by the ShareJS team) implements a server-centric OT architecture in Node.js, backed by MongoDB or PostgreSQL for operation log persistence. It handles the full transform matrix for plain text and JSON document types and powers several production collaborative tools out of the box.
Yjs is a high-performance CRDT framework that supports pluggable network providers β WebSockets for server-mediated sync, WebRTC for peer-to-peer. Yjs is used by Tiptap, ProseMirror integrations, and a growing ecosystem of no-code and collaborative SaaS platforms.
Automerge is an academic-grade CRDT library providing a JSON-like document with full change history. It offers the most formal correctness guarantees but carries the highest memory overhead β best suited for smaller documents or archival-first use cases where history preservation is more important than runtime memory efficiency.
π Lessons Learned From Building Collaborative Systems at Scale
Lesson 1 β Start with a single OT node and shard by document ID from day one. Adding document-level routing as an afterthought means rewriting the client session resume protocol. Design the routing layer before you have users.
Lesson 2 β The operation log is your debugging superpower. Because every edit is logged with author ID and timestamp, replaying the log lets you diagnose exactly which concurrent operation caused a corruption. Never truncate the log in production without archiving it first.
Lesson 3 β Presence is harder than edits. Broadcasting cursor positions and user avatars requires a separate extremely low-latency channel. At 1,000 users, cursor position broadcasts can generate more traffic than the actual document operations. Implement presence throttling β update cursor positions at most 10 times per second per user.
Lesson 4 β Snapshots are not optional at scale. An operation log with 10 million entries takes minutes to replay. Checkpoint snapshots every 100β500 operations are essential for fast session initialization and disaster recovery.
π TLDR & Key Takeaways for Collaborative Document Design
- The core problem is the Position Shift Problem: concurrent edits invalidate each other's character indices, producing corrupted documents in naive systems.
- OT solves this by transforming each incoming operation against all concurrent operations before applying it, guaranteeing all clients converge to the same state.
- The architecture uses WebSocket shards for connection handling, a centralized OT Service as the arbiter, Redis for active doc caching and Pub/Sub fan-out, and Postgres for durable operation log storage.
- Scale the OT service by sharding on document ID β all editors of one document always route to the same partition.
- The key trade-off is OT (simpler for online, linear text) vs. CRDT (better for offline and peer-to-peer use cases).
- Never neglect presence: cursor broadcasting is a separate high-frequency channel that needs independent throttling and its own latency budget.
π Related Posts
- System Design HLD: Chat & Messaging β Real-time WebSocket fan-out patterns that directly mirror the broadcast mechanics of collaborative editing at scale.
- System Design HLD: Notification Service β How to architect the presence and alert delivery layer that complements a collaborative document editor.
- System Design HLD: File Storage & Sync β The companion problem of syncing binary document blobs, version snapshots, and media attachments at cloud scale.
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Stale Reads and Cascading Failures in Distributed Systems
TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable β stale reads...
NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data
TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node β virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...
Clock Skew and Causality Violations: Why Distributed Clocks Lie
TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions β but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...
Split Brain Explained: When Two Nodes Both Think They Are Leader
TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader β each accepting writes the other never sees. Prevent it with quorum consensus (at least βN/2β+1 nodes must agree before leadership is g...
