System Design HLD Example: File Storage and Sync (Dropbox and Google Drive)

Build an HLD for file storage, metadata consistency, and cross-device synchronization.

Abstract Algorithms

·Mar 13, 2026·15 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Cloud sync systems separate immutable blob storage (S3) from atomic metadata operations (PostgreSQL), using chunk-level deduplication to optimize storage costs and delta-sync events to minimize bandwidth.

Dropbox serves 700 million registered users who edit files simultaneously across desktop, mobile, and web clients. The hard problem is not storing bytes — object stores like S3 handle petabyte durability reliably. The real complexity is metadata consistency: if a user edits a file offline on a laptop while the same file is modified on a phone, the system must detect the conflict, preserve both versions, and let the user resolve it without any data loss.

Imagine a user uploading a 2GB 4K video. If the upload is a single HTTP request and the network drops at 99%, the user must restart from zero. If your system doesn't deduplicate, and 1,000 users upload the same viral video, you pay for 2TB of redundant storage. If your sync logic isn't delta-based, every time a user changes one character in a 100MB document, the entire file is re-transmitted.

Designing a file sync system teaches you how to separate blob durability from metadata correctness — a separation pattern that recurs in object databases, CDNs, and distributed filesystems at every scale. By the end of this guide, you will know how to design for petabyte-scale durability while maintaining sub-second synchronization across a global fleet of devices.

📖 File Storage & Sync: Use Cases & Requirements

A Dropbox-style system must bridge the gap between high-latency file transfers and low-latency metadata operations.

Functional Requirements

File Upload/Download: Support files up to 50GB via chunked, resumable transfers.
Versioning: Maintain a history of file versions (e.g., last 30 days) to allow recovery from accidental edits or ransomware.
Deduplication: Save storage by only storing identical chunks once, even across different user accounts.
Multi-device Sync: Automatically propagate changes to all devices owned by a user.
Sharing & Permissions: Support private sharing, public links, and granular permissions (View/Edit).
Conflict Resolution: Detect and handle concurrent edits to the same file.

Non-Functional Requirements

Durability (11 9s): Files must never be lost. We rely on object storage with erasure coding.
Strong Consistency (for Metadata): Users must never see an inconsistent file tree (e.g., a file appearing in two folders or a version pointing to non-existent chunks).
Availability (99.9%): Users expect to access their files at all times.
Bandwidth Efficiency: Minimize data transfer using delta-sync and compression.

🔍 Basics: Baseline Architecture

The system is split into two distinct planes: the Data Plane (heavy bytes) and the Control Plane (metadata).

Block Service: Handles file chunking, hashing, and storage in an object store (like AWS S3).
Metadata Service: Manages the file hierarchy, version history, and sharing permissions in a relational database (PostgreSQL).
Sync Service: Propagates metadata change events to clients using WebSockets or Long Polling.
Client App: Handles local file watching, chunking, hashing, and delta-application.

⚙️ Mechanics: Key Logic

1. The 4 MB Chunking Strategy

Instead of treating a file as a single blob, we split it into fixed-size chunks (e.g., 4 MB).

Benefit: If an upload fails at chunk 10 of 100, we only retry chunk 10.
Benefit: Parallelism. A client can upload 4 chunks simultaneously to saturate bandwidth.

2. Content-Addressed Storage (Deduplication)

Every chunk is hashed (SHA-256). This hash becomes the unique ID of the chunk in the object store.

Before uploading chunk H, the client asks the server: "Do you have a chunk with hash H?"
If the server says "Yes," the client skips the upload and simply links the file's metadata to the existing chunk.

3. Delta Synchronization (The Cursor)

To keep devices in sync without constant polling:

Every metadata change is assigned a monotonic change_id.
The client stores a sync_cursor (the last change_id it processed).
On reconnect, the client asks: "Give me all changes since change_id = X."
The server returns a list of deltas (e.g., "File A renamed to B", "Chunk 5 of File C updated").

📐 Estimations & Design Goals

Scale: 100M Daily Active Users.
Storage: 100M users * 10GB avg = 1 Exabyte total storage.
Throughput: 1M writes per second (metadata), 10GB/s aggregate upload.
Dedup Ratio: Typically 30-50% in consumer clouds due to common OS files and shared media.
Goal: Minimize S3 API costs by caching chunk metadata in Redis.

📊 High-Level Design

graph TD
    Client[Client App] -->|1. Init Upload| Meta[Metadata API]
    Client -->|2. Upload Chunks| Block[Block Service]
    Block -->|3. Check Dedup| Redis[(Redis Hash Cache)]
    Block -->|4. Store Bytes| S3[Object Storage - S3]
    Block -->|5. ACK Chunks| Client
    Client -->|6. Commit| Meta
    Meta -->|7. Update Tree| DB[(PostgreSQL)]
    Meta -->|8. Emit Event| Kafka{Kafka}
    Kafka -->|9. Fan Out| Sync[Sync Service]
    Sync -->|10. Notify| OtherClients[Other Devices]

The upload flow separates chunk durability from metadata correctness into two parallel tracks. The Block Service handles the heavy byte operations — checking Redis for a known chunk hash, writing new chunks to S3, and acknowledging receipt to the client. The Metadata Service handles the logical file tree — atomically updating the folder structure in PostgreSQL and emitting a change event to Kafka. The Sync Service consumes those Kafka events and fans them out to all other devices via WebSocket, completing the cross-device propagation path within seconds of the original save.

File and Chunk Metadata Schema

These two tables are the core of the metadata plane. The chunk table is content-addressed: the chunk hash serves as both its unique identifier and its storage key in S3.

Table	Column	Type	Description
files	file_id	UUID	Unique identifier for a file version
files	user_id	UUID	Owner of the file
files	parent_folder_id	UUID	Parent folder in the file tree
files	name	VARCHAR	Display name of the file
files	size_bytes	BIGINT	Total file size
files	chunk_ids	TEXT[]	Ordered list of chunk SHA-256 hashes
files	version	INTEGER	Monotonic version counter for conflict detection
files	created_at	TIMESTAMP	Version creation time
chunks	chunk_hash	CHAR(64)	SHA-256 hash; also the S3 object key
chunks	size_bytes	INTEGER	Chunk size (max 4MB)
chunks	ref_count	INTEGER	Number of files referencing this chunk (for GC)

🧠 Deep Dive: Content-Addressed Storage, Delta Sync Cursor, and Conflict Resolution Mechanics

The three hardest problems in a cloud file sync system are storing bytes efficiently without duplication, propagating changes to devices without polling, and resolving conflicts when two devices edit the same file concurrently offline.

Internals: Content-Addressed Chunk Storage and the SHA-256 Deduplication Gate

Every file uploaded to the system is split into fixed-size chunks — typically 4MB — before any byte reaches the network. Each chunk is hashed with SHA-256 to produce a 64-character hex digest that serves as both the chunk's unique identifier and its object key in S3. Before uploading any chunk, the client sends the server a list of proposed chunk hashes (a "check-before-upload" request). The server checks Redis — which maintains a set of all known chunk hashes — and responds with the subset the client actually needs to upload. For a user uploading the same 2GB video they uploaded last month, or for 1,000 users uploading the same viral clip, the deduplication gate returns "you need to upload 0 of these chunks" and the commit step simply links the file's metadata to the existing chunk records without any S3 API call. This is content-addressed storage (CAS): the content determines the address, and identical content shares one address.

The ref_count column in the chunk table tracks how many file records reference each chunk. When a file is deleted or overwritten, its chunk references are decremented. A background garbage collector periodically scans for chunks with ref_count = 0 and issues S3 delete calls. This prevents orphaned bytes from accumulating and controls storage cost at petabyte scale.

Performance Analysis: The Delta Sync Cursor for Bandwidth-Efficient Cross-Device Propagation

The naive synchronization approach is polling: every connected device asks the server every few seconds "has anything changed?" At 100 million devices polling every 10 seconds, the Sync API receives 10M RPS of pure overhead — most returning "no changes." The delta sync cursor eliminates this by inverting the model: devices register a persistent WebSocket connection and wait for the server to push change events to them.

Every metadata write in PostgreSQL is assigned a monotonically increasing change_id using a database sequence. When the Metadata Service commits a file change, it emits a Kafka event containing the change_id, the affected user_id, and the change delta (rename, content update, new version). The Sync Service consumes this event and pushes it to all WebSocket connections registered for that user_id. Each device tracks a sync_cursor — the highest change_id it has processed. On reconnect after being offline, the device sends its current cursor to the Sync API, which returns all changes since that cursor ID in a single batch. Devices that were offline for days catch up in one round trip. This cursor pattern is O(changes since last sync) rather than O(all files), making reconnection efficient even for users with 100,000 files who were offline for a week.

🌍 Real-World File Storage Architectures: Dropbox, Google Drive, and iCloud

Dropbox Magic Pocket is Dropbox's proprietary object storage system, built to replace their dependence on AWS S3 for cost reasons. Magic Pocket is a custom erasure-coded storage cluster that stores file chunks across spinning-disk JBOD servers in multiple data centers. Dropbox's key architectural decision was to separate the chunk storage layer (Magic Pocket) from the metadata layer (MySQL with Vitess for sharding), exactly mirroring the pattern described in this guide. Moving from S3 to Magic Pocket reduced Dropbox's storage costs by approximately 40% while improving their control over durability guarantees.

Google Drive uses Google's Bigtable for chunk metadata and Colossus (GFS successor) for byte storage. Google's approach to deduplication operates at the block level within each user's storage quota rather than globally across users for privacy reasons. Google's conflict resolution model is server-authoritative: when two devices write different versions of the same file simultaneously, Drive keeps both as separate versioned files and surfaces the conflict to the user rather than attempting automatic merge.

iCloud Drive uses a similar chunk-based architecture but integrates deeply with Apple's CloudKit framework for metadata synchronization. iCloud's sync model uses operational transforms for document-level conflict resolution in Pages and Numbers, falling back to full-version forking for binary files where merge is impossible.

⚖️ Trade-offs and Failure Modes in File Storage and Sync Design

Dimension	Trade-off	Recommendation
Chunk size (4MB fixed vs. variable)	Fixed: simple but inefficient for small edits in large files; Variable (CDC): optimal dedup but complex implementation	Use 4MB fixed for MVP; variable chunking (CDC) for mature systems with many large documents
Global dedup vs. per-user dedup	Global: maximum storage savings; Per-user: avoids privacy leakage of cross-user chunk sharing	Per-user dedup for privacy; global only for internal metadata like OS files and common binaries
Strong vs. eventual metadata consistency	Strong: users always see a correct file tree; Eventual: faster writes, risk of inconsistent state	Strong consistency for metadata (PostgreSQL with ACID); eventual consistency is not safe for file tree integrity
WebSocket vs. Long Polling for sync	WebSocket: true push, sub-second propagation; Long Polling: simpler, higher server load	WebSocket for sync-critical paths; Long Polling as fallback for clients behind restrictive firewalls

Conflict on Concurrent Offline Edits: When Device A renames a file while Device B deletes it offline, both sync on reconnect. The server must resolve this deterministically. The recommended strategy is "last writer wins" for rename vs. delete conflicts (the rename survives if its change_id is higher) combined with "fork both versions" for content conflicts (both edits are preserved as separate versions with a conflict marker).

S3 Eventual Consistency for Reads-After-Writes: AWS S3 now provides strong read-after-write consistency for new object PUTs. However, for update and delete operations, ensure that the metadata commit to PostgreSQL happens only after the S3 write acknowledgment. Never mark a file commit as complete in the metadata layer before the bytes are durably stored in S3 — this is the "Persist-Before-Commit" pattern applied to blob storage.

🧭 Decision Guide: Chunk Size, Storage Backend, and Sync Strategy

System Characteristics	Recommended Design Choice
Files primarily under 1MB	Use 1MB fixed chunks; large 4MB chunks waste storage for small files
Files primarily over 100MB (video, 4K images)	Use 8–16MB chunks; reduces chunk count in metadata and S3 API call overhead
Sub-second sync propagation required	WebSocket-based push with Kafka fan-out
Sync latency of under 30 seconds acceptable	Long Polling with 20-second timeout intervals (simpler ops overhead)
Multi-region active-active write support	Conflict-free Replicated Data Types (CRDTs) for metadata; extremely complex — prefer active-passive multi-region for most systems
Storage cost is the primary constraint	Enable global deduplication; use variable-size chunking (CDC algorithm) for maximum dedup efficiency
Durability constraint is 99.999999999% (11 9s)	Store chunks on AWS S3 Standard or GCS with multi-region replication enabled

🧪 Interview Delivery Example: Designing a Cloud File Sync System in 45 Minutes

Minutes 1–5 — Frame the Problem: Open with the multi-device failure scenario. A user edits a 100MB document on their laptop and closes the lid before the upload completes. They open their phone on a train with intermittent connectivity. The sync system must handle partial uploads, resume from the last successful chunk, and not re-transmit bytes already safely stored. Establish NFRs: 11-nines durability, eventual sync within 30 seconds, support for 50GB files.

Minutes 6–15 — Two-Plane Architecture: Draw the separation between the Block Service (byte plane) and the Metadata Service (control plane) before drawing any boxes. Explain why these must be separate: the Block Service is stateless and horizontally scalable, while the Metadata Service requires ACID guarantees on a relational database for file tree integrity.

Minutes 16–30 — Chunking and Deduplication: Walk through the chunking algorithm (4MB fixed, SHA-256 hash). Explain the check-before-upload optimization. Show how the chunk table's ref_count enables safe garbage collection. Mention that deduplication saves 30–50% of storage in consumer clouds.

Minutes 31–40 — Delta Sync Cursor: Draw the Kafka → Sync Service → WebSocket path. Explain the sync_cursor and how offline devices catch up in one batch read rather than polling. Describe the conflict resolution strategy for simultaneous offline edits.

Minutes 41–45 — Trade-offs: Compare fixed chunk size vs. content-defined chunking. Discuss global vs. per-user deduplication privacy implications. Mention how Dropbox, Google Drive, and iCloud implement this pattern with different storage backends.

🛠️ Open-Source File Storage and Sync Tools Worth Knowing

MinIO: S3-compatible object storage deployable on bare metal or Kubernetes. Used as the chunk storage backend for self-hosted Nextcloud and private Dropbox-like systems.
Nextcloud: Open-source Dropbox alternative. Implements chunked uploads via the Nextcloud Chunked Upload protocol, compatible with the design described here.
Rclone: A command-line tool for syncing files between cloud storage providers. Its delta-sync algorithm mirrors the cursor-based approach described above — it uses file modification time and size as change indicators.
SeaweedFS: A distributed blob storage system optimized for storing billions of small files efficiently, addressing S3's small-file performance overhead.

📚 Lessons Learned from File Sync Failures in Production

A Missing Chunk Reference Caused Silent Data Loss for 0.01% of Files. A file storage platform had a race condition between the chunk upload acknowledgment and the metadata commit. If the server crashed between these two steps, the metadata record pointed to a chunk hash that was never successfully written to S3. When the user downloaded the file later, the Block Service returned a 404 for the missing chunk, and the client displayed a corrupted file. The fix: use a two-phase commit — write the chunk to S3, write the chunk record to PostgreSQL, and only then allow the metadata commit to proceed. The metadata commit must be gated on chunk durability.

Sync Loops from Timestamp Precision Mismatch. A desktop client on macOS used nanosecond file modification timestamps, while the server stored millisecond timestamps in PostgreSQL. On every sync, the client detected a "change" (nanoseconds differed from milliseconds) and re-uploaded unmodified files, creating an infinite sync loop. The fix: normalize all timestamps to millisecond precision at the client before comparing. Timestamp precision mismatches between OS filesystems and server databases are among the most common sources of sync bugs.

Garbage Collection Deleted Referenced Chunks. A garbage collector ran a batch job to delete chunks with ref_count = 0. Due to a race with an in-progress upload that had not yet committed its metadata, a chunk was deleted while a file was mid-upload. The user's file became permanently unrestorable. The fix: implement a "grace period" for chunks — a chunk's ref_count hits 0, but it is only scheduled for deletion after 24 hours, allowing any in-flight uploads to complete their metadata commit.

📌 TLDR & Key Takeaways

Separate blob storage (S3, durable, immutable, cheap at scale) from metadata storage (PostgreSQL, ACID, file tree integrity). These two planes scale independently and fail independently.
Use 4MB fixed-size chunking with SHA-256 content-addressing for deduplication. The chunk hash is both the dedup key and the S3 object key — identical content shares one address regardless of which user or device uploaded it.
The check-before-upload deduplication gate (query Redis for known chunk hashes before any S3 API call) saves 30–50% of storage and bandwidth in consumer clouds.
Use a monotonic change_id cursor for delta sync. Devices that reconnect after being offline catch up in one batch read rather than polling. WebSocket push delivery keeps propagation under 30 seconds.
Conflicts between concurrent offline edits must be handled with a deterministic resolution strategy. "Fork both versions" is safest for content conflicts; "last writer wins" is acceptable for metadata-only conflicts like renames.
Never mark a file commit complete in the metadata layer before the chunk bytes are durably confirmed by S3 — this is the file storage equivalent of the Persist-Before-Call pattern.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read