System Design HLD Example: File Storage and Sync (Dropbox and Google Drive)
Build an HLD for file storage, metadata consistency, and cross-device synchronization.
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Cloud sync systems separate immutable blob storage (S3) from atomic metadata operations (PostgreSQL), using chunk-level deduplication to optimize storage costs and delta-sync events to minimize bandwidth.
Dropbox serves 700 million registered users who edit files simultaneously across desktop, mobile, and web clients. The hard problem is not storing bytes โ object stores like S3 handle petabyte durability reliably. The real complexity is metadata consistency: if a user edits a file offline on a laptop while the same file is modified on a phone, the system must detect the conflict, preserve both versions, and let the user resolve it without any data loss.
Imagine a user uploading a 2GB 4K video. If the upload is a single HTTP request and the network drops at 99%, the user must restart from zero. If your system doesn't deduplicate, and 1,000 users upload the same viral video, you pay for 2TB of redundant storage. If your sync logic isn't delta-based, every time a user changes one character in a 100MB document, the entire file is re-transmitted.
Designing a file sync system teaches you how to separate blob durability from metadata correctness โ a separation pattern that recurs in object databases, CDNs, and distributed filesystems at every scale. By the end of this guide, you will know how to design for petabyte-scale durability while maintaining sub-second synchronization across a global fleet of devices.
๐ File Storage & Sync: Use Cases & Requirements
A Dropbox-style system must bridge the gap between high-latency file transfers and low-latency metadata operations.
Functional Requirements
- File Upload/Download: Support files up to 50GB via chunked, resumable transfers.
- Versioning: Maintain a history of file versions (e.g., last 30 days) to allow recovery from accidental edits or ransomware.
- Deduplication: Save storage by only storing identical chunks once, even across different user accounts.
- Multi-device Sync: Automatically propagate changes to all devices owned by a user.
- Sharing & Permissions: Support private sharing, public links, and granular permissions (View/Edit).
- Conflict Resolution: Detect and handle concurrent edits to the same file.
Non-Functional Requirements
- Durability (11 9s): Files must never be lost. We rely on object storage with erasure coding.
- Strong Consistency (for Metadata): Users must never see an inconsistent file tree (e.g., a file appearing in two folders or a version pointing to non-existent chunks).
- Availability (99.9%): Users expect to access their files at all times.
- Bandwidth Efficiency: Minimize data transfer using delta-sync and compression.
๐ Basics: Baseline Architecture
The system is split into two distinct planes: the Data Plane (heavy bytes) and the Control Plane (metadata).
- Block Service: Handles file chunking, hashing, and storage in an object store (like AWS S3).
- Metadata Service: Manages the file hierarchy, version history, and sharing permissions in a relational database (PostgreSQL).
- Sync Service: Propagates metadata change events to clients using WebSockets or Long Polling.
- Client App: Handles local file watching, chunking, hashing, and delta-application.
โ๏ธ Mechanics: Key Logic
1. The 4 MB Chunking Strategy
Instead of treating a file as a single blob, we split it into fixed-size chunks (e.g., 4 MB).
- Benefit: If an upload fails at chunk 10 of 100, we only retry chunk 10.
- Benefit: Parallelism. A client can upload 4 chunks simultaneously to saturate bandwidth.
2. Content-Addressed Storage (Deduplication)
Every chunk is hashed (SHA-256). This hash becomes the unique ID of the chunk in the object store.
- Before uploading chunk
H, the client asks the server: "Do you have a chunk with hashH?" - If the server says "Yes," the client skips the upload and simply links the file's metadata to the existing chunk.
3. Delta Synchronization (The Cursor)
To keep devices in sync without constant polling:
- Every metadata change is assigned a monotonic
change_id. - The client stores a
sync_cursor(the lastchange_idit processed). - On reconnect, the client asks: "Give me all changes since
change_id = X." - The server returns a list of deltas (e.g., "File A renamed to B", "Chunk 5 of File C updated").
๐ Estimations & Design Goals
- Scale: 100M Daily Active Users.
- Storage: 100M users * 10GB avg = 1 Exabyte total storage.
- Throughput: 1M writes per second (metadata), 10GB/s aggregate upload.
- Dedup Ratio: Typically 30-50% in consumer clouds due to common OS files and shared media.
- Goal: Minimize S3 API costs by caching chunk metadata in Redis.
๐ High-Level Design
graph TD
Client[Client App] -->|1. Init Upload| Meta[Metadata API]
Client -->|2. Upload Chunks| Block[Block Service]
Block -->|3. Check Dedup| Redis[(Redis Hash Cache)]
Block -->|4. Store Bytes| S3[Object Storage - S3]
Block -->|5. ACK Chunks| Client
Client -->|6. Commit| Meta
Meta -->|7. Update Tree| DB[(PostgreSQL)]
Meta -->|8. Emit Event| Kafka{Kafka}
Kafka -->|9. Fan Out| Sync[Sync Service]
Sync -->|10. Notify| OtherClients[Other Devices]
The upload flow separates chunk durability from metadata correctness into two parallel tracks. The Block Service handles the heavy byte operations โ checking Redis for a known chunk hash, writing new chunks to S3, and acknowledging receipt to the client. The Metadata Service handles the logical file tree โ atomically updating the folder structure in PostgreSQL and emitting a change event to Kafka. The Sync Service consumes those Kafka events and fans them out to all other devices via WebSocket, completing the cross-device propagation path within seconds of the original save.
File and Chunk Metadata Schema
These two tables are the core of the metadata plane. The chunk table is content-addressed: the chunk hash serves as both its unique identifier and its storage key in S3.
| Table | Column | Type | Description |
| files | file_id | UUID | Unique identifier for a file version |
| files | user_id | UUID | Owner of the file |
| files | parent_folder_id | UUID | Parent folder in the file tree |
| files | name | VARCHAR | Display name of the file |
| files | size_bytes | BIGINT | Total file size |
| files | chunk_ids | TEXT[] | Ordered list of chunk SHA-256 hashes |
| files | version | INTEGER | Monotonic version counter for conflict detection |
| files | created_at | TIMESTAMP | Version creation time |
| chunks | chunk_hash | CHAR(64) | SHA-256 hash; also the S3 object key |
| chunks | size_bytes | INTEGER | Chunk size (max 4MB) |
| chunks | ref_count | INTEGER | Number of files referencing this chunk (for GC) |
๐ง Deep Dive: Content-Addressed Storage, Delta Sync Cursor, and Conflict Resolution Mechanics
The three hardest problems in a cloud file sync system are storing bytes efficiently without duplication, propagating changes to devices without polling, and resolving conflicts when two devices edit the same file concurrently offline.
Internals: Content-Addressed Chunk Storage and the SHA-256 Deduplication Gate
Every file uploaded to the system is split into fixed-size chunks โ typically 4MB โ before any byte reaches the network. Each chunk is hashed with SHA-256 to produce a 64-character hex digest that serves as both the chunk's unique identifier and its object key in S3. Before uploading any chunk, the client sends the server a list of proposed chunk hashes (a "check-before-upload" request). The server checks Redis โ which maintains a set of all known chunk hashes โ and responds with the subset the client actually needs to upload. For a user uploading the same 2GB video they uploaded last month, or for 1,000 users uploading the same viral clip, the deduplication gate returns "you need to upload 0 of these chunks" and the commit step simply links the file's metadata to the existing chunk records without any S3 API call. This is content-addressed storage (CAS): the content determines the address, and identical content shares one address.
The ref_count column in the chunk table tracks how many file records reference each chunk. When a file is deleted or overwritten, its chunk references are decremented. A background garbage collector periodically scans for chunks with ref_count = 0 and issues S3 delete calls. This prevents orphaned bytes from accumulating and controls storage cost at petabyte scale.
Performance Analysis: The Delta Sync Cursor for Bandwidth-Efficient Cross-Device Propagation
The naive synchronization approach is polling: every connected device asks the server every few seconds "has anything changed?" At 100 million devices polling every 10 seconds, the Sync API receives 10M RPS of pure overhead โ most returning "no changes." The delta sync cursor eliminates this by inverting the model: devices register a persistent WebSocket connection and wait for the server to push change events to them.
Every metadata write in PostgreSQL is assigned a monotonically increasing change_id using a database sequence. When the Metadata Service commits a file change, it emits a Kafka event containing the change_id, the affected user_id, and the change delta (rename, content update, new version). The Sync Service consumes this event and pushes it to all WebSocket connections registered for that user_id. Each device tracks a sync_cursor โ the highest change_id it has processed. On reconnect after being offline, the device sends its current cursor to the Sync API, which returns all changes since that cursor ID in a single batch. Devices that were offline for days catch up in one round trip. This cursor pattern is O(changes since last sync) rather than O(all files), making reconnection efficient even for users with 100,000 files who were offline for a week.
๐ Real-World File Storage Architectures: Dropbox, Google Drive, and iCloud
Dropbox Magic Pocket is Dropbox's proprietary object storage system, built to replace their dependence on AWS S3 for cost reasons. Magic Pocket is a custom erasure-coded storage cluster that stores file chunks across spinning-disk JBOD servers in multiple data centers. Dropbox's key architectural decision was to separate the chunk storage layer (Magic Pocket) from the metadata layer (MySQL with Vitess for sharding), exactly mirroring the pattern described in this guide. Moving from S3 to Magic Pocket reduced Dropbox's storage costs by approximately 40% while improving their control over durability guarantees.
Google Drive uses Google's Bigtable for chunk metadata and Colossus (GFS successor) for byte storage. Google's approach to deduplication operates at the block level within each user's storage quota rather than globally across users for privacy reasons. Google's conflict resolution model is server-authoritative: when two devices write different versions of the same file simultaneously, Drive keeps both as separate versioned files and surfaces the conflict to the user rather than attempting automatic merge.
iCloud Drive uses a similar chunk-based architecture but integrates deeply with Apple's CloudKit framework for metadata synchronization. iCloud's sync model uses operational transforms for document-level conflict resolution in Pages and Numbers, falling back to full-version forking for binary files where merge is impossible.
โ๏ธ Trade-offs and Failure Modes in File Storage and Sync Design
| Dimension | Trade-off | Recommendation |
| Chunk size (4MB fixed vs. variable) | Fixed: simple but inefficient for small edits in large files; Variable (CDC): optimal dedup but complex implementation | Use 4MB fixed for MVP; variable chunking (CDC) for mature systems with many large documents |
| Global dedup vs. per-user dedup | Global: maximum storage savings; Per-user: avoids privacy leakage of cross-user chunk sharing | Per-user dedup for privacy; global only for internal metadata like OS files and common binaries |
| Strong vs. eventual metadata consistency | Strong: users always see a correct file tree; Eventual: faster writes, risk of inconsistent state | Strong consistency for metadata (PostgreSQL with ACID); eventual consistency is not safe for file tree integrity |
| WebSocket vs. Long Polling for sync | WebSocket: true push, sub-second propagation; Long Polling: simpler, higher server load | WebSocket for sync-critical paths; Long Polling as fallback for clients behind restrictive firewalls |
Conflict on Concurrent Offline Edits: When Device A renames a file while Device B deletes it offline, both sync on reconnect. The server must resolve this deterministically. The recommended strategy is "last writer wins" for rename vs. delete conflicts (the rename survives if its change_id is higher) combined with "fork both versions" for content conflicts (both edits are preserved as separate versions with a conflict marker).
S3 Eventual Consistency for Reads-After-Writes: AWS S3 now provides strong read-after-write consistency for new object PUTs. However, for update and delete operations, ensure that the metadata commit to PostgreSQL happens only after the S3 write acknowledgment. Never mark a file commit as complete in the metadata layer before the bytes are durably stored in S3 โ this is the "Persist-Before-Commit" pattern applied to blob storage.
๐งญ Decision Guide: Chunk Size, Storage Backend, and Sync Strategy
| System Characteristics | Recommended Design Choice |
| Files primarily under 1MB | Use 1MB fixed chunks; large 4MB chunks waste storage for small files |
| Files primarily over 100MB (video, 4K images) | Use 8โ16MB chunks; reduces chunk count in metadata and S3 API call overhead |
| Sub-second sync propagation required | WebSocket-based push with Kafka fan-out |
| Sync latency of under 30 seconds acceptable | Long Polling with 20-second timeout intervals (simpler ops overhead) |
| Multi-region active-active write support | Conflict-free Replicated Data Types (CRDTs) for metadata; extremely complex โ prefer active-passive multi-region for most systems |
| Storage cost is the primary constraint | Enable global deduplication; use variable-size chunking (CDC algorithm) for maximum dedup efficiency |
| Durability constraint is 99.999999999% (11 9s) | Store chunks on AWS S3 Standard or GCS with multi-region replication enabled |
๐งช Interview Delivery Example: Designing a Cloud File Sync System in 45 Minutes
Minutes 1โ5 โ Frame the Problem: Open with the multi-device failure scenario. A user edits a 100MB document on their laptop and closes the lid before the upload completes. They open their phone on a train with intermittent connectivity. The sync system must handle partial uploads, resume from the last successful chunk, and not re-transmit bytes already safely stored. Establish NFRs: 11-nines durability, eventual sync within 30 seconds, support for 50GB files.
Minutes 6โ15 โ Two-Plane Architecture: Draw the separation between the Block Service (byte plane) and the Metadata Service (control plane) before drawing any boxes. Explain why these must be separate: the Block Service is stateless and horizontally scalable, while the Metadata Service requires ACID guarantees on a relational database for file tree integrity.
Minutes 16โ30 โ Chunking and Deduplication: Walk through the chunking algorithm (4MB fixed, SHA-256 hash). Explain the check-before-upload optimization. Show how the chunk table's ref_count enables safe garbage collection. Mention that deduplication saves 30โ50% of storage in consumer clouds.
Minutes 31โ40 โ Delta Sync Cursor: Draw the Kafka โ Sync Service โ WebSocket path. Explain the sync_cursor and how offline devices catch up in one batch read rather than polling. Describe the conflict resolution strategy for simultaneous offline edits.
Minutes 41โ45 โ Trade-offs: Compare fixed chunk size vs. content-defined chunking. Discuss global vs. per-user deduplication privacy implications. Mention how Dropbox, Google Drive, and iCloud implement this pattern with different storage backends.
๐ ๏ธ Open-Source File Storage and Sync Tools Worth Knowing
- MinIO: S3-compatible object storage deployable on bare metal or Kubernetes. Used as the chunk storage backend for self-hosted Nextcloud and private Dropbox-like systems.
- Nextcloud: Open-source Dropbox alternative. Implements chunked uploads via the Nextcloud Chunked Upload protocol, compatible with the design described here.
- Rclone: A command-line tool for syncing files between cloud storage providers. Its delta-sync algorithm mirrors the cursor-based approach described above โ it uses file modification time and size as change indicators.
- SeaweedFS: A distributed blob storage system optimized for storing billions of small files efficiently, addressing S3's small-file performance overhead.
๐ Lessons Learned from File Sync Failures in Production
A Missing Chunk Reference Caused Silent Data Loss for 0.01% of Files. A file storage platform had a race condition between the chunk upload acknowledgment and the metadata commit. If the server crashed between these two steps, the metadata record pointed to a chunk hash that was never successfully written to S3. When the user downloaded the file later, the Block Service returned a 404 for the missing chunk, and the client displayed a corrupted file. The fix: use a two-phase commit โ write the chunk to S3, write the chunk record to PostgreSQL, and only then allow the metadata commit to proceed. The metadata commit must be gated on chunk durability.
Sync Loops from Timestamp Precision Mismatch. A desktop client on macOS used nanosecond file modification timestamps, while the server stored millisecond timestamps in PostgreSQL. On every sync, the client detected a "change" (nanoseconds differed from milliseconds) and re-uploaded unmodified files, creating an infinite sync loop. The fix: normalize all timestamps to millisecond precision at the client before comparing. Timestamp precision mismatches between OS filesystems and server databases are among the most common sources of sync bugs.
Garbage Collection Deleted Referenced Chunks. A garbage collector ran a batch job to delete chunks with ref_count = 0. Due to a race with an in-progress upload that had not yet committed its metadata, a chunk was deleted while a file was mid-upload. The user's file became permanently unrestorable. The fix: implement a "grace period" for chunks โ a chunk's ref_count hits 0, but it is only scheduled for deletion after 24 hours, allowing any in-flight uploads to complete their metadata commit.
๐ TLDR & Key Takeaways
- Separate blob storage (S3, durable, immutable, cheap at scale) from metadata storage (PostgreSQL, ACID, file tree integrity). These two planes scale independently and fail independently.
- Use 4MB fixed-size chunking with SHA-256 content-addressing for deduplication. The chunk hash is both the dedup key and the S3 object key โ identical content shares one address regardless of which user or device uploaded it.
- The check-before-upload deduplication gate (query Redis for known chunk hashes before any S3 API call) saves 30โ50% of storage and bandwidth in consumer clouds.
- Use a monotonic
change_idcursor for delta sync. Devices that reconnect after being offline catch up in one batch read rather than polling. WebSocket push delivery keeps propagation under 30 seconds. - Conflicts between concurrent offline edits must be handled with a deterministic resolution strategy. "Fork both versions" is safest for content conflicts; "last writer wins" is acceptable for metadata-only conflicts like renames.
- Never mark a file commit complete in the metadata layer before the chunk bytes are durably confirmed by S3 โ this is the file storage equivalent of the Persist-Before-Call pattern.
๐ Related Posts
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
