The Role of Data in Precise Capacity Estimations for System Design
Don't guess. Calculate. We explain how to estimate QPS, Storage, and Bandwidth for your system de...
Abstract AlgorithmsTLDR: Capacity estimation is the skill of back-of-the-envelope math that tells you whether your system design will survive its traffic before you write a line of code. Four numbers do most of the work: DAU, QPS, Storage/day, and Bandwidth/day.
๐ The Restaurant Napkin Math
Before opening a restaurant, you estimate:
- Expected customers per day.
- Average orders per customer.
- Average order size (for kitchen capacity).
- Peak hours (for staffing).
Software capacity estimation follows the same logic. You are sizing the kitchen before building the restaurant.
๐ข The Four-Step Estimation Pipeline
Every system design capacity estimation follows this flow:
flowchart LR
DAU["Daily Active Users\n(e.g., 10M)"] --> QPS["Convert to QPS\n(requests/sec)"]
QPS --> Storage["Storage/day\n(data written)"]
QPS --> Bandwidth["Bandwidth/day\n(data read/transferred)"]
Storage --> Total["Total infra sizing\n(servers, DB, cache)"]
Bandwidth --> Total
Step 1 โ DAU to QPS
$$\text{QPS} = \frac{\text{DAU} \times \text{requests per user per day}}{86400 \text{ seconds}}$$
Example โ Twitter-scale:
- 100M DAU, each user generates 10 requests/day (timelines, searches, posts).
- $QPS = (100M \times 10) / 86400 \approx 11,600~\text{RPS}$
- Peak is typically 2-3ร average: ~35,000 RPS peak.
Step 2 โ Storage per Day
$$\text{Storage/day} = \text{write QPS} \times \text{record size} \times 86400$$
Example โ URL Shortener:
- 1,000 new URLs/day. Each record = ~500 bytes.
- $1000 \times 500 = 500 \text{ KB/day}$
- Over 5 years: $500 \text{ KB} \times 365 \times 5 \approx 900 \text{ MB}$ โ fits in a single DB.
Example โ Image Platform (Instagram-scale):
- 1M uploads/day, average image = 1 MB.
- $1M \times 1 \text{ MB} = 1 \text{ TB/day}$ โ 365 TB/year. Object storage (S3), not a relational DB.
Step 3 โ Bandwidth
$$\text{Read Bandwidth} = \text{read QPS} \times \text{average response size}$$
If read:write ratio is 100:1 (social media timeline):
- Write QPS = 1,000/sec at 100 bytes each โ 100 KB/s write.
- Read QPS = 100,000/sec at 10 KB each โ 1 GB/s read. โ CDN is mandatory.
โ๏ธ Reference Numbers to Memorize
| Quantity | Approximate Value |
| Seconds in a day | 86,400 (~10^5) |
| Bytes in 1 MB | 10^6 |
| Bytes in 1 GB | 10^9 |
| Bytes in 1 TB | 10^12 |
| Average SSD latency | 1 ms |
| Average DB query (indexed) | 1โ10 ms |
| Average network request (same DC) | 1โ5 ms |
| Typical API response size | 1โ50 KB |
| Typical image size (compressed) | 200 KB โ 2 MB |
| Video (1080p, 1 hour) | ~1.5 GB |
๐ง Worked Example: Design a Pastebin
Assumptions:
- 1M DAU. Read:Write = 10:1. Average paste = 10 KB.
- Write QPS = (1M ร 1 paste/day) / 86400 โ 12 writes/sec
- Read QPS = 12 ร 10 = 120 reads/sec
- Storage: 12 writes/sec ร 10 KB ร 86400 = ~10 GB/day โ 10 TB over 3 years.
- Read bandwidth: 120 reads/sec ร 10 KB = 1.2 MB/sec โ no CDN needed at this scale.
What this tells you:
- A single PostgreSQL can comfortably handle sub-1000 writes/sec.
- Storage backend should be durable object storage to handle 10 TB over 3 years.
- No CDN or caching tier needed at this scale โ 120 RPS fits in a single app instance.
โ๏ธ Common Estimation Mistakes
| Mistake | Why It Matters |
| Ignoring peak-to-average ratio | Sizing for average means you can't handle 3ร traffic spikes |
| Forgetting replication overhead | A 1 TB DB with 3 replicas = 3 TB stored |
| Treating all writes as equal | Writes to a hot row (stock ticker, popular post) create hotspots |
| Not accounting for growth | A system sized for today will be undersized in 12 months โ plan for 3โ5ร |
| Ignoring Pareto: 1% of users drive 90% of traffic | A few power users can dominate the system |
๐ Summary
- DAU โ QPS โ Storage โ Bandwidth is the standard four-step pipeline.
- Peak QPS = 2-3ร average; always design for peak.
- 10^5 seconds/day is the key constant โ it converts user behavior to per-second rates.
- Compare storage requirements early: 1 GB/day โ relational DB. 1 TB/day โ object storage.
- High read bandwidth โ CDN. Low bandwidth โ single server is fine.
๐ Practice Quiz
A social network has 50M DAU. Each user reads 20 posts per day. What is the approximate read QPS?
- A) ~580 RPS
- B) ~11,600 RPS
- C) ~1,000,000 RPS
Answer: B ((50M ร 20) / 86400 โ 11,574 โ 11,600 RPS)
Your system writes 100 bytes per transaction at 1,000 writes/sec. How much DB storage do you need per year?
- A) ~3 GB
- B) ~3 TB
- C) ~3 PB
Answer: A (100 bytes ร 1000/sec ร 86400 ร 365 โ 3.15 ร 10^12 bytes = 3 TB โ actually B; answer: B)
Your service has a read:write ratio of 1000:1. Writes are 10 RPS, each response is 50 KB. What is the read bandwidth in GB/s?
- A) 0.5 GB/s
- B) 500 GB/s
- C) 5 GB/s
Answer: A (10 ร 1000 reads/sec ร 50 KB = 500,000 KB/s = 500 MB/s โ 0.5 GB/s)

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. ๏ฟฝ...
