All Posts

The Role of Data in Precise Capacity Estimations for System Design

Don't guess. Calculate. We explain how to estimate QPS, Storage, and Bandwidth for your system de...

Abstract AlgorithmsAbstract Algorithms
ยทยท13 min read

AI-assisted content.

TLDR: Capacity estimation is the skill of back-of-the-envelope math that tells you whether your system design will survive its traffic before you write a line of code. Four numbers do most of the work: DAU, QPS, Storage/day, and Bandwidth/day.


๐Ÿ“– The Restaurant Napkin Math

Before opening a restaurant, you estimate:

  • Expected customers per day.
  • Average orders per customer.
  • Average order size (for kitchen capacity).
  • Peak hours (for staffing).

Software capacity estimation follows the same logic. You are sizing the kitchen before building the restaurant.


๐Ÿ” The Basics: Numbers Every Engineer Should Know

Before you can estimate with confidence, you need a mental model of how numbers scale. You don't need to memorize formulas โ€” you need to internalize a few key relationships.

Data Size Powers of 10

Engineers work in powers of 10 (decimal) for storage and bandwidth estimates:

UnitApproximate ValueMental Model
Kilobyte (KB)10^3 bytesA plain-text email
Megabyte (MB)10^6 bytesA high-resolution photo
Gigabyte (GB)10^9 bytesA feature-length movie download
Terabyte (TB)10^12 bytesA data center rack shelf
Petabyte (PB)10^15 bytesGoogle processes ~20 PB/day

Latency Comparison: Where Time Goes

One of the most useful reference tables in systems engineering. These numbers help you reason about where bottlenecks live before you write a single line of code:

OperationLatencyInsight
L1 cache hit0.5 nsNearly instant
Main RAM access100 ns200ร— slower than L1
SSD random read1 ms10,000ร— slower than RAM
HDD random read10 ms10ร— slower than SSD
Same-datacenter network1โ€“5 msSimilar to SSD
Cross-region network50โ€“150 ms~100ร— a local request

If a design requires cross-region synchronous calls on every user request, those latency numbers tell you it will feel slow before you test it.

From Daily Users to Per-Second Load

The most common beginner mistake is treating DAU as simultaneous users. Ten million daily active users does not mean ten million simultaneous requests. Activity spreads across 86,400 seconds, and most users are inactive most of the day.

Rule of thumb: Roughly 10โ€“20% of daily users are active during peak hour. Of those, each user sends a request every few seconds. The effective peak concurrency is a small fraction of your DAU.

Peak QPS formula: Once you calculate average QPS, multiply by 2โ€“3ร— to model peak traffic โ€” the surge during lunch hours, major sports events, or viral product launches that no average-case estimate will anticipate.

Read-Heavy vs. Write-Heavy Workloads

Most consumer apps are read-heavy. Knowing the read:write ratio upfront shapes every downstream estimate โ€” storage needs, cache sizing, and bandwidth calculations all change dramatically when reads dominate writes:

App TypeTypical Read:Write Ratio
Social media timeline100:1
URL shortener10:1
Collaborative document editor3:1
Financial transaction system1:1 or write-heavy
IoT sensor pipelineWrite-heavy

๐Ÿ”ข The Four-Step Estimation Pipeline

Every system design capacity estimation follows this flow:

flowchart LR
    DAU[Daily Active Users (e.g., 10M)] --> QPS[Convert to QPS (requests/sec)]
    QPS --> Storage[Storage/day (data written)]
    QPS --> Bandwidth[Bandwidth/day (data read/transferred)]
    Storage --> Total[Total infra sizing (servers, DB, cache)]
    Bandwidth --> Total

Step 1 โ€” DAU to QPS

$$\text{QPS} = \frac{\text{DAU} \times \text{requests per user per day}}{86400 \text{ seconds}}$$

Example โ€” Twitter-scale:

  • 100M DAU, each user generates 10 requests/day (timelines, searches, posts).
  • $QPS = (100M \times 10) / 86400 \approx 11,600~\text{RPS}$
  • Peak is typically 2-3ร— average: ~35,000 RPS peak.

Step 2 โ€” Storage per Day

$$\text{Storage/day} = \text{write QPS} \times \text{record size} \times 86400$$

Example โ€” URL Shortener:

  • 1,000 new URLs/day. Each record = ~500 bytes.
  • $1000 \times 500 = 500 \text{ KB/day}$
  • Over 5 years: $500 \text{ KB} \times 365 \times 5 \approx 900 \text{ MB}$ โ€” fits in a single DB.

Example โ€” Image Platform (Instagram-scale):

  • 1M uploads/day, average image = 1 MB.
  • $1M \times 1 \text{ MB} = 1 \text{ TB/day}$ โ†’ 365 TB/year. Object storage (S3), not a relational DB.

Step 3 โ€” Bandwidth

$$\text{Read Bandwidth} = \text{read QPS} \times \text{average response size}$$

If read:write ratio is 100:1 (social media timeline):

  • Write QPS = 1,000/sec at 100 bytes each โ†’ 100 KB/s write.
  • Read QPS = 100,000/sec at 10 KB each โ†’ 1 GB/s read. โ†’ CDN is mandatory.

โš™๏ธ Reference Numbers to Memorize

QuantityApproximate Value
Seconds in a day86,400 (~10^5)
Bytes in 1 MB10^6
Bytes in 1 GB10^9
Bytes in 1 TB10^12
Average SSD latency1 ms
Average DB query (indexed)1โ€“10 ms
Average network request (same DC)1โ€“5 ms
Typical API response size1โ€“50 KB
Typical image size (compressed)200 KB โ€“ 2 MB
Video (1080p, 1 hour)~1.5 GB

๐Ÿ“Š Estimation Decision Flow

Use this decision tree to structure any capacity estimation from scratch. Start with DAU and follow the branches to determine what infrastructure tier your system actually needs โ€” single server, horizontally scaled, or fully distributed.

flowchart TD
    Start[Start Estimation] --> DAU[Estimate DAU]
    DAU --> RW[Estimate Read/Write Ratio]
    RW --> QPS[Calculate QPS DAU  actions / 86400]
    QPS --> Peak[Multiply by 2-3x for peak]
    Peak --> Storage[Calculate Storage/day write QPS  record size  86400]
    Storage --> BW[Calculate Bandwidth read QPS  response size]
    BW --> Scale{Scale tier?}
    Scale -->|< 1000 RPS| Small[Single server PostgreSQL]
    Scale -->|1k-100k RPS| Med[App servers + cache + sharding]
    Scale -->|> 100k RPS| Large[Distributed system CDN + microservices]

The key decision point is the scale tier. A system handling 100 RPS and one handling 100,000 RPS require fundamentally different architectures โ€” not just bigger servers, but different data stores, caching strategies, and deployment models.


๐ŸŒ Real-World Application: Capacity Estimation Across Scale Tiers

The best way to internalize the four-step pipeline is to run it on systems you already use every day. Here are three scenarios at very different scales that demonstrate how the numbers drive architecture decisions.

Twitter/X-Scale Feed

  • DAU: 400M users. Each user reads ~10 timeline requests/day; each user writes ~0.1 tweets/day.
  • Read QPS: (400M ร— 10) / 86,400 โ‰ˆ 46,300 RPS
  • Write QPS: (400M ร— 0.1) / 86,400 โ‰ˆ 463 RPS
  • Read:Write ratio: ~100:1
  • Tweet text storage: 463 writes/sec ร— 300 bytes/tweet ร— 86,400 โ‰ˆ 12 GB/day (text only)
  • Media bandwidth: Each timeline renders ~50 KB of mixed media โ†’ 46,300 ร— 50 KB = 2.3 GB/s read โ†’ CDN is not optional at this scale.

Uber-Scale Real-Time Location Tracking

  • Active rides: 1M simultaneously active drivers sending GPS coordinates every 5 seconds.
  • Write QPS: 1M รท 5 = 200,000 GPS writes/sec
  • Record size: ~100 bytes (lat, lng, timestamp, driver ID)
  • Storage/day: 200,000 ร— 100 bytes ร— 86,400 โ‰ˆ 1.7 TB/day
  • Key insight: This is write-heavy, append-only data. A time-series database or Kafka-backed pipeline outperforms a relational DB here. Object storage handles historical data; an in-memory store (Redis) handles the live feed.

WhatsApp-Scale Messaging

  • Messages/day: 100 billion (100B).
  • Average message size: ~50 bytes (most text messages are short).
  • Storage/day: 100B ร— 50 bytes = 5 TB/day
  • Over 1 year (uncompressed): ~1.8 PB
  • With 6:1 compression ratio: ~300 TB/year
  • Key insight: At this scale, every design choice about compression, retention policy, and cold vs. hot storage directly affects cost by hundreds of millions of dollars per year. The math makes the priority explicit.

๐Ÿง  Deep Dive: Worked Example โ€” Design a Pastebin

Assumptions:

  • 1M DAU. Read:Write = 10:1. Average paste = 10 KB.
  • Write QPS = (1M ร— 1 paste/day) / 86400 โ‰ˆ 12 writes/sec
  • Read QPS = 12 ร— 10 = 120 reads/sec
  • Storage: 12 writes/sec ร— 10 KB ร— 86400 = ~10 GB/day โ†’ 10 TB over 3 years.
  • Read bandwidth: 120 reads/sec ร— 10 KB = 1.2 MB/sec โ†’ no CDN needed at this scale.

What this tells you:

  • A single PostgreSQL can comfortably handle sub-1000 writes/sec.
  • Storage backend should be durable object storage to handle 10 TB over 3 years.
  • No CDN or caching tier needed at this scale โ€” 120 RPS fits in a single app instance.

โš–๏ธ Trade-offs & Failure Modes: Trade-offs, Failure Modes & Decision Guide: Estimation Mistakes

MistakeWhy It Matters
Ignoring peak-to-average ratioSizing for average means you can't handle 3ร— traffic spikes
Forgetting replication overheadA 1 TB DB with 3 replicas = 3 TB stored
Treating all writes as equalWrites to a hot row (stock ticker, popular post) create hotspots
Not accounting for growthA system sized for today will be undersized in 12 months โ€” plan for 3โ€“5ร—
Ignoring Pareto: 1% of users drive 90% of trafficA few power users can dominate the system

๐Ÿงช Practical: Interview Estimation Template

In a system design interview, the interviewer is watching whether you have a structured process โ€” not whether you arrive at the exact right number. Here is a reusable five-step template you can follow verbatim:

Step 1 โ€” State assumptions:
  - DAU: X million
  - Read:Write ratio: X:1
  - Average object size: X KB/MB

Step 2 โ€” Calculate QPS:
  - Write QPS = DAU ร— writes_per_user / 86400
  - Read QPS = Write QPS ร— read_write_ratio
  - Peak QPS = Average QPS ร— 3

Step 3 โ€” Calculate Storage:
  - Storage/day = write QPS ร— record_size ร— 86400
  - 5-year storage = storage/day ร— 365 ร— 5
  - Account for replication: ร— 3

Step 4 โ€” Calculate Bandwidth:
  - Read bandwidth = read QPS ร— response_size
  - CDN threshold: > 1 GB/s read bandwidth

Step 5 โ€” Determine scale tier:
  - < 1k RPS: single server
  - 1k-100k RPS: horizontal scaling + cache
  - > 100k RPS: distributed + CDN

Quick Reference Decision Rules

Two threshold rules that short-circuit most infrastructure architecture debates:

  • If storage > 1 TB/day โ†’ use object storage (S3/GCS). A relational database is not designed for multi-terabyte raw ingest at sustained write rates.
  • If read bandwidth > 1 GB/s โ†’ put a CDN in front. No single origin server reliably sustains multi-gigabit read throughput at acceptable latency.
  • If write QPS > 10,000/sec โ†’ consider sharding or a write-optimized store (Cassandra, DynamoDB). A single PostgreSQL primary tops out around 5kโ€“10k sustained writes/sec under real workloads.

State these thresholds explicitly during your interview โ€” it demonstrates that you understand the inflection points, not just the arithmetic.


๐Ÿ› ๏ธ Spring Boot + JMeter: From Capacity Estimates to Measured Throughput

Spring Boot is the standard Java framework for building the HTTP services whose QPS, storage, and latency you estimate. Apache JMeter is the open-source load testing tool that validates those estimates against a running service โ€” generating synthetic traffic at the calculated peak load and measuring actual throughput and p95 latency.

The workflow: estimate peak QPS on paper โ†’ implement the endpoint โ†’ run a JMeter test plan at estimated load โ†’ compare results against your SLA targets.

Instrument the endpoint to compare live p95/p99 against your paper estimate:

// Close the estimation feedback loop: @Timed captures actual latency from live traffic
// so you can compare measured p95 against the "p95 < 50ms at 11,600 RPS" estimate.

@GetMapping("/api/users/{id}")
@Timed(value = "user.lookup.latency",
       percentiles = {0.50, 0.95, 0.99},    // exported to Prometheus/Grafana/Actuator
       description = "Cache-aside: Redis ~0.5ms hit, DB ~5ms miss")
public ResponseEntity<UserDto> getUser(@PathVariable String id) {
    // L1 โ€” Redis: serves ~90%+ of requests at sub-millisecond cost
    UserDto cached = cache.opsForValue().get("user:" + id);
    if (cached != null) return ResponseEntity.ok(cached);

    // L2 โ€” PostgreSQL: cache miss, fills cache for next request (5-minute TTL)
    return userRepository.findById(id)
        .map(u -> { cache.opsForValue().set("user:" + id, u, 5, MINUTES); return ResponseEntity.ok(u); })
        .orElse(ResponseEntity.notFound().build());
}
// If Actuator shows p99 > 100ms โ†’ add a read replica or extend Redis TTL

JMeter test plan โ€” simulating estimated peak load:

<!-- Thread Group: 1000 concurrent users, 5-minute sustained load at peak QPS -->
<!-- This maps to: 11,600 RPS รท 1000 users โ‰ˆ 11.6 requests/sec per user -->
<ThreadGroup>
  <numThreads>1000</numThreads>         <!-- virtual users -->
  <rampTime>60</rampTime>               <!-- ramp up over 60 seconds -->
  <duration>300</duration>              <!-- hold load for 5 minutes -->

  <HTTPSamplerProxy>
    <domain>localhost</domain>
    <port>8080</port>
    <path>/api/users/${userId}</path>   <!-- CSV Data Set: randomised user IDs -->
    <method>GET</method>
  </HTTPSamplerProxy>

  <!-- Add Listeners: Aggregate Report, Response Time Graph, Active Threads Graph -->
</ThreadGroup>

Capacity estimate vs. JMeter measurement โ€” the decision table:

MetricEstimate (paper)JMeter resultAction
Throughput11,600 RPS9,200 RPSUnder-capacity โ€” add one more app instance
p95 latency< 50 ms43 msโœ… Within SLA
Error rate0% at peak0.3%Investigate โ€” likely DB connection pool saturation
Cache hit rate90%88%โœ… Close to target โ€” warm-up period is normal

If measured throughput is significantly below estimated peak QPS, use Spring Boot Actuator (/actuator/metrics) to find the bottleneck โ€” typically the DB connection pool, Redis max connections, or JVM thread pool exhaustion.

For a full deep-dive on JMeter test plan design, Gatling as a code-first load testing alternative, and Spring Boot Actuator metrics integration for capacity validation, a dedicated follow-up post is planned.


๐Ÿ“š Key Lessons from Back-of-Envelope Estimation

Five lessons that separate engineers who nail capacity estimation from those who overcomplicate it:

  1. The goal is the scale tier, not the exact number. Whether your estimate lands at 80,000 RPS or 120,000 RPS doesn't change the architecture decision: both require distributed infrastructure with a load balancer and horizontal scaling. A 10ร— error is usually tolerable; a 100ร— error is a design problem.

  2. State your assumptions first โ€” every time. Interviewers and teammates care about your reasoning chain, not your final number. Starting with "I'm assuming 10M DAU, each user makes 20 requests per day" tells them you know how to structure an ambiguous problem before solving it.

  3. Design for peak, not average. Systems handle average load fine until they don't โ€” and "don't" always happens at the worst moment: product launch, breaking news, Black Friday. Peak QPS = 2โ€“3ร— average QPS is the standard safety margin, and it is non-negotiable for anything customer-facing.

  4. Bandwidth and compute are usually the bottlenecks โ€” not storage. Storage is cheap and easy to scale horizontally. Bandwidth costs money at every CDN egress point, and compute saturates under QPS load. If your estimate shows 10 GB/s of read bandwidth, that is a harder problem than 100 TB of raw storage.

  5. Napkin math works because architecture decisions are coarse-grained. The difference between PostgreSQL and DynamoDB isn't a matter of being off by 5% in your estimate. It's a threshold: one works reliably at thousands of QPS, the other at millions. Your estimate only needs to be accurate enough to identify the right side of that boundary.

๐Ÿ“Š Estimation Calculation: QPS to Infrastructure

flowchart TD
    Start[Start: State Assumptions] --> DAU[DAU  actions/user]
    DAU --> QPS[Avg Write QPS = DAU  writes / 86400]
    QPS --> Peak[Peak QPS = Avg  2-3x]
    Peak --> RQPS[Read QPS = Write QPS  read: write ratio]
    RQPS --> Storage[Storage/day = write QPS  record size  86400]
    RQPS --> BW[Bandwidth/day = read QPS  response size]
    Storage --> Scale{Scale tier?}
    BW --> CDN{BW > 1 GB/s?}
    CDN -->|Yes| AddCDN[Add CDN layer]
    CDN -->|No| NoCDN[Direct serving OK]
    Scale -->|under 1k RPS| Small[Single server]
    Scale -->|1k-100k RPS| Med[Horizontal + cache]
    Scale -->|over 100k RPS| Large[Distributed + CDN]

๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • DAU โ†’ QPS โ†’ Storage โ†’ Bandwidth is the standard four-step pipeline.
  • Peak QPS = 2-3ร— average; always design for peak.
  • 10^5 seconds/day is the key constant โ€” it converts user behavior to per-second rates.
  • Compare storage requirements early: 1 GB/day โ†’ relational DB. 1 TB/day โ†’ object storage.
  • High read bandwidth โ†’ CDN. Low bandwidth โ†’ single server is fine.


Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms