System Design HLD Example: Video Streaming (YouTube/Netflix)
A practical interview-ready HLD for a video streaming platform with adaptive bitrate and CDN delivery.
Abstract AlgorithmsIntermediate
For developers with some experience. Builds on fundamentals.
Estimated read time: 16 min
AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: A video streaming platform is a two-sided architectural beast: a batch-oriented transcoding pipeline that converts raw uploads into multi-resolution segments, and a real-time global delivery network that serves those segments via CDNs. The technical linchpin is Adaptive Bitrate Streaming (ABR), which enables the client player to seamlessly switch quality based on network fluctuations, ensuring a buffer-free experience for millions of concurrent users.
πΊ The "Game of Thrones" Crash
Itβs Sunday night at 9:00 PM. Millions of people simultaneously hit "Play" on the season finale of the world's most popular show. For the first five minutes, everything is fine. Then, the Twitter complaints start: "Buffering...", "Pixelated mess!", "Server error 500."
Behind the scenes, the origin servers are melting. Every edge node in the Content Delivery Network (CDN) is trying to fetch the same 4K video segment from the central storage at once. This is the Cache Thundering Herd problem. If your system is designed to serve a few thousand users, it will crumble when a viral event spikes traffic by 100x in seconds.
A video platform isn't just a website; it's a global distribution engine. The challenge isn't just storing the bytesβit's moving those bytes across the world's oceans and through congested ISP networks to a smartphone on a shaky 3G connection, all without a single "Buffering" spinner. At the scale of Netflix or YouTube, you aren't just optimizing code; you are optimizing the physics of data movement.
π Video Streaming: Use Cases & Requirements
Actors & Journeys
- Content Creator: Uploads high-quality raw video files (often $>100$ GB). They require a reliable, resumable upload path.
- Viewer: Consumes content across diverse devices (4K TV, 720p Laptop, 360p Smartphone). They require "instant-on" playback and no buffering.
- Platform Admin: Manages content moderation, transcoding priorities, and CDN cost-efficiency.
In/Out Scope
- In-Scope: Video ingestion (upload), distributed transcoding, segment storage, global delivery via CDN, and adaptive bitrate logic.
- Out-of-Scope: Content recommendation engines (AI/ML), complex copyright management (DMCA takedown workflows), and live interactive chat.
Functional Requirements
- Multipart Upload: Support for uploading large files with resume capability.
- Automated Transcoding: Convert raw video into multiple resolutions (360p, 720p, 1080p, 4K) and streaming formats (HLS, DASH).
- Adaptive Playback: Seamlessly serve the best possible quality based on the user's real-time bandwidth.
- Metadata Management: Searchable titles, descriptions, thumbnails, and view counts.
Non-Functional Requirements (NFRs)
- High Availability: 99.99% for the playback path (viewers shouldn't know if the uploader is down).
- Ultra-Low Latency: Playback startup should be $< 2$ seconds globally.
- Massive Scalability: Handle 500 hours of video uploaded per minute and 1 billion views per day.
- Cost Efficiency: Optimize storage and egress bandwidth (the single largest expense).
π Foundations: How Video Streaming Actually Works
Unlike a simple file download where you wait for the whole file to arrive, modern streaming uses Segmented Delivery.
The baseline architecture involves three main pillars:
- The Bitrate Ladder: A single video is converted into multiple files with different resolutions and bitrates.
- Segmentation: Each of these files is sliced into 2-10 second "chunks" or "segments."
- The Manifest: A text file (like
.m3u8for HLS) that tells the player where to find these segments.
When you press "Play," you aren't downloading movie.mp4. You are downloading manifest.m3u8, which points to segment_1_1080p.ts, then segment_2_1080p.ts, and so on. This architecture allows the player to jump to any part of the video instantly by just requesting the relevant segment, and it's the foundation for adaptive quality.
βοΈ The Mechanics of Adaptive Bitrate (ABR)
The mechanism of quality switching happens entirely on the Client Side.
- HLS (HTTP Live Streaming): Developed by Apple, it uses
.tssegments and is the standard for iOS/Safari. - DASH (Dynamic Adaptive Streaming over HTTP): An international standard that is more flexible and widely used on Android and Smart TVs.
- The Switching Logic: The player maintains a "Buffer Health" counter (e.g., 20 seconds of video pre-downloaded). If the download speed of the last segment was slower than the segment's duration, the player switches to a lower bitrate rendition in the bitrate ladder for the next segment to prevent the buffer from hitting zero.
π Estimations & Design Goals
The Math of YouTube-Scale
- Ingest Volume: 500 hours/min = 30,000 hours/hour.
- Raw Storage: 30,000 hours $\times$ 5 GB/hour (high-bitrate 1080p) = 150 TB/hour.
- Transcoding Expansion: Each video is transcoded into $\approx 6$ resolutions. Total storage after processing $\approx 3\times$ the raw size.
- Egress Bandwidth: 1B views/day. If average view is 10 mins at 2 Mbps:
- $1B \times 600s \times 2Mbps / 8 \text{ (bits to bytes)} = \mathbf{150 \text{ Petabytes/day}}$ of egress traffic.
Design Goals
- 99% Cache Hit Rate: The origin server should almost never see a request from a user; the CDN must handle the load.
- Parallel Processing: Transcoding must be chunked to allow a 1-hour video to be processed in $< 10$ minutes.
- Cold vs. Hot Storage: Move rarely watched videos to S3 Glacier to save costs.
π High-Level Design: The Twin Pipeline Architecture
The architecture is split into a Write Path (Ingestion & Processing) and a Read Path (Discovery & Delivery).
graph TD
Creator -->|Upload| LB[Load Balancer]
LB --> US[Upload Service]
US --> Raw[(Raw Store: S3)]
Raw --> MQ[Message Queue: Kafka]
MQ --> TW[Transcoding Workers]
TW --> Segs[(Segment Store: S3)]
Segs --> OS[Origin Shield]
OS --> CDN[Global CDN Nodes]
Viewer -->|Metadata| API[API Gateway]
Viewer -->|Stream| CDN
API --> DB[(Metadata DB: Postgres)]
API --> Cache[(Redis Cache)]
The architecture divides cleanly into a Write Path and a Read Path. On the Write Path, a creator's raw upload flows through a Load Balancer into the Upload Service, which stores the raw file in S3 and publishes a transcoding job to Kafka. A pool of Transcoding Workers consumes these jobs, producing multi-resolution segments stored in a separate S3 bucket. On the Read Path, a viewer's playback request hits the CDN edge first. If the edge has the segment cached, it returns immediately. If not, the edge fetches from the Origin Shield, which serializes misses to protect S3 from a thundering herd. Metadata queries (title, thumbnail, recommended next videos) flow through the API Gateway backed by Postgres and Redis.
π§ Deep Dive: The Transcoding DAG and the Origin Shield Thundering-Herd Fix
The transcoding pipeline is the most computationally intensive component in a video platform. A naive implementation assigns a single worker to transcode an entire 1-hour video into 6 resolutions sequentially β a process that could take 6 or more hours on a single machine. Production systems break this into a Directed Acyclic Graph (DAG) of parallel jobs.
Each uploaded video is first split into 1-minute raw chunks at the Upload Service. Each chunk is published as a separate message to the Kafka topic video.transcoding.jobs. A pool of Transcoding Workers β typically instances with GPU acceleration β picks up these messages and processes them in parallel. A 1-hour video becomes 60 independent 1-minute jobs running simultaneously across 60 workers, reducing wall-clock transcoding time from 6 hours to approximately 6 minutes.
Internals: How HLS Segmentation Enables Adaptive Bitrate Playback
When a Transcoding Worker finishes encoding a 1-minute chunk at a given resolution, it does not produce a single large video file. It produces a series of short .ts segment files (typically 4β6 seconds each) and a .m3u8 manifest file that lists the URLs of all segments for that resolution. The .m3u8 manifest is a simple text file: each line is either a segment duration hint or a relative URL pointing to the next segment.
The ABR player on the viewer's device downloads only the manifest, never the full video. It reads the first segment URL, downloads that 4-second chunk, plays it, then checks its buffer health before deciding which resolution to request next. If the last download was faster than real-time β meaning the buffer is growing β the player may switch to a higher-resolution rendition for the next segment. If the buffer is shrinking toward empty, it switches down. This switching logic runs entirely on the client, with zero server coordination required, making ABR a distributed system built out of stateless HTTP requests and client-side heuristics.
The Transcoding Job DAG
graph TD
Upload[Raw Video Upload] --> Split[Chunk Splitter: 1-min segments]
Split --> Kafka[Kafka: video.transcoding.jobs]
Kafka --> W1[Worker: 360p Encoding]
Kafka --> W2[Worker: 720p Encoding]
Kafka --> W3[Worker: 1080p Encoding]
Kafka --> W4[Worker: 4K Encoding]
W1 --> Merge[Manifest Builder]
W2 --> Merge
W3 --> Merge
W4 --> Merge
Merge --> S3[Segment Store: S3]
The Manifest Builder is triggered after all encoding workers for a given chunk have completed. It updates the HLS .m3u8 manifest with the new segment URLs, making those segments immediately playable β even before the full video finishes processing. This is how YouTube can show a video as "watchable" within minutes of upload while the 4K version is still being generated.
How the Origin Shield Prevents a Thundering Herd
Without an Origin Shield, a viral video release creates a Cache Thundering Herd: thousands of CDN edge nodes simultaneously experience cache misses for the same segment and all race to the S3 origin at once. A 1 Gbps S3 bucket receiving 10,000 simultaneous segment requests produces queue depths that can cascade into an origin outage β exactly the "Game of Thrones crash" scenario described at the opening.
The Origin Shield inserts a single intermediary cache layer between the CDN edge nodes and S3. All CDN edge nodes that miss their local cache forward the miss to the Origin Shield rather than to S3. The Shield serializes these requests: the first edge to miss acquires a fill lock, fetches from S3, caches the segment, and all waiting edges receive the segment from the Shield. This reduces S3 origin requests from O(number of edge nodes) to O(1) per segment per cache miss β a reduction from thousands of requests to one.
Video Metadata and Segment Storage Models
Video Metadata Table (Postgres)
| Column | Type | Notes |
video_id | UUID | Primary key |
title | VARCHAR(500) | Full-text indexed for search |
creator_id | UUID | FK to users table |
status | ENUM | PROCESSING, READY, FAILED |
duration_seconds | INTEGER | Populated after transcoding completes |
view_count | BIGINT | Updated asynchronously via counter service |
storage_tier | ENUM | HOT, WARM, COLD (Glacier) |
created_at | TIMESTAMP | Upload timestamp |
Performance Analysis: Egress Cost and CDN Cache Hit Rate Economics
The dominant operational cost for a video streaming platform is not compute or storage β it is egress bandwidth. At 1 billion daily views averaging 10 minutes at 2 Mbps, the platform generates approximately 150 petabytes of egress per day. At a typical CDN egress price of $0.008 per GB, this equates to roughly $1.2 million per day in bandwidth costs. Every percentage point of CDN cache hit rate improvement directly reduces this cost.
A 99% CDN cache hit rate means 1% of requests reach the Origin Shield, which means 0.01% reach S3 (assuming the Shield has a 99% hit rate of its own). For a viral video with 10 million views in the first hour, this translates to approximately 1,000 Origin Shield requests and just 10 S3 requests β compared to 10 million S3 requests without any caching. Achieving 99% cache hit rate requires segment TTLs of at least 1 hour at the edge, which is feasible because video segments are immutable once created. The HLS manifest file, which changes during transcoding, uses a much shorter TTL of 5β10 seconds to ensure new segments become discoverable quickly.
Segment and Manifest Cache (Redis)
| Key Pattern | Redis Type | Value | TTL |
manifest:{video_id} | String | HLS m3u8 manifest content | 300 seconds |
video:meta:{video_id} | Hash | Title, creator, duration, view count | 60 seconds |
trending:global | Sorted Set | Video IDs scored by view velocity | 30 seconds |
cdn:segment:{segment_key} | String | Cached at Origin Shield level | 3600 seconds |
π How Netflix, YouTube, and Twitch Deploy This Architecture
Netflix is one of the world's largest CDN operators. Their Open Connect program places appliance servers directly inside ISP data centers worldwide. Before a show releases, Netflix pre-positions all resolutions of all episodes onto Open Connect appliances using a Push CDN model. A viewer watching a new Netflix series is typically served by hardware sitting in a rack at their ISP β eliminating most internet traversal and the origin shield problem entirely. Netflix's engineering blog documents that over 95% of their traffic is served from Open Connect appliances, not from Netflix's own data centers.
YouTube processes 500+ hours of video uploaded every minute. Their transcoding infrastructure employs custom ASICs (Application-Specific Integrated Circuits) for H.264 and VP9 encoding, achieving energy efficiency 10Γ better than general-purpose GPUs. The DAG approach described in this HLD mirrors YouTube's actual implementation, where segments are called "GOPs" (Groups of Pictures) in their internal terminology. YouTube also uses a Pull CDN model because their long-tail library is too vast to pre-position β most videos receive very few views after their first week.
Twitch imposes a real-time constraint absent from YouTube or Netflix: live streams must be transcoded and delivered with end-to-end latency below 4 seconds. Their architecture bypasses the "chunk and store to S3" step entirely for live content, using a real-time ingest pipeline where segments are encoded and pushed to CDN edge nodes within 1β2 seconds of capture. The HLS delivery mechanism is the same as for VOD, but the segment source changes from an S3 bucket to a live ingest server running at the CDN edge.
βοΈ The Trade-offs That Define Video Platform Architecture
HLS vs. DASH: HLS is Apple's proprietary standard, required for iOS and Safari. DASH is the international open standard, preferred for Android and Smart TVs. Most platforms transcode into both formats, accepting the storage cost of maintaining two segment sets. Choosing only HLS sacrifices Android market share; choosing only DASH blocks all iOS users. The dominant production choice is to support both, amortizing the storage cost against the requirement for universal device coverage.
Push CDN vs. Pull CDN: A Push CDN pre-positions content at edge nodes before any user requests it (Netflix Open Connect). A Pull CDN fetches from origin on the first request and caches at the edge for subsequent requests (YouTube's model for the long tail). Push CDN eliminates the first-request miss and the thundering herd problem but requires predicting which content to pre-position β feasible for a curated library, impractical for a platform where any of 800 million videos could go viral. Pull CDN with an Origin Shield is the correct architecture for long-tail video libraries.
Hot vs. Cold Storage Tiering: A video that received 1 billion views in its first week may have 10 views per day six months later. Keeping it in S3 Standard (hot storage) costs roughly $23 per TB per month. Moving it to S3 Glacier (cold storage) costs roughly $4 per TB per month β an 83% reduction. The trade-off is a 3β5 minute restore time for Glacier content, which is acceptable for low-traffic videos and can be mitigated with a pre-warming job triggered when a video's view velocity rises above a threshold.
π§ Choosing the Right Video Delivery Architecture
| Scenario | Recommended Architecture | Avoid |
| Under 1 million monthly views | Single CDN with pull-through cache | Custom transcoding infrastructure |
| Viral-scale global delivery | Push CDN + Origin Shield | Single-origin serving without CDN |
| Live streaming under 4 seconds latency | Real-time ingest pipeline, edge encoding | VOD-style S3 chunk store for live |
| Cost-sensitive long-tail library | S3 Intelligent Tiering + Pull CDN | Pre-positioned Push CDN for all content |
| UGC platform with 500+ hours/min upload | DAG-based parallel transcoding with Kafka | Sequential single-worker transcoding |
π§ͺ Delivering the Video Streaming HLD in a 45-Minute Interview
Open with scale numbers that frame the problem as infrastructure-first: "500 hours of video uploaded per minute, 1 billion views per day, an estimated 150 petabytes of daily egress traffic." These numbers communicate that you understand this is a physics and logistics problem, not an application development problem.
Draw the two pipelines explicitly and label them "Write Path" and "Read Path." Interviewers almost always interrupt at this point to drill into one or the other β labeling them gives the interviewer clear entry points and shows that you have organized your thinking into two distinct problem domains.
When discussing CDN architecture, proactively introduce the Origin Shield pattern and the thundering herd problem. Most candidates describe a CDN as "caching near the user" and stop there. Introducing the Origin Shield and explaining why it exists β to serialize cache misses during viral events β demonstrates that you have thought through the failure modes, not just the happy path.
Close with the storage tiering discussion: "Content has an exponential view-count decay curve after release. I would implement S3 Intelligent Tiering to automatically migrate rarely accessed videos to Glacier after 90 days of low activity, which I estimate would reduce storage costs by 60% for a platform with a large long-tail library."
π What Video Streaming Architecture Reveals About Data Gravity
The core insight of video streaming system design is that bandwidth is more expensive than compute. The 150 petabytes of daily egress traffic is not a compute problem β modern servers can transcode video faster than real-time with GPU acceleration. It is a physics and economics problem: moving bits across ocean cables and through congested ISP peering points costs money proportional to distance and volume.
This is why every major architectural decision in a video platform ultimately reduces to "how do we serve these bytes from as close to the user as physically possible?" The CDN is not a performance optimization bolted onto the side of the real system β it is the primary architecture. The transcoding pipeline, the Postgres metadata database, and the API gateway exist to produce bytes that the CDN then delivers. Building a video platform that treats the CDN as an afterthought will fail at scale regardless of how well-optimized the application code is.
π TLDR & Key Takeaways
- Video streaming is a two-pipeline problem: a batch-oriented transcoding Write Path and a real-time CDN delivery Read Path.
- Parallel DAG-based transcoding reduces a 6-hour serial job to approximately 6 minutes by splitting video into 1-minute chunks processed simultaneously.
- Adaptive Bitrate Streaming (ABR) switches the client between quality levels mid-playback based on buffer health, ensuring no buffering spinner.
- The Origin Shield eliminates CDN cache thundering herds by serializing all edge cache misses through a single intermediary before hitting S3.
- Storage tiering (Hot β Warm β Cold) is essential for cost management in platforms with large long-tail video libraries.
- At 150 PB/day of egress, CDN architecture is the primary design constraint, not an implementation detail.
- HLS and DASH are both required for full device coverage; production platforms maintain segment sets in both formats.
π Related Posts
- System Design HLD: Web Crawler β Content discovery at scale; understanding how video URLs and metadata reach an index is a natural complement to understanding how they are delivered.
- System Design HLD: Distributed Cache β Deep dive into the Redis caching layer that backs metadata responses and Origin Shield segment caching in this design.
- System Design HLD: File Storage & Sync β The S3 blob storage patterns and multipart upload mechanics used for raw video ingestion and segment storage.
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Stale Reads and Cascading Failures in Distributed Systems
TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable β stale reads...
NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data
TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node β virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...
Clock Skew and Causality Violations: Why Distributed Clocks Lie
TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions β but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...
Split Brain Explained: When Two Nodes Both Think They Are Leader
TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader β each accepting writes the other never sees. Prevent it with quorum consensus (at least βN/2β+1 nodes must agree before leadership is g...
