System Design HLD Example: Web Crawler
A practical interview-ready HLD for a distributed web crawler handling politeness, deduplication, and scale.
Abstract Algorithms
Intermediate
For developers with some experience. Builds on fundamentals.
Estimated read time: 17 min
AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: A distributed web crawler must balance global throughput with per-domain politeness. The architectural crux is the URL Frontier, which manages priority and rate-limiting across a distributed fetcher pool. By combining Bloom Filters for URL deduplication and SimHash for content near-duplicate detection, the system can crawl billions of pages while maintaining a minimal storage footprint and respecting the stability of the target websites.
π·οΈ The Infinite Web Problem
Imagine youβve built a simple crawler. It works great on your personal blog. You point it at a small news site, and it finishes in minutes. Then, you point it at a major e-commerce site with millions of products.
Within seconds, your crawler starts hitting the "Calendar Trap." The site has a calendar page for every product, and your crawler is clicking "Next Month" infinitely. Meanwhile, the site's webmaster is seeing 5,000 requests per second from your IP address. Their server CPU spikes to 100%, their database locks up, and their legitimate customers can no longer buy products.
You haven't just crawled the site; you've accidentally launched a Distributed Denial of Service (DDoS) attack.
The design paradox at the heart of a web crawler is this: you need to be as fast as possible globally, but as slow as necessary per domain. If you design for speed, you break the web's politeness rules. If you design for politeness, you'll never finish crawling the billions of pages that exist. This tension drives every architectural decision in a production-grade web crawler. At 1 billion pages per month, failure to handle these constraints doesn't just result in a slow systemβit results in being blacklisted by ISPs and potential legal ramifications.
π Web Crawler: Use Cases & Requirements
Actors & Journeys
- Search Engine Indexer: The main actor that consumes crawled content to build a search index. It provides "Seed URLs" to start the process.
- Webmaster: The entity that owns the content being crawled. They communicate via
robots.txtto define which parts of their site are off-limits. - Data Scientist: Analyzes web-scale data for trends, using the crawler's output as a raw data lake.
In/Out Scope
- In-Scope: Distributed crawling, URL deduplication, content near-duplicate detection, politeness, and robots.txt handling.
- Out-of-Scope: Building the search index itself (Indexing), real-time ranking algorithms, and deep-web crawling (password-protected sites).
Functional Requirements
- Scalable Discovery: Identify and extract all valid hyperlinks (HTTP/HTTPS) from fetched pages.
- Polite Fetching: Ensure the system respects
robots.txtand never exceeds a specified QPS (Queries Per Second) per domain. - Deduplication: Skip URLs already crawled and content that is "substantially similar" to previously indexed pages.
- Fault Tolerance: If a crawler node fails, another must pick up the work without losing the frontier state.
Non-Functional Requirements (NFRs)
- High Throughput: Handle 1 billion pages/month (~400 QPS globally).
- Scalability: The architecture must scale horizontally by adding more fetcher nodes.
- Extensibility: The modular design should allow adding support for PDF, FTP, or other protocols with minimal refactoring.
- Robustness: Gracefully handle "crawler traps" (infinite loops), malformed HTML, and slow network responses.
π Foundations: How Crawling Actually Works
At its most basic level, a web crawler is a graph traversal engine. The web is a directed graph where pages are nodes and hyperlinks are edges.
The baseline architecture follows a simple loop:
- Pick a URL from the list of known URLs (the "Frontier").
- Download the page from the internet.
- Extract new URLs found in that page.
- Filter & Store those URLs back into the Frontier.
However, moving from a single-threaded script to a production system requires solving "The Three Ps": Persistence (storing billions of URLs), Politeness (not overwhelming servers), and Parallelism (using hundreds of nodes). Without a structured "Frontier" management system, your crawler would either run out of memory or get blocked by every major firewall on the planet.
βοΈ The Fetcher Mechanics: DNS & URL Normalization
The mechanism of fetching isn't as simple as GET /. To operate at scale, we must optimize the "pre-fetch" phase.
- URL Normalization: Before checking if a URL has been crawled, we must normalize it. This includes converting to lowercase, removing default ports (e.g.,
:80), and stripping session IDs or tracking parameters (e.g.,?utm_source=...). Without this, the crawler would fetch the same page thousands of times. - Asynchronous DNS: Standard DNS lookups are synchronous and blocking. A production fetcher uses an asynchronous DNS resolver (like Netty's) and maintains a local cache of IP addresses to avoid hitting public DNS servers 400 times a second.
- Protocol Agnosticism: The fetcher should be a pluggable module. While 99% of the web is HTTP/HTTPS, a robust crawler should be able to handle FTP or Gopher by simply swapping the protocol handler.
π Estimations & Design Goals
Capacity Math (The 1B Page Scale)
To crawl 1 billion pages per month, we need to understand the physical constraints of our hardware and network.
- Global QPS: $1,000,000,000 \text{ pages} / 30 \text{ days} / 86,400 \text{ seconds} \approx \mathbf{385 \text{ pages/sec}}$.
- Storage (Raw Content): Assuming 250 KB per page (HTML + metadata), $1B \times 250KB = \mathbf{250 \text{ TB/month}}$. We likely need S3 or a distributed file system (HDFS) for this.
- URL Storage: If a URL is 100 bytes on average, 1B URLs = 100 GB.
- Bloom Filter Memory: To keep track of 10B URLs (including discovered but not yet crawled) with a 1% false positive rate, we need $\approx 12 \text{ GB}$ of RAM.
Scaling Targets
- Max Delay per Domain: 5 seconds between requests (industry standard politeness).
- DNS Resolution: Must be $< 10ms$ (requires local caching/asynchronous resolvers).
- Fetcher Utilization: Target 80% CPU/Network utilization per node.
π High-Level Design: The Distributed Crawl Architecture
The following diagram shows the separation between the URL management (The Frontier) and the execution (The Fetchers).
graph TD
Seeds[Seed URLs] --> Frontier[URL Frontier]
Frontier --> Fetchers[Distributed Fetcher Pool]
Fetchers --> DNS[Local DNS Resolver]
Fetchers --> Robots[Robots.txt Cache]
Fetchers --> ContentStore[(Content Store: S3)]
Fetchers --> ContentDedupe[SimHash Engine]
ContentDedupe --> LinkExt[Link Extractor]
LinkExt --> URLDedupe[URL Dedupe: Bloom Filter]
URLDedupe --> Frontier
The diagram captures the closed feedback loop at the heart of every web crawler. Seed URLs enter the URL Frontier, which orchestrates dispatch to the Distributed Fetcher Pool. Before each fetch, the pool consults a Local DNS Resolver (to avoid 400 public DNS lookups per second) and a Robots.txt Cache (to enforce domain-level access rules without re-downloading the robots file on every request). After each successful fetch, the content passes through the SimHash Engine for near-duplicate detection and the Link Extractor for new URL discovery. Newly extracted URLs flow through the Bloom Filter Deduplication layer before re-entering the Frontier. Content that clears deduplication is written to the Content Store (S3 or HDFS).
π§ Deep Dive: The Two-Queue URL Frontier and the Mathematics of Bloom Filter Deduplication
The URL Frontier is the most architecturally complex component of a distributed web crawler. It must simultaneously satisfy two contradictory requirements: it must function as a priority queue (high-value pages should be crawled before low-value pages) and as a rate limiter (no single domain should receive more than N requests per unit time). No single data structure satisfies both requirements, which is why production systems use a Two-Queue Architecture.
Queue 1 β The Priority Queue (Front Queue): URLs are assigned a priority score based on factors such as estimated PageRank, domain authority, inbound link count from already-crawled pages, and content freshness signals. This queue is a min-max heap sorted by descending priority. High-authority domains such as wikipedia.org and major news sites occupy the top; low-authority blogs and comment spam occupy the bottom.
Queue 2 β The Politeness Queue (Back Queue): Before any URL is dispatched to a fetcher, it passes through the Politeness Queue β a collection of per-domain FIFO queues. Each domain gets exactly one queue. A Scheduler monitors the last-fetch timestamp for each domain and prevents its queue from releasing the next URL until the configured minimum delay (typically 5 seconds for production crawlers) has elapsed. This prevents any fetcher from accidentally issuing concurrent requests to the same domain, regardless of how many of that domain's pages appear in the Priority Queue.
Internals: How Robots.txt Enforcement and URL Normalization Work at Scale
Before a URL enters the Fetcher, two pre-flight checks must complete. First, the Robots.txt Cache is consulted to determine whether the target path is accessible. Each domain's robots.txt file is fetched once and cached with a 24-hour TTL. The cache maps each domain to a parsed rule set (a list of Allow and Disallow directives per user agent). If the path is disallowed, the URL is dropped silently and never enters the Content Store.
Second, URL normalization is applied before any deduplication check. Normalization rules include: converting the scheme and host to lowercase, removing default ports (:80 for HTTP, :443 for HTTPS), stripping session IDs and tracking parameters (any query parameter matching patterns like utm_*, sessionid=, PHPSESSID=), resolving relative paths (../), and decoding percent-encoded characters where unambiguous. Without normalization, http://Example.com/page, http://example.com:80/page, and http://example.com/page?utm_source=google would be treated as three distinct URLs requiring three fetches of identical content.
The Two-Queue Frontier Architecture
graph TD
New[New URLs Discovered] --> PQ[Priority Queue: Score-Sorted Heap]
PQ --> Selector[Domain Selector]
Selector --> DQ1[Domain Queue: wikipedia.org]
Selector --> DQ2[Domain Queue: nytimes.com]
Selector --> DQ3[Domain Queue: example.com]
DQ1 --> Throttle[Rate Limiter: min 5s per domain]
DQ2 --> Throttle
DQ3 --> Throttle
Throttle --> Fetcher[Distributed Fetcher Pool]
The Selector is a hash map in memory mapping each domain to its queue identifier. The Rate Limiter maintains a Redis hash of domain β last_fetch_timestamp and blocks the queue from releasing the next URL until now - last_fetch > min_delay. This enforcement happens at the scheduler level, not the fetcher level β a critical distinction. If enforcement were at the fetcher level, multiple fetcher threads could independently decide it is "time" to fetch from the same domain and issue simultaneous requests, violating politeness even if each individual fetcher respected the delay.
The Mathematics of Bloom Filter Deduplication
A Bloom Filter is a probabilistic data structure that answers "Has this URL been seen before?" in O(1) time with sub-linear space requirements. It consists of a bit array of size M and K independent hash functions. To insert a URL, all K hash functions are applied and the resulting K bit positions are set to 1. To query, the same K positions are checked: if all are 1, the URL has "probably" been seen; if any bit is 0, it has definitely not been seen.
Key Property: Bloom Filters have zero false negatives β a URL that was crawled will never be falsely reported as "not seen" β but a configurable false positive rate. In web crawling, a false positive means skipping a URL we should crawl, which is acceptable for 1% of cases. A false negative (re-crawling a page already indexed) wastes resources but never happens with a Bloom Filter.
Sizing for 10 Billion URLs at 1% False Positive Rate
| Parameter | Value |
| Total URLs (N) | 10 billion |
| Target false positive rate (p) | 1% (0.01) |
| Required bit array size (M) | ~95.85 billion bits (~12 GB) |
| Optimal hash function count (K) | ~7 |
| Memory cost per URL | ~9.6 bits |
This 12 GB Bloom Filter fits on a single high-memory server instance and can be loaded into Redis as a BitSet or maintained in the crawling process's own heap. In practice, crawlers shard the Bloom Filter across multiple machines using a consistent hashing scheme when the URL set exceeds what a single node can hold.
Performance Analysis: Throughput, DNS Cost, and Fetcher Utilization at 400 QPS
The system must sustain 385 pages per second globally (1 billion pages per month). At 250 KB per page and 100 Mbps per fetcher node, a single fetcher can sustain approximately 50 pages per second when accounting for DNS resolution, TCP handshake, and TLS overhead. Reaching 385 QPS therefore requires a minimum of 8 fetcher nodes, but production deployments use 100+ nodes to handle the long tail of slow-responding domains and provide fault tolerance when individual nodes fail.
DNS resolution is a hidden bottleneck. A naive fetcher resolves every hostname via the public DNS infrastructure, which adds 20β200 ms per request at scale. At 385 QPS, this means 385 DNS lookups per second β enough to get the crawler's IP flagged as a DNS abuser. The mitigation is a local Unbound or Bind9 resolver running on each fetcher node, caching responses with TTL respecting the target domain's DNS TTL (typically 60β3600 seconds). A well-tuned local DNS cache reduces DNS lookup cost to under 1 ms for any domain that has been visited in the past hour.
URL Frontier and Deduplication Data Model
| Component | Storage | Key Format | Notes |
| Priority Queue | Redis Sorted Set | frontier:priority | Score equals URL priority value |
| Domain Queues | Redis List per domain | frontier:domain:{domain_hash} | FIFO queue per domain |
| Last Fetch Timestamp | Redis Hash | domain:lastfetch | Polled by Rate Limiter |
| Bloom Filter | Redis BitSet or in-process | dedupe:bloom | ~12 GB for 10 B URLs at 1% FPR |
| Robots.txt Cache | Redis Hash | robots:{domain} | TTL 24 hours |
| SimHash Fingerprints | Cassandra | {fingerprint_64bit} | Queried for near-duplicate detection |
π How Googlebot, Bingbot, and CommonCrawl Deploy Crawler Architecture at Scale
Googlebot indexes over 130 trillion web pages and is the world's most sophisticated crawler. Google's published patents and engineering papers describe a URL Frontier nearly identical to the two-queue architecture above, augmented with a "freshness" layer that dynamically adjusts re-crawl frequency based on observed change rate. News sites are re-crawled hourly, product listing pages daily, and static documentation monthly. Google also uses proprietary signals β server response time trends and historical crawl success rates β to dynamically throttle its crawl rate per domain, not just the fixed 5-second delay used by simpler crawlers.
Bingbot follows Microsoft's publicly documented Bing Webmaster Guidelines. Their architecture emphasizes "crawl budget management" for large sites with millions of auto-generated pages β for example, e-commerce sites where every product variant generates a unique URL. Without a crawl budget limit, Bingbot could spend its entire allocation for a single domain crawling millions of near-identical product pages, starving the rest of its index.
CommonCrawl is a non-profit open-source web crawl that releases monthly data sets used by researchers, AI training pipelines, and academic institutions. Unlike commercial crawlers, CommonCrawl prioritizes breadth over freshness β it aims to crawl as much of the web as possible without prioritizing re-crawl frequency. Their architecture is built on Apache Nutch running on Hadoop clusters, using HDFS for the URL frontier state and MapReduce for link extraction. This distributed-systems-first approach sacrifices real-time responsiveness for horizontal scale across hundreds of machines.
βοΈ The Core Trade-offs in Web Crawler Architecture
Depth-First vs. Breadth-First Traversal: A depth-first crawler follows the first link on every page as deep as possible before backtracking. It is highly susceptible to crawler traps β infinite calendar pages, endlessly paginated results, and dynamic parameter combinations. A breadth-first crawler processes all links at the current depth before going deeper. It is more trap-resistant but requires a significantly larger in-memory frontier queue. Production crawlers use priority-weighted BFS, which is functionally breadth-first but biased toward high-authority domains, combining trap-resistance with value maximization.
Bloom Filter False Positive Rate vs. Memory Cost: Reducing the false positive rate from 1% to 0.1% roughly triples the required bit array size from 12 GB to 36 GB. For most production crawlers, a 1% skip rate (missing 10 million URLs in a 1-billion-page crawl) is an acceptable trade-off against the infrastructure cost of the larger filter. If completeness is critical β for example, a legal discovery crawl where missing any document carries liability β a persistent hash set in Cassandra provides zero false positives at the cost of much higher storage and O(log N) lookup latency.
SimHash Near-Duplicate Threshold Sensitivity: SimHash generates a 64-bit fingerprint from page content using locality-sensitive hashing. Two pages are considered near-duplicates if their Hamming distance is 3 or fewer bits out of 64. This threshold is configurable: a threshold of 3 catches near-identical pages (same article, different URL); a threshold of 10 catches pages that are substantially similar but with different boilerplate sections. Setting the threshold too high causes genuine unique content to be incorrectly skipped; too low allows near-duplicate pollution in the index.
π§ URL Deduplication Strategy Decision Framework
| Scale | Best Deduplication Strategy | Storage Cost | Trade-off |
| Fewer than 10 M URLs | In-process HashSet | ~800 MB RAM | Zero false positives; does not survive restarts |
| 10 M to 1 B URLs | Redis Set with URL hashes | ~100 bytes per URL; ~100 GB | Persistent, O(1) lookup, exact match |
| 1 B to 100 B URLs | Bloom Filter | ~12 GB at 1% FPR | Sub-linear space; 1% false positive skip rate |
| Over 100 B URLs | Distributed Bloom Filter (sharded) | ~120 GB+ | Network overhead for cross-shard lookups |
π§ͺ Presenting the Web Crawler HLD in a System Design Interview
Start every web crawler answer by drawing the core loop on the whiteboard: Frontier β Fetcher β Content Store β Link Extractor β Frontier. This loop is the entire system. Every other component β Bloom Filter, SimHash Engine, Robots.txt cache, DNS cache β is a correctness or performance guard on one of the loop's stages.
Interviewers reliably probe three specific areas in every web crawler question. Prepare for each:
Politeness: Explain the two-queue architecture before the interviewer asks. The key insight to communicate is that politeness must be enforced at the scheduler level, not the fetcher level. If enforcement were at the fetcher level, multiple fetchers independently checking the clock could both decide it is safe to fetch from the same domain simultaneously, violating politeness even if each individual fetcher appears to respect the delay.
Deduplication: Distinguish clearly between URL deduplication (Bloom Filter β "have we crawled this URL before?") and content deduplication (SimHash β "have we seen this content before, possibly at a different URL?"). These are separate problems with separate data structures and are frequently conflated. Both are required: URL deduplication prevents re-fetching; content deduplication prevents index pollution from mirror sites, AMP pages, and copy-paste spam.
Scale: Ground your answer with the capacity math. "1 billion pages per month is roughly 385 pages per second globally. At 250 KB per page, that is approximately 96 MB/sec of sustained bandwidth per fetcher cluster. Distributed across 100 fetcher nodes, each node handles about 960 KB/sec β trivially within the capacity of a single data center rack."
π What Building a Web Crawler Teaches About Adversarial Distributed Systems
A web crawler is a distributed system that must cooperate with millions of external systems β the web's servers β over which it has zero control. This forces architectural decisions that most internal distributed systems never encounter: adversarial environments, unpredictable response times, resource-constrained counterparties, and legal constraints from robots.txt and terms of service.
Every design choice in a web crawler is a negotiation between the crawler's appetite for throughput and the web's need for stability. The politeness constraint is not merely an ethical guideline β it is a self-preservation mechanism. A crawler that ignores politeness quickly finds its IP ranges blacklisted by CDNs, blocked by Web Application Firewalls, and reported to ISP abuse teams. The constraint enforces disciplined engineering in a way that no internal system ever requires. Building a crawler teaches you to design software that is not just correct in isolation, but well-behaved as a participant in a larger ecosystem.
π TLDR & Key Takeaways
- A web crawler is a graph traversal engine on the world's largest directed graph, constrained by politeness, deduplication, and fault tolerance requirements.
- The Two-Queue URL Frontier cleanly separates priority scheduling (which URLs to crawl next) from politeness rate limiting (how fast to crawl each individual domain).
- Bloom Filters enable O(1) URL deduplication for 10 billion URLs in approximately 12 GB of RAM, with a configurable false positive skip rate.
- SimHash detects near-duplicate content using a 64-bit fingerprint regardless of URL, preventing duplicate AMP pages, mirrors, and copy-paste spam from polluting the index.
- Per-domain rate limiting (minimum 5 seconds between requests) is enforced at the scheduler level to prevent accidental DDoS against target sites.
- At 1 billion pages per month (~400 QPS), the system requires approximately 100 distributed fetcher nodes for adequate parallelism and fault tolerance.
- Separate Bloom Filter tuning for URL deduplication and SimHash threshold tuning for content deduplication are independent architectural decisions driven by different correctness requirements.
π Related Posts
- System Design HLD: Search Autocomplete β The downstream consumer of web crawler output, turning the indexed content into instant search suggestions and type-ahead completions.
- System Design HLD: Distributed Cache β The Redis patterns used for the Bloom Filter bit array, Robots.txt cache, domain queue management, and last-fetch timestamp tracking in this design.
- System Design HLD: File Storage & Sync β The S3 and HDFS blob storage patterns used for the raw content store where crawled pages and extracted link graphs are persisted.
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Stale Reads and Cascading Failures in Distributed Systems
TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable β stale reads...
Clock Skew and Causality Violations: Why Distributed Clocks Lie
TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions β but under load, across datacenters, or after a VM pause, the drift can reach seconds. When s...
Split Brain Explained: When Two Nodes Both Think They Are Leader
TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader β each accepting writes the other never sees. Prevent it with quorum consensus (at least βN/2β+1 nodes must agree before leadership is g...
NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data
TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node β virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...
