System Design HLD Example: Web Crawler
A practical interview-ready HLD for a distributed web crawler handling politeness, deduplication, and scale.
Abstract Algorithms
TLDR: A distributed web crawler needs three things to work at scale: a priority-aware URL Frontier that drives crawl order, a Politeness Module that enforces per-domain rate limits so crawlers don't take down small sites, and a two-layer deduplication stack β Bloom filters for URL deduplication and SimHash for near-duplicate content detection. Everything else (DNS caching, connection pooling, distributed fetcher workers) is a performance multiplier on top of those three.
Googlebot crawls roughly 15 billion pages per month. That works out to roughly 5,700 page fetches per second, continuously, around the clock. Google does not ask individual websites for permission before crawling them β yet if Googlebot hit a small news blog at 5,700 req/sec it would saturate the server within seconds.
The design paradox at the heart of a web crawler is this: you need to be as fast as possible globally, but as slow as necessary per domain. That tension drives almost every architectural decision in this HLD.
By the end of this walkthrough you'll understand why the URL Frontier is the hardest component to scale, why a Bloom filter beats a hash set for URL deduplication at a billion-URL scale, and why SimHash catches near-duplicate pages that MD5 checksums silently miss.
π What a Crawler Is Actually Doing: Use Cases and Actors
Before picking any technology, we need to know who uses the system and what jobs they need it to do.
Actors
| Actor | Role |
| Seed URL provider | Injects starting URLs into the Frontier (editorial team, sitemap parser, or prior crawl result) |
| Fetcher worker | Downloads the raw HTML of a single URL; reports success, failure, or redirect |
| Parser | Extracts outbound links and structured metadata from fetched HTML |
| Politeness Module | Enforces per-domain crawl-delay rules and robots.txt directives |
| Deduplication layer | Checks Bloom filter (URL-level) and SimHash index (content-level) before storing a page |
| Content Store | Persists raw HTML + metadata for downstream indexing pipelines |
| Scheduler | Reads the URL Frontier; selects the next URL to fetch respecting domain rate limits |
Use Cases
- URL discovery β extract all
<a href>links from a fetched page and enqueue new, unseen URLs - Page download β fetch the raw HTML at a given URL, follow redirects (up to a configurable hop limit), and store the response
- Content change detection β compare the content hash of the newly-fetched page with the stored hash; only write to the Content Store if the page changed
- robots.txt compliance β before fetching any URL on a domain, read and cache the domain's
robots.txt; honourDisallowrules andCrawl-delaydirectives - Duplicate URL prevention β check a URL against the Bloom filter before enqueuing; reject if already seen
- Near-duplicate page detection β compute SimHash for each fetched page; if Hamming distance from a known page is β€ 3 bits, mark as near-duplicate and skip content storage
This template starts with actors and use cases because architecture only makes sense once the crawl workload shape is explicit. In interviews, naming the actors forces precision about write-path vs. read-path responsibilities before any component is named.
π Functional Requirements: What's In and What's Out
In Scope (v1)
- Broad web crawl starting from a set of seed URLs (BFS-based URL expansion)
- robots.txt parsing and Crawl-delay enforcement per domain
- URL deduplication via Bloom filter (reject already-crawled URLs before enqueuing)
- Content deduplication via SimHash (skip near-duplicate pages before writing to Content Store)
- Raw HTML storage with URL, crawl timestamp, HTTP status, and content hash
- Per-domain politeness: maximum 1 req/sec per domain by default, configurable per domain
Out of Scope (v1)
| Excluded Feature | Reason |
| JavaScript rendering (SPAs, dynamic pages) | Requires a full headless browser (Chromium); separate service with entirely different resource model |
| Content indexing pipeline | Text extraction, NLP, ranking signals β downstream consumers of the raw HTML; not part of the crawler |
| Login-gated pages | Authentication flows require credential management; a separate access-controlled crawl tier |
| Real-time link graph analysis | Graph processing at this scale (PageRank updates) belongs in an offline batch pipeline |
Naming non-goals explicitly is a high-signal move in interviews β it tells the interviewer you understand the system boundary and have consciously traded scope for simplicity.
βοΈ Capacity Estimation and Design Goals
Capacity Estimation
Assume: 1 billion pages per month crawl target.
| Metric | Calculation | Result |
| Pages per second (QPS) | 1B / (30 Γ 24 Γ 3600) | ~385 pages/sec |
| Average page size (HTML) | Empirical web average | ~250 KB |
| Raw HTML storage per month | 1B Γ 250 KB | ~250 TB/month |
| URL Frontier queue depth (1-week buffer) | 385 Γ 7 Γ 86400 | ~233M URLs in queue |
| Bloom filter size (10B URLs, 1% false-positive) | m = -nΒ·ln(p) / (ln 2)Β² | ~12 GB (fits in Redis) |
| DNS cache entries (unique domains) | ~200M domains estimated | ~200M entries in Redis |
Design Goals
- Politeness enforcer caps at max 1 req/sec per domain by default; configurable up to 10 req/sec for domains that have opted in via
Crawl-delay: 0- Bloom filter deduplication keeps the URL re-crawl false-positive rate below 0.01% using 10 hash functions over 12 GB of bits
- SimHash near-duplicate detection marks pages as duplicate if Hamming distance β€ 3 bits out of 64; threshold tunable per content vertical
- Fetcher SLA: p99 fetch latency β€ 5 seconds per page (including DNS resolution); timeout + retry with exponential backoff on failure
- Content Store write throughput: sustain 400 writes/sec to object storage (S3-compatible) for raw HTML blobs
π High-Level Architecture: How All the Pieces Fit Together
The crawler is a pipeline: URLs enter the Frontier, get scheduled by a politeness-aware Scheduler, are fetched by a pool of stateless Fetcher workers, parsed for new links and content hash, and then either stored or discarded based on deduplication checks.
The diagram below shows the full data flow. Each arrow represents a message queue or direct call. Notice the two feedback loops: the URL Extractor feeds back into the Frontier (BFS expansion), and the Content Dedup layer feeds back to skip writes.
graph TD
SEEDS([π± Seed URLs]) --> FRONTIER
subgraph URL Frontier
FRONTIER[Priority Queue\nper-domain FIFO buckets]
end
FRONTIER --> SCHEDULER[β±οΈ Scheduler\nPolite rate limiter]
SCHEDULER -->|next URL| FETCHER_POOL
subgraph Fetcher Pool
FETCHER_POOL[Fetcher Workers\nHTTP + DNS cache]
end
FETCHER_POOL -->|raw HTML| PARSER[π Parser\nLink extractor + metadata]
FETCHER_POOL -->|error| RETRY[Retry Queue\nexponential backoff]
RETRY --> FRONTIER
PARSER -->|new URLs| URL_DEDUP{Bloom Filter\nURL Dedup}
URL_DEDUP -->|unseen| FRONTIER
URL_DEDUP -->|seen| DISCARD1[ποΈ Discard]
PARSER -->|page content| CONTENT_DEDUP{SimHash\nContent Dedup}
CONTENT_DEDUP -->|new/changed| CONTENT_STORE[(π¦ Content Store\nS3 raw HTML)]
CONTENT_DEDUP -->|near-duplicate| DISCARD2[ποΈ Skip write]
SCHEDULER --> POLITENESS[π¦ Politeness Module\nrobots.txt + Crawl-delay]
POLITENESS -->|allowed| FETCHER_POOL
POLITENESS -->|blocked| BACKOFF[Domain back-off queue]
BACKOFF --> FRONTIER
FETCHER_POOL --> DNS_CACHE[(π DNS Cache\nRedis TTL:5min)]
The Scheduler is the chokepoint between the Frontier and the Fetcher pool β it is the only component that enforces per-domain politeness. Every URL must pass through the Politeness Module before a Fetcher worker is dispatched.
π§ Component Deep Dives
Internals: How Each Crawl Subsystem Manages State
The crawler pipeline is a chain of stateful components, and each one manages a different slice of the crawl's memory:
| Subsystem | State it owns | Where state lives |
| URL Frontier | Per-domain FIFO queues + priority heap | Redis sorted sets (hot); PostgreSQL url_frontier table (overflow) |
| Politeness Module | Last-fetch timestamp per domain; robots.txt rules | Redis (STRING + EXPIRE for rate window; HASH for robots.txt cache) |
| URL Dedup (Bloom filter) | Bit array of seen URL hashes | Redis (RedisBloom BF, 12 GB persistent) |
| Content Dedup (SimHash) | 64-bit fingerprint per crawled page | Redis (HASH, sharded; cold in PostgreSQL crawled_pages.content_hash) |
| DNS Cache | Hostname β IP mappings | Redis (STRING, 5-min TTL) |
| Fetcher Workers | Warm HTTP connection pool per domain | In-process (per worker node); re-created on restart |
Understanding state ownership clarifies which components are safe to scale horizontally (stateless Fetcher workers) versus which require careful partitioning (Frontier, Politeness Module).
The URL Frontier: Driving Crawl Order at Scale
The URL Frontier is the heart of the crawler. It answers one question on every scheduling tick: which URL should we fetch next?
A naΓ―ve FIFO queue fails immediately β it will pull 1,000 URLs from the same domain in sequence, hammer that server, and violate politeness rules. The Frontier needs two layers of queuing:
- Priority queues (front-end): URLs are bucketed by crawl priority β freshness score, PageRank estimate, or recency of the domain's last update. High-priority URLs (e.g., a major news homepage) jump the queue.
- Per-domain FIFO queues (back-end): Each domain has its own FIFO queue. The Scheduler pulls from the front-end priority queues, then dispatches a URL from the appropriate domain queue β but only if the domain's politeness timer has expired.
BFS vs DFS vs Priority-based traversal
| Strategy | Behaviour | Best for |
| BFS | Discovers broad, shallow pages first | General-purpose broad web crawl (default) |
| DFS | Goes deep on a single domain before backtracking | Focused crawl of a single site's entire archive |
| Priority-based | Scores URLs by freshness + authority; crawls highest-value pages first | News crawl, sitemap-guided re-crawl |
For a broad web crawl, BFS is the correct default β it surfaces diverse high-authority pages quickly rather than exhausting one domain before moving on.
Partitioning the Frontier across multiple crawl nodes: Partition by hash(domain) % N where N is the number of crawl workers. All URLs belonging to nytimes.com always route to the same worker β this centralises per-domain state (politeness timer, robots.txt cache) without any cross-node coordination.
The Politeness Module: Keeping Crawlers from Crashing Websites
Every time the Scheduler picks a URL, the Politeness Module makes three checks:
- robots.txt β has the crawler fetched and cached this domain's
robots.txt? Is the target path allowed for the crawler's User-Agent? - Crawl-delay β what is the minimum wait between requests to this domain? Default: 1 second. If
Crawl-delayis declared inrobots.txt, honour it. - Domain rate limiter β has the configured inter-request delay elapsed since the last fetch from this domain?
# Pseudocode: robots.txt compliance check
def is_allowed(domain: str, path: str, user_agent: str = "MyCrawler") -> bool:
rules = robots_cache.get(domain)
if rules is None:
rules = fetch_and_parse_robots_txt(domain) # HTTP GET /robots.txt
robots_cache.set(domain, rules, ttl=86400) # cache for 24 hours
for rule_group in rules:
if user_agent_matches(rule_group.agents, user_agent):
for disallow in rule_group.disallow:
if path.startswith(disallow):
return False # path is explicitly disallowed
return True
The Redis rate limiter enforces the per-domain crawl delay. The INCR + EXPIRE pattern gives an atomic, TTL-bounded request counter per domain key:
-- Redis Lua script: per-domain politeness gate (atomic)
local key = "crawl:ratelimit:" .. KEYS[1] -- e.g., crawl:ratelimit:nytimes.com
local limit = tonumber(ARGV[1]) -- max requests per window (e.g., 1)
local ttl = tonumber(ARGV[2]) -- window in seconds (e.g., 1)
local count = redis.call("INCR", key)
if count == 1 then
redis.call("EXPIRE", key, ttl)
end
if count > limit then
return 0 -- rate-limited: scheduler should back off this domain
else
return 1 -- allowed: proceed with fetch
end
A 0 return from the Lua script pushes the URL back into the domain's back-off queue with a timestamp. The Scheduler re-examines it only after the TTL expires.
URL Deduplication: Why a Bloom Filter Beats a Hash Set at One Billion URLs
After the Parser extracts new URLs from a fetched page, each candidate URL passes through the deduplication layer. The question is: have we already crawled or enqueued this URL?
A plain hash set would work β until it doesn't. At 10 billion URLs, a Python set holding 50-byte URL hashes consumes ~500 GB of RAM. That's not deployable.
A Bloom filter solves this with a fixed-size bit array. The false-positive probability is:
P(false_positive) β (1 - e^(-kn/m))^k
where:
k = number of hash functions
n = number of inserted elements
m = number of bits in the filter
For 10 billion URLs at 1% false-positive rate, the optimal parameters are:
- m β 95.8 billion bits β ~12 GB
- k = 7 hash functions (optimal: k = (m/n) Γ ln 2)
A 12 GB Redis Bloom filter (using RedisBloom module) handles 10 billion URLs with a false-positive rate of 1% β meaning 1 in 100 unseen URLs is incorrectly flagged as already-crawled. At our 385 pages/sec crawl rate, that means roughly 3-4 valid URLs are missed per second β an acceptable loss for the enormous memory savings.
Bloom filter properties to remember in interviews: Bloom filters have no false negatives β if the filter says "not seen", it is definitely not seen. The only failure mode is false positives (seen when not actually seen), which results in a small number of valid pages being skipped, not silent corruption.
Content Deduplication: SimHash for Near-Duplicate Page Detection
MD5 or SHA-256 checksums catch exact duplicates β pages with byte-for-byte identical content. They miss near-duplicates: a product page that differs only in a sidebar ad, a news article reprinted with a changed byline, or a paginated article where only the page number differs.
SimHash produces a 64-bit fingerprint where similar documents produce fingerprints with similar bit patterns. The Hamming distance between two SimHashes correlates with document similarity.
# SimHash algorithm sketch
def simhash(tokens: list[str]) -> int:
vector = [0] * 64 # 64-bit output
for token in tokens:
h = mmh3.hash64(token)[0] # 64-bit MurmurHash3
weight = 1 # uniform weight; use TF-IDF weight in production
for i in range(64):
if h & (1 << i):
vector[i] += weight # bit is 1: add weight
else:
vector[i] -= weight # bit is 0: subtract weight
fingerprint = 0
for i in range(64):
if vector[i] > 0:
fingerprint |= (1 << i) # set bit if net positive
return fingerprint
def is_near_duplicate(new_hash: int, stored_hash: int, threshold: int = 3) -> bool:
xor = new_hash ^ stored_hash
return bin(xor).count('1') <= threshold # Hamming distance β€ 3 bits = near-duplicate
At interview time, the key decision to articulate is: use MD5 for exact-duplicate detection as a fast first pass, then SimHash for near-duplicate detection as a second pass. Running SimHash on every page and storing all 64-bit fingerprints for 10 billion pages consumes only 80 GB β manageable as a distributed in-memory index.
Distributed Fetcher Pool: Throughput Without Overloading the Scheduler
Fetcher workers are stateless. Each worker:
- Receives a URL from the Scheduler
- Resolves DNS (from the DNS cache first)
- Opens a persistent HTTP/1.1 or HTTP/2 connection (connection pool per domain)
- Downloads the HTML body (max configurable size: e.g., 10 MB; truncate beyond that)
- Passes the raw HTML to the Parser
- Reports status back to the Scheduler (success, HTTP 4xx, HTTP 5xx, timeout)
Workers scale horizontally. Each worker node handles one domain partition (per the hash(domain) % N assignment), which means it can maintain a warm connection pool and DNS cache for its assigned domains without cross-node invalidation.
DNS caching: DNS lookups add 50β200 ms of latency per page if uncached. A Redis DNS cache with a 5-minute TTL cuts DNS resolution to a sub-millisecond local lookup for hot domains. The cache key is the hostname; the value is the resolved IP address.
Write Path: From Discovered URL to Content Store
This sequence shows what happens when the Parser discovers a brand-new URL on a page that was just fetched.
sequenceDiagram
participant P as Parser
participant BF as Bloom Filter
participant F as URL Frontier
participant S as Scheduler
participant PM as Politeness Module
participant W as Fetcher Worker
participant CH as Content Store
P->>BF: CHECK url_hash
BF-->>P: NOT_SEEN
P->>BF: ADD url_hash
P->>F: ENQUEUE(url, priority)
F-->>S: next URL (priority order)
S->>PM: is_allowed(domain, path)
PM-->>S: ALLOWED (rate limit OK)
S->>W: FETCH(url)
W-->>P: raw HTML + headers
P->>BF: CHECK content_hash (SimHash)
BF-->>P: NEW_CONTENT
P->>CH: WRITE(url, html, metadata)
The write path has two deduplication gates: one at URL enqueue time (Bloom filter check on the URL hash) and one at content write time (SimHash comparison). A URL that passes the first gate but produces a near-duplicate page is fetched but not stored β saving Content Store space while still consuming a fetch slot.
Read Path: How the Scheduler Picks the Next URL to Fetch
The Scheduler loop runs continuously:
- Pop the highest-priority URL from the front-end priority queue in the Frontier
- Look up the domain's last-fetch timestamp in Redis (
domain:last_fetch:{domain}) - If
now - last_fetch < crawl_delay, push the URL into the domain's back-off queue and pop the next URL instead - Check robots.txt cache; skip if the path is disallowed
- Dispatch the URL to an available Fetcher worker in the domain's worker partition
- Update
domain:last_fetch:{domain}= now
The Scheduler never blocks on a single domain. If nytimes.com is in its crawl-delay window, the Scheduler skips to the next domain. This is the key insight that keeps throughput high: domain-level politeness does not pause the entire crawl.
Performance Analysis: Crawl Throughput Ceiling and Latency Contributors
Understanding the latency budget per page fetch is essential for sizing the fetcher pool and diagnosing production slowdowns.
| Latency Contributor | Typical Cost | Mitigation |
| DNS resolution (cold) | 50β200 ms | Redis DNS cache (5-min TTL) reduces to < 1 ms on cache hit |
| TCP handshake (cold) | 100β300 ms | Persistent connection pool per domain; reuse across fetches |
| TLS handshake (HTTPS) | 50β150 ms | TLS session resumption; HTTP/2 multiplexing |
| HTTP response (server-side) | 50β500 ms | Configurable timeout (2 s default); skip slow servers after 3 timeouts |
| HTML download (250 KB avg) | 20β100 ms | Content size cap (10 MB); truncate at limit |
| Total p50 per fetch | ~300β800 ms | Target: p99 β€ 5 s |
At 385 fetches/sec with a p50 latency of 500 ms, you need at least 193 concurrent Fetcher workers to keep the pipeline saturated (Little's Law: N = Ξ» Γ W = 385 Γ 0.5). In practice, size the pool at 2Γ the theoretical minimum (β 400 workers) to absorb bursts and slow-domain back-off delays.
Throughput ceiling: the Scheduler's single-threaded dispatch loop becomes the bottleneck above approximately 2,000 dispatches/sec. Above that, partition the Scheduler by domain range into multiple Scheduler instances (each managing a slice of the domain hash space).
Bloom filter lookup latency: a single BF.EXISTS call to Redis takes 0.1β0.5 ms β negligible at 385 checks/sec. At 100Γ scale (38,500/sec), batch lookups via Redis pipeline or a local in-process Bloom filter replica reduce this to effectively zero.
: The Three Tables That Drive Crawl State
-- URL Frontier: persistent queue state (overflow to DB when in-memory queue is full)
CREATE TABLE url_frontier (
url TEXT NOT NULL,
url_hash BIGINT NOT NULL, -- FNV-64 hash for fast dedup lookup
domain TEXT NOT NULL,
priority FLOAT NOT NULL DEFAULT 0.5,
enqueued_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
status TEXT NOT NULL DEFAULT 'pending', -- pending | in_flight | done | error
PRIMARY KEY (url_hash)
);
CREATE INDEX idx_frontier_priority ON url_frontier (priority DESC, enqueued_at ASC)
WHERE status = 'pending';
-- Crawled pages: source of truth for fetched content
CREATE TABLE crawled_pages (
url TEXT NOT NULL PRIMARY KEY,
url_hash BIGINT NOT NULL UNIQUE,
domain TEXT NOT NULL,
http_status SMALLINT NOT NULL,
content_hash BIGINT NOT NULL, -- SimHash fingerprint (64-bit)
md5_checksum CHAR(32), -- exact-duplicate fast check
content_size INT, -- raw HTML size in bytes
last_crawled TIMESTAMPTZ NOT NULL DEFAULT NOW(),
crawl_count INT NOT NULL DEFAULT 1 -- incremented on re-crawl
);
CREATE INDEX idx_pages_domain ON crawled_pages (domain, last_crawled DESC);
-- Domain metadata: politeness and robots.txt state
CREATE TABLE domain_metadata (
domain TEXT NOT NULL PRIMARY KEY,
robots_txt TEXT, -- cached robots.txt content
crawl_delay_sec FLOAT NOT NULL DEFAULT 1.0,
last_robots_fetch TIMESTAMPTZ,
is_blocked BOOLEAN NOT NULL DEFAULT FALSE, -- manually blocked domain
total_pages_crawled INT NOT NULL DEFAULT 0
);
The content_hash column in crawled_pages stores the 64-bit SimHash fingerprint. On re-crawl, if the new SimHash is within 3 bits of the stored value, the page is marked as near-duplicate and the Content Store write is skipped β only last_crawled and crawl_count are updated.
π Cache Design: DNS, robots.txt, and the Bloom Filter
| Cache Layer | What it Stores | Backend | TTL | Eviction |
| DNS cache | hostname β IP address | Redis (STRING) | 5 minutes | TTL-based; lazy expiry |
| robots.txt cache | domain β parsed rules | Redis (HASH) | 24 hours | TTL-based; refresh on crawl |
| URL Bloom filter | Seen URL bit array (12 GB) | Redis (RedisBloom BF) | Persistent (no TTL) | Never evicted; rebuilt on restart from crawled_pages |
| SimHash index | content_hash β url (for near-dup lookup) | Redis (HASH, sharded) | 30 days | LRU; evict stale pages |
| Domain rate limiter | domain β request count in window | Redis (STRING + EXPIRE) | 1 second (window) | TTL-based; auto-resets |
The Bloom filter is the single most important cache in the system. It is always kept warm: on cold start, it is rebuilt by scanning the url_hash column in crawled_pages. This takes about 30 minutes for 10 billion entries but happens only on full restarts β normal rolling deploys keep the Bloom filter alive in Redis.
βοΈ Trade-offs, Failure Modes, and Scaling Limits
URL Frontier as the Primary Bottleneck
The Frontier is the single coordination point for all crawl workers. If the Frontier is a single Redis sorted set, it becomes a write bottleneck above ~10K enqueue ops/sec.
Solution: Partition the Frontier by hash(domain) % N. Each partition owns a subset of domains and is managed by a dedicated Scheduler instance. Workers for a given domain always communicate with the same Scheduler β no cross-partition coordination needed.
Distributed Bloom Filter: Scaling Beyond 12 GB
A single 12 GB Bloom filter fits in one Redis instance today. At 100 billion URLs it would require 120 GB. Options:
| Approach | Trade-off |
| Sharded Bloom filters (N filters, partition by URL hash) | Scales linearly; each shard is independent; requires routing layer |
| Counting Bloom filter | Supports deletion (useful for re-crawl expiry); 4Γ memory overhead |
| Cuckoo filter (RedisBloom CF) | Better space efficiency than Bloom; supports deletion; slightly higher constant overhead |
For this design, sharded Bloom filters with domain-based partitioning is the recommended scaling path β it aligns with the Frontier partitioning scheme and requires no additional routing logic.
Fetcher Pool Scaling
Add fetcher workers horizontally. Each new worker node registers itself with the Scheduler's domain assignment table. The hash(domain) % N function is recomputed on worker join/leave β a consistent hashing ring avoids re-assigning all domains on every topology change.
Content Store Throughput
At 400 writes/sec of 250 KB pages, the Content Store absorbs ~100 MB/sec of raw write throughput. S3-compatible object storage handles this easily. The content hash is the object key β exact-duplicate pages never generate a new S3 write (same key = same content, already stored).
π§ Decision Guide: Key Architectural Choices
| Choice | Recommendation | Avoid |
| URL Deduplication | Bloom filter (Redis) β O(1) lookup, fixed 12 GB memory | Hash set β O(n) memory growth; 500+ GB at 10B URLs |
| Frontier traversal order | BFS with priority scoring for broad crawl | DFS β exhausts single domains, violates politeness |
| Frontier partitioning | hash(domain) % N β centralises per-domain state per worker | Random partitioning β scatters domain state across workers, breaks rate limiting |
| Content dedup first pass | MD5 checksum β O(1) exact match before SimHash | SimHash only β misses exact duplicates and is more expensive than MD5 |
| robots.txt caching | Redis with 24-hour TTL | Fetch on every request β adds latency and hammers small servers |
| Crawl-delay enforcement | Redis Lua script (atomic INCR + EXPIRE per domain) | Application-level sleep β blocks the Scheduler thread; non-atomic under concurrent workers |
| DNS caching | Redis with 5-minute TTL | No cache β 50β200 ms added per fetch; catastrophic at 385 fetches/sec |
π Real-World Applications: How Googlebot, Common Crawl, and Bingbot Scale
The design in this HLD is not hypothetical β it maps directly to how the three dominant production crawlers are architected.
Googlebot operates a geographically distributed fleet with separate crawl workers for different content verticals (news, shopping, images). Politeness is enforced per-IP, not just per-domain, so that Google's distributed fleet doesn't accidentally aggregate into a DDoS from multiple geographic locations hitting the same server. Google uses a custom URL Frontier backed by Bigtable (essentially a distributed priority queue at Google scale) and a proprietary content deduplication pipeline that runs SimHash at the web scale.
Common Crawl (commoncrawl.org) is the open-source reference crawl: it crawls approximately 3 billion pages per month and makes the raw WARC (Web ARChive) files publicly available on S3. Common Crawl uses Apache Nutch (covered below) for crawl coordination and stores the raw HTML + metadata in the WARC format β exactly the Content Store pattern described in this HLD. The WARC files are consumed by downstream ML training pipelines (GPT-3 was partially trained on Common Crawl data).
Bingbot (Microsoft) differs from Googlebot primarily in its adaptive crawl-rate policy: Bingbot reads the X-Robots-Tag HTTP response header in addition to robots.txt, and dynamically backs off per-domain based on server response time β if a server's p95 latency exceeds 500 ms, Bingbot halves its crawl rate for that domain. This is a production-grade politeness heuristic that goes beyond simple Crawl-delay enforcement.
| Crawler | Scale | URL Frontier backend | Dedup strategy | Politeness mechanism |
| Googlebot | ~15B pages/month | Bigtable (distributed priority queue) | SimHash + content hash | Per-IP rate limit; robots.txt; adaptive backoff |
| Common Crawl | ~3B pages/month | Apache Nutch / HDFS CrawlDB | URL hash + WARC dedup | robots.txt + Crawl-delay |
| Bingbot | Undisclosed | Proprietary | Content hash | robots.txt + server latency adaptive backoff |
The unifying principle across all three: the URL Frontier and politeness module are the hardest components to get right, while the Fetcher workers are straightforward horizontal scale problems.
π οΈ Apache Nutch: How It Implements Distributed Crawling
Apache Nutch is the reference open-source implementation of a distributed web crawler. It is the project from which Apache Solr and Hadoop were originally spun out. Understanding Nutch's architecture validates the design choices in this HLD.
Nutch uses Hadoop MapReduce (or Apache YARN) to distribute crawl phases across a cluster:
| Nutch Phase | MapReduce Job | What it does |
| Inject | Single mapper | Seeds the URL database with the initial URL list |
| Generate | Partitioned mapper | Selects URLs from the CrawlDB for the next fetch batch; enforces politeness limits per domain |
| Fetch | Multi-threaded mapper | Downloads pages using Nutch's built-in HTTP fetcher; respects robots.txt and Crawl-delay |
| Parse | Mapper + Reducer | Extracts links and metadata from raw HTML; runs pluggable parser plugins |
| UpdateDB | Reducer | Merges fetch results back into the CrawlDB; updates crawl scores and status |
| Invert Links | Reducer | Builds the inbound link graph for PageRank-style scoring |
A minimal Nutch crawl configuration showing politeness settings:
<!-- nutch-site.xml: politeness and fetch configuration -->
<configuration>
<!-- Per-domain minimum fetch delay: 1000 ms (1 req/sec max) -->
<property>
<name>fetcher.server.delay</name>
<value>1.0</value>
</property>
<!-- Maximum threads per queue (domain): prevents hammering a single server -->
<property>
<name>fetcher.threads.per.queue</name>
<value>1</value>
</property>
<!-- robots.txt cache TTL: 24 hours -->
<property>
<name>http.robots.403.allow</name>
<value>true</value>
</property>
<!-- Maximum content size to download: 10 MB per page -->
<property>
<name>http.content.limit</name>
<value>10485760</value>
</property>
<!-- Bloom filter for URL dedup: enabled by default in Nutch 2.x -->
<property>
<name>crawldb.url.normalizers</name>
<value>true</value>
</property>
</configuration>
Nutch's CrawlDB is an HDFS-backed persistent store that maps directly to the url_frontier and crawled_pages tables in this HLD's data model. The fetcher.threads.per.queue=1 configuration is exactly the "max 1 req/sec per domain" politeness rule β implemented as a thread quota rather than a timer, but with the same effect.
For a deep dive on Apache Nutch's Hadoop integration and plugin system, see the Apache Nutch documentation or the planned follow-up companion post.
π§ͺ Practical Example: Walking https://news.ycombinator.com Through the Full Crawl Pipeline
A concrete trace of one URL from seed injection to Content Store write β the kind of step-by-step walkthrough that wins points in interviews.
Step 1 β Seed injection: The URL https://news.ycombinator.com is injected as a seed with priority 0.9 (high-authority domain). It enters the Frontier's per-domain FIFO queue for news.ycombinator.com.
Step 2 β Bloom filter check (URL dedup): The Scheduler calls BF.EXISTS for the FNV-64 hash of the URL. Result: 0 (not seen). The URL is added to the Bloom filter (BF.ADD) and enqueued.
Step 3 β Politeness check: The Scheduler looks up news.ycombinator.com in the robots.txt cache (Redis HASH key robots:news.ycombinator.com). Cache miss β fetches /robots.txt, finds Crawl-delay: 30 for most bots but no Disallow for the root path. The crawl delay is stored; the URL is allowed.
Step 4 β Rate limit gate: Redis Lua script checks crawl:ratelimit:news.ycombinator.com. Count = 0 β set to 1 with 30-second EXPIRE. Gate returns 1 (allowed). A Fetcher worker is dispatched.
Step 5 β Fetch: The Fetcher resolves DNS (cache hit: 45.79.22.115), opens a persistent HTTPS connection, and downloads the HTML. Response: 200 OK, 42 KB, latency 120 ms.
Step 6 β Content dedup: Parser computes MD5 (a3f9...) and SimHash (0x3B4A...) for the page. Checks crawled_pages β no existing entry. New content confirmed; write to Content Store (S3).
Step 7 β Link extraction: Parser finds 30 outbound links (story URLs on the HN homepage). Each link goes through the Bloom filter check. 2 are already seen (return 0); 28 are new. The 28 new URLs are enqueued in the Frontier.
Outcome: 1 page stored, 28 new URLs discovered, Bloom filter updated with 29 new entries (1 page URL + 28 link URLs), rate-limit window set to 30 seconds before the next news.ycombinator.com fetch is allowed.
This trace is the right level of detail for an interview: it shows you understand the data flow at every gate, not just the high-level components.
π Lessons Learned from Building Crawlers at Scale
- Never skip robots.txt compliance, even in internal tools. A crawler that ignores
robots.txtwill get IP-banned by CDNs and Cloudflare within minutes at scale. Always cache and honour it. - The URL Frontier is harder than the Fetcher. Most engineers focus on HTTP fetching performance, but the Frontier is the true scaling bottleneck. Partition it early and make it domain-aware from day one.
- Bloom filters have no false negatives β design around their false-positive direction. A false positive means you skip a valid URL. Set your false-positive rate target based on acceptable crawl miss rate, not memory preference.
- SimHash thresholds need tuning per content vertical. A threshold of 3 Hamming bits works well for news articles but is too aggressive for e-commerce product pages, which differ legitimately by SKU with nearly identical templates.
- DNS caching is mandatory, not optional. At 385 fetches/sec, uncached DNS adds an average 100 ms to every fetch = 38 seconds of wasted latency per second of crawl throughput. Cache aggressively with a 5-minute TTL.
- Connection pooling per domain matters more than per-worker. Keep persistent HTTP connections to each domain warm in the assigned worker. A cold TCP handshake costs 100β300 ms.
π TLDR: Summary & Key Takeaways
- A distributed web crawler has three core design challenges: priority-aware URL Frontier, politeness enforcement per domain, and two-layer deduplication (URL-level Bloom filter + content-level SimHash)
- Partition everything by domain: Frontier queues, Fetcher workers, DNS caches, and rate limiters β domain is the natural sharding key because all politeness state is domain-scoped
- Bloom filters beat hash sets at billion-URL scale: 12 GB for 10B URLs at 1% false-positive rate vs. 500+ GB for a hash set β with the trade-off that ~1% of valid URLs are silently skipped
- SimHash catches what MD5 misses: near-duplicate pages with minor differences (ads, pagination, sidebars) have identical SimHash fingerprints within 3 bits but completely different MD5 checksums
- The Scheduler is the politeness enforcer, not individual Fetcher workers β centralising rate-limit decisions per domain partition prevents coordination overhead across workers
- Apache Nutch validates this architecture β its Hadoop-based MapReduce phases map directly to the Frontier β Scheduler β Fetcher β Parser β UpdateDB pipeline
π Practice Quiz
Why does a web crawler use a Bloom filter for URL deduplication instead of storing seen URLs in a hash set?
- A) Bloom filters support O(1) deletion, making it easy to remove expired URLs
- B) A Bloom filter uses a fixed amount of memory (e.g., 12 GB for 10B URLs) with a configurable false-positive rate, while a hash set grows linearly with the number of URLs β consuming 500+ GB at 10B entries
- C) Bloom filters are faster than hash sets for exact lookups in all cases
Correct Answer: B
Why is per-domain rate limiting applied at the Scheduler rather than inside each Fetcher worker?
- A) Fetcher workers don't have network access to the domain metadata
- B) Centralising rate-limit decisions at the Scheduler (one decision point per domain partition) prevents two workers assigned to the same domain from each independently approving a fetch, which would double the effective request rate
- C) The Scheduler runs on faster hardware than Fetcher workers
Correct Answer: B
A crawler fetches a news article that has been republished on 50 different sites with only the publication date changed in the byline. Which deduplication mechanism catches these near-duplicates, and why doesn't MD5 work here?
- A) The URL Bloom filter catches them because the article title appears in all 50 URLs
- B) MD5 works fine β any byte-level difference in the page produces a completely different checksum, which is exactly what we want for detecting changes
- C) SimHash catches them β the 64-bit fingerprint of two near-identical documents differs by β€ 3 bits (Hamming distance), while MD5 produces completely different hashes for documents that differ by even a single byte
Correct Answer: C
Why does a broad web crawl use BFS (breadth-first search) for URL expansion rather than DFS (depth-first search)?
- A) BFS is always faster than DFS because it uses a queue instead of a stack
- B) DFS would crawl all 500,000 pages of a single domain before visiting any page on another domain β concentrating fetch load on one server for extended periods and violating politeness rules; BFS distributes fetches across many domains naturally
- C) DFS cannot handle redirects, making it unsuitable for web crawling
Correct Answer: B
Open-ended challenge: A domain's
robots.txtspecifiesCrawl-delay: 10(10 seconds between requests). Your crawler has 50 URLs queued for that domain. How does the Scheduler handle this without blocking crawl throughput to all other domains? Walk through the data flow.
π Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi
TLDR: Delta Lake, Apache Iceberg, and Apache Hudi are open table formats that wrap Parquet files with a transaction log (or snapshot tree) to deliver ACID guarantees, time travel, schema evolution, an

Medallion Architecture: Bronze, Silver, and Gold Layers in Practice
TLDR: Medallion Architecture solves the "data swamp" problem by organizing a data lake into three progressively refined zones β Bronze (raw, immutable), Silver (cleaned, conformed), Gold (aggregated,

Kappa Architecture: Streaming-First Data Pipelines
TLDR: Kappa architecture replaces Lambda's batch + speed dual codebases with a single streaming pipeline backed by a replayable Kafka log. Reprocessing becomes replaying from offset 0. One codebase, n
Big Data 101: The 5 Vs, Ecosystem, and Why Scale Breaks Everything
TLDR: Traditional databases fail at big data scale for three concrete reasons β storage saturation, compute bottleneck, and write-lock contention. The 5 Vs (Volume, Velocity, Variety, Veracity, Value) frame what makes data "big." A layered ecosystem ...
