All Posts

System Design HLD Example: Web Crawler

A practical interview-ready HLD for a distributed web crawler handling politeness, deduplication, and scale.

Abstract AlgorithmsAbstract Algorithms
Β·Β·32 min read
Cover Image for System Design HLD Example: Web Crawler
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: A distributed web crawler needs three things to work at scale: a priority-aware URL Frontier that drives crawl order, a Politeness Module that enforces per-domain rate limits so crawlers don't take down small sites, and a two-layer deduplication stack β€” Bloom filters for URL deduplication and SimHash for near-duplicate content detection. Everything else (DNS caching, connection pooling, distributed fetcher workers) is a performance multiplier on top of those three.

Googlebot crawls roughly 15 billion pages per month. That works out to roughly 5,700 page fetches per second, continuously, around the clock. Google does not ask individual websites for permission before crawling them β€” yet if Googlebot hit a small news blog at 5,700 req/sec it would saturate the server within seconds.

The design paradox at the heart of a web crawler is this: you need to be as fast as possible globally, but as slow as necessary per domain. That tension drives almost every architectural decision in this HLD.

By the end of this walkthrough you'll understand why the URL Frontier is the hardest component to scale, why a Bloom filter beats a hash set for URL deduplication at a billion-URL scale, and why SimHash catches near-duplicate pages that MD5 checksums silently miss.

πŸ“– What a Crawler Is Actually Doing: Use Cases and Actors

Before picking any technology, we need to know who uses the system and what jobs they need it to do.

Actors

ActorRole
Seed URL providerInjects starting URLs into the Frontier (editorial team, sitemap parser, or prior crawl result)
Fetcher workerDownloads the raw HTML of a single URL; reports success, failure, or redirect
ParserExtracts outbound links and structured metadata from fetched HTML
Politeness ModuleEnforces per-domain crawl-delay rules and robots.txt directives
Deduplication layerChecks Bloom filter (URL-level) and SimHash index (content-level) before storing a page
Content StorePersists raw HTML + metadata for downstream indexing pipelines
SchedulerReads the URL Frontier; selects the next URL to fetch respecting domain rate limits

Use Cases

  • URL discovery β€” extract all <a href> links from a fetched page and enqueue new, unseen URLs
  • Page download β€” fetch the raw HTML at a given URL, follow redirects (up to a configurable hop limit), and store the response
  • Content change detection β€” compare the content hash of the newly-fetched page with the stored hash; only write to the Content Store if the page changed
  • robots.txt compliance β€” before fetching any URL on a domain, read and cache the domain's robots.txt; honour Disallow rules and Crawl-delay directives
  • Duplicate URL prevention β€” check a URL against the Bloom filter before enqueuing; reject if already seen
  • Near-duplicate page detection β€” compute SimHash for each fetched page; if Hamming distance from a known page is ≀ 3 bits, mark as near-duplicate and skip content storage

This template starts with actors and use cases because architecture only makes sense once the crawl workload shape is explicit. In interviews, naming the actors forces precision about write-path vs. read-path responsibilities before any component is named.

πŸ” Functional Requirements: What's In and What's Out

In Scope (v1)

  • Broad web crawl starting from a set of seed URLs (BFS-based URL expansion)
  • robots.txt parsing and Crawl-delay enforcement per domain
  • URL deduplication via Bloom filter (reject already-crawled URLs before enqueuing)
  • Content deduplication via SimHash (skip near-duplicate pages before writing to Content Store)
  • Raw HTML storage with URL, crawl timestamp, HTTP status, and content hash
  • Per-domain politeness: maximum 1 req/sec per domain by default, configurable per domain

Out of Scope (v1)

Excluded FeatureReason
JavaScript rendering (SPAs, dynamic pages)Requires a full headless browser (Chromium); separate service with entirely different resource model
Content indexing pipelineText extraction, NLP, ranking signals β€” downstream consumers of the raw HTML; not part of the crawler
Login-gated pagesAuthentication flows require credential management; a separate access-controlled crawl tier
Real-time link graph analysisGraph processing at this scale (PageRank updates) belongs in an offline batch pipeline

Naming non-goals explicitly is a high-signal move in interviews β€” it tells the interviewer you understand the system boundary and have consciously traded scope for simplicity.

βš™οΈ Capacity Estimation and Design Goals

Capacity Estimation

Assume: 1 billion pages per month crawl target.

MetricCalculationResult
Pages per second (QPS)1B / (30 Γ— 24 Γ— 3600)~385 pages/sec
Average page size (HTML)Empirical web average~250 KB
Raw HTML storage per month1B Γ— 250 KB~250 TB/month
URL Frontier queue depth (1-week buffer)385 Γ— 7 Γ— 86400~233M URLs in queue
Bloom filter size (10B URLs, 1% false-positive)m = -nΒ·ln(p) / (ln 2)Β²~12 GB (fits in Redis)
DNS cache entries (unique domains)~200M domains estimated~200M entries in Redis

Design Goals

  • Politeness enforcer caps at max 1 req/sec per domain by default; configurable up to 10 req/sec for domains that have opted in via Crawl-delay: 0
  • Bloom filter deduplication keeps the URL re-crawl false-positive rate below 0.01% using 10 hash functions over 12 GB of bits
  • SimHash near-duplicate detection marks pages as duplicate if Hamming distance ≀ 3 bits out of 64; threshold tunable per content vertical
  • Fetcher SLA: p99 fetch latency ≀ 5 seconds per page (including DNS resolution); timeout + retry with exponential backoff on failure
  • Content Store write throughput: sustain 400 writes/sec to object storage (S3-compatible) for raw HTML blobs

πŸ“Š High-Level Architecture: How All the Pieces Fit Together

The crawler is a pipeline: URLs enter the Frontier, get scheduled by a politeness-aware Scheduler, are fetched by a pool of stateless Fetcher workers, parsed for new links and content hash, and then either stored or discarded based on deduplication checks.

The diagram below shows the full data flow. Each arrow represents a message queue or direct call. Notice the two feedback loops: the URL Extractor feeds back into the Frontier (BFS expansion), and the Content Dedup layer feeds back to skip writes.

graph TD
    SEEDS([🌱 Seed URLs]) --> FRONTIER

    subgraph URL Frontier
        FRONTIER[Priority Queue\nper-domain FIFO buckets]
    end

    FRONTIER --> SCHEDULER[⏱️ Scheduler\nPolite rate limiter]
    SCHEDULER -->|next URL| FETCHER_POOL

    subgraph Fetcher Pool
        FETCHER_POOL[Fetcher Workers\nHTTP + DNS cache]
    end

    FETCHER_POOL -->|raw HTML| PARSER[πŸ” Parser\nLink extractor + metadata]
    FETCHER_POOL -->|error| RETRY[Retry Queue\nexponential backoff]
    RETRY --> FRONTIER

    PARSER -->|new URLs| URL_DEDUP{Bloom Filter\nURL Dedup}
    URL_DEDUP -->|unseen| FRONTIER
    URL_DEDUP -->|seen| DISCARD1[πŸ—‘οΈ Discard]

    PARSER -->|page content| CONTENT_DEDUP{SimHash\nContent Dedup}
    CONTENT_DEDUP -->|new/changed| CONTENT_STORE[(πŸ“¦ Content Store\nS3 raw HTML)]
    CONTENT_DEDUP -->|near-duplicate| DISCARD2[πŸ—‘οΈ Skip write]

    SCHEDULER --> POLITENESS[🚦 Politeness Module\nrobots.txt + Crawl-delay]
    POLITENESS -->|allowed| FETCHER_POOL
    POLITENESS -->|blocked| BACKOFF[Domain back-off queue]
    BACKOFF --> FRONTIER

    FETCHER_POOL --> DNS_CACHE[(πŸ”‘ DNS Cache\nRedis TTL:5min)]

The Scheduler is the chokepoint between the Frontier and the Fetcher pool β€” it is the only component that enforces per-domain politeness. Every URL must pass through the Politeness Module before a Fetcher worker is dispatched.

🧠 Component Deep Dives

Internals: How Each Crawl Subsystem Manages State

The crawler pipeline is a chain of stateful components, and each one manages a different slice of the crawl's memory:

SubsystemState it ownsWhere state lives
URL FrontierPer-domain FIFO queues + priority heapRedis sorted sets (hot); PostgreSQL url_frontier table (overflow)
Politeness ModuleLast-fetch timestamp per domain; robots.txt rulesRedis (STRING + EXPIRE for rate window; HASH for robots.txt cache)
URL Dedup (Bloom filter)Bit array of seen URL hashesRedis (RedisBloom BF, 12 GB persistent)
Content Dedup (SimHash)64-bit fingerprint per crawled pageRedis (HASH, sharded; cold in PostgreSQL crawled_pages.content_hash)
DNS CacheHostname β†’ IP mappingsRedis (STRING, 5-min TTL)
Fetcher WorkersWarm HTTP connection pool per domainIn-process (per worker node); re-created on restart

Understanding state ownership clarifies which components are safe to scale horizontally (stateless Fetcher workers) versus which require careful partitioning (Frontier, Politeness Module).

The URL Frontier: Driving Crawl Order at Scale

The URL Frontier is the heart of the crawler. It answers one question on every scheduling tick: which URL should we fetch next?

A naΓ―ve FIFO queue fails immediately β€” it will pull 1,000 URLs from the same domain in sequence, hammer that server, and violate politeness rules. The Frontier needs two layers of queuing:

  1. Priority queues (front-end): URLs are bucketed by crawl priority β€” freshness score, PageRank estimate, or recency of the domain's last update. High-priority URLs (e.g., a major news homepage) jump the queue.
  2. Per-domain FIFO queues (back-end): Each domain has its own FIFO queue. The Scheduler pulls from the front-end priority queues, then dispatches a URL from the appropriate domain queue β€” but only if the domain's politeness timer has expired.

BFS vs DFS vs Priority-based traversal

StrategyBehaviourBest for
BFSDiscovers broad, shallow pages firstGeneral-purpose broad web crawl (default)
DFSGoes deep on a single domain before backtrackingFocused crawl of a single site's entire archive
Priority-basedScores URLs by freshness + authority; crawls highest-value pages firstNews crawl, sitemap-guided re-crawl

For a broad web crawl, BFS is the correct default β€” it surfaces diverse high-authority pages quickly rather than exhausting one domain before moving on.

Partitioning the Frontier across multiple crawl nodes: Partition by hash(domain) % N where N is the number of crawl workers. All URLs belonging to nytimes.com always route to the same worker β€” this centralises per-domain state (politeness timer, robots.txt cache) without any cross-node coordination.

The Politeness Module: Keeping Crawlers from Crashing Websites

Every time the Scheduler picks a URL, the Politeness Module makes three checks:

  1. robots.txt β€” has the crawler fetched and cached this domain's robots.txt? Is the target path allowed for the crawler's User-Agent?
  2. Crawl-delay β€” what is the minimum wait between requests to this domain? Default: 1 second. If Crawl-delay is declared in robots.txt, honour it.
  3. Domain rate limiter β€” has the configured inter-request delay elapsed since the last fetch from this domain?
# Pseudocode: robots.txt compliance check
def is_allowed(domain: str, path: str, user_agent: str = "MyCrawler") -> bool:
    rules = robots_cache.get(domain)
    if rules is None:
        rules = fetch_and_parse_robots_txt(domain)     # HTTP GET /robots.txt
        robots_cache.set(domain, rules, ttl=86400)     # cache for 24 hours

    for rule_group in rules:
        if user_agent_matches(rule_group.agents, user_agent):
            for disallow in rule_group.disallow:
                if path.startswith(disallow):
                    return False               # path is explicitly disallowed
    return True

The Redis rate limiter enforces the per-domain crawl delay. The INCR + EXPIRE pattern gives an atomic, TTL-bounded request counter per domain key:

-- Redis Lua script: per-domain politeness gate (atomic)
local key   = "crawl:ratelimit:" .. KEYS[1]          -- e.g., crawl:ratelimit:nytimes.com
local limit = tonumber(ARGV[1])                       -- max requests per window (e.g., 1)
local ttl   = tonumber(ARGV[2])                       -- window in seconds (e.g., 1)

local count = redis.call("INCR", key)
if count == 1 then
    redis.call("EXPIRE", key, ttl)
end

if count > limit then
    return 0    -- rate-limited: scheduler should back off this domain
else
    return 1    -- allowed: proceed with fetch
end

A 0 return from the Lua script pushes the URL back into the domain's back-off queue with a timestamp. The Scheduler re-examines it only after the TTL expires.

URL Deduplication: Why a Bloom Filter Beats a Hash Set at One Billion URLs

After the Parser extracts new URLs from a fetched page, each candidate URL passes through the deduplication layer. The question is: have we already crawled or enqueued this URL?

A plain hash set would work β€” until it doesn't. At 10 billion URLs, a Python set holding 50-byte URL hashes consumes ~500 GB of RAM. That's not deployable.

A Bloom filter solves this with a fixed-size bit array. The false-positive probability is:

P(false_positive) β‰ˆ (1 - e^(-kn/m))^k

where:
  k = number of hash functions
  n = number of inserted elements
  m = number of bits in the filter

For 10 billion URLs at 1% false-positive rate, the optimal parameters are:

  • m β‰ˆ 95.8 billion bits β†’ ~12 GB
  • k = 7 hash functions (optimal: k = (m/n) Γ— ln 2)

A 12 GB Redis Bloom filter (using RedisBloom module) handles 10 billion URLs with a false-positive rate of 1% β€” meaning 1 in 100 unseen URLs is incorrectly flagged as already-crawled. At our 385 pages/sec crawl rate, that means roughly 3-4 valid URLs are missed per second β€” an acceptable loss for the enormous memory savings.

Bloom filter properties to remember in interviews: Bloom filters have no false negatives β€” if the filter says "not seen", it is definitely not seen. The only failure mode is false positives (seen when not actually seen), which results in a small number of valid pages being skipped, not silent corruption.

Content Deduplication: SimHash for Near-Duplicate Page Detection

MD5 or SHA-256 checksums catch exact duplicates β€” pages with byte-for-byte identical content. They miss near-duplicates: a product page that differs only in a sidebar ad, a news article reprinted with a changed byline, or a paginated article where only the page number differs.

SimHash produces a 64-bit fingerprint where similar documents produce fingerprints with similar bit patterns. The Hamming distance between two SimHashes correlates with document similarity.

# SimHash algorithm sketch
def simhash(tokens: list[str]) -> int:
    vector = [0] * 64                          # 64-bit output
    for token in tokens:
        h = mmh3.hash64(token)[0]              # 64-bit MurmurHash3
        weight = 1                             # uniform weight; use TF-IDF weight in production
        for i in range(64):
            if h & (1 << i):
                vector[i] += weight            # bit is 1: add weight
            else:
                vector[i] -= weight            # bit is 0: subtract weight

    fingerprint = 0
    for i in range(64):
        if vector[i] > 0:
            fingerprint |= (1 << i)            # set bit if net positive
    return fingerprint

def is_near_duplicate(new_hash: int, stored_hash: int, threshold: int = 3) -> bool:
    xor = new_hash ^ stored_hash
    return bin(xor).count('1') <= threshold    # Hamming distance ≀ 3 bits = near-duplicate

At interview time, the key decision to articulate is: use MD5 for exact-duplicate detection as a fast first pass, then SimHash for near-duplicate detection as a second pass. Running SimHash on every page and storing all 64-bit fingerprints for 10 billion pages consumes only 80 GB β€” manageable as a distributed in-memory index.

Distributed Fetcher Pool: Throughput Without Overloading the Scheduler

Fetcher workers are stateless. Each worker:

  1. Receives a URL from the Scheduler
  2. Resolves DNS (from the DNS cache first)
  3. Opens a persistent HTTP/1.1 or HTTP/2 connection (connection pool per domain)
  4. Downloads the HTML body (max configurable size: e.g., 10 MB; truncate beyond that)
  5. Passes the raw HTML to the Parser
  6. Reports status back to the Scheduler (success, HTTP 4xx, HTTP 5xx, timeout)

Workers scale horizontally. Each worker node handles one domain partition (per the hash(domain) % N assignment), which means it can maintain a warm connection pool and DNS cache for its assigned domains without cross-node invalidation.

DNS caching: DNS lookups add 50–200 ms of latency per page if uncached. A Redis DNS cache with a 5-minute TTL cuts DNS resolution to a sub-millisecond local lookup for hot domains. The cache key is the hostname; the value is the resolved IP address.

Write Path: From Discovered URL to Content Store

This sequence shows what happens when the Parser discovers a brand-new URL on a page that was just fetched.

sequenceDiagram
    participant P as Parser
    participant BF as Bloom Filter
    participant F as URL Frontier
    participant S as Scheduler
    participant PM as Politeness Module
    participant W as Fetcher Worker
    participant CH as Content Store

    P->>BF: CHECK url_hash
    BF-->>P: NOT_SEEN
    P->>BF: ADD url_hash
    P->>F: ENQUEUE(url, priority)
    F-->>S: next URL (priority order)
    S->>PM: is_allowed(domain, path)
    PM-->>S: ALLOWED (rate limit OK)
    S->>W: FETCH(url)
    W-->>P: raw HTML + headers
    P->>BF: CHECK content_hash (SimHash)
    BF-->>P: NEW_CONTENT
    P->>CH: WRITE(url, html, metadata)

The write path has two deduplication gates: one at URL enqueue time (Bloom filter check on the URL hash) and one at content write time (SimHash comparison). A URL that passes the first gate but produces a near-duplicate page is fetched but not stored β€” saving Content Store space while still consuming a fetch slot.

Read Path: How the Scheduler Picks the Next URL to Fetch

The Scheduler loop runs continuously:

  1. Pop the highest-priority URL from the front-end priority queue in the Frontier
  2. Look up the domain's last-fetch timestamp in Redis (domain:last_fetch:{domain})
  3. If now - last_fetch < crawl_delay, push the URL into the domain's back-off queue and pop the next URL instead
  4. Check robots.txt cache; skip if the path is disallowed
  5. Dispatch the URL to an available Fetcher worker in the domain's worker partition
  6. Update domain:last_fetch:{domain} = now

The Scheduler never blocks on a single domain. If nytimes.com is in its crawl-delay window, the Scheduler skips to the next domain. This is the key insight that keeps throughput high: domain-level politeness does not pause the entire crawl.

Performance Analysis: Crawl Throughput Ceiling and Latency Contributors

Understanding the latency budget per page fetch is essential for sizing the fetcher pool and diagnosing production slowdowns.

Latency ContributorTypical CostMitigation
DNS resolution (cold)50–200 msRedis DNS cache (5-min TTL) reduces to < 1 ms on cache hit
TCP handshake (cold)100–300 msPersistent connection pool per domain; reuse across fetches
TLS handshake (HTTPS)50–150 msTLS session resumption; HTTP/2 multiplexing
HTTP response (server-side)50–500 msConfigurable timeout (2 s default); skip slow servers after 3 timeouts
HTML download (250 KB avg)20–100 msContent size cap (10 MB); truncate at limit
Total p50 per fetch~300–800 msTarget: p99 ≀ 5 s

At 385 fetches/sec with a p50 latency of 500 ms, you need at least 193 concurrent Fetcher workers to keep the pipeline saturated (Little's Law: N = Ξ» Γ— W = 385 Γ— 0.5). In practice, size the pool at 2Γ— the theoretical minimum (β‰ˆ 400 workers) to absorb bursts and slow-domain back-off delays.

Throughput ceiling: the Scheduler's single-threaded dispatch loop becomes the bottleneck above approximately 2,000 dispatches/sec. Above that, partition the Scheduler by domain range into multiple Scheduler instances (each managing a slice of the domain hash space).

Bloom filter lookup latency: a single BF.EXISTS call to Redis takes 0.1–0.5 ms β€” negligible at 385 checks/sec. At 100Γ— scale (38,500/sec), batch lookups via Redis pipeline or a local in-process Bloom filter replica reduce this to effectively zero.

: The Three Tables That Drive Crawl State

-- URL Frontier: persistent queue state (overflow to DB when in-memory queue is full)
CREATE TABLE url_frontier (
    url             TEXT        NOT NULL,
    url_hash        BIGINT      NOT NULL,           -- FNV-64 hash for fast dedup lookup
    domain          TEXT        NOT NULL,
    priority        FLOAT       NOT NULL DEFAULT 0.5,
    enqueued_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    status          TEXT        NOT NULL DEFAULT 'pending',  -- pending | in_flight | done | error
    PRIMARY KEY (url_hash)
);
CREATE INDEX idx_frontier_priority ON url_frontier (priority DESC, enqueued_at ASC)
    WHERE status = 'pending';

-- Crawled pages: source of truth for fetched content
CREATE TABLE crawled_pages (
    url             TEXT        NOT NULL PRIMARY KEY,
    url_hash        BIGINT      NOT NULL UNIQUE,
    domain          TEXT        NOT NULL,
    http_status     SMALLINT    NOT NULL,
    content_hash    BIGINT      NOT NULL,           -- SimHash fingerprint (64-bit)
    md5_checksum    CHAR(32),                       -- exact-duplicate fast check
    content_size    INT,                            -- raw HTML size in bytes
    last_crawled    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    crawl_count     INT         NOT NULL DEFAULT 1  -- incremented on re-crawl
);
CREATE INDEX idx_pages_domain ON crawled_pages (domain, last_crawled DESC);

-- Domain metadata: politeness and robots.txt state
CREATE TABLE domain_metadata (
    domain          TEXT        NOT NULL PRIMARY KEY,
    robots_txt      TEXT,                           -- cached robots.txt content
    crawl_delay_sec FLOAT       NOT NULL DEFAULT 1.0,
    last_robots_fetch TIMESTAMPTZ,
    is_blocked      BOOLEAN     NOT NULL DEFAULT FALSE,  -- manually blocked domain
    total_pages_crawled INT     NOT NULL DEFAULT 0
);

The content_hash column in crawled_pages stores the 64-bit SimHash fingerprint. On re-crawl, if the new SimHash is within 3 bits of the stored value, the page is marked as near-duplicate and the Content Store write is skipped β€” only last_crawled and crawl_count are updated.

πŸ”‘ Cache Design: DNS, robots.txt, and the Bloom Filter

Cache LayerWhat it StoresBackendTTLEviction
DNS cachehostname β†’ IP addressRedis (STRING)5 minutesTTL-based; lazy expiry
robots.txt cachedomain β†’ parsed rulesRedis (HASH)24 hoursTTL-based; refresh on crawl
URL Bloom filterSeen URL bit array (12 GB)Redis (RedisBloom BF)Persistent (no TTL)Never evicted; rebuilt on restart from crawled_pages
SimHash indexcontent_hash β†’ url (for near-dup lookup)Redis (HASH, sharded)30 daysLRU; evict stale pages
Domain rate limiterdomain β†’ request count in windowRedis (STRING + EXPIRE)1 second (window)TTL-based; auto-resets

The Bloom filter is the single most important cache in the system. It is always kept warm: on cold start, it is rebuilt by scanning the url_hash column in crawled_pages. This takes about 30 minutes for 10 billion entries but happens only on full restarts β€” normal rolling deploys keep the Bloom filter alive in Redis.

βš–οΈ Trade-offs, Failure Modes, and Scaling Limits

URL Frontier as the Primary Bottleneck

The Frontier is the single coordination point for all crawl workers. If the Frontier is a single Redis sorted set, it becomes a write bottleneck above ~10K enqueue ops/sec.

Solution: Partition the Frontier by hash(domain) % N. Each partition owns a subset of domains and is managed by a dedicated Scheduler instance. Workers for a given domain always communicate with the same Scheduler β€” no cross-partition coordination needed.

Distributed Bloom Filter: Scaling Beyond 12 GB

A single 12 GB Bloom filter fits in one Redis instance today. At 100 billion URLs it would require 120 GB. Options:

ApproachTrade-off
Sharded Bloom filters (N filters, partition by URL hash)Scales linearly; each shard is independent; requires routing layer
Counting Bloom filterSupports deletion (useful for re-crawl expiry); 4Γ— memory overhead
Cuckoo filter (RedisBloom CF)Better space efficiency than Bloom; supports deletion; slightly higher constant overhead

For this design, sharded Bloom filters with domain-based partitioning is the recommended scaling path β€” it aligns with the Frontier partitioning scheme and requires no additional routing logic.

Fetcher Pool Scaling

Add fetcher workers horizontally. Each new worker node registers itself with the Scheduler's domain assignment table. The hash(domain) % N function is recomputed on worker join/leave β€” a consistent hashing ring avoids re-assigning all domains on every topology change.

Content Store Throughput

At 400 writes/sec of 250 KB pages, the Content Store absorbs ~100 MB/sec of raw write throughput. S3-compatible object storage handles this easily. The content hash is the object key β€” exact-duplicate pages never generate a new S3 write (same key = same content, already stored).

🧭 Decision Guide: Key Architectural Choices

ChoiceRecommendationAvoid
URL DeduplicationBloom filter (Redis) β€” O(1) lookup, fixed 12 GB memoryHash set β€” O(n) memory growth; 500+ GB at 10B URLs
Frontier traversal orderBFS with priority scoring for broad crawlDFS β€” exhausts single domains, violates politeness
Frontier partitioninghash(domain) % N β€” centralises per-domain state per workerRandom partitioning β€” scatters domain state across workers, breaks rate limiting
Content dedup first passMD5 checksum β€” O(1) exact match before SimHashSimHash only β€” misses exact duplicates and is more expensive than MD5
robots.txt cachingRedis with 24-hour TTLFetch on every request β€” adds latency and hammers small servers
Crawl-delay enforcementRedis Lua script (atomic INCR + EXPIRE per domain)Application-level sleep β€” blocks the Scheduler thread; non-atomic under concurrent workers
DNS cachingRedis with 5-minute TTLNo cache β€” 50–200 ms added per fetch; catastrophic at 385 fetches/sec

🌍 Real-World Applications: How Googlebot, Common Crawl, and Bingbot Scale

The design in this HLD is not hypothetical β€” it maps directly to how the three dominant production crawlers are architected.

Googlebot operates a geographically distributed fleet with separate crawl workers for different content verticals (news, shopping, images). Politeness is enforced per-IP, not just per-domain, so that Google's distributed fleet doesn't accidentally aggregate into a DDoS from multiple geographic locations hitting the same server. Google uses a custom URL Frontier backed by Bigtable (essentially a distributed priority queue at Google scale) and a proprietary content deduplication pipeline that runs SimHash at the web scale.

Common Crawl (commoncrawl.org) is the open-source reference crawl: it crawls approximately 3 billion pages per month and makes the raw WARC (Web ARChive) files publicly available on S3. Common Crawl uses Apache Nutch (covered below) for crawl coordination and stores the raw HTML + metadata in the WARC format β€” exactly the Content Store pattern described in this HLD. The WARC files are consumed by downstream ML training pipelines (GPT-3 was partially trained on Common Crawl data).

Bingbot (Microsoft) differs from Googlebot primarily in its adaptive crawl-rate policy: Bingbot reads the X-Robots-Tag HTTP response header in addition to robots.txt, and dynamically backs off per-domain based on server response time β€” if a server's p95 latency exceeds 500 ms, Bingbot halves its crawl rate for that domain. This is a production-grade politeness heuristic that goes beyond simple Crawl-delay enforcement.

CrawlerScaleURL Frontier backendDedup strategyPoliteness mechanism
Googlebot~15B pages/monthBigtable (distributed priority queue)SimHash + content hashPer-IP rate limit; robots.txt; adaptive backoff
Common Crawl~3B pages/monthApache Nutch / HDFS CrawlDBURL hash + WARC deduprobots.txt + Crawl-delay
BingbotUndisclosedProprietaryContent hashrobots.txt + server latency adaptive backoff

The unifying principle across all three: the URL Frontier and politeness module are the hardest components to get right, while the Fetcher workers are straightforward horizontal scale problems.

πŸ› οΈ Apache Nutch: How It Implements Distributed Crawling

Apache Nutch is the reference open-source implementation of a distributed web crawler. It is the project from which Apache Solr and Hadoop were originally spun out. Understanding Nutch's architecture validates the design choices in this HLD.

Nutch uses Hadoop MapReduce (or Apache YARN) to distribute crawl phases across a cluster:

Nutch PhaseMapReduce JobWhat it does
InjectSingle mapperSeeds the URL database with the initial URL list
GeneratePartitioned mapperSelects URLs from the CrawlDB for the next fetch batch; enforces politeness limits per domain
FetchMulti-threaded mapperDownloads pages using Nutch's built-in HTTP fetcher; respects robots.txt and Crawl-delay
ParseMapper + ReducerExtracts links and metadata from raw HTML; runs pluggable parser plugins
UpdateDBReducerMerges fetch results back into the CrawlDB; updates crawl scores and status
Invert LinksReducerBuilds the inbound link graph for PageRank-style scoring

A minimal Nutch crawl configuration showing politeness settings:

<!-- nutch-site.xml: politeness and fetch configuration -->
<configuration>
  <!-- Per-domain minimum fetch delay: 1000 ms (1 req/sec max) -->
  <property>
    <name>fetcher.server.delay</name>
    <value>1.0</value>
  </property>

  <!-- Maximum threads per queue (domain): prevents hammering a single server -->
  <property>
    <name>fetcher.threads.per.queue</name>
    <value>1</value>
  </property>

  <!-- robots.txt cache TTL: 24 hours -->
  <property>
    <name>http.robots.403.allow</name>
    <value>true</value>
  </property>

  <!-- Maximum content size to download: 10 MB per page -->
  <property>
    <name>http.content.limit</name>
    <value>10485760</value>
  </property>

  <!-- Bloom filter for URL dedup: enabled by default in Nutch 2.x -->
  <property>
    <name>crawldb.url.normalizers</name>
    <value>true</value>
  </property>
</configuration>

Nutch's CrawlDB is an HDFS-backed persistent store that maps directly to the url_frontier and crawled_pages tables in this HLD's data model. The fetcher.threads.per.queue=1 configuration is exactly the "max 1 req/sec per domain" politeness rule β€” implemented as a thread quota rather than a timer, but with the same effect.

For a deep dive on Apache Nutch's Hadoop integration and plugin system, see the Apache Nutch documentation or the planned follow-up companion post.

πŸ§ͺ Practical Example: Walking https://news.ycombinator.com Through the Full Crawl Pipeline

A concrete trace of one URL from seed injection to Content Store write β€” the kind of step-by-step walkthrough that wins points in interviews.

Step 1 β€” Seed injection: The URL https://news.ycombinator.com is injected as a seed with priority 0.9 (high-authority domain). It enters the Frontier's per-domain FIFO queue for news.ycombinator.com.

Step 2 β€” Bloom filter check (URL dedup): The Scheduler calls BF.EXISTS for the FNV-64 hash of the URL. Result: 0 (not seen). The URL is added to the Bloom filter (BF.ADD) and enqueued.

Step 3 β€” Politeness check: The Scheduler looks up news.ycombinator.com in the robots.txt cache (Redis HASH key robots:news.ycombinator.com). Cache miss β†’ fetches /robots.txt, finds Crawl-delay: 30 for most bots but no Disallow for the root path. The crawl delay is stored; the URL is allowed.

Step 4 β€” Rate limit gate: Redis Lua script checks crawl:ratelimit:news.ycombinator.com. Count = 0 β†’ set to 1 with 30-second EXPIRE. Gate returns 1 (allowed). A Fetcher worker is dispatched.

Step 5 β€” Fetch: The Fetcher resolves DNS (cache hit: 45.79.22.115), opens a persistent HTTPS connection, and downloads the HTML. Response: 200 OK, 42 KB, latency 120 ms.

Step 6 β€” Content dedup: Parser computes MD5 (a3f9...) and SimHash (0x3B4A...) for the page. Checks crawled_pages β€” no existing entry. New content confirmed; write to Content Store (S3).

Step 7 β€” Link extraction: Parser finds 30 outbound links (story URLs on the HN homepage). Each link goes through the Bloom filter check. 2 are already seen (return 0); 28 are new. The 28 new URLs are enqueued in the Frontier.

Outcome: 1 page stored, 28 new URLs discovered, Bloom filter updated with 29 new entries (1 page URL + 28 link URLs), rate-limit window set to 30 seconds before the next news.ycombinator.com fetch is allowed.

This trace is the right level of detail for an interview: it shows you understand the data flow at every gate, not just the high-level components.

πŸ“š Lessons Learned from Building Crawlers at Scale

  • Never skip robots.txt compliance, even in internal tools. A crawler that ignores robots.txt will get IP-banned by CDNs and Cloudflare within minutes at scale. Always cache and honour it.
  • The URL Frontier is harder than the Fetcher. Most engineers focus on HTTP fetching performance, but the Frontier is the true scaling bottleneck. Partition it early and make it domain-aware from day one.
  • Bloom filters have no false negatives β€” design around their false-positive direction. A false positive means you skip a valid URL. Set your false-positive rate target based on acceptable crawl miss rate, not memory preference.
  • SimHash thresholds need tuning per content vertical. A threshold of 3 Hamming bits works well for news articles but is too aggressive for e-commerce product pages, which differ legitimately by SKU with nearly identical templates.
  • DNS caching is mandatory, not optional. At 385 fetches/sec, uncached DNS adds an average 100 ms to every fetch = 38 seconds of wasted latency per second of crawl throughput. Cache aggressively with a 5-minute TTL.
  • Connection pooling per domain matters more than per-worker. Keep persistent HTTP connections to each domain warm in the assigned worker. A cold TCP handshake costs 100–300 ms.

πŸ“Œ TLDR: Summary & Key Takeaways

  • A distributed web crawler has three core design challenges: priority-aware URL Frontier, politeness enforcement per domain, and two-layer deduplication (URL-level Bloom filter + content-level SimHash)
  • Partition everything by domain: Frontier queues, Fetcher workers, DNS caches, and rate limiters β€” domain is the natural sharding key because all politeness state is domain-scoped
  • Bloom filters beat hash sets at billion-URL scale: 12 GB for 10B URLs at 1% false-positive rate vs. 500+ GB for a hash set β€” with the trade-off that ~1% of valid URLs are silently skipped
  • SimHash catches what MD5 misses: near-duplicate pages with minor differences (ads, pagination, sidebars) have identical SimHash fingerprints within 3 bits but completely different MD5 checksums
  • The Scheduler is the politeness enforcer, not individual Fetcher workers β€” centralising rate-limit decisions per domain partition prevents coordination overhead across workers
  • Apache Nutch validates this architecture β€” its Hadoop-based MapReduce phases map directly to the Frontier β†’ Scheduler β†’ Fetcher β†’ Parser β†’ UpdateDB pipeline

πŸ“ Practice Quiz

  1. Why does a web crawler use a Bloom filter for URL deduplication instead of storing seen URLs in a hash set?

    • A) Bloom filters support O(1) deletion, making it easy to remove expired URLs
    • B) A Bloom filter uses a fixed amount of memory (e.g., 12 GB for 10B URLs) with a configurable false-positive rate, while a hash set grows linearly with the number of URLs β€” consuming 500+ GB at 10B entries
    • C) Bloom filters are faster than hash sets for exact lookups in all cases

    Correct Answer: B

  2. Why is per-domain rate limiting applied at the Scheduler rather than inside each Fetcher worker?

    • A) Fetcher workers don't have network access to the domain metadata
    • B) Centralising rate-limit decisions at the Scheduler (one decision point per domain partition) prevents two workers assigned to the same domain from each independently approving a fetch, which would double the effective request rate
    • C) The Scheduler runs on faster hardware than Fetcher workers

    Correct Answer: B

  3. A crawler fetches a news article that has been republished on 50 different sites with only the publication date changed in the byline. Which deduplication mechanism catches these near-duplicates, and why doesn't MD5 work here?

    • A) The URL Bloom filter catches them because the article title appears in all 50 URLs
    • B) MD5 works fine β€” any byte-level difference in the page produces a completely different checksum, which is exactly what we want for detecting changes
    • C) SimHash catches them β€” the 64-bit fingerprint of two near-identical documents differs by ≀ 3 bits (Hamming distance), while MD5 produces completely different hashes for documents that differ by even a single byte

    Correct Answer: C

  4. Why does a broad web crawl use BFS (breadth-first search) for URL expansion rather than DFS (depth-first search)?

    • A) BFS is always faster than DFS because it uses a queue instead of a stack
    • B) DFS would crawl all 500,000 pages of a single domain before visiting any page on another domain β€” concentrating fetch load on one server for extended periods and violating politeness rules; BFS distributes fetches across many domains naturally
    • C) DFS cannot handle redirects, making it unsuitable for web crawling

    Correct Answer: B

  5. Open-ended challenge: A domain's robots.txt specifies Crawl-delay: 10 (10 seconds between requests). Your crawler has 50 URLs queued for that domain. How does the Scheduler handle this without blocking crawl throughput to all other domains? Walk through the data flow.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms