System Design Notes All designs

Core Systems

Web Crawler

A distributed crawler that starts from seed URLs, traverses the web breadth-first, extracts and stores page text (a corpus for LLM training), finishing within 5 days across 10 billion pages — fault-tolerant and polite to every host it visits.

Requirements

Functional

Non-functional

Scale & back-of-the-envelope

Quantity Value Derivation
Pages 10B (10¹⁰) given
Avg page size 2 MB given
Throughput ~10,000 pages/s 10¹⁰ ÷ ~10⁶ s
Bandwidth (bottleneck) ~160 Gbps 10⁴ × 2 MB = 20 GB/s → ~4 fat-NIC fetchers
Raw HTML storage ~20 PB 10¹⁰ × 2 MB (lifecycle-expire after parse)
Seen-set memory ~200 GB 10¹⁰ × 20 B/hash → sharded Redis / Bloom filter

Crawling is network/IO-bound, not CPU-bound — a handful of fat-pipe machines saturate the bandwidth, which is why the fetcher fleet is just ~4 machines.

Core entities

This is a batch system: input = seed URLs, output = text corpus. Internal queue contracts are { url } (frontier) and { url, s3Link } (parsing).

High-level design

A naive monolith couples network-bound fetching and CPU-bound parsing in one process. The final design decouples them through queues so each scales on its own signal — bandwidth vs. queue depth — and raw HTML is persisted to S3 before parsing so a parser crash never loses a fetched page.

flowchart TD
    SEED["Seed URLs"] --> FQ["URL Frontier (SQS)"]
    FQ --> CR["Crawler x4 (fetcher)"]
    CR -->|resolve IP| DNS["DNS + Redis cache"]
    CR -->|GET page| WEB["Webpage"]
    CR -->|rate limit| RL["Redis politeness"]
    CR -->|save raw HTML| S3R["S3 Raw HTML"]
    CR -->|update status| DDB["DynamoDB URL + Domain"]
    CR -->|enqueue msg| PQ["Parsing Queue"]
    CR -->|5th retry| DLQ["Dead Letter Queue"]
    PQ --> PW["Parsing Workers (autoscale)"]
    PW -->|fetch HTML| S3R
    PW -->|save text| S3T["S3 Text Data"]
    PW -->|dedup check| DDB
    PW -->|new URLs| FQ
      

Fetch/parse decoupled by a queue; the Parsing Queue is a shock absorber for fetch bursts.

Deep dive · the URL frontier (politeness + priority)

The frontier is the heart of a crawler — not just "a queue." It enforces politeness (don't overload a host) and priority (fetch important pages first) while staying BFS-ordered.

flowchart TD
    IN["Incoming URLs"] --> PR["Priority Router"]
    PR -->|high| HQ["High Priority Queue"]
    PR -->|low| LQ["Low Priority Queue"]
    HQ --> HOST["Per-Host Queues"]
    LQ --> HOST
    HOST --> RL["Politeness Rate Limiter (Redis)"]
    RL -->|delay ok| CR["Crawler Fetch"]
    RL -->|too soon| WAIT["Delay then requeue"]
    WAIT --> HOST
      

Deep dive · dedup (URLs and content)

Two independent dedup problems map to the two efficiency rules: don't crawl a URL already crawled; don't parse content already parsed.

flowchart LR
    U["Extracted URL"] --> N["Normalize URL"]
    N --> SEEN{"Seen before?"}
    SEEN -->|yes| DROP["Drop URL"]
    SEEN -->|no| ADD["Add to Frontier"]
    ADD --> FETCH["Fetch then hash HTML"]
    FETCH --> DUP{"Content hash exists?"}
    DUP -->|yes| SKIP["Skip parse: dup content"]
    DUP -->|no| PARSE["Parse and store text"]
      

Deep dive · robots.txt & crawler traps

Fetch and cache each domain's /robots.txt once; store rules on the Domain record so a single fetch governs millions of that domain's URLs.

User-agent: *
Disallow: /private/
Crawl-delay: 10

The web is adversarial — defend against infinite/dynamic URLs (calendar "next month", faceted search) with URL normalization, max depth and per-domain page caps; cap redirect chains; detect spider-traps/link-farms by abnormal fan-out; check Content-Type/Content-Length to skip binaries; enforce strict fetch timeouts.

Deep dive · fault tolerance & CAP posture

sequenceDiagram
    participant F as Frontier (SQS)
    participant C as Crawler
    participant W as Website
    participant S as S3
    participant M as DynamoDB
    participant P as Parsing Queue
    participant PW as Parser
    F->>C: dequeue URL
    C->>M: seen URL? (dedup)
    C->>W: GET page (after politeness wait)
    W-->>C: HTML
    C->>S: store raw HTML
    C->>M: update lastCrawlTime + hash
    C->>P: enqueue url and s3Link
    P->>PW: deliver message
    PW->>S: fetch raw HTML
    PW->>S: store text data
    PW->>F: enqueue new URLs
      

CAP choice: AP

Under a partition, fetchers/parsers keep making progress rather than blocking on a globally consistent "what's been crawled" view. The cost — occasional duplicate work — is harmless because writes are idempotent. This is why DynamoDB (tunable, highly available) fits the metadata store.

Data model

URL table (DynamoDB)

Field Notes
url Partition key — normalized URL; powers URL dedup
s3Link Pointer to raw HTML blob
lastCrawlTime Recrawl + status
hash Content hash — GSI for content dedup
status crawled / pending / failed

Domain table (DynamoDB)

Field Notes
domain Partition key
lastCrawledTime Drives per-domain politeness pacing
disallow Disallowed path prefixes
crawlDelay Min seconds between requests

Plus S3 (raw HTML, text data), Redis (DNS cache, rate-limit counters, seen-set), and SQS queues (frontier + parsing, each with a DLQ).

Bottleneck summary

Bandwidth (~160 Gbps) is the dominant constraint → ~4 fat-NIC fetchers. DNS → aggressive Redis cache. Politeness → per-host queues + rate limiting. Parsing CPU → autoscale on queue depth. Dedup memory (200 GB) → sharded Redis/Bloom. Raw HTML (20 PB) → S3 with lifecycle expiry.