Core Systems

Web Crawler

A distributed crawler that starts from seed URLs, traverses the web breadth-first, extracts and stores page text (a corpus for LLM training), finishing within 5 days across 10 billion pages — fault-tolerant and polite to every host it visits.

Requirements

Functional

Crawl the web starting from a set of seed URLs.
Extract text from each page and store it (LLM-training corpus).
Discover new URLs from each page and continue BFS until the reachable web is crawled.

Non-functional

Fault tolerant — failed fetches/parses retry and eventually park; no lost work.
Polite — honor robots.txt, per-domain crawl-delay, rate limits.
Scale to 10B pages; every component scales horizontally.
Efficient — finish in < 5 days; never re-crawl or re-parse duplicates.
AP-leaning — availability >> consistency; crawling a URL twice is acceptable, stalling is not.

Scale & back-of-the-envelope

Quantity	Value	Derivation
Pages	10B (10¹⁰)	given
Avg page size	2 MB	given
Throughput	~10,000 pages/s	10¹⁰ ÷ ~10⁶ s
Bandwidth (bottleneck)	~160 Gbps	10⁴ × 2 MB = 20 GB/s → ~4 fat-NIC fetchers
Raw HTML storage	~20 PB	10¹⁰ × 2 MB (lifecycle-expire after parse)
Seen-set memory	~200 GB	10¹⁰ × 20 B/hash → sharded Redis / Bloom filter

Crawling is network/IO-bound, not CPU-bound — a handful of fat-pipe machines saturate the bandwidth, which is why the fetcher fleet is just ~4 machines.

Core entities

Text Data — cleaned text extracted from a page (the deliverable in S3).
URL Metadata — raw-HTML location, last crawl time, content hash, status.
Domain Metadata — politeness rules parsed from robots.txt.

This is a batch system: input = seed URLs, output = text corpus. Internal queue contracts are { url } (frontier) and { url, s3Link } (parsing).

High-level design

A naive monolith couples network-bound fetching and CPU-bound parsing in one process. The final design decouples them through queues so each scales on its own signal — bandwidth vs. queue depth — and raw HTML is persisted to S3 before parsing so a parser crash never loses a fetched page.

flowchart TD
    SEED["Seed URLs"] --> FQ["URL Frontier (SQS)"]
    FQ --> CR["Crawler x4 (fetcher)"]
    CR -->|resolve IP| DNS["DNS + Redis cache"]
    CR -->|GET page| WEB["Webpage"]
    CR -->|rate limit| RL["Redis politeness"]
    CR -->|save raw HTML| S3R["S3 Raw HTML"]
    CR -->|update status| DDB["DynamoDB URL + Domain"]
    CR -->|enqueue msg| PQ["Parsing Queue"]
    CR -->|5th retry| DLQ["Dead Letter Queue"]
    PQ --> PW["Parsing Workers (autoscale)"]
    PW -->|fetch HTML| S3R
    PW -->|save text| S3T["S3 Text Data"]
    PW -->|dedup check| DDB
    PW -->|new URLs| FQ

Fetch/parse decoupled by a queue; the Parsing Queue is a shock absorber for fetch bursts.

Deep dive · the URL frontier (politeness + priority)

The frontier is the heart of a crawler — not just "a queue." It enforces politeness (don't overload a host) and priority (fetch important pages first) while staying BFS-ordered.

flowchart TD
    IN["Incoming URLs"] --> PR["Priority Router"]
    PR -->|high| HQ["High Priority Queue"]
    PR -->|low| LQ["Low Priority Queue"]
    HQ --> HOST["Per-Host Queues"]
    LQ --> HOST
    HOST --> RL["Politeness Rate Limiter (Redis)"]
    RL -->|delay ok| CR["Crawler Fetch"]
    RL -->|too soon| WAIT["Delay then requeue"]
    WAIT --> HOST

Politeness: each domain has a crawlDelay (e.g. 10 s). Redis tracks lastCrawledTime / a token bucket; if hit too recently, the URL is delayed and requeued.
Per-host queues: one busy domain (millions of links) never starves others.
Priority: rank by authority/freshness/depth into high/low queues for the 5-day budget.
Implementation: SQS (durable, visibility timeouts, built-in DLQ) + Redis timing layer.

Deep dive · dedup (URLs and content)

Two independent dedup problems map to the two efficiency rules: don't crawl a URL already crawled; don't parse content already parsed.

flowchart LR
    U["Extracted URL"] --> N["Normalize URL"]
    N --> SEEN{"Seen before?"}
    SEEN -->|yes| DROP["Drop URL"]
    SEEN -->|no| ADD["Add to Frontier"]
    ADD --> FETCH["Fetch then hash HTML"]
    FETCH --> DUP{"Content hash exists?"}
    DUP -->|yes| SKIP["Skip parse: dup content"]
    DUP -->|no| PARSE["Parse and store text"]

URL dedup: normalize (lowercase host, strip ports/fragments/tracking params), then check a sharded Redis set / Bloom filter (~200 GB), backed by the url partition key in DynamoDB. A bloom false positive just skips a truly-new URL — acceptable.
Content dedup: hash the HTML at fetch; a GSI on the hash lets parsers skip byte-identical bodies (mirrors, session-id URLs, printer-friendly variants). Use SimHash/MinHash for near-duplicates.

Deep dive · robots.txt & crawler traps

Fetch and cache each domain's /robots.txt once; store rules on the Domain record so a single fetch governs millions of that domain's URLs.

User-agent: *
Disallow: /private/
Crawl-delay: 10

The web is adversarial — defend against infinite/dynamic URLs (calendar "next month", faceted search) with URL normalization, max depth and per-domain page caps; cap redirect chains; detect spider-traps/link-farms by abnormal fan-out; check Content-Type/Content-Length to skip binaries; enforce strict fetch timeouts.

Deep dive · fault tolerance & CAP posture

sequenceDiagram
    participant F as Frontier (SQS)
    participant C as Crawler
    participant W as Website
    participant S as S3
    participant M as DynamoDB
    participant P as Parsing Queue
    participant PW as Parser
    F->>C: dequeue URL
    C->>M: seen URL? (dedup)
    C->>W: GET page (after politeness wait)
    W-->>C: HTML
    C->>S: store raw HTML
    C->>M: update lastCrawlTime + hash
    C->>P: enqueue url and s3Link
    P->>PW: deliver message
    PW->>S: fetch raw HTML
    PW->>S: store text data
    PW->>F: enqueue new URLs

Retry with exponential backoff; after the 5th attempt a URL goes to the DLQ.
At-least-once + idempotency — storage keyed by url/content hash, so reprocessing overwrites the same object.
Persist-before-parse — a parser crash costs only a cheap re-parse, never a re-fetch.

CAP choice: AP

Under a partition, fetchers/parsers keep making progress rather than blocking on a globally consistent "what's been crawled" view. The cost — occasional duplicate work — is harmless because writes are idempotent. This is why DynamoDB (tunable, highly available) fits the metadata store.

Data model

URL table (DynamoDB)

Field	Notes
`url`	Partition key — normalized URL; powers URL dedup
`s3Link`	Pointer to raw HTML blob
`lastCrawlTime`	Recrawl + status
`hash`	Content hash — GSI for content dedup
`status`	crawled / pending / failed

Domain table (DynamoDB)

Field	Notes
`domain`	Partition key
`lastCrawledTime`	Drives per-domain politeness pacing
`disallow`	Disallowed path prefixes
`crawlDelay`	Min seconds between requests

Plus S3 (raw HTML, text data), Redis (DNS cache, rate-limit counters, seen-set), and SQS queues (frontier + parsing, each with a DLQ).

Bottleneck summary

Bandwidth (~160 Gbps) is the dominant constraint → ~4 fat-NIC fetchers. DNS → aggressive Redis cache. Politeness → per-host queues + rate limiting. Parsing CPU → autoscale on queue depth. Dedup memory (200 GB) → sharded Redis/Bloom. Raw HTML (20 PB) → S3 with lifecycle expiry.