Core Systems
Web Crawler
A distributed crawler that starts from seed URLs, traverses the web breadth-first, extracts and stores page text (a corpus for LLM training), finishing within 5 days across 10 billion pages — fault-tolerant and polite to every host it visits.
Requirements
Functional
- Crawl the web starting from a set of seed URLs.
- Extract text from each page and store it (LLM-training corpus).
- Discover new URLs from each page and continue BFS until the reachable web is crawled.
Non-functional
- Fault tolerant — failed fetches/parses retry and eventually park; no lost work.
-
Polite — honor
robots.txt, per-domaincrawl-delay, rate limits. - Scale to 10B pages; every component scales horizontally.
- Efficient — finish in < 5 days; never re-crawl or re-parse duplicates.
- AP-leaning — availability >> consistency; crawling a URL twice is acceptable, stalling is not.
Scale & back-of-the-envelope
| Quantity | Value | Derivation |
|---|---|---|
| Pages | 10B (10¹⁰) | given |
| Avg page size | 2 MB | given |
| Throughput | ~10,000 pages/s | 10¹⁰ ÷ ~10⁶ s |
| Bandwidth (bottleneck) | ~160 Gbps | 10⁴ × 2 MB = 20 GB/s → ~4 fat-NIC fetchers |
| Raw HTML storage | ~20 PB | 10¹⁰ × 2 MB (lifecycle-expire after parse) |
| Seen-set memory | ~200 GB | 10¹⁰ × 20 B/hash → sharded Redis / Bloom filter |
Crawling is network/IO-bound, not CPU-bound — a handful of fat-pipe machines saturate the bandwidth, which is why the fetcher fleet is just ~4 machines.
Core entities
- Text Data — cleaned text extracted from a page (the deliverable in S3).
- URL Metadata — raw-HTML location, last crawl time, content hash, status.
-
Domain Metadata — politeness rules parsed from
robots.txt.
This is a batch system: input = seed URLs, output = text corpus.
Internal queue contracts are { url } (frontier) and
{ url, s3Link } (parsing).
High-level design
A naive monolith couples network-bound fetching and CPU-bound parsing in one process. The final design decouples them through queues so each scales on its own signal — bandwidth vs. queue depth — and raw HTML is persisted to S3 before parsing so a parser crash never loses a fetched page.
flowchart TD
SEED["Seed URLs"] --> FQ["URL Frontier (SQS)"]
FQ --> CR["Crawler x4 (fetcher)"]
CR -->|resolve IP| DNS["DNS + Redis cache"]
CR -->|GET page| WEB["Webpage"]
CR -->|rate limit| RL["Redis politeness"]
CR -->|save raw HTML| S3R["S3 Raw HTML"]
CR -->|update status| DDB["DynamoDB URL + Domain"]
CR -->|enqueue msg| PQ["Parsing Queue"]
CR -->|5th retry| DLQ["Dead Letter Queue"]
PQ --> PW["Parsing Workers (autoscale)"]
PW -->|fetch HTML| S3R
PW -->|save text| S3T["S3 Text Data"]
PW -->|dedup check| DDB
PW -->|new URLs| FQ
Fetch/parse decoupled by a queue; the Parsing Queue is a shock absorber for fetch bursts.
Deep dive · the URL frontier (politeness + priority)
The frontier is the heart of a crawler — not just "a queue." It enforces politeness (don't overload a host) and priority (fetch important pages first) while staying BFS-ordered.
flowchart TD
IN["Incoming URLs"] --> PR["Priority Router"]
PR -->|high| HQ["High Priority Queue"]
PR -->|low| LQ["Low Priority Queue"]
HQ --> HOST["Per-Host Queues"]
LQ --> HOST
HOST --> RL["Politeness Rate Limiter (Redis)"]
RL -->|delay ok| CR["Crawler Fetch"]
RL -->|too soon| WAIT["Delay then requeue"]
WAIT --> HOST
-
Politeness: each domain has a
crawlDelay(e.g. 10 s). Redis trackslastCrawledTime/ a token bucket; if hit too recently, the URL is delayed and requeued. - Per-host queues: one busy domain (millions of links) never starves others.
- Priority: rank by authority/freshness/depth into high/low queues for the 5-day budget.
- Implementation: SQS (durable, visibility timeouts, built-in DLQ) + Redis timing layer.
Deep dive · dedup (URLs and content)
Two independent dedup problems map to the two efficiency rules: don't crawl a URL already crawled; don't parse content already parsed.
flowchart LR
U["Extracted URL"] --> N["Normalize URL"]
N --> SEEN{"Seen before?"}
SEEN -->|yes| DROP["Drop URL"]
SEEN -->|no| ADD["Add to Frontier"]
ADD --> FETCH["Fetch then hash HTML"]
FETCH --> DUP{"Content hash exists?"}
DUP -->|yes| SKIP["Skip parse: dup content"]
DUP -->|no| PARSE["Parse and store text"]
-
URL dedup: normalize (lowercase host, strip
ports/fragments/tracking params), then check a sharded
Redis set / Bloom filter (~200 GB), backed by the
urlpartition key in DynamoDB. A bloom false positive just skips a truly-new URL — acceptable. - Content dedup: hash the HTML at fetch; a GSI on the hash lets parsers skip byte-identical bodies (mirrors, session-id URLs, printer-friendly variants). Use SimHash/MinHash for near-duplicates.
Deep dive · robots.txt & crawler traps
Fetch and cache each domain's /robots.txt once; store
rules on the Domain record so a single fetch governs millions of that
domain's URLs.
User-agent: *
Disallow: /private/
Crawl-delay: 10
The web is adversarial — defend against infinite/dynamic URLs
(calendar "next month", faceted search) with URL normalization,
max depth and per-domain page caps; cap redirect
chains; detect spider-traps/link-farms by abnormal fan-out; check
Content-Type/Content-Length to skip
binaries; enforce strict fetch timeouts.
Deep dive · fault tolerance & CAP posture
sequenceDiagram
participant F as Frontier (SQS)
participant C as Crawler
participant W as Website
participant S as S3
participant M as DynamoDB
participant P as Parsing Queue
participant PW as Parser
F->>C: dequeue URL
C->>M: seen URL? (dedup)
C->>W: GET page (after politeness wait)
W-->>C: HTML
C->>S: store raw HTML
C->>M: update lastCrawlTime + hash
C->>P: enqueue url and s3Link
P->>PW: deliver message
PW->>S: fetch raw HTML
PW->>S: store text data
PW->>F: enqueue new URLs
- Retry with exponential backoff; after the 5th attempt a URL goes to the DLQ.
-
At-least-once + idempotency — storage keyed by
url/content hash, so reprocessing overwrites the same object. - Persist-before-parse — a parser crash costs only a cheap re-parse, never a re-fetch.
CAP choice: AP
Under a partition, fetchers/parsers keep making progress rather than blocking on a globally consistent "what's been crawled" view. The cost — occasional duplicate work — is harmless because writes are idempotent. This is why DynamoDB (tunable, highly available) fits the metadata store.
Data model
URL table (DynamoDB)
| Field | Notes |
|---|---|
url |
Partition key — normalized URL; powers URL dedup |
s3Link |
Pointer to raw HTML blob |
lastCrawlTime |
Recrawl + status |
hash |
Content hash — GSI for content dedup |
status |
crawled / pending / failed |
Domain table (DynamoDB)
| Field | Notes |
|---|---|
domain |
Partition key |
lastCrawledTime |
Drives per-domain politeness pacing |
disallow |
Disallowed path prefixes |
crawlDelay |
Min seconds between requests |
Plus S3 (raw HTML, text data), Redis (DNS cache, rate-limit counters, seen-set), and SQS queues (frontier + parsing, each with a DLQ).
Bottleneck summary
Bandwidth (~160 Gbps) is the dominant constraint → ~4 fat-NIC fetchers. DNS → aggressive Redis cache. Politeness → per-host queues + rate limiting. Parsing CPU → autoscale on queue depth. Dedup memory (200 GB) → sharded Redis/Bloom. Raw HTML (20 PB) → S3 with lifecycle expiry.