System Design Notes All designs

Core Systems

Bit.ly — URL Shortener

Convert long URLs into compact, unique short codes and redirect visitors back to the original destination in ~200 ms — at 100M DAU and 1B stored URLs, with optional custom aliases, expiration, and analytics.

Requirements

Functional

Non-functional

Below the line (deferred)

User accounts (assumed to exist), rich click analytics, and link editing are treated as extensions.

Scale & back-of-the-envelope

Metric Value Derivation
Daily active users 100M (10⁸) given
Total URLs 1B (10⁹) given
Avg redirect QPS ~1,000 RPS 10⁸ ÷ 10⁵ s/day
Peak redirect QPS ~10k–100k RPS avg × 10–100 peak factor
Read : write ~100 : 1 reads dominate
Storage ~0.5 TB 1B rows × ~500 B

Short-code length (base62 = 0-9 A-Z a-z)

Random generation is subject to the birthday paradox (~expected collisions grow with n²/2M), which is why a uniqueness strategy matters (see deep dive).

Core entities

API design

Shorten

POST /urls
{
  "originalUrl":    "https://example.com/very/long/path?x=1",
  "alias":          "my-link",                 // optional custom alias
  "expirationTime": "2026-12-31T00:00:00Z"     // optional
}

201 Created  → { "shortUrl": "https://bit.ly/abc123" }
409 Conflict → requested custom alias already taken

Redirect

GET /{shortCode}
302 Found
Location: <originalUrl>
404 / 410 → unknown or expired code

Write endpoints sit behind auth + rate limiting to prevent abuse and keyspace scraping.

High-level design

A client hits an API Gateway (TLS, auth, rate limiting, routing) that splits traffic into a Read Service (redirects — the hot path) and a Write Service (creating short URLs). The Read Service is fronted by a Redis read-through LRU cache keyed shortCode → longUrl; on a miss it falls back to the database. The Write Service generates a unique code via an atomic global counter, persists the mapping, and returns the short URL. Redirects are returned as 302.

flowchart LR
    Client["Client"]
    GW["API Gateway"]
    Read["Read Service"]
    Write["Write Service"]
    Cache["Redis Cache (LRU)"]
    Counter["Global Counter (INCR)"]
    DB[("Database")]

    Client -->|"shorten / visit"| GW
    GW --> Read
    GW --> Write
    Read -->|"read-through"| Cache
    Read --> DB
    Write --> DB
    Write -->|"INCR"| Counter
    GW -->|"302 redirect"| Client
      

Read/write split behind a gateway; Redis fronts the redirect hot path.

Deep dives

1 · Short-code generation

Codes must be fast, unique, and short. Four candidates:

Approach How Verdict
Prefix of long URL Leading chars ❌ Not unique — shared prefixes collide.
Random → base62 Random int, encode ⚠️ Birthday-paradox collisions; needs a DB check + retry.
Hash → base62 md5(url), truncate to 6 ⚠️ Deterministic dedup, but truncation re-introduces collisions.
Counter + base62 Atomic INCR, encode ✅ Guaranteed unique, no check, fast. ❌ Predictable.

Chosen: counter + base62, mitigating predictability with a bijective scramble (e.g. sqids/Hashids) so codes are non-sequential yet still unique & reversible, plus rate limiting and a "don't shorten private URLs" caveat. Scale the counter with Redis INCR and range/block leasing (each writer leases 1,000 IDs at a time) to remove per-request contention.

flowchart TD
    Start["Need short code"]
    Q1{"Custom alias?"}
    UseAlias["Use alias, verify unique"]
    Gen["Counter + base62"]
    Scramble["Bijective scramble (sqids)"]
    Done["Store mapping"]

    Start --> Q1
    Q1 -->|yes| UseAlias
    Q1 -->|no| Gen
    Gen --> Scramble
    Scramble --> Done
    UseAlias --> Done
      

2 · Read path & caching (scaling the hot path)

Redirects carry ~100× the traffic of writes, so the path is cache-first. URL popularity follows a power law, so a modest Redis cache yields a high hit ratio and keeps most redirects in single-digit ms. On a miss, read the DB and populate the cache with a TTL aligned to the URL's expiration; add read replicas and optionally a CDN/edge layer.

sequenceDiagram
    participant C as Client
    participant R as Read Service
    participant Cache as Redis
    participant DB as Database
    C->>R: GET /shortCode
    R->>Cache: GET shortCode
    alt cache hit
        Cache-->>R: longUrl
    else miss
        Cache-->>R: nil
        R->>DB: SELECT longUrl WHERE shortCode
        DB-->>R: longUrl
        R->>Cache: SET shortCode longUrl
    end
    R-->>C: 302 redirect to longUrl
      

3 · 301 vs 302 redirect

301 Permanent 302 Found ✅
Browser caches redirect Yes No
Later clicks hit our service No Yes
Analytics / expiry ❌ Lost / hard ✅ Every click counted
Server load Lower Slightly higher

Use 302 because expiration and analytics require every click to reach the service. Choose 301 only when raw redirect performance outweighs tracking.

4 · Expiration & analytics

Lazy deletion on read (now > expirationTime → 410 + evict from cache) plus a background purge job. Analytics are emitted asynchronously (fire-and-forget to Kafka → stream processing) so they never block the redirect.

Data model

Column Type Notes
short_url VARCHAR(7) Primary key — code or custom alias; redirect is a single point-read by PK
long_url VARCHAR(2048) Original destination
expiration_time TIMESTAMP NULL Optional expiry
creation_time TIMESTAMP When created
user_id BIGINT Owner (FK → users)

Indexes: clustered PK on short_url (the O(1) redirect lookup), secondary on user_id ("my links"), optional on expiration_time for cleanup.

DB choice: the access pattern is a key-value lookup by short code, so either relational (Postgres/MySQL with replicas, shard by hash(short_code)) or a NoSQL KV store (DynamoDB/Cassandra with native TTL) works. Shard by short_code for even load.

Bottlenecks & scaling

Concern Risk Mitigation
Read hot path 10k–100k peak RPS Redis read-through + LRU; DB read replicas; CDN/edge
Hot keys Viral links dominate Cache absorbs; replicate hot keys; edge caching
Counter SPOF / contention Redis INCR + per-writer range leasing; replicate
Predictable codes Enumeration / scraping Bijective scramble + rate limiting
Write scale DB throughput Shard by hash(short_code)
Expired data Table bloat Lazy delete + background purge; cache TTL

Why split read & write services?

They scale independently — the Read Service scales out for the 100:1 read ratio and 200 ms target, while a Write outage never takes down redirects.