System Design Notes All designs

Analytics & Streaming

Ad Click Aggregator

When a user clicks an ad, redirect them to the advertiser and aggregate per-ad click counts over time windows for near-real-time advertiser analytics. The central tension is real-time vs. accurate — resolved with a hybrid Lambda architecture: a streaming speed layer plus a batch layer that recomputes exact counts.

Requirements

Functional

Non-functional

Scale & back-of-the-envelope

API design

GET /click?adId={adId}&impressionId={impId}&signature={sig}
  -> 302 Found  Location: {redirectUrl}     # the ad links HERE, not direct to advertiser

GET /api/v1/ads/{adId}/aggregates?from&to&granularity=minute
  -> { series: [ { minute, clicks }, ... ] }

302, never 301

A 301 (permanent) is cached by the browser, so later clicks skip our server → undercounting. 302/307 guarantees every click hits the Click Processor. impressionId + signature make the click verifiable and idempotent.

High-level design

The Click Processor validates (verify signature), dedups (Redis), looks up the redirect URL, issues the 302, and emits the click to a stream. Flink windows + aggregates into an OLAP store (speed layer); raw events also land in S3 for a daily Spark reconciliation (batch layer).

flowchart LR
    U["User Browser"] -->|"GET /click?adId"| GW["API Gateway"]
    GW -->|"302 redirect"| U
    GW --> CP["Click Processor Svc"]
    CP -->|"getRedirectUrl(adId)"| ADB[("Ad DB")]
    CP -->|"dedup check"| R[("Redis")]
    CP -->|"click event"| K["Kinesis (Kafka-equiv)"]
    K --> F["Flink (stream agg)"]
    F --> OLAP[("OLAP store")]
    K --> S3[("S3 raw events")]
    CRON["Cron Scheduler"] --> SP["Spark (map reduce)"]
    S3 --> SP
    SP --> RW["Reconciliation Worker"]
    RW --> OLAP
    ADV["Advertiser Browser"] --> QS["Query Svc"]
    QS --> OLAP
      

Deep dive · stream aggregation with windowing

Flink keys by adId and maintains 1-minute tumbling windows on event time (with watermarks), but flushes partial results every 10 s so dashboards update in near real time. The OLAP sink upserts keyed by (adId, minute); combined with Flink checkpointing this gives exactly-once — a replay overwrites rather than double-counts.

Flush trade-off: smaller interval = fresher but more OLAP write amplification; 10 s is the chosen balance. Event-time (not processing-time) keeps counts correct even when ingestion lags.

Deep dive · idempotency / dedup

A single click must count once despite double-clicks, retries, and replay. At impression time the Ad service generates a signed impressionId (HMAC) embedded in the click URL; the Click Processor verifies the signature, then does an atomic SETNX impressionId in Redis — first write wins, duplicates are dropped (the user is still redirected).

sequenceDiagram
    participant U as User
    participant CP as Click Processor
    participant AD as Ad DB
    participant R as Redis
    participant K as Kinesis
    U->>CP: GET /click adId impressionId sig
    CP->>CP: verify signature
    CP->>R: SETNX impressionId
    alt new click
        R-->>CP: stored (1)
        CP->>AD: getRedirectUrl(adId)
        CP->>K: emit click event
        CP-->>U: 302 redirect
    else duplicate
        R-->>CP: exists (0)
        CP-->>U: 302 redirect, drop event
    end
      

Redis TTL bounds memory (10k/s × 600 s ≈ 6M keys — trivial). The batch layer dedups on eventId over the full S3 archive as the durable backstop if a key expired or Redis lost data.

Deep dive · speed layer vs batch reconciliation

The streaming path is fast but approximate (drifts on failures, very-late events, hot-key skew). A batch path recomputes ground truth from the immutable S3 log and fixes the OLAP store.

flowchart LR
    K["Kinesis Stream"] --> SPD["Speed layer: Flink"]
    K --> S3[("S3 raw archive 7d")]
    SPD -->|"approx, realtime"| OLAP[("OLAP serving")]
    S3 --> BAT["Batch layer: Spark daily"]
    BAT --> RW["Reconciliation Worker"]
    RW -->|"exact, corrected"| OLAP
    OLAP --> DASH["Advertiser Dashboard"]
      
Kappa (stream-only) Lambda (chosen)
Idea One continuous stream Batch (historical) + speed (real-time)
Pros Simpler, one code path Highest accuracy; real-time + correct
Cons Correctness rides entirely on the stream Two pipelines to maintain

Billing-grade integrity demands the reconciling batch layer: real-time numbers appear in seconds; the daily Spark job overwrites any incorrect (adId, minute) rows so the dashboard self-heals. Late/out-of-order events are absorbed by watermarks + allowed lateness; the long tail is swept up by reconciliation.

Deep dive · hot-ad partitioning

A viral ad can exceed a single shard's 1,000 rec/s and create a hot key that overwhelms one Flink task. For hot ads only, salt the key adId:0..N to spread load across N shards/tasks, then sum the partial counts.

flowchart TD
    HOT["Hot ad adId=42 (viral)"] --> SUF["Append suffix 0..N to key"]
    SUF --> S0["shard key 42:0"]
    SUF --> S1["shard key 42:1"]
    SUF --> S2["shard key 42:N"]
    S0 --> F0["Flink task 0"]
    S1 --> F1["Flink task 1"]
    S2 --> F2["Flink task N"]
    F0 --> M["Merge partial counts"]
    F1 --> M
    F2 --> M
    M --> OLAP[("OLAP: adId=42 total")]
      

Apply salting selectively (only detected hot keys) to avoid multiplying shard count for cold ads; Kinesis resharding + Flink rescaling handle sustained growth.

Key decisions recap

302 not 301; signed impressionId + Redis SETNX (+ batch dedup on eventId); Kinesis sharded by adId with adId:0..N salting; Flink 1-min tumbling / 10-s flush / event-time / exactly-once via checkpoints + idempotent upserts; OLAP keyed on (adId, minute); hybrid Lambda for real-time and accurate.