Fundamentals

Kafka Deep Dive

Kafka is a distributed, partitioned, replicated append-only commit log — both a durable message queue and a real-time event stream. The default answer whenever a design needs async processing, ordered processing, decoupling, or stream processing. Almost every "deep dive" reduces to questions about that log: how it scales, survives failure, retries, performs, and gets cleaned up.

When to use Kafka

Two umbrella reasons. If your problem matches one, name Kafka and move on.

Need a queue	Need a stream
Async processing — YouTube transcoding	Process lots of data in real time — Ad Click Aggregator
In-order processing — Ticketmaster waiting queue	Many consumers read the same stream — Messenger, FB Live comments
Decouple producer/consumer to scale independently — LeetCode judge	Replayable continuous history

When NOT to: tiny scale (a DB-backed queue is simpler), strict request/response RPC, per-message priorities or TTL/delays (use RabbitMQ/SQS), or millions of distinct queues (Kafka scales with partitions, not topics).

Core concepts

Term	Definition	Nuance
Broker	Servers that hold the log	Each broker is leader for some partitions, follower for others
Partition	An ordered, immutable, append-only log	The unit of parallelism, ordering & replication; order only within a partition
Topic	A logical grouping of partitions	Name + partition count + config; repartitioning later is disruptive
Producer	Writes records to topics	Picks partition by key hash; controls durability via `acks`
Consumer	Reads records from topics	Pulls; tracks an offset per partition; joins a group

A record = key, value, timestamp, headers + broker-assigned partition & offset. The key drives partition assignment and therefore ordering and compaction.

Architecture at a glance

Producers append records into a topic's partitions; a consumer group spreads those partitions across its consumers so each partition is owned by exactly one consumer in the group.

flowchart LR
    P1["Producer 1"] --> T
    P2["Producer 2"] --> T
    P3["Producer 3"] --> T
    subgraph T["Topic"]
        PA["Partition 0"]
        PB["Partition 1"]
    end
    subgraph CG["Consumer Group"]
        C3["Consumer A"]
        C4["Consumer B"]
    end
    PA --> C3
    PB --> C4

Within a group, each partition → exactly one consumer; max useful parallelism = partition count (extra consumers idle).
Different groups read independently with their own offsets — that's pub/sub fan-out.

Partitioning & ordering

Partitions are simultaneously the unit of parallelism, ordering, and replication. A producer picks a partition by: explicit partition → else hash(key) mod n (same key ⇒ same partition ⇒ ordered) → else round-robin.

Ordering guarantee: total order within a partition, none across partitions. To preserve per-entity order, key by that entity (e.g. by gameId). You can increase but not decrease partitions, and increasing rehashes keys (breaks per-key order) — so right-size early. More partitions also mean longer rebalances and more leader elections.

Consumer groups & rebalancing

The consumer group is both the unit of scaling and the queue-vs-pubsub switch: consumers sharing a group.id divide partitions (competing consumers); each extra group gets its own copy of the stream (fan-out).

Rebalancing on membership/partition change. Eager assignors stop the world; cooperative/incremental (modern default) move only what must move. Static membership avoids rebalances on brief restarts.
Liveness: heartbeats + must call poll() within max.poll.interval.ms or be presumed dead (a classic trap).
Pull model: consumers fetch batches at their own pace; slow consumers fall behind gracefully (bounded by retention).

Offsets & delivery semantics

An offset is the consumer's bookmark. The order of {process, commit} determines semantics:

Semantic	How	Failure outcome
At-most-once	Commit before processing	Crash after commit → message lost
At-least-once (default)	Process then commit	Crash before commit → redelivered (dup)
Exactly-once	Idempotent producer + transactions	No loss, no dupes

Most systems run at-least-once + idempotent consumers (dedupe by id / upsert by key). True exactly-once is only "free" Kafka→Kafka (Streams/transactions); external sinks still need idempotent writes.

sequenceDiagram
    participant P as Producer
    participant B as Broker Leader
    participant R as Follower ISR
    participant C as Consumer
    P->>B: Send record PID seq n
    B->>R: Replicate
    R-->>B: Ack
    B-->>P: Ack committed
    Note over B,R: Dedup by PID and seq
    C->>B: Fetch from offset
    B-->>C: Records
    C->>C: Process
    C->>B: Commit offset

Replication, leader/follower & ISR

Each partition has R replicas (typically 3) on different brokers. One leader handles reads + writes; followers fetch from it.

flowchart TD
    Prod["Producer"] --> L["Leader Broker 1"]
    subgraph ISR["In-Sync Replicas"]
        L --> F1["Follower Broker 2"]
        L --> F2["Follower Broker 3"]
    end
    L --> Cons["Consumer"]

ISR = replicas caught up to the leader. With acks=all, a write is committed only after all ISR persist it → never lost while one ISR survives.
Leader failure → controller elects a new leader from the ISR (no committed data lost). Keep unclean.leader.election=false to avoid electing a stale replica.
Consumers only read up to the high-water mark (replicated to all ISR).

Durable-by-default config to recite

replication.factor=3, min.insync.replicas=2, acks=all, enable.idempotence=true, unclean.leader.election.enable=false. Survives one broker failure with zero data loss. Modern Kafka replaces ZooKeeper with KRaft.

Retention vs log compaction

Kafka decouples consumption from deletion via the topic's cleanup policy:

delete (time/size): keep records for retention.ms then drop oldest segments. Enables replay — re-read the window for reprocessing/backfills.
compact: retain at least the latest value per key (older values GC'd) — turns a topic into a changelog / latest-state store; a null value is a tombstone.

Use delete when you care about the event-history window; compact when you care about the current value per key.

Worked use-cases

Async queue — YouTube transcoding

Upload returns fast; raw video lands in S3; Kafka buffers transcode jobs; a worker fleet consumes in parallel and writes renditions back. Kafka decouples upload from slow, bursty transcoding.

flowchart LR
    Client["Client"] --> Upload["Upload Server"]
    Upload --> S3in["S3 raw video"]
    Upload --> K["Kafka topic"]
    K --> T1["Transcoder 1"]
    K --> T2["Transcoder 2"]
    T1 --> S3out["S3 transcoded"]
    T2 --> S3out

In-order — Ticketmaster waiting queue keyed by eventId for fair arrival order.
Real-time stream — Ad Click Aggregator: click stream → Flink windows/aggregates → DB; retention gives replay.
Fan-out — one comment stream read by many independent consumer groups (moderation, ranking, notifications).

Common pitfalls

Hot partitions / key skew — a low-cardinality key overloads one consumer. Use higher-cardinality / composite / salted keys.
Rebalancing storms — frequent join/leave stalls consumption. Use cooperative assignor + static membership + tuned timeouts.
Poison pills — a bad record crash-loops a partition (head-of-line blocking). Use try/catch + DLQ + retry topics.
Consumer lag blowups — if lag exceeds retention, data is dropped before it's read. Monitor lag; scale consumers (≤ partition count).
acks=1 + leader failure — silent data loss. Use acks=all + min.insync.replicas≥2.
Large messages — Kafka isn't a blob store; keep payloads small, store blobs in S3 and pass a pointer.