Fundamentals
Kafka Deep Dive
Kafka is a distributed, partitioned, replicated append-only commit log — both a durable message queue and a real-time event stream. The default answer whenever a design needs async processing, ordered processing, decoupling, or stream processing. Almost every "deep dive" reduces to questions about that log: how it scales, survives failure, retries, performs, and gets cleaned up.
When to use Kafka
Two umbrella reasons. If your problem matches one, name Kafka and move on.
| Need a queue | Need a stream |
|---|---|
| Async processing — YouTube transcoding | Process lots of data in real time — Ad Click Aggregator |
| In-order processing — Ticketmaster waiting queue | Many consumers read the same stream — Messenger, FB Live comments |
| Decouple producer/consumer to scale independently — LeetCode judge | Replayable continuous history |
When NOT to: tiny scale (a DB-backed queue is simpler), strict request/response RPC, per-message priorities or TTL/delays (use RabbitMQ/SQS), or millions of distinct queues (Kafka scales with partitions, not topics).
Core concepts
| Term | Definition | Nuance |
|---|---|---|
| Broker | Servers that hold the log | Each broker is leader for some partitions, follower for others |
| Partition | An ordered, immutable, append-only log | The unit of parallelism, ordering & replication; order only within a partition |
| Topic | A logical grouping of partitions | Name + partition count + config; repartitioning later is disruptive |
| Producer | Writes records to topics |
Picks partition by key hash; controls durability via
acks
|
| Consumer | Reads records from topics | Pulls; tracks an offset per partition; joins a group |
A record = key, value, timestamp, headers + broker-assigned partition & offset. The key drives partition assignment and therefore ordering and compaction.
Architecture at a glance
Producers append records into a topic's partitions; a consumer group spreads those partitions across its consumers so each partition is owned by exactly one consumer in the group.
flowchart LR
P1["Producer 1"] --> T
P2["Producer 2"] --> T
P3["Producer 3"] --> T
subgraph T["Topic"]
PA["Partition 0"]
PB["Partition 1"]
end
subgraph CG["Consumer Group"]
C3["Consumer A"]
C4["Consumer B"]
end
PA --> C3
PB --> C4
- Within a group, each partition → exactly one consumer; max useful parallelism = partition count (extra consumers idle).
- Different groups read independently with their own offsets — that's pub/sub fan-out.
Partitioning & ordering
Partitions are simultaneously the unit of
parallelism, ordering, and
replication. A producer picks a partition by:
explicit partition → else hash(key) mod n (same key ⇒
same partition ⇒ ordered) → else round-robin.
Ordering guarantee: total order within a partition,
none across partitions. To preserve per-entity order,
key by that entity (e.g. by gameId). You
can increase but not decrease partitions, and
increasing rehashes keys (breaks per-key order) — so right-size early.
More partitions also mean longer rebalances and more leader elections.
Consumer groups & rebalancing
The consumer group is both the unit of scaling and
the queue-vs-pubsub switch: consumers sharing a
group.id divide partitions (competing consumers); each
extra group gets its own copy of the stream (fan-out).
- Rebalancing on membership/partition change. Eager assignors stop the world; cooperative/incremental (modern default) move only what must move. Static membership avoids rebalances on brief restarts.
-
Liveness: heartbeats + must call
poll()withinmax.poll.interval.msor be presumed dead (a classic trap). - Pull model: consumers fetch batches at their own pace; slow consumers fall behind gracefully (bounded by retention).
Offsets & delivery semantics
An offset is the consumer's bookmark. The order of {process, commit} determines semantics:
| Semantic | How | Failure outcome |
|---|---|---|
| At-most-once | Commit before processing | Crash after commit → message lost |
| At-least-once (default) | Process then commit | Crash before commit → redelivered (dup) |
| Exactly-once | Idempotent producer + transactions | No loss, no dupes |
Most systems run at-least-once + idempotent consumers (dedupe by id / upsert by key). True exactly-once is only "free" Kafka→Kafka (Streams/transactions); external sinks still need idempotent writes.
sequenceDiagram
participant P as Producer
participant B as Broker Leader
participant R as Follower ISR
participant C as Consumer
P->>B: Send record PID seq n
B->>R: Replicate
R-->>B: Ack
B-->>P: Ack committed
Note over B,R: Dedup by PID and seq
C->>B: Fetch from offset
B-->>C: Records
C->>C: Process
C->>B: Commit offset
Replication, leader/follower & ISR
Each partition has R replicas (typically 3) on different brokers. One leader handles reads + writes; followers fetch from it.
flowchart TD
Prod["Producer"] --> L["Leader Broker 1"]
subgraph ISR["In-Sync Replicas"]
L --> F1["Follower Broker 2"]
L --> F2["Follower Broker 3"]
end
L --> Cons["Consumer"]
-
ISR = replicas caught up to the leader. With
acks=all, a write is committed only after all ISR persist it → never lost while one ISR survives. -
Leader failure → controller elects a new leader
from the ISR (no committed data lost). Keep
unclean.leader.election=falseto avoid electing a stale replica. - Consumers only read up to the high-water mark (replicated to all ISR).
Durable-by-default config to recite
replication.factor=3,
min.insync.replicas=2, acks=all,
enable.idempotence=true,
unclean.leader.election.enable=false. Survives one
broker failure with zero data loss. Modern Kafka replaces ZooKeeper
with KRaft.
Retention vs log compaction
Kafka decouples consumption from deletion via the topic's cleanup policy:
-
delete(time/size): keep records forretention.msthen drop oldest segments. Enables replay — re-read the window for reprocessing/backfills. -
compact: retain at least the latest value per key (older values GC'd) — turns a topic into a changelog / latest-state store; a null value is a tombstone.
Use delete when you care about the event-history window; compact when you care about the current value per key.
Worked use-cases
Async queue — YouTube transcoding
Upload returns fast; raw video lands in S3; Kafka buffers transcode jobs; a worker fleet consumes in parallel and writes renditions back. Kafka decouples upload from slow, bursty transcoding.
flowchart LR
Client["Client"] --> Upload["Upload Server"]
Upload --> S3in["S3 raw video"]
Upload --> K["Kafka topic"]
K --> T1["Transcoder 1"]
K --> T2["Transcoder 2"]
T1 --> S3out["S3 transcoded"]
T2 --> S3out
-
In-order — Ticketmaster waiting queue keyed by
eventIdfor fair arrival order. - Real-time stream — Ad Click Aggregator: click stream → Flink windows/aggregates → DB; retention gives replay.
- Fan-out — one comment stream read by many independent consumer groups (moderation, ranking, notifications).
Common pitfalls
- Hot partitions / key skew — a low-cardinality key overloads one consumer. Use higher-cardinality / composite / salted keys.
- Rebalancing storms — frequent join/leave stalls consumption. Use cooperative assignor + static membership + tuned timeouts.
- Poison pills — a bad record crash-loops a partition (head-of-line blocking). Use try/catch + DLQ + retry topics.
- Consumer lag blowups — if lag exceeds retention, data is dropped before it's read. Monitor lag; scale consumers (≤ partition count).
-
acks=1+ leader failure — silent data loss. Useacks=all+min.insync.replicas≥2. - Large messages — Kafka isn't a blob store; keep payloads small, store blobs in S3 and pass a pointer.