DARK MODE

Apache Kafka

// event streaming · partitions · consumer groups · delivery semantics · stream processing · senior → principal

Overview
Deep Dive
Q & A
Scenarios
Core Concepts
📡 What Kafka Is
A distributed, append-only log used as a high-throughput, durable message bus. Producers write events; consumers read them at their own pace. Events are persisted to disk — a slow or offline consumer doesn't lose data, it just has lag. Kafka is optimized for sequential disk I/O, which is why it outperforms in-memory queues at scale.
event streaming durable high-throughput distributed log
📦 Topics, Partitions & Offsets
A topic is a named channel (e.g. payments). Each topic is split into partitions — ordered, immutable sequences of records. Each record has an offset, its unique position within a partition. Ordering is guaranteed within a partition; none across partitions. Consumers track their own offset — they can replay from any point.
Topic Partition 0..N Offset 0..∞
👥 Consumer Groups
Consumers in the same group share the work — each partition is consumed by exactly one consumer in the group. More consumers than partitions = idle consumers. Different groups each get all messages independently — fan-out is free. Max parallelism per group = partition count.
Partition 0 Consumer A
Partition 1 Consumer B
🏗️ Brokers, Leaders & Replication
A Kafka cluster is a set of brokers. Each partition has one leader (handles all reads/writes) and zero or more followers (replicas). The ISR (In-Sync Replicas) is the set of replicas caught up with the leader. If the leader fails, a follower in the ISR is elected. Replication factor controls durability — typically 3 in production.
RF=3 tolerates 2 broker failures
✅ Delivery Guarantees
At-most-once — send and forget. May lose messages, never duplicates. Fastest. At-least-once — retry on failure. No data loss, but duplicates possible. Default for most systems. Exactly-once (EOS) — idempotent producer + transactional API. No loss, no duplicates. Highest cost. Covers Kafka-to-Kafka only; your sink must also be idempotent for true end-to-end EOS.
at-most-once at-least-once exactly-once (EOS)
🔁 Log Retention vs Log Compaction
Time/size retention — delete records older than retention.ms or beyond retention.bytes. Use when you care about the event stream up to a time window. Log compaction — retain only the latest value per key. Tombstone (null-value) records signal deletion. Use for changelog topics and materializing current state.
compaction = KV store retention = time-bounded log
🔌 Kafka Connect
A framework for moving data between Kafka and external systems without writing producer/consumer code. Source connectors pull data in (databases via CDC, S3, APIs); sink connectors push data out (Elasticsearch, JDBC, S3). Connectors run as distributed workers and handle parallelism and fault tolerance. Use Connect before writing a custom consumer for standard integrations.
🌊 Kafka Streams & ksqlDB
Kafka Streams is a Java library for stateful stream processing — aggregations, joins, windowing — embedded in your application with no external cluster. ksqlDB is a SQL layer on top for simpler transformations. Both materialize state in local RocksDB stores backed by Kafka changelog topics for fault tolerance. Use for real-time aggregations, enrichment, and event-driven pipelines. Not for heavy ML inference or complex batch workloads — use Flink or Spark there.
Gotchas & Failure Modes
Ordering is per-partition only Cross-partition ordering is your domain's responsibility. If you need global order, you need one partition — which caps throughput at one consumer and one broker. Most systems that think they need global order actually need per-entity order (model it as a key).
Rebalancing pauses all consumers in the group Eager rebalancing (default before 2.4) stops all consumers, drops partitions, then reassigns everything. A group of 50 consumers rebalancing takes seconds of downtime. Migrate to CooperativeStickyAssignor. Monitor rebalance frequency — frequent rebalancing is a consumer stability problem, not a Kafka problem.
Hot partitions from bad key choice Choosing a low-cardinality key (e.g. country code for a US-heavy workload) concentrates load on a few partitions. One broker becomes the bottleneck. Diagnose with broker-level disk and network metrics. Fix by adding entropy to the key or using a custom partitioner.
max.poll.interval.ms thrashing If processing a batch takes longer than this timeout, the broker kicks the consumer out and rebalances. The consumer rejoins, reprocesses from last committed offset, hits the timeout again — a rebalancing loop with no progress. Fix by reducing max.poll.records, optimizing processing, or moving slow work off the poll loop.
auto.offset.reset=latest is a silent data loss footgun A new consumer group with latest misses all historical data and starts only from the moment it's deployed. Teams often don't notice until days later. For any consumer that needs a complete view, use earliest. Always verify committed offsets exist before deploying new consumer groups to production.
Exactly-once at the consumer is harder than at the broker Kafka's EOS covers Kafka-to-Kafka flows only. If your consumer writes to a database or calls an HTTP API, you're outside the transaction boundary. True end-to-end EOS requires an idempotent sink — upsert by event ID, deduplication table, or conditional writes.
When to Use / When Not To
✓ Use Kafka When
  • High-throughput event streaming (millions of events/sec)
  • Multiple independent consumers need the same data (fan-out)
  • You need replay — consumers can re-read historical events
  • Decoupling producers from consumers with different processing rates
  • Event sourcing or audit log that must be durable and ordered per entity
  • Stream processing pipelines (aggregation, enrichment, joining streams)
✗ Don't Use Kafka When
  • Request/reply RPC — Kafka is not a synchronous messaging system
  • Low-volume, low-latency task queues — RabbitMQ or SQS are simpler
  • Sub-millisecond latency requirements — broker round-trips add overhead
  • Large payloads (images, videos) — store in S3, put the reference in Kafka
  • You need per-message TTL or priority queues — Kafka supports neither natively
Quick Reference & Comparisons
⚙️ Producer Config Reference
acks0 = none, 1 = leader, all = full ISR. Use `all` for durability.
enable.idempotencetrue — broker deduplicates retries using PID + sequence number.
retriesSet high with idempotence. Default MAX_INT in modern clients.
linger.msWait before flushing a batch. Higher = better throughput, more latency.
batch.sizeMax bytes per batch per partition. Larger batches = better compression.
compression.typenone / gzip / snappy / lz4 / zstd. snappy or lz4 for throughput; zstd for ratio.
max.in.flight.requestsKeep ≤5 with idempotence. >1 without idempotence risks reordering on retry.
buffer.memoryTotal bytes the producer can buffer before blocking. Tune for burst traffic.
🔄 Consumer Config Reference
auto.offset.resetearliest (full replay) / latest (from now) / none (throw if no offset). Set intentionally.
enable.auto.commitfalse for correctness. Commit manually after processing.
max.poll.interval.msMax time between poll() calls. Processing must complete within this window.
max.poll.recordsMax records per poll. Tune down if processing is slow.
session.timeout.msMax time without heartbeat before broker considers consumer dead.
heartbeat.interval.msHow often the consumer sends a heartbeat. ~1/3 of session timeout.
isolation.levelread_committed — only see committed transactions. Required for EOS consumers.
🏗️ Broker / Topic Config Reference
replication.factor3 in production. 1 = no fault tolerance.
min.insync.replicas2 with RF=3 is the standard safety setting.
retention.msHow long to keep messages. Default 7 days. Set per topic.
cleanup.policydelete (time/size retention) or compact (keep latest per key).
unclean.leader.electionfalse (default). Never elect an out-of-ISR replica — avoids data loss.
replica.lag.time.max.msHow long before a straggling replica is dropped from ISR.
💻 CLI Commands
Topic management
kafka-topics.sh --create --topic my-topic --partitions 12 --replication-factor 3 --bootstrap-server broker:9092 kafka-topics.sh --describe --topic my-topic --bootstrap-server broker:9092 kafka-topics.sh --alter --topic my-topic --partitions 24 --bootstrap-server broker:9092
Consumer group inspection
kafka-consumer-groups.sh --describe --group my-group --bootstrap-server broker:9092 kafka-consumer-groups.sh --list --bootstrap-server broker:9092
Offset management
kafka-consumer-groups.sh --reset-offsets --group my-group --topic my-topic --to-earliest --execute --bootstrap-server broker:9092
Testing / debugging
kafka-console-producer.sh --topic my-topic --bootstrap-server broker:9092 kafka-console-consumer.sh --topic my-topic --from-beginning --bootstrap-server broker:9092
⚖️ Kafka vs RabbitMQ vs Amazon SQS vs Apache Pulsar
Dimension Apache Kafka RabbitMQ Amazon SQS Apache Pulsar
Throughput Very high (millions/sec) Medium (tens of thousands/sec) Medium (managed, auto-scales) Very high — comparable to Kafka
Message ordering Guaranteed within a partition Per queue (single consumer) No guarantee (FIFO queue: per group ID) Guaranteed within a partition
Retention & replay Configurable days/weeks. Full replay anytime. Consumed = deleted. No replay. Consumed = deleted. Max 14-day retention. Tiered storage — near-unlimited retention.
Consumer model Pull. Consumer groups share partitions. Push or pull. Competing consumers. Pull (long-poll). Competing consumers. Push + pull. Shared, failover, key-shared.
Delivery semantics At-least-once default. Exactly-once via transactions. At-least-once with manual acks. At-least-once. Standard queue can duplicate. At-least-once default. Exactly-once with transactions.
Ops complexity High — cluster management, partition tuning. Medium — easier at smaller scale. Very low — fully managed by AWS. High — BookKeeper adds operational complexity.
Best for Event streaming, audit logs, multi-consumer fan-out. Task queues, work distribution, complex routing. Cloud-native decoupling, serverless triggers. Kafka replacement with tiered storage, multi-tenancy.
Interview Q & A
Senior Engineer — Execution Depth
S-01 How does Kafka guarantee message ordering, and what breaks it? Senior
Kafka guarantees ordering only within a partition — records are appended in order and read in order. A producer sending records with the same key always routes to the same partition (consistent hashing of the key), so all events for a given entity (user, order, account) arrive in order at the consumer. What breaks it: sending without a key causes round-robin routing across partitions, so two records for the same entity can land in different partitions and be consumed out of order. A partition count change after data exists also remaps keys. Sticky partitioning (default since 2.4) batches keyless records to the same partition per batch — this improves throughput, but is not an ordering guarantee.
Cross-partition ordering is your domain's responsibility, not Kafka's. The key design question is: do you actually need global ordering across all entities, or just ordering per entity? Almost always it's the latter — and that's exactly what key-based partitioning gives you. If you think you need global order, push back: a single partition caps your throughput to one consumer and one leader broker. Model the domain so ordering is required only within a natural key boundary.
S-02 What is consumer group rebalancing, when does it happen, and how do you minimize its impact? Senior
A rebalance happens when the group membership changes: a consumer joins or leaves, a consumer times out (max.poll.interval.ms exceeded or missed heartbeat), or partition count changes. During an eager rebalance (default pre-2.4), all consumers stop consuming, drop all partitions, and wait for the group coordinator to reassign — a full stop-the-world pause. Cooperative rebalancing (CooperativeStickyAssignor) only revokes and moves the partitions that need to change, leaving the rest untouched. Minimize impact: use cooperative rebalancing, tune session.timeout.ms and heartbeat.interval.ms appropriately, and ensure max.poll.records is low enough that processing finishes within max.poll.interval.ms.
Frequent rebalancing is a consumer stability problem, not a Kafka problem. At scale, a group of 50 consumers rebalancing takes several seconds — during which consumer lag grows. Design for it: make consumers idempotent (rebalances cause re-processing of uncommitted offsets), and monitor rebalance frequency as a first-class metric. If rebalances are triggered by max.poll.interval.ms, the real fix is reducing batch size or moving slow processing off the poll loop — not raising the timeout.
S-03 What is an idempotent producer, and how does Kafka prevent duplicate messages from retries? Senior
Without idempotence, a producer that retries a failed send may cause the broker to write the same record twice if the first write succeeded but the ack was lost. With enable.idempotence=true, the broker assigns each producer a PID (Producer ID) and tracks a sequence number per partition. If a message arrives with the same PID and sequence number as one already written, the broker silently deduplicates it. This gives exactly-once semantics at the producer-to-broker level. Modern Kafka clients enable idempotence automatically when acks=all and retries is high.
Idempotent producer only covers the producer-to-broker hop. It doesn't help when a consumer processes a message, writes to a database, and crashes before committing its offset — on restart it re-reads and re-processes the same message. Covering that gap requires either Kafka transactions (for Kafka-to-Kafka flows) or an idempotent sink — upsert by event ID, conditional writes, or a deduplication table. Most production systems use at-least-once + idempotent sinks. Kafka transactions add latency and operational complexity that's rarely justified outside Kafka Streams topologies.
S-04 What is ISR, and what happens when a replica falls out of ISR? Senior
ISR (In-Sync Replicas) is the set of replicas that are fully caught up with the partition leader. A replica falls out of ISR when it hasn't fetched from the leader within replica.lag.time.max.ms (default 30s) — typically due to being overloaded, slow disk, or a network partition. min.insync.replicas defines the minimum ISR members that must acknowledge a write before the leader returns success (when acks=all). If ISR drops below min.insync.replicas, the partition becomes unavailable for writes — producers receive NotEnoughReplicasException.
The triangle of replication.factor, min.insync.replicas, and acks=all is how you tune the availability vs. durability dial. Standard production setup: RF=3, min.insync.replicas=2, acks=all — you tolerate one broker failure while still requiring two replicas to ack every write. You lose write availability before you lose data. Setting min.insync.replicas=1 restores availability but loses the durability guarantee. Know this triangle cold — it comes up in every Kafka reliability discussion.
S-05 How do you choose partition count for a topic, and what are the tradeoffs? Senior
More partitions = more consumer parallelism (one consumer per partition per group) and higher throughput, but more overhead: more open file handles on brokers, more memory in clients, longer leader election time. You cannot reduce partition count once set — you can only add. Adding partitions on a live topic remaps key-based routing, breaking ordering for in-flight messages. Rule of thumb: target throughput ÷ throughput per partition. Common starting points: 6–12 for moderate load, 24–48 for high-throughput topics.
Partition count is a capacity decision that also sets your parallelism ceiling. If you have 12 partitions, a consumer group can have at most 12 active members. Over-partitioning has real costs: Confluent benchmarks show significant metadata overhead past ~4000 partitions per broker. If you must repartition a critical live topic, the safest approach is a new topic with a migration period, not in-place resize.
S-06 What is log compaction, how does it work internally, and when do you choose it over time-based retention? Senior
Log compaction retains the latest value for each key and discards older values. A background cleaner thread scans dirty (uncompacted) log segments and rewrites them, keeping only the most recent record per key. A null-value record (tombstone) signals deletion — retained for delete.retention.ms so consumers can observe deletions. Use compaction when you care about current state, not history: entity state changes, user settings, feature flags, CDC events where only the latest row matters.
Compaction has subtleties that matter in production. Compaction doesn't run continuously — it waits until the dirty ratio (log.cleaner.min.dirty.ratio, default 0.5) is reached, so the active segment is never compacted and can still contain duplicates. Tombstones are retained after compaction for delete.retention.ms — consumers must read them or they'll miss deletes during bootstrap. If you're using a compacted topic to rebuild state (Kafka Streams, cache reload), test the bootstrap path explicitly — it's where compaction-based systems quietly lose data.
S-07 What causes consumer lag, and how do you diagnose and address it? Senior
Consumer lag = end offset − committed offset for a partition. It grows when: producer throughput bursts beyond the consumer's processing rate; consumer processing is slow (expensive DB writes, synchronous HTTP calls, GC pauses); rebalancing paused consumption; or a consumer is stuck on a poison-pill message. Monitor lag per consumer group per partition — a single hot partition's lag is invisible in aggregates. Tools: kafka-consumer-groups.sh --describe, Datadog Kafka integration, Burrow.
Lag is a leading indicator — alert on it before it becomes an incident. Response depends on root cause: burst → temporary, monitor for recovery; slow consumer → scale out (up to partition count), optimize processing, or profile the bottleneck; stuck on poison pill → implement a DLQ pattern so the consumer can skip and continue; systematic growth over days → capacity problem. Design your lag SLA per topic: "payments lag must not exceed 30s" is meaningful. Set alert thresholds derived from business impact, not arbitrary numbers.
S-08 What is a Dead Letter Queue in a Kafka context, and how do you implement one correctly? Senior
A DLQ is a separate topic where messages that fail processing after N retries are written, so a bad message doesn't block the main consumer indefinitely. Pattern: consumer reads from payments, processing fails → write to payments.DLQ with headers capturing original topic, partition, offset, error type, retry count, and timestamp → commit the offset on the main topic so the consumer advances. A separate DLQ consumer handles alerting, inspection, and reprocessing. Committing the offset is critical — without it, you loop on the same record forever.
DLQ design deserves the same rigor as the main path. Decisions to make explicitly: which exceptions go to DLQ vs. retry? (Transient network errors should retry; deserialization errors are poison pills that should DLQ immediately.) How long is DLQ retention? Who monitors it and gets alerted? How does reprocessing work? For financial or compliance events, a message silently sitting in DLQ with no action is a data integrity risk, not just technical debt.
S-09 A consumer is being repeatedly kicked out of the group due to max.poll.interval.ms. What causes this and how do you fix it? Senior
max.poll.interval.ms is the maximum time a consumer can go between poll() calls. If processing a batch takes longer than this, the coordinator considers the consumer dead, triggers a rebalance, and reassigns the partition to another consumer — which starts from the last committed offset, picks up the same slow batch, and also times out. The result is a rebalance loop with no progress. Root causes: batch size too large (max.poll.records), individual record processing is expensive, or GC pauses interrupting the poll loop. Don't just raise the timeout — that masks the problem.
This is a design smell. Kafka consumers are designed for fast, lightweight processing. If processing routinely takes seconds per record, the consumer model is wrong. Fix patterns: reduce max.poll.records; move slow downstream I/O to an async worker with careful offset management; or redesign the pipeline — consume fast into an internal queue, let a thread pool process, commit after confirmed completion. If the bottleneck is an external API with rate limits, consider whether a task queue is better suited than Kafka for this consumer.
S-10 Walk through acks=0, acks=1, and acks=all — what you gain and lose at each. Senior
acks=0: producer doesn't wait for any acknowledgment. Highest throughput, lowest latency, zero durability — leader crash means data loss with no visibility. Use only for non-critical metrics or logs where some loss is acceptable. acks=1: leader acknowledges before replicating to followers. Loss is possible if the leader crashes before followers catch up. Reasonable middle ground. acks=all (or -1): all ISR members must acknowledge. Data survives any single broker failure in a RF=3 cluster. Higher latency but correct for any data you cannot lose.
acks=all means "durable as long as ISR is healthy." If ISR shrinks below min.insync.replicas, producers block with NotEnoughReplicasException — this is intended behavior, trading availability for safety. Design your producer error handling for this: should producers block and retry (tolerate backpressure), circuit-break and write to a local buffer, or fail fast to the caller? The answer depends on whether availability or consistency is the stronger requirement for that topic. Make it an explicit decision.
S-11 What does enable.auto.commit=false force you to think about? What are the failure modes of manual offset management? Senior
With auto.commit (default true), offsets are committed on a schedule regardless of whether processing succeeded — you can commit an offset for a record whose processing failed (at-most-once behavior). With manual commit, you commit only after confirming processing succeeded (at-least-once). The failure mode: if you commit before processing completes (commit-before-write), a crash after commit but before write means data loss. If you process then crash before committing (the correct order), you reprocess on restart — duplicates, which your sink must handle idempotently.
Manual offset management is a contract with your sink. Once you commit, you're telling Kafka "this was handled." If your sink write fails after a commit, you have a gap with no retry mechanism. The canonical pattern: write to sink (idempotently) → then commit offset. There's no way to make this transactional without Kafka transactions — and even then, only for Kafka-to-Kafka. Teams that get into trouble here treat offset commits as a progress marker rather than a durability contract.
S-12 What is a Kafka transaction — what does it guarantee, and what does it not cover? Senior
Kafka transactions let a producer write to multiple partitions atomically — either all writes commit or none do. The transactional producer registers a transactional.id, coordinates with a transaction coordinator broker, and uses a two-phase commit. Consumers with isolation.level=read_committed only see committed transaction messages. In a consume-transform-produce pattern (Kafka Streams), the consumer offset commit and producer writes can happen in the same transaction — giving exactly-once for Kafka-to-Kafka flows.
Transactions are expensive and operationally complex. They add latency (coordinator round-trips), and aborted transactions occupy log space until cleanup. More critically: Kafka transactions do not cover external sinks. If your consumer writes to PostgreSQL or calls an HTTP API, that write is not inside the transaction boundary. The practical answer for most systems is at-least-once + idempotent sink — simpler to reason about and operate. Use transactions when you need atomic multi-partition writes and all sinks are Kafka topics.
S-13 What happens when you increase partition count on a live topic? Senior
Adding partitions to a live topic causes: all consumer groups rebalance (brief processing pause); and key-based routing changes — keys that hashed to partition 3 may now go to partition 7. Historical data and new data for the same key are now in different partitions. Any consumer depending on key-based ordering (entity processing, Kafka Streams) will see inconsistency during and after the transition. You cannot decrease partition count — only increase.
The safest approach to repartitioning a critical topic: create a new topic with the target partition count, run a dual-write period (produce to both simultaneously), cutover consumers once they've caught up, then decommission the old topic. For Kafka Streams stateful applications, you must rebuild all state stores after a repartition — plan for that rebuild time in your migration window. Never resize a high-value production topic without a documented rollback plan.
S-14 How does auto.offset.reset interact with consumer groups, and what are the dangerous failure scenarios? Senior
auto.offset.reset only applies when a consumer group has no committed offsets for a partition — either a brand new group, or a group whose committed offsets have expired. earliest: start from offset 0 (full replay). latest: start from the current end (miss all history). none: throw NoOffsetForPartitionException. A group with existing committed offsets ignores this setting entirely. Dangerous scenario: a consumer group's committed offsets expire (default offset retention is 7 days). On restart, the group has no offsets, auto.offset.reset=latest kicks in, and the consumer silently starts from the current end — missing potentially days of events.
auto.offset.reset=latest is the default in many framework setups and is a footgun for teams building event-sourced or stateful systems. For any consumer that builds state from the event history, always use earliest and test the bootstrap path explicitly. Operationally: before deploying a new consumer group to production, verify committed offsets exist with kafka-consumer-groups.sh --describe; if they don't, seed them with --reset-offsets --to-earliest before starting the consumer. Make this a deployment checklist item.
Staff Engineer — Design & Cross-System Thinking
ST-01 Design exactly-once semantics end-to-end from producer through to a relational database sink. Staff
Kafka's EOS only covers Kafka-to-Kafka. For a database sink, the approach is at-least-once + idempotent consumer: the producer uses enable.idempotence=true. The consumer uses manual offset commit. Processing pattern: read record → upsert to database using the Kafka offset or a business key as the idempotency key → commit offset. An upsert (INSERT ON CONFLICT UPDATE) ensures that reprocessing the same record produces the same database state. The database transaction and the Kafka offset commit are separate — there's an at-least-once gap — which is why the upsert must be idempotent.
The cleanest approach beyond upserts is the transactional outbox pattern: the consumer writes to the database in a transaction that also records a processed_event_id. On restart, the consumer queries this table before processing — if the event ID is already there, skip and commit the offset. This moves deduplication into the database layer, which is transactionally consistent. For high-throughput sinks, bloom filters or in-memory deduplication windows can replace the database check for recent events.
ST-02 How do you handle schema evolution across producers and consumers in production without downtime? Staff
Use a schema registry (Confluent Schema Registry, AWS Glue Schema Registry) to store and version schemas separately from messages. Each message includes a schema ID in its header — consumers fetch the schema once and cache it. Schema evolution is governed by compatibility modes: BACKWARD (new schema can read old data — add optional fields with defaults), FORWARD (old schema can read new data), FULL (both directions). BACKWARD compatibility is the safest default — deploy new consumers before old producers are updated. Never use NONE compatibility in production — it allows breaking changes that corrupt consumers silently.
Schema governance is an organizational problem as much as a technical one. In a system with dozens of producers and consumers, schema changes become a cross-team coordination event. The schema registry enforces compatibility at registration time, but teams still need a process: who approves schema changes, how are breaking changes communicated, and what's the migration path? For complex evolutions (rename a field, change a type), the cleanest pattern is a new topic version with a migration consumer that reads the old schema and re-publishes with the new one — avoiding in-place schema migration entirely.
ST-03 Design a multi-region Kafka topology. What are the options and what breaks first? Staff
Two models. Active-passive — one primary cluster, a replica cluster in a second region via MirrorMaker 2 or Confluent Replicator. Consumers in the secondary region read from the replica. On failover, consumers are redirected to the secondary (offset translation required — offsets don't map 1:1 across clusters). Active-active — both regions produce and consume, with bidirectional replication and a namespace prefix per region. Complex to operate: cycle detection, ordering guarantees across regions are gone, and conflict resolution for shared state is your problem.
What breaks first in multi-region Kafka: offset translation during failover. MirrorMaker 2 maintains offset mapping topics, but failover requires redirecting consumers to translated offsets — operationally complex and easy to get wrong under pressure. Test failover regularly, not just in theory. Active-active's real complexity is ordering: if a user's events are split between regions, rebuilding their state requires merging two ordered streams. For most use cases, active-passive with a well-tested failover runbook is the right answer.
ST-04 What is your Kafka observability strategy — which metrics matter and what do you alert on first? Staff
Three tiers. Broker: under-replicated partitions (non-zero = durability risk, alert immediately), ISR shrink/expand rate, active controller count (must be exactly 1), bytes in/out per broker (detect hotspots), produce and fetch request latency. Consumer: lag per group per partition (primary SLI), rebalance rate, commit rate. Producer: record error rate, request latency, batch size (too small = inefficient). Don't aggregate lag across partitions as your primary alert — one partition with high lag is an incident; ten partitions with low lag is noise.
Lag is a leading indicator; under-replicated partitions are a fire alarm. Build alerting in that order. Add ISR shrink/expand as a broker health signal (frequent shrink = broker instability) and consumer rebalance rate as a leading indicator of consumer instability. For capacity planning, track bytes in/out per broker over 30-day windows — clusters at 60%+ utilization on a single broker are one traffic spike away from a partition leadership cascade. Instrument your consumers to emit processing time and error rate alongside the Kafka client metrics.
ST-05 Your consumer must process Kafka events and call an external HTTP API exactly once. The API isn't idempotent. How do you design this? Staff
You cannot get true exactly-once with a non-idempotent external API — at-least-once delivery means retries are possible. Options: (1) Make the API call idempotent at the call site by including a client-generated idempotency key in the request (UUID derived from the Kafka offset or message key) — if the API supports it, duplicate calls with the same key are rejected or deduplicated. (2) Use an intermediate durable store: consumer writes a "pending" record to a database, a separate worker processes them and calls the API, marks them done. Kafka offset commits are tied to the database write, not the API call. (3) Accept at-least-once and build deduplication into a downstream reconciliation job.
This is a fundamental distributed systems problem: you cannot have exactly-once across two systems without a shared transaction boundary or idempotency. The honest answer is: negotiate idempotency into the API design. If the API is internal, add an idempotency key header and implement deduplication on the server side. If it's a third-party API, document the at-least-once semantics and build a reconciliation process to detect and handle duplicates. "Exactly-once to a non-idempotent API" is not achievable without some form of deduplication — the question is where that deduplication lives and who owns it.
ST-06 Design a retry and DLQ strategy for a consumer processing critical financial events. Staff
Classify errors first: deserialization failures are non-retryable — DLQ immediately. Transient errors (DB unavailable, network timeout) are retryable — use exponential backoff with jitter up to a maximum retry count. Use retry topics for backoff: route failed records to payments.retry-1 (retry after 1m), payments.retry-2 (after 5m), payments.retry-3 (after 30m), then payments.DLQ. Each retry topic is consumed by a worker that waits the appropriate delay before reprocessing. Headers on DLQ records carry: original topic, partition, offset, error cause, attempt count, timestamps.
For financial events, the DLQ is not the end of the road — it's a queue requiring human or automated action. Design the operational path explicitly: who is alerted when records land in DLQ, what's the SLA for investigation, and how does reprocessing work? The retry topic pattern keeps ordering within a key intact during retry — but note that during the backoff window, newer records for the same key on the main topic have already been processed. If ordering across a key's lifecycle is critical, your retry design must account for this gap. Document the ordering guarantee you offer explicitly.
ST-07 Kafka Streams vs Apache Flink vs Spark Streaming — how do you decide? Staff
Kafka Streams: embedded in your Java application, no separate cluster. Best for stateful processing within a single bounded domain — aggregations, enrichments, joins between a few topics. State is local RocksDB, backed by Kafka changelogs. Operationally simple. Limit: large state joins hit RocksDB memory limits; no native SQL. Apache Flink: dedicated cluster, rich stateful streaming with large state support, event-time processing, complex windowing. Best for complex CEP, large state joins, ML inference pipelines. Higher ops overhead. Spark Streaming: micro-batch model (not true streaming). Better for teams with existing Spark infrastructure or mixed batch+streaming workloads. Higher latency.
The decision is as much organizational as technical. Kafka Streams requires no new infrastructure and integrates naturally with existing Kafka expertise — the right choice for 80% of stateful streaming use cases. Reach for Flink when: your state exceeds what Kafka Streams can manage locally, you need complex event time semantics with late data handling, or you're building a data platform that serves multiple teams. Spark Streaming is rarely the right greenfield choice if you can tolerate any latency overhead of micro-batching.
ST-08 When would you NOT use Kafka for a problem that looks like a fit? Staff
Kafka is overkill when: volume is low and a database outbox or lightweight queue (SQS, RabbitMQ) would do — the operational overhead isn't justified for hundreds of events per hour; the pattern is request/reply RPC — Kafka can fake it but it's awkward and slower than HTTP or gRPC; you need per-message TTL or priority queues — Kafka supports neither natively; your team lacks Kafka operational expertise; or sub-millisecond latency is required.
The "use Kafka for everything" instinct is common in organizations that have invested in a cluster. Push back when appropriate. The outbox pattern (write events to a database table, tail with CDC or a poller) often replaces a Kafka topic for intra-service coordination with fewer moving parts. Redis Streams cover low-volume real-time fan-out without a full Kafka cluster. The question to ask: do I need replay, high throughput, or multiple independent consumer groups? If the answer to all three is no, something simpler probably fits. Kafka's value comes from durability, replayability, and fan-out — if you're not using those properties, you're paying the operational cost without the benefit.
ST-09 Walk me through capacity planning for a Kafka cluster. What inputs do you need? Staff
Inputs: peak bytes-in per second per topic (producer throughput), replication factor (disk and network multiplier — RF=3 means 3x write bytes on network), retention period (disk = bytes/s × retention_seconds × RF), number of partitions (drives file handles, memory, controller load), and number of consumer groups (each group consumes the full topic bytes — adds to network egress). Disk per broker = (total bytes/s × RF × retention_seconds) / broker_count. CPU is rarely the bottleneck; disk I/O and network are. Size for peak × 1.5–2x headroom.
The tricky part is accounting for all consumer groups — teams add them informally and each one multiplies your egress. Establish a catalog of consumer groups per topic as part of your capacity model. Also model for the worst case: a consumer group that falls behind and catches up at maximum speed will consume your full retained history in a burst — that burst throughput must fit within your broker's disk read and network capacity. Run this calculation for your highest-retention, highest-throughput topics to surface wrong capacity assumptions.
ST-10 One broker in your cluster is consistently hot while others are idle. How do you diagnose and fix it without downtime? Staff
Diagnose: check which partitions have their leader on the hot broker. Check if the broker holds a disproportionate number of high-throughput partition leaders — this happens when partition leaders aren't rebalanced after broker restarts. Check disk I/O, network, and CPU independently. Fix for leader imbalance: kafka-leader-election.sh --election-type preferred — triggers preferred leader election, redistributing leaders to their original assignment without moving data. For persistent imbalance, use kafka-reassign-partitions.sh to move partition replicas to underutilized brokers.
Preferred leader election is the zero-risk fix when the hot broker holds too many leaders. Partition reassignment (moving replicas) requires data replication — generates significant network traffic and can itself become a performance event. Throttle reassignment with --throttle to avoid impacting producers and consumers. Longer-term: enable auto.leader.rebalance so leaders redistribute automatically after broker restarts. The root cause of persistent hot brokers is usually initial topic creation without considering broker distribution — address this in your topic provisioning process.
ST-11 How do you implement multi-tenancy on a shared Kafka cluster — and where does it break down at org scale? Staff
Multi-tenancy relies on: topic naming conventions (team prefix enforced by policy or ACL), ACLs (per-topic read/write grants to service accounts), and quota enforcement (producer/consumer byte rate per client-id to prevent one tenant from saturating the cluster). Kafka's ACL system is coarse — no namespace isolation or resource group concept. Quota enforcement at the client-id level is the main protection against noisy neighbors.
Multi-tenancy breaks down at org scale in predictable ways: ACL management becomes unmanageable without automation; quota tuning is a constant negotiation; a single cluster means every cluster upgrade and incident blast radius affects all tenants. The mature answer at large scale is cluster-per-team or cluster-per-domain. Multi-tenancy on a shared cluster is a reasonable starting point — define the scale at which you'll federate to multiple clusters before you hit that wall.
ST-12 Describe a production-grade Kafka security model — authentication, authorization, and encryption. Staff
Authentication: SASL/SCRAM-SHA-512 for username/password (stored hashed), or mTLS for mutual certificate-based authentication. mTLS is stronger but requires PKI management. SASL/PLAIN must never be used in production (credentials in plaintext). Authorization: ACLs via kafka-acls.sh grant DESCRIBE, READ, WRITE, CREATE per principal per topic/group. More fine-grained: Confluent RBAC or Ranger. Encryption: TLS (SSL listener) for data in transit between clients and brokers. Encryption at rest is handled at the disk/OS layer, not by Kafka natively.
Security configuration is easy to get wrong and hard to audit. Common gaps: brokers with a plaintext listener alongside TLS — clients that don't specify TLS silently fall back. Require TLS-only listeners and remove plaintext in production. ACL sprawl: teams create service accounts with wildcard permissions to avoid ACL provisioning friction — enforce least-privilege via automation. Certificate rotation: mTLS requires a certificate rotation process; design for it from day one or you'll have an expiry incident. Kafka has no built-in audit log for data access — ship access logs to your SIEM if you have compliance requirements.
Principal Engineer — Architecture & Org-Scale Thinking
P-01 How do you evaluate whether to run self-managed Kafka vs. migrate to a managed service (Confluent Cloud, MSK, Aiven)? Principal
Self-managed Kafka: full control over config, no per-GB egress pricing, data stays in your infrastructure. Cost is engineering time — cluster provisioning, upgrades, broker failure recovery, monitoring, and on-call burden (0.5–2 FTE depending on cluster count and size). Managed services shift operational burden to the vendor. Confluent Cloud is the most feature-complete (RBAC, ksqlDB, Connectors, Schema Registry, multi-region replication). MSK has lower egress within AWS, tighter IAM integration, lower baseline cost. Evaluate on: total cost (compute + storage + egress + engineering time), compliance requirements (data residency, certifications), and team's Kafka expertise depth.
The hidden cost of self-managed Kafka is incident response. A cluster degradation at 2am requires engineers with deep Kafka operational knowledge — a rare skill set. Managed services shift that cost to a SLA contract. The calculation changes at very high scale: at hundreds of TB/day, the per-GB cost of Confluent Cloud may exceed self-managed infrastructure cost significantly. But below that threshold — which is most organizations — the operational savings from managed services usually outweigh the cost premium. Re-evaluate the decision annually; managed service cost curves have been dropping consistently.
P-02 Kafka is becoming the central data bus for your org — 50 teams, 400 topics. What governance model do you put in place? Principal
Governance axes: topic ownership (every topic has a declared owner team, SLA for availability, and retention policy), schema contracts (schema registry with enforced compatibility modes — no breaking schema changes without a deprecation period), consumer SLAs (producing teams document their throughput and retention commitments), and topic lifecycle (creation requires justification, partition count estimate, and retention policy). Access control: self-service topic creation within a team's namespace via automation, not manual ACL provisioning. A platform team owns the cluster and tooling; domain teams own their topics.
The organizational failure mode is treating Kafka as shared infrastructure with no clear contracts between teams. A data catalog (Schema Registry + DataHub or Amundsen) makes contracts visible. More importantly: establish architectural review for topics that cross team boundaries — a "public" topic that other teams consume is a public API and should be treated with the same rigor as an HTTP endpoint. Conway's Law applies: your topic ownership map will reflect your team structure whether you design it that way or not.
P-03 Design a disaster recovery strategy for Kafka with an RPO of 30 seconds and RTO of 5 minutes. Principal
RPO=30s means you can lose at most 30 seconds of events. RTO=5min means consumers must resume within 5 minutes of declaring a disaster. Architecture: active-passive with near-synchronous replication via MirrorMaker 2 or Confluent Replicator, with replication lag monitored and alerted if it exceeds 15s. The secondary cluster is pre-warmed with consumers in standby state. Offset translation topic enables consumers to resume from the correct position. Failover runbook: detect degradation → declare disaster → redirect producers (DNS/load balancer update) → resume secondary consumers with translated offsets → verify lag converging. Automated where possible; manually triggered for safety.
The 5-minute RTO will be violated the first time you run failover in production without rehearsal. Treat DR runbooks as code: version-controlled, tested quarterly with real traffic in staging, owned by the on-call rotation. The most common DR failures: offset translation doesn't work for compacted topics; producers can't redirect quickly due to hardcoded broker addresses; secondary consumers haven't been tested at production throughput and can't catch up within the RTO. RPO and RTO are commitments to your business — price them against the cost of the DR infrastructure and the test cadence required to actually achieve them.
P-04 Your event-sourced system has been in production 3 years. The log is 2TB, replay takes 4 hours, and teams are coupling directly to internal event schemas. What do you do? Principal
Three problems to solve separately. Replay time: 4-hour bootstrap is operationally risky. Address with snapshotting — periodically write a point-in-time state snapshot to S3 or a database. New consumers bootstrap from the snapshot + replaying only the delta since the snapshot. Schema coupling: introduce a public event contract layer — a separate set of topics with stable, versioned schemas that teams outside the domain may consume. Internal events can evolve freely; public events have a deprecation process. Migration: audit which external consumers are reading internal topics and migrate them to the public layer before enforcing access controls.
This is the natural aging failure mode of event sourcing at scale: the log becomes a liability. The snapshot pattern solves replay without changing the event model. The public/private topic boundary solves coupling without a big-bang migration. Neither is quick — both are multi-quarter initiatives. Prioritize based on risk: if schema coupling is blocking feature development, that's higher priority. The deeper lesson for new event-sourced systems: design for snapshot-based bootstrap from day one, treat public event schemas as APIs from the first external consumer, and define retention policy as a first-class decision.
P-05 How does Conway's Law apply to Kafka topic design? Principal
Conway's Law: systems mirror the communication structure of the org that built them. In Kafka, topics tend to follow team boundaries, schemas evolve in ways that reflect internal team priorities rather than consumer needs, and cross-team topics become coordination bottlenecks that mirror org chart friction. You can read an org's team structure from its topic map. This is not a problem to eliminate — it's a signal to use. If your topic design should reflect domain boundaries, and domain boundaries should reflect your team structure (per Team Topologies), then Kafka topics are a forcing function for getting your service and team boundaries right.
Use topic design as an architectural diagnostic. When two teams are producing to and consuming from each other's topics in a complex web, it reveals a team boundary problem. When a topic has 20 consumer groups from 15 different teams, it's a de facto platform API and should be treated as one — with a versioning policy, deprecation process, and SLA. The worst pattern is a "God topic" that carries events from multiple domains because it was convenient — it creates a coupling point that makes every schema change a cross-org event. Many small, domain-aligned topics are better than a few large, multi-purpose ones.
P-06 An upstream service starts producing at 5x normal volume. Consumer lag cascades across unrelated consumer groups. How do you contain blast radius and prevent recurrence? Principal
Immediate containment: throttle the misbehaving producer at the Kafka quota level (kafka-configs.sh --entity-type clients --entity-name producer-id --add-config producer_byte_rate=...) without restarting it. Check if the surge is intentional (a backfill job) or a bug. Scale out consumers that are lagging (up to partition count). If specific broker partitions are stressed, check broker disk I/O — high-retention topics competing for disk bandwidth. Prevent cascading to unrelated groups: separate high-throughput, bursty topics onto their own broker rack or cluster so a surge doesn't contend for shared broker resources.
The architectural fix is isolation. Topics with different throughput profiles and SLAs should not share brokers or compete for the same disk and network budget. A burstable analytics topic on the same cluster as a payment processing topic means a sudden analytics workload can degrade payment consumer latency — this is a blast radius design failure. Long-term: quota enforcement on all producers by default (not retroactively applied), capacity buffers per cluster accounting for burst behavior, and separate clusters for topics with different reliability tiers. Ask "what's the blast radius if this topic surges?" at topic creation time, not after an incident.
P-07 A team wants to use Kafka for request/reply RPC between services. How do you respond — technically and organizationally? Principal
Technically: Kafka can do request/reply via a reply-to topic pattern (producer sends a request with a correlation ID and reply-to topic header; the consumer processes and publishes the response; the original caller polls filtering by correlation ID). It works, but it's awkward — the caller blocks waiting for a response, consumer group management becomes complex for reply topics, and debugging a lost response is painful. Latency is significantly higher than HTTP or gRPC. For anything requiring a response within hundreds of milliseconds, gRPC or HTTP is the right tool.
The organizational response matters as much as the technical one. "We already have Kafka so let's use it for everything" is a common attractor. The principled answer: match the communication pattern to the tool. Kafka is for async event streaming where the producer doesn't need to know the outcome. Request/reply is synchronous by nature — use a synchronous protocol. Document this as an architectural decision record: Kafka for events, gRPC/HTTP for queries. Enforce it in architecture reviews, not one-off conversations.
System Design Scenarios
📊 Scenario 1 — Real-Time Fraud Detection Pipeline
Problem
Design a system that processes payment events from a Kafka topic and flags potentially fraudulent transactions in near-real-time. Flagged transactions must be available for downstream review within 500ms of the payment event being produced.
Constraints
  • Peak load: 50,000 payment events per second
  • End-to-end latency target: <500ms from produce to fraud verdict
  • Must not lose a payment event — detection can be delayed, not skipped
  • Fraud rules are updated frequently without system restart
  • False positive rate must be <0.1%
Key Discussion Points
  • Partitioning strategy: partition by account or user ID so all events for a given entity land in the same partition — enables stateful per-entity analysis in a single consumer without cross-partition coordination
  • Consumer group design: dedicated consumer group for fraud detection, sized to partition count; separate consumer groups for downstream (audit log, analytics) to isolate failure domains
  • Stateful processing: Kafka Streams or Flink for windowed aggregations (transaction velocity, unusual amounts, new device) — state must survive consumer restarts via changelog topics
  • Rule updates: load fraud rules from a separate compacted topic (a Kafka Streams GlobalKTable) — rules update without restart
  • DLQ: events that fail processing go to a DLQ topic with full context for reprocessing; never block the main consumer
🚩 Red Flags
  • Single partition for 'global ordering' — immediately caps throughput at one consumer; fraud detection doesn't require global order, only per-entity order
  • Synchronous DB call on every event in the hot path — introduces latency and a hard dependency on DB availability; use async writes or in-memory state with periodic flush
  • No DLQ — a single malformed event blocks all processing for that partition indefinitely
  • Ignoring consumer lag as a metric — growing lag means fraud verdicts are silently delayed past the 500ms SLA
🛒 Scenario 2 — Event-Driven Order Processing System
Problem
Decouple an order creation service from fulfillment, inventory reservation, and customer notification using Kafka. The system must handle 5,000 orders per minute at peak. Order state must be consistent — a confirmed order must eventually result in inventory being reserved and a notification being sent, even if downstream services are temporarily unavailable.
Constraints
  • Eventual consistency is acceptable — not every step must be synchronous
  • Inventory reservation must not double-book — exactly-once reservation per order
  • Customer notification is best-effort but must be at-least-once
  • The order service must not fail if fulfillment or inventory is down
  • An order can be cancelled after creation but before fulfillment
Key Discussion Points
  • Topic design: one topic per domain event type (order-created, order-cancelled, inventory-reserved, inventory-reservation-failed) is cleaner than one multipurpose 'orders' topic — consumers subscribe only to what they need
  • Saga pattern: coordinate multi-step workflows via a choreography saga — each service publishes domain events that trigger the next step; failure compensation (inventory-reservation-failed triggers order-failed, which triggers refund and notification)
  • Inventory idempotency: inventory service must upsert by order ID to prevent double-booking on consumer replay; store reservation state with a unique constraint on order ID
  • Cancellation: order-cancelled event consumed by inventory (release reservation) and notifications; consumers must handle out-of-order events — design state machines, not assumed ordering
  • Partitioning: partition by order ID so all events for an order land in the same partition
🚩 Red Flags
  • One 'orders' topic with event type in the payload — consumers must consume everything and filter; schema evolution affects all consumers at once
  • Synchronous calls between saga steps — defeats the purpose of decoupling; if fulfillment is down, the order service call fails
  • No compensation logic for saga failures — an inventory-reservation-failed event with no handler means the customer was charged for an order that will never ship
  • No idempotency on inventory reservation — at-least-once delivery means a consumer restart can replay the reservation event, double-booking inventory
🚨 Scenario 3 — Production Incident: Consumer Lag Spiking Across All Groups
Problem
You receive an alert at 2am: consumer lag is growing across multiple consumer groups on different topics. The lag started 20 minutes ago and is accelerating. Some groups are for payment processing; others are for analytics. No deployments have happened in the past 6 hours.
Constraints
  • Cannot restart Kafka brokers — they are shared with other teams in production
  • Payment processing consumer groups have a lag SLA of 30 seconds
  • Analytics lag can tolerate hours of delay
  • You have access to broker metrics, consumer group describe output, and application logs
Key Discussion Points
  • Is it a producer surge or consumer slowdown? Check bytes-in per broker — if producer throughput is elevated, consumers are simply outpaced; if bytes-in is normal, consumers are slower or stopped
  • Broker health: check under-replicated partitions (non-zero = escalate immediately), active controller count (must be 1), and broker CPU/disk I/O for resource exhaustion
  • Consumer group status: kafka-consumer-groups.sh --describe — look for partitions showing LAG=∞ or CONSUMER-ID=- (no active consumer assigned)
  • Isolate payment groups first: triage by SLA — get payment consumers healthy before diagnosing analytics lag
  • Common root causes at simultaneous multi-group impact: network partition between consumers and brokers, a broker going into resource bottleneck, or a shared downstream system (a database all consumers write to) becoming slow
🚩 Red Flags
  • Restarting all consumers as the first action — triggers rebalances across all groups simultaneously, potentially making lag worse and causing a thundering herd on a stressed downstream resource
  • Increasing partition count on the live topic to add consumer parallelism — requires all consumers to rebalance, breaks key-based ordering, not a safe action under incident pressure
  • Treating analytics and payment lag with equal urgency — triage by SLA; payment lag at 30s is an incident; analytics lag is not
  • Scaling consumers past partition count — adding more consumers than partitions gains nothing; excess consumers sit idle while the others lag
  • Not checking broker health first when all groups are simultaneously affected — simultaneous impact across groups points to shared infrastructure, not a consumer bug