// decomposition · communication · resilience · sagas · service mesh · senior → principal
orders-service.default.svc.cluster.local) that resolves to the ClusterIP, which load-balances to healthy pods. Service registry (Consul, Eureka) is typically needed only outside Kubernetes or for multi-cluster scenarios.
requests/limits) are a bulkhead at the infrastructure level — a runaway pod can't starve its neighbors.
Combine with circuit breaker: bulkhead limits concurrency; circuit breaker stops calls when the dependency is failing.
| Retry | Retry transient failures with exponential backoff + jitter. Jitter prevents thundering herd (all retries hitting at the same time). Retry only idempotent operations. Max 3 retries with backoff: 100ms, 200ms, 400ms + rand(0,100ms). |
| Timeout | Every external call must have a timeout. Without it, slow downstream exhausts threads indefinitely. Set timeout < caller's own SLO. Propagate deadlines with context (Go context, gRPC deadline). Default: fail open after timeout, not hang. |
| Circuit Breaker | Fail fast when a downstream is unhealthy. Closed → Open when failure rate > threshold. Open → Half-Open after wait duration. Half-Open → Closed on probe success. Resilience4j, Polly. Combine with fallback (cached response, default value, graceful degradation). |
| Bulkhead | Isolate resource pools per dependency. Thread pool bulkhead: dedicated pool per downstream; slow call saturates its pool, not the shared one. Semaphore bulkhead: limits concurrent calls without threads. Prevents one slow dependency from starving all others. |
| Fallback | Return a degraded but acceptable response when primary fails. Options: cached previous response, default/empty response, static data, alternative data source. Fallback quality should be transparent to callers — log a metric but don't fail the request. |
| Rate Limiter | Protect a service from being overloaded by a single caller. Applied outbound (limit calls to downstream) or inbound (limit calls from upstream). Resilience4j RateLimiter: token bucket. Different from circuit breaker — limits load proactively, not reactively. |
| Hedged Requests | Send the same request to multiple replicas after a timeout and use the first response. Reduces tail latency at the cost of extra load. Useful for read-only, idempotent calls where P99 latency matters. Google's 'backup requests' pattern. |
| Synchronous REST | Simple request/response over HTTP. Best for queries needing immediate results. Caller blocked during call. Timeout and retry required. Adds latency per hop. Use for: auth checks, pricing, inventory reads. |
| Synchronous gRPC | HTTP/2 + Protobuf. 5–10× faster than REST for internal calls. Strongly typed contracts (.proto). Bidirectional streaming. Browser requires grpc-web proxy. Best for high-frequency internal service communication. |
| Async (Kafka/SQS) | Producer publishes; consumer processes independently. Decouples availability. Enables fan-out, replay, audit. Eventual consistency. Requires DLQ for failed messages. Best for: notifications, analytics, downstream side effects, event-driven workflows. |
| Outbox Pattern | Write event + DB record in a single local transaction. A relay (Debezium CDC or polling job) publishes the event to the broker. Guarantees at-least-once delivery without distributed transactions. Prevents lost events if broker is down at write time. |
| Request-Reply (async) | Async with correlation ID. Caller publishes to a request queue with a reply-to address; consumer processes and publishes response to reply queue. Decouples caller and callee without synchronous blocking. Useful for long-running operations. |
| Database per Service | Each service owns its schema. No shared tables. Other services get data via APIs or events. Enables independent schema evolution, polyglot persistence, and independent scaling. Data duplication is intentional and managed via eventual consistency. |
| Saga (Choreography) | Each service emits an event after completing its step. Other services react. No central coordinator. Simple, but flow is implicit. Hard to debug; requires distributed tracing. Use for simple linear flows with few steps. |
| Saga (Orchestration) | Saga orchestrator explicitly commands each step. Tracks state machine. Handles compensations. Flow is explicit and testable. Tools: Temporal, AWS Step Functions, Camunda. Risk: orchestrator is a bottleneck. |
| CQRS | Separate write (commands) and read (queries) models. Read models are denormalized projections optimized for specific queries. Enables independent scaling of reads/writes. Read model is eventually consistent with write model. |
| Event Sourcing | Store append-only event log instead of current state. Derive state by replaying events. Full audit, temporal queries, replay to rebuild projections. Hard: event schema evolution, query complexity, snapshot management for long-lived aggregates. |
| API Composition | Read from multiple services and join in-memory (at API gateway or BFF). Simple to implement. Risk: coupling to multiple services, N+1 if done naively. Use for reads only. For writes, use saga. |
| mTLS | Mutual TLS between every service pair. Sidecar handles cert rotation. Zero-trust: no implicit trust between internal services. |
| Traffic management | Canary deployments (route 5% to v2), header-based routing (internal testers get v2), weighted traffic split, fault injection for chaos testing. |
| Observability | Automatic distributed traces, metrics (request rate, error rate, latency), access logs — without instrumentation in app code. |
| Retries & timeouts | Configured in mesh policy, not app code. Apply consistently across all services without per-service implementation. |
| AuthorizationPolicy | Declares which services can call which (e.g., orders-service can call inventory-service; nothing else can). Enforced by sidecars. |
| Tools | Istio (full-featured, complex), Linkerd (simpler, lighter), Consul Connect (multi-platform), AWS App Mesh (AWS-native). |
| Coordination | Implicit via events | Explicit via orchestrator commands |
| Coupling | Services coupled to event contracts | Services coupled to orchestrator |
| Flow visibility | Distributed; need tracing to see flow | Centralized; explicit state machine |
| Failure handling | Each service handles own compensation events | Orchestrator drives compensations |
| Debugging | Hard; events span services and time | Easier; orchestrator has full saga state |
| Testing | Integration tests across services needed | Unit-testable orchestrator state machine |
| Scalability | High; no central bottleneck | Orchestrator can become bottleneck at scale |
| Best for | Simple linear flows, few steps | Complex flows, many compensation paths |
| Tools | Kafka, RabbitMQ events | Temporal, AWS Step Functions, Camunda |
failureRateThreshold: % failures to open (default 50) - slidingWindowSize: number of calls in the window - waitDurationInOpenState: how long to stay open before probing - permittedNumberOfCallsInHalfOpenState: probe count
Fallback: always pair with a fallback — return cached data, a default, or a degraded response so the caller doesn't surface a raw failure.Problem: after a service writes to its DB, it must publish an event. If the app publishes to the broker and then the process crashes before the DB commits (or vice versa), you get inconsistency: DB updated but no event, or event published but DB rolled back.
Outbox pattern: 1. Within a single local DB transaction, write both the business record AND an
outbox_events row: {id, aggregate_type, aggregate_id, event_type, payload, created_at}
2. Transaction commits atomically — both or neither 3. A separate relay process reads unpublished outbox rows and publishes them to the
broker (Kafka, SQS)
4. On successful publish, mark the row as published (or delete it)
Relay options: - Polling: relay polls outbox_events WHERE published = false every N seconds.
Simple, slightly delayed.
- CDC (Debezium): captures DB changes via transaction log (binlog/WAL) and streams
them to Kafka without polling. Near-real-time, lower DB load.
At-least-once: if the relay crashes after publishing but before marking as published, it republishes. Consumers must be idempotent (deduplicate by event ID).
Distributed tracing connects a single user request across all the services it touches into a single trace, composed of spans (one per operation).
How it works: 1. The first service (or API gateway) generates a trace-id and a span-id 2. These are propagated via HTTP headers: traceparent (W3C standard), or
X-B3-TraceId/X-B3-SpanId (Zipkin B3 format)
3. Each downstream service reads the incoming headers, creates a child span
(inheriting the trace-id), does its work, and reports the span to a collector
4. The tracing backend (Jaeger, Tempo, Zipkin) assembles spans into a trace waterfall
Instrumentation: - Auto-instrumentation: agent-based (Java agent, OpenTelemetry SDK) — instruments
HTTP clients, DB drivers, message consumers automatically
- Manual spans: add spans around business-critical operations not covered by
auto-instrumentation
OpenTelemetry is now the standard: vendor-neutral SDK that exports to any backend (Jaeger, Zipkin, Tempo, Datadog, Honeycomb). Sampling: never trace 100% in production. Use adaptive sampling: trace 100% of errors, 1–5% of successful requests. Head-based or tail-based sampling.
traceparent). The consumer extracts it and creates a linked span — this is a different trace than the producer's (async), but they're linked via the message ID. Tools like OpenTelemetry's Kafka instrumentation handle this automatically. Also: discuss sampling strategy design — tail-based sampling (keep traces with errors or high latency) is more useful but harder to implement than head-based.A saga step failure leaves some local transactions committed and some not. Consistency is restored via compensating transactions that undo the committed steps. Compensation requirements: - Every saga step must have a corresponding compensating transaction defined upfront - Compensations must be idempotent — they may be retried on failure - Compensations must be semantically correct: undoing "charge $100" is "refund $100", not just "delete the charge record" - Some operations can't be undone (sending an email). Use a pivot transaction: schedule the irreversible action only after all preceding steps succeed, or use a flag to suppress the action if the saga is later rolled back
Failure handling in orchestration: - Orchestrator detects failure (timeout, explicit failure response) - Triggers compensations in reverse order - Compensations are retried with backoff on failure; if all retries fail, saga enters a "stuck" state requiring manual intervention — alert on this
Semantic lock (countermeasure): mark the resource as "pending" during the saga. Prevents other transactions from reading a stale committed intermediate state.
java BulkheadConfig config = BulkheadConfig.custom()
.maxConcurrentCalls(10)
.maxWaitDuration(Duration.ofMillis(500))
.build();
Infrastructure level: Kubernetes resource requests/limits — each pod gets bounded CPU/memory so a runaway service can't starve neighbors.
Pair bulkhead with circuit breaker: bulkhead limits load; circuit breaker stops calls to failed services. Together they contain failures within defined boundaries.CQRS (Command Query Responsibility Segregation) separates write operations (commands that change state) from read operations (queries that return data). The write model handles validation and business rules; the read model is a denormalized projection optimized for specific query patterns. Why separate them? - Write model enforces invariants (requires consistency); read model optimizes for query performance (can be denormalized, cached, eventually consistent) - Read and write loads are typically asymmetric (10:1 reads:writes) — scale independently - Read projections can be tailored per consumer (dashboard view, search index, mobile summary) without changing the write model
Implementation patterns: - Simple CQRS (no event sourcing): same DB, separate code paths. Write path updates normalized tables; read path queries denormalized views or materialized views. - Full CQRS + Event Sourcing: commands produce events; events update separate read stores (Elasticsearch, Redis, read DB). Read stores are eventually consistent.
When NOT to use: - Simple CRUD applications — adds complexity without benefit - When strong consistency between write and read is required - Small teams without the operational capacity to maintain dual stores CQRS is a pattern, not a requirement. Apply it to specific aggregates/domains with high read/write asymmetry, not globally across the entire system.
CreateOrderSaga:
1. orders-service: create order (status=PENDING)
2. inventory-service: reserve items
3. payment-service: charge customer
4. orders-service: confirm order (status=CONFIRMED)
5. notification-service: send confirmation (async, fire-and-forget)
Compensations (reverse order on failure): - Payment fails: release inventory → cancel order - Inventory fails: cancel order (no payment taken yet) - Confirmation fails: release payment → release inventory → cancel order
Implementation details: - Orchestrator: Temporal workflow. Each activity is idempotent (activity ID as
idempotency key). Temporal handles retries, timeouts, and state persistence.
- Each service exposes: reserve(), release() (compensation), charge(), refund().
All idempotent by saga step ID.
- Order status machine: PENDING → INVENTORY_RESERVED → PAYMENT_CHARGED → CONFIRMED
| INVENTORY_FAILED → CANCELLED | PAYMENT_FAILED → INVENTORY_RELEASED → CANCELLED
- Semantic lock: order in PENDING blocks concurrent saga on same order.
Observability: - Trace ID propagated through all saga steps - Saga state persisted in Temporal — visible in Temporal UI - Metrics: saga success rate, step failure rate, compensation rate, saga duration P99 - Alert: sagas stuck in non-terminal state > 5 min → manual intervention queueUPDATE inventory SET reserved = reserved + N and INSERT INTO outbox (saga_id, step='inventory_reserved') in one transaction. The orchestrator gets the step acknowledgment reliably. Also: what happens when Temporal itself is down? Design for this: sagas in-flight pause; no new sagas start; existing inventory holds don't expire (or have a generous timeout). Temporal's durable execution model resumes sagas transparently after recovery.yaml # VirtualService: 5% to v2, 95% to v1 - destination: { host: orders-service, subset: v1 }
weight: 95
- destination: { host: orders-service, subset: v2 }
weight: 5
--- # DestinationRule: defines v1 and v2 subsets by pod label subsets: - name: v1
labels: { version: v1 }
- name: v2
labels: { version: v2 }
Progressive delivery automation (Flagger/Argo Rollouts): - Start at 5% → auto-promote to 10% → 20% → 50% → 100% on success - Rollback trigger: error rate > 1% OR P99 latency > SLO threshold - Analysis period: 5 minutes at each step before promoting - Metrics from Prometheus/Istio telemetry — no manual intervention needed
What to measure per canary step: - HTTP 5xx error rate (primary signal) - P99 request latency - Business metrics (order conversion rate, payment success rate via custom Prometheus metrics) - DB error rate, downstream dependency error rate
Header-based routing: before percentage rollout, route internal testers to v2 via X-Canary: true header. Validates v2 with real data before any external traffic.
Database migration: canary deployments require backward-compatible DB changes. v1 and v2 must read/write the same schema. Use expand-contract for schema changes.Kubernetes resource model: every container declares requests (guaranteed) and limits (maximum). The scheduler places pods on nodes with sufficient requested resources. requests = what the container needs under normal load. limits = cap to prevent runaway consumption.
Setting requests correctly: - Profile under realistic load (load test): P95 CPU and memory during peak - Set requests to P95 steady-state, limits to 2× requests (headroom for bursts) - For memory: set requests == limits to avoid OOMKilled evictions under pressure
(memory is not compressible like CPU)
Bulkheads at cluster level: - Namespace resource quotas: cap total CPU/memory per team namespace. Prevents one team's runaway service from exhausting cluster capacity. - LimitRange: sets default requests/limits for pods that don't specify them. - Priority classes: critical services (auth, payment) get higher priority — scheduler evicts lower-priority pods first under pressure.
Autoscaling: - HPA (Horizontal Pod Autoscaler): scale pods on CPU, memory, or custom metrics (Kafka consumer lag, queue depth via KEDA) - VPA (Vertical Pod Autoscaler): recommends correct requests/limits based on observed usage - Cluster Autoscaler: adds/removes nodes based on pending pod pressure Capacity planning process: - Baseline: measure CPU/memory/RPS per service under current load - Forecast: project traffic growth (30/60/90 day) - Load test: validate the new service handles projected peak × 1.5 safety margin - Review: revisit resource settings quarterly; shrink over-provisioned services
Schema evolution strategies:
Backward-compatible (producer adds): - Add optional fields: consumers using older schema ignore unknown fields
(if consumer is lenient — Avro, Protobuf handle this natively; JSON requires
lenient deserialization)
- Avro/Protobuf evolution rules: add fields with defaults; never remove or rename
Breaking changes (producer removes/renames): - Dual-write transition: producer publishes both v1 and v2 events simultaneously.
Consumers migrate to v2 at their own pace. Producer drops v1 after all consumers
migrate (tracked via consumer group lag metrics).
- Event versioning: include schema_version: "v2" in event headers. Consumers
route by version — handle v1 and v2 with separate handlers during transition.
- Schema registry (Confluent): enforces compatibility rules (BACKWARD, FORWARD,
FULL) before a schema can be registered. CI validates new schema against registry
before deploy.
Consumer-driven contract testing (Pact): - Consumers publish their expected event schema as a pact - Producer CI runs against the pact broker; build fails if new event schema breaks a consumer contract - Catches breaking changes before they reach production Pitfall: never share a code library of event classes between services — it couples service deployments. Each service owns its own event schema representation. Use a schema registry for the canonical schema, not a shared library.
Conway's Law: your architecture mirrors your communication structure. A monolith built by one team; microservices built by many teams. The implication: design team structure and service boundaries together, not separately. Team Topologies framework: - Stream-aligned teams: each team owns a product domain end-to-end (Orders, Payments, Catalog). Services align to team boundaries. - Platform team: owns shared infrastructure (auth, observability, CI/CD, service mesh). Other teams consume it as self-service. - Enabling teams: temporary; help stream teams adopt new patterns. - Complicated-subsystem teams: own high-expertise components (ML pipeline, real-time pricing engine).
Service granularity principles: - One service per team (not per class or DB table). If two services are always deployed together, they're one service owned by one team. - Service size: small enough to be understood by the team, large enough to be worth the operational overhead. "How long to onboard a new engineer?" is a good proxy. - Start coarser, split later. Merging overly-split services is much harder than splitting an oversized one.
Wrong granularity signals: - Teams routinely coordinate multi-service deployments → boundaries are wrong - Teams step on each other's changes in the same service → split needed - A service has very low traffic but high operational cost → merge - Services that can't be tested independently → excessive coupling My framing for the interview: microservices are an organizational scaling pattern as much as a technical one. You don't adopt microservices because it's better architecture; you adopt it because your team structure demands independent deployment velocity. If you have 5 engineers, you almost certainly shouldn't have 20 services.
Immediate diagnosis: - Pull stuck sagas from Temporal/Step Functions dashboard. Group by: which step fails, which service, what error type (timeout? business exception? infra error?) - Correlate with: deployment times (did failure rate increase after a deploy?), infrastructure events (DB slow query, network blip), specific customers/order types - Check compensation success rate: are compensations themselves failing? Double-failure is the worst case — payment charged, inventory not reserved, compensation can't refund
Root cause categories: - Flaky downstream (payment processor 2% timeout): add retry with backoff inside the saga step; ensure idempotency key on processor call so retries don't double-charge - Non-idempotent step being retried: audit every step — does retrying produce duplicate side effects? Fix: idempotency key per saga step ID - Timeout too tight: step timeout < downstream P99. Instrument downstream P99 per saga step; set timeout at P99 + 20% buffer - Data race in semantic lock: two sagas on the same order_id competing. Fix: advisory lock or optimistic concurrency (version field) on the order record
Operational fixes (immediate): - Manual resolution workflow: ops console to inspect stuck sagas, see exact state, trigger retry or rollback. This unblocks customers now. - Alert: stuck saga > 10 min → PagerDuty. SLA: resolve or escalate within 30 min. Systemic fixes (weeks): - Saga observability dashboard: success rate, step failure rate, median/P99 duration, compensation rate — per saga type, per step - Chaos engineering: inject failures at each saga step; validate compensations work; validate stuck saga detection and alerting fire correctly - Backpressure: if saga failure rate > 5%, pause new saga creation (circuit breaker at saga entry point) rather than accumulating more stuck sagas
saga_id + "_charge". The payment processor is called with this key. On retry (Temporal retries activities automatically), the same key is sent — the processor deduplicates and returns the original result. No double-charge possible.saga_id + "_release_inventory" etc).scheduleToCloseTimeout per activity. Inventory check and payment authorization can be parallelized (fan-out within the Temporal workflow) if business rules allow — reduces sequential latency to 2 calls deep instead of 4.OrderConfirmed event to Kafka. Email service consumes asynchronously. Failure doesn't affect the checkout response. Separate from the saga — don't block the customer on email delivery.inventory-pool (20 threads), pricing-pool (10 threads), recommendation-pool (5 threads). Recommendation slowness can exhaust only its 5-thread pool. Inventory and pricing pools are unaffected.[] or cached recommendations from Redis
(last known good result per user). Degrade gracefully — page loads without recs.
- Inventory/pricing: no fallback — these are critical. If they fail, surface an error
page with retry. Do not show stale inventory to avoid overselling.transactions table and outbox_events in a single local transaction. Relay publishes to Kafka for downstream consumers (reconciliation, audit). No event is lost even if the consumer is down. Dual-write during migration: write to both old schema and new service's schema; reconciliation job validates they match.