// pods · scheduling · networking · storage · RBAC · autoscaling · operators · senior → principal
pod-0, pod-1) and stable persistent storage. Pods are created/deleted in order. Use for databases, Kafka, Zookeeper — anything with per-instance state.
DaemonSet: runs exactly one pod per (matching) node. Use for node-level agents: log collectors, monitoring daemons, CNI plugins.
Job / CronJob: runs pods to completion. Job for one-off tasks; CronJob for scheduled work.
kube-proxy (or eBPF with Cilium) programs iptables/ipvs rules on every node to forward traffic to pod IPs.
Ingress: HTTP/HTTPS routing from outside the cluster to Services, handled by an Ingress Controller (nginx, Traefik, ALB). Supports path-based and host-based routing, TLS termination.
DNS: CoreDNS resolves service.namespace.svc.cluster.local to the Service ClusterIP. Pods resolve short names within the same namespace automatically.
requiredDuringScheduling is hard — scheduling fails if unmet. preferredDuringScheduling is soft.
Taints & tolerations: nodes repel pods unless the pod explicitly tolerates the taint. Used to reserve nodes (GPU nodes, spot nodes, infra nodes).
volumeBindingMode: WaitForFirstConsumer delays provisioning until the pod is scheduled — ensures the volume is in the same AZ as the node.
Access modes: ReadWriteOnce (one node), ReadOnlyMany (many nodes read), ReadWriteMany (many nodes read/write — requires NFS or shared storage like EFS).
/var/run/secrets/kubernetes.io/serviceaccount/token and used to authenticate to the API server.
Role / ClusterRole: a set of allowed verbs on resources. Role is namespace-scoped; ClusterRole is cluster-wide.
RoleBinding / ClusterRoleBinding: binds a Role to a subject (user, group, ServiceAccount).
Pod Security Standards: replace deprecated PodSecurityPolicy. Three levels — privileged (unrestricted), restricted (hardened, drops capabilities, read-only root fs), baseline (middle ground). Enforced via namespace labels.
Network Policies: default is all traffic allowed between pods. A NetworkPolicy selects pods and restricts ingress/egress to specified sources/destinations.
limits.cpu: 500m is throttled by the Linux CFS scheduler when it tries to use more than 500m in a scheduling period — even if the node has idle CPU. This causes latency spikes invisible to CPU utilization metrics (the container appears to use less CPU than the limit, yet is being throttled). Many teams set CPU limits equal to requests and get surprised. For latency-sensitive services, either set limits much higher than requests or omit CPU limits entirely and rely on requests for scheduling.
limits.memory — the Linux OOM killer terminated it. The pod restarts; it's a container-level event. Eviction: the node's actual memory is under pressure — the kubelet evicts pods with the lowest requests.memory relative to usage. The pod is moved to another node, not restarted in place. If you don't set memory requests, your pod is the first eviction candidate under node pressure. Always set memory requests.
timeoutSeconds or failureThreshold will restart a pod that is temporarily slow (GC pause, slow DB query) rather than truly dead. The restart makes things worse — the pod loses in-flight requests, the spike repeats. Liveness probes should only detect deadlock or unrecoverable states. Use a failureThreshold of at least 3 and a generous timeout. Don't use liveness probes for dependency health (if the database is down, killing and restarting the pod won't help).
BestEffort QoS class — the first to be evicted under memory pressure. The scheduler also cannot do meaningful bin-packing, leading to hot nodes. Set requests for both CPU and memory on every container. Use VPA in recommendation mode to right-size requests based on actual usage.
ImagePullPolicy: Always contacts the registry on every pod start. If the registry is slow or unavailable (network partition, rate limiting), pods cannot start even when the image is already cached on the node. Use IfNotPresent with immutable image tags (a digest or a version tag, not latest). This also prevents latest from silently running different images on different nodes.
SIGTERM and waits terminationGracePeriodSeconds (default 30 s) before sending SIGKILL. But traffic may still be routed to the pod for a few seconds after SIGTERM because endpoint removal propagates asynchronously. Add a preStop hook with a short sleep (5 s) before the application begins shutdown so the load balancer has time to drain. Without this, every rolling deploy or scale-down drops a small percentage of in-flight requests.
| Deployment | Stateless replicas. Rolling updates with configurable maxSurge/maxUnavailable. Rollback via revision history. Pods are interchangeable. |
| StatefulSet | Stable pod identity (pod-0..N), stable DNS (pod-0.svc), ordered start/stop, stable PVCs. Required for databases, Kafka, Zookeeper, Elasticsearch. |
| DaemonSet | One pod per node (or matching nodes). Used for node agents: Fluentd, Prometheus node-exporter, Calico, Cilium. Respects node taints and affinity. |
| Job | Runs pods to completion. completions + parallelism control concurrency. backoffLimit caps retries. Use for one-off data migrations, report generation. |
| CronJob | Creates Jobs on a cron schedule. concurrencyPolicy controls overlapping runs (Allow/Forbid/Replace). successfulJobsHistoryLimit prevents history accumulation. |
| ReplicaSet | Maintains N pod replicas. Rarely used directly — owned by Deployments. Only manage ReplicaSets directly if you need custom rollout logic. |
| ClusterIP | Default. Virtual IP reachable only within the cluster. All inter-service communication should use ClusterIP. kube-proxy programs iptables/ipvs rules on every node. |
| NodePort | Exposes the service on every node's IP at a static port (30000–32767). Reachable externally via |
| LoadBalancer | Provisions a cloud load balancer (ALB, NLB, GCP LB) pointing to the NodePort. One LB per Service = expensive at scale. Use Ingress with a single LB instead for HTTP workloads. |
| ExternalName | DNS CNAME alias to an external hostname. No proxying. Use to abstract external services (RDS endpoint, third-party API) behind a Kubernetes service name. |
| Headless (clusterIP: None) | No virtual IP. DNS returns individual pod IPs directly. Used by StatefulSets so clients (databases, Kafka clients) can address specific pods. |
| livenessProbe | Is the container alive? Failure → restart. Use for deadlock/frozen process detection only. Aggressive thresholds cause cascading restarts under load. |
| readinessProbe | Is the container ready to serve traffic? Failure → removed from Service endpoints (no traffic routed). Use for startup completion and dependency health. |
| startupProbe | Runs once at startup. Disables liveness/readiness until it succeeds. Use for slow-starting applications (JVM warmup, large model loading) to prevent premature liveness failures. |
| requests.cpu / memory | Used by scheduler for bin-packing. Sets QoS class. Always set on every container. |
| limits.cpu | Hard ceiling enforced by CFS throttling. Can cause latency spikes. Set significantly above requests or omit for latency-sensitive services. |
| limits.memory | Exceeding this → OOMKilled (exit 137). Set equal to or slightly above requests for predictable behavior. |
| terminationGracePeriodSeconds | Time between SIGTERM and SIGKILL. Default 30 s. Set to cover your app's shutdown time + preStop hook duration. |
| preStop hook | Runs before SIGTERM. Use for sleep to allow endpoint propagation, or to drain connections gracefully. |
| topologySpreadConstraints | Spread pods across zones/nodes. maxSkew=1 ensures at most 1 more pod in any zone than others. Preferred over podAntiAffinity for spreading. |
| Dimension | Kubernetes | Docker Swarm | HashiCorp Nomad | Amazon ECS |
|---|---|---|---|---|
| Complexity | High — steep learning curve, many concepts | Low — simple Compose-like model | Medium — simpler than K8s, richer than Swarm | Low-medium — managed control plane |
| Workload types | Containers, pods, jobs, CronJobs | Containers only | Containers, binaries, VMs (with Nomad driver) | Containers (EC2 and Fargate) |
| Networking model | Flat pod network via CNI; Services; Ingress | Overlay network; routing mesh | CNI plugins; Consul service mesh integration | VPC networking; ALB/NLB integration |
| Storage | PV/PVC/StorageClass; CSI plugins | Named volumes; limited cloud integration | Host volumes; CSI support | EBS, EFS via task definitions |
| Autoscaling | HPA, VPA, KEDA, Cluster Autoscaler | Manual; limited built-in scaling | Horizontal scaling; Nomad Autoscaler | ECS Service Autoscaling; Fargate auto-provision |
| Multi-tenancy | Namespaces, RBAC, NetworkPolicy, resource quotas | Limited namespace isolation | Namespaces + ACL policies | Accounts/IAM for isolation; no native namespacing |
| Ecosystem | Vast — Helm, operators, service meshes, GitOps | Limited; largely superseded | Growing; strong with Consul/Vault integration | AWS ecosystem only |
| Managed offerings | GKE, EKS, AKS, DOKS (fully managed control plane) | None actively maintained | HCP Nomad | ECS is fully managed; EKS for Kubernetes on AWS |
| Best for | Large-scale, multi-team, cloud-native platforms | Simple deployments, small teams | Mixed workloads (containers + VMs), HashiCorp stack | AWS-native, teams avoiding K8s complexity |
kubectl serializes the manifest and sends a PATCH (or PUT) request to the
API server over HTTPS.
API server authenticates the request (client cert or token), then runs admission controllers — mutating webhooks (inject sidecars, set defaults) run first, then validating webhooks (enforce policy). If any admission controller rejects the request, it fails here.
The API server writes the desired state to etcd and returns 200 OK to kubectl.
The Deployment controller (part of kube-controller-manager) watches etcd for Deployment changes. It computes the desired ReplicaSet and creates/updates it.
The ReplicaSet controller notices the desired replica count isn't met and creates Pod objects in etcd (spec only, no node assigned).
The scheduler watches for pods with no nodeName. It filters nodes (enough CPU/memory,
tolerations match, affinity satisfied) then scores them and binds the pod to the
winning node by writing nodeName to the Pod object in etcd.
The kubelet on the assigned node watches for pods bound to its node. It calls the container runtime (containerd) via CRI to pull the image and start the container.
The kubelet reports back pod status. Once containers pass readiness probes, the endpoints controller adds the pod's IP to the Service's Endpoints object.
kube-proxy (or Cilium) on every node watches Endpoints and updates iptables/eBPF rules so traffic to the ClusterIP is forwarded to the new pod.
kubectl apply may take seconds to fully propagate — each step has its own watch-react cycle.Deployment for stateless workloads where all replicas are identical and interchangeable. Pod names are random hashes. Pods can be killed and rescheduled in any order. Use for APIs, web frontends, stateless workers.
StatefulSet for workloads that require: - Stable network identity: pod-0, pod-1, pod-2 — DNS names like pod-0.svc.ns.svc.cluster.local
persist across restarts
- Stable persistent storage: each pod gets its own PVC (data-pod-0, data-pod-1)
that is not deleted when the pod is rescheduled
- Ordered startup/shutdown: pod-0 must be Running before pod-1 is created; shutdown
proceeds in reverse order. Critical for leader-election-based systems (Kafka, etcd).
Use StatefulSets for: databases, Kafka, Zookeeper, Elasticsearch, Redis Cluster. DaemonSet when you need exactly one pod on every node (or a subset of nodes via node selector). Pods are created as nodes join the cluster and removed when nodes leave. Use for log shippers, metrics collectors, CNI plugins, storage drivers.
RollingUpdate with partition for canary, or OnDelete for manual control) is appropriate for your workload.api.example.com). An external DNS record points to the cloud Load Balancer IP provisioned for the Ingress.
2. Load balancer → Ingress Controller: the cloud LB forwards the request to the Ingress Controller pods (nginx, Traefik, AWS ALB Ingress Controller). The Ingress Controller watches Ingress resources and routes based on host/path rules.
3. Ingress → Service: the Ingress Controller forwards to the matching Service's ClusterIP.
4. Service → Pod (kube-proxy): kube-proxy has programmed iptables/ipvs rules on every node that DNAT the ClusterIP:port to one of the pod IPs in the Endpoints list. Selection is random (iptables) or round-robin (ipvs).
5. Pod processes the request and returns a response through the same path in reverse.
Key detail — Endpoints: the Service only routes to pods whose readiness probe is passing. The endpoints controller removes pods failing readiness from the Endpoints list, so the Service's load balancing naturally excludes unhealthy pods.Guaranteed: requests == limits for all containers. Highest priority — last to be evicted. - Burstable: requests set but less than limits (or limits not set for some containers). - BestEffort: no requests or limits. First to be evicted under node pressure.
CPU limit exceeded: the container is throttled by the Linux CFS scheduler — it's rate-limited to the specified millicores per scheduling period. The process keeps running but gets less CPU time. This causes latency spikes, not crashes.
Memory limit exceeded: the container is OOMKilled (exit code 137) by the Linux OOM killer. The pod restarts according to restartPolicy. Repeated OOMKills trigger CrashLoopBackOff with exponential backoff.
Node memory pressure (eviction): kubelet evicts pods starting with BestEffort, then Burstable with the highest memory usage relative to requests, then Guaranteed only as a last resort. The pod is terminated and rescheduled on another node.limits.cpu: 200m and actual usage of 180m can still be heavily throttled if it has bursts within a 100ms CFS scheduling period. The metric to watch is not CPU utilization but container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total — a throttled ratio above 25% indicates the limit is too low for the workload's burst pattern. Many teams discover their service is constantly throttled only after adding this metric to dashboards. For latency-sensitive services, consider setting CPU limits to 2–4× requests, or removing CPU limits entirely and relying on requests for scheduling.Startup probe: runs at pod start. Disables liveness and readiness until it succeeds. Use for slow-starting applications (JVM warmup, model loading) to give them time to initialize without being prematurely killed by liveness. Once it succeeds, liveness takes over. Readiness probe: determines if the pod should receive traffic. Failure removes the pod from the Service Endpoints — traffic stops routing to it, but the pod keeps running. Use for: app startup completion, dependency availability (cache warm-up), graceful degradation. Liveness probe: determines if the container is alive. Failure → container restart. Use sparingly — only for deadlock or truly unrecoverable states. Liveness misconfiguration failure mode: - Probe checks an external dependency (database, upstream API). Database goes down → liveness fails → pod restarts → pod is still unhealthy → restart loop. Now you have a CrashLoopBackOff AND a database outage. Restarting the pod cannot fix the database. - Probe timeout too short (e.g., 1s) → a GC pause or slow request causes a timeout → pod restarted unnecessarily → in-flight requests dropped → under load, many pods restart simultaneously → cascading failure.
Rule: liveness probes should only check internal state that restart can actually fix. A simple /healthz endpoint returning 200 when the main loop is running is sufficient.
ServiceAccount: the pod's identity. Every pod runs as a ServiceAccount (default ServiceAccount in the namespace if not specified). Its JWT token is mounted into the pod and sent with API server requests.
Authentication: the API server validates the JWT against the Kubernetes OIDC
endpoint. The token contains sub: system:serviceaccount:<namespace>:<name>.
Authorization (RBAC):
(apiGroups, resources, verbs) rules.verb on the requested resource in the requested namespace.RBAC is deny by default — no matching rule means the request is denied (403).
Admission control runs after authorization (mutating then validating webhooks).
Example: a pod running as ServiceAccount: my-app in namespace prod calls GET /api/v1/namespaces/prod/configmaps. The API server checks: does any RoleBinding in prod (or ClusterRoleBinding) bind my-app to a Role that allows get on configmaps?
kubectl auth can-i --list --as=system:serviceaccount:<ns>:<sa> to see what a ServiceAccount can do. For platform teams, audit all ClusterRoleBindings for cluster-admin grants — this is the first thing to check after a security incident. Automate this check in CI with tools like rbac-police or audit2rbac.A PodDisruptionBudget (PDB) limits the number of pods of a workload that can be simultaneously unavailable during voluntary disruptions — node drains, cluster upgrades, node autoscaler scale-down events.
yaml apiVersion: policy/v1 kind: PodDisruptionBudget spec:
minAvailable: 2 # or maxUnavailable: 1
selector:
matchLabels:
app: payment-api
With this PDB, kubectl drain will wait before evicting a payment-api pod if doing so would leave fewer than 2 pods available. The drain will block until a replacement pod is running and ready.
Critical for: - Services where N < minReplicas causes user-facing impact (quota errors, degraded SLA) - StatefulSets where losing too many replicas causes quorum loss (etcd, Kafka, Zookeeper) - Any service with an SLO — PDBs are how you tell Kubernetes "this service has an SLO,
respect it during maintenance"
Without a PDB: a node drain can evict all pods of a Deployment simultaneously (especially if they're all on one node), causing a complete outage during routine maintenance.
minAvailable: 2 on a Deployment with replicas: 2 will block all node drains indefinitely — there's no slack for eviction. Always set minAvailable < replicas (or maxUnavailable >= 1) so drains can proceed. The right pattern is replicas: 3, minAvailable: 2 — allows one pod to be evicted at a time. PDBs pair with topologySpreadConstraints to ensure pods are spread across nodes/zones so a single node drain doesn't breach the PDB in the first place.requests fit within Allocatable) - nodeSelector labels present on the node - nodeAffinity requiredDuringScheduling rules satisfied - Pod tolerates all NoSchedule and NoExecute taints on the node - Volume zone constraints (volume must be in the same AZ as the node) - Pod's hostPort not already used on the node
Phase 2 — Scoring (rank remaining nodes 0–100): - LeastAllocated: prefer nodes with the most remaining capacity - NodeAffinity: weight preferredDuringScheduling rules - InterPodAffinity: prefer nodes near (or away from) specified pods - TopologySpread: balance pods across zones/nodes per topologySpreadConstraints
Scheduling tools: - nodeSelector: simple key-value label match — hard constraint - nodeAffinity: richer expressions (In, NotIn, Exists); required or preferred - podAffinity / podAntiAffinity: schedule near/away from pods with matching labels - topologySpreadConstraints: spread pods evenly across topology domains (zone, node) - taints + tolerations: reserve nodes for specific workloads (GPU nodes, spot nodes) - priorityClass: high-priority pods can preempt lower-priority pods when cluster is fulltopologySpreadConstraints with maxSkew=1 across zones. Without it, a Deployment of 6 replicas may land all 6 on nodes in the same AZ — one AZ outage takes down the service. podAntiAffinity with requiredDuringScheduling achieves strict spreading but prevents scheduling when a zone has no capacity (scheduling fails instead of allowing imbalance). topologySpreadConstraints with whenUnsatisfiable: ScheduleAnyway is more resilient — it spreads best-effort but doesn't block scheduling. Pair it with PDBs to ensure that during a zone failure, the remaining pods across other zones satisfy the PDB.Pending 3. CA detects Pending pods → provisions a new node 4. New pods schedule and become Ready → HPA stabilizes
Common failure modes:
HPA targeting too-high CPU utilization (e.g., 90%): pods are consistently near limit, HPA triggers scale-out only when already overloaded. By the time new pods are ready (node provisioning takes 2–5 min on cloud providers), users have experienced degradation. Target 50–70% CPU so there's headroom before scale-out.
HPA and CA fighting: HPA scales down pods → CA sees underutilized nodes → CA scales down nodes → HPA scales up pods → CA scales up nodes. The stabilization window on HPA (default 5 min for scale-down) must be longer than CA's node drain time to avoid this.
Pod requests too low: HPA sees CPU at 80% of request and scales out, but the actual CPU is low — the request was wrong. CA adds a node unnecessarily. Fix by right-sizing requests with VPA in recommendation mode.
CA can't scale down: pods lack PDBs (safe eviction can't proceed), pods have local storage (emptyDir, hostPath), or pods have cluster-autoscaler.kubernetes.io/safe-to-evict: "false". The node sits idle but CA can't remove it.Namespace per team/environment is the primary isolation unit. Each team gets one or more namespaces (e.g., team-payments-prod, team-payments-staging).
Resource isolation: - ResourceQuota: cap total CPU, memory, PVC storage, and object counts per namespace.
Prevents one team from consuming cluster resources and starving others.
- LimitRange: set default requests/limits for pods that don't specify them, and enforce
min/max bounds. Ensures BestEffort pods don't exist in the namespace.
Access isolation (RBAC): - Bind each team's users/service accounts to namespace-scoped roles only. - Use ClusterRole + RoleBinding (not ClusterRoleBinding) to grant namespace-scoped
permissions without cluster-wide access.
- Restrict ClusterRoleBinding to the platform team only.
Network isolation: - Default NetworkPolicy: deny all ingress and egress within the namespace. - Allow egress to CoreDNS (kube-dns namespace), cluster-internal services, and the
internet as needed.
- Allow ingress from the Ingress Controller namespace only.
Admission control for policy: - Enforce Pod Security Standards (restricted profile) per namespace via namespace labels. - Use OPA/Gatekeeper or Kyverno for custom policies: require image from approved
registries, require labels, require non-root containers.
netassert or similar tools to assert connectivity rules in CI.PostgresCluster) and write a controller that watches for it and reconciles the desired state — creating Deployments, Services, Secrets, and PVCs, and responding to failures.
Operators encode operational knowledge: backup procedures, failover logic, schema migrations, scaling policies. The controller loop runs continuously — if a pod fails, the operator restores it according to the domain-specific rules, not generic Kubernetes logic.
Helm is a templating and release management tool. It renders YAML manifests with values and tracks releases. It doesn't reconcile state after install — if a resource is deleted outside Helm, Helm doesn't notice until the next helm upgrade. Helm is for packaging and distributing applications, not for complex lifecycle management.
When to use Helm: installing off-the-shelf applications (Prometheus, Cert-manager, Vault), managing configuration across environments, packaging your own services for deployment.
When to build an Operator: your application has complex operational logic that Helm can't encode (automatic failover, backup/restore, rolling upgrades with data migrations). Or you're a software vendor shipping a Kubernetes-native product. Use the Operator SDK or Kubebuilder framework.Platform team owns: - Cluster lifecycle: provisioning, upgrades, node pool management, multi-region topology - Core infrastructure: CNI, CSI, DNS (CoreDNS), Ingress controllers, cert-manager - Security baseline: OPA/Gatekeeper/Kyverno policies, Pod Security Standards enforcement, image scanning pipeline, RBAC governance - Observability stack: Prometheus/Thanos, Grafana, Loki/OpenSearch, distributed tracing - GitOps tooling: ArgoCD or Flux managing application deployments - Namespace onboarding automation: self-service (Terraform module, operator) that provisions namespace, ResourceQuota, LimitRange, default NetworkPolicies, RBAC
Application teams own: - Deployment manifests (Helm charts, Kustomize overlays) in their own Git repos - HPA/KEDA configuration and tuning - Readiness/liveness probes and resource requests (guided by VPA recommendations) - Service-level SLOs and alerting rules in their namespaces
Key design decisions: - Single large cluster vs many small clusters: large clusters are more efficient
(bin-packing, shared infra overhead) but harder to isolate and upgrade. Small clusters
reduce blast radius but multiply operational burden. Common pattern: one cluster per
environment tier (prod, staging, dev) per region, with namespace-level team isolation.
- GitOps for all changes: no kubectl apply in production. ArgoCD/Flux enforces
that the cluster state matches Git. Drift detection alerts on out-of-band changes.
- Self-service with guardrails: platform team provides a Terraform module or a
Namespace CRD that teams fill in. The platform operator provisions everything.
Teams don't need Kubernetes expertise to onboard.
kubectl convert or use pluto to find deprecated API versions in your manifests.
- Validate PDBs exist for all critical workloads — the upgrade will drain nodes one at a time. - Ensure the cluster has enough spare capacity for one node's worth of pods to reschedule. - Test the upgrade in a staging cluster with the same workload profile first. - Verify all addons (CNI, CSI, cert-manager, CoreDNS) have versions compatible with the target K8s version.
Control plane upgrade (managed clusters like GKE/EKS handle this): API server, scheduler, controller-manager upgrade in a rolling fashion. A brief period of control plane unavailability is possible (seconds) but kubectl get pods may time out. Running workloads are unaffected — kubelet and pods continue running without the API server.
Node upgrade (the risky part): 1. Add a new node pool/group with the new version. 2. Cordon old nodes (mark unschedulable). 3. Drain old nodes one at a time: kubectl drain --ignore-daemonsets --delete-emptydir-data.
Kubernetes evicts pods respecting PDBs — if a PDB blocks eviction, drain waits.
4. New pods schedule on new nodes. 5. Verify workloads are healthy on new nodes. 6. Delete old nodes.
Zero downtime requires: PDBs on all critical workloads, topologySpreadConstraints so pods aren't all on the nodes being drained, graceful shutdown (preStop hook + adequate terminationGracePeriodSeconds), and readiness probes so traffic doesn't route to not-yet-ready pods on new nodes.networking.k8s.io/v1beta1 Ingress (removed in 1.22) will fail to apply after the upgrade even if the running workload is unaffected. Run pluto detect-all-in-cluster before every upgrade to enumerate all objects using deprecated APIs — this is the single highest-ROI pre-upgrade check. The second failure mode is insufficient spare capacity. If the cluster runs at 90% utilization, draining one node creates 10 pods that can't schedule. Build spare capacity into your cluster sizing (target 70–75% utilization) or use overprovisioning to keep buffer nodes warm.CrashLoopBackOff. The on-call engineer is paged. The service handles live payment traffic. Walk through your diagnostic and recovery process.kubectl rollout undo deployment/payment-api. If the previous version was healthy, this is the fastest path to recovery. Don't spend time diagnosing while users are impacted — restore, then investigate.kubectl logs <pod> --previous -n prod. The --previous flag returns logs from the last terminated container. If the container dies in 2 s, look for startup errors, missing env vars, connection refused, or panic stack traces.kubectl describe pod <pod> -n prod. Events section shows: OOMKilled (exit 137 = memory limit), image pull error, volume mount failure, liveness probe failure. The exit code tells you a lot — 1 = app error, 137 = OOM, 143 = SIGTERM.kubectl diff -f deployment.yaml or check the GitOps PR/commit that triggered the deploy. What changed? New image, changed env var, modified resource limits, a new secret reference?kubectl exec into another pod in the same namespace to verify secrets are accessible, or run locally with docker run and exported env vars).kubectl get events for Error: secret not found or similar. Add the missing secret and redeploy rather than rolling back if the old version can't use the new secret either.failureThreshold so the next slow start doesn't look like a crash-loop. Add structured logging at startup that logs all required env vars (values masked) so future startups are diagnosable in 2 seconds of logs.--previousPostgresCluster with 3 instances, one instance per AZ (using topologySpreadConstraints). Synchronous replication to at least one standby ensures RPO=0 — the primary waits for WAL acknowledgment before confirming commits.minAvailable: 2 ensures at most 1 pod is evicted during node drains. Combined with AZ spread, AZ failure leaves 2 pods in the remaining AZs — above the quorum threshold.volumeClaimTemplates with WaitForFirstConsumer binding ensures the PVC is provisioned in the same AZ as the pod. Use a StorageClass with allowVolumeExpansion: true so you can grow PVCs without downtime.initContainer or pre-deployment step. Use backward-compatible migrations (add column before removing old one, never rename columns directly). With CNPG, the primary is always available during migrations — never run migrations against a replica.scheduledBackups to S3 with WAL archiving. Test restore monthly. RTO of 60s is achievable with failover (not restore-from-backup); restore-from-backup is the disaster recovery path with a different (higher) RTO.startupProbe with failureThreshold: 15 and periodSeconds: 5 gives 75 seconds before liveness kicks in. Without this, liveness kills the pod during JVM warmup on every start.requests.memory = 2.5 GB (heap + metaspace + off-heap overhead). Set limits.memory = 3 GB. Set -Xmx to 2 GB explicitly — if JVM autodetects the limit, it may use 75% of the container limit, conflicting with off-heap memory. Set GOMAXPROCS equivalent for JVM: -XX:ActiveProcessorCount if limiting CPU.-XX:+PrintFlagsFinal that ActiveProcessorCount matches your limits.cpu (in cores, not millicores). A JVM on a 2-CPU limit that thinks it has 64 CPUs creates too many threads.maxSurge: 1, maxUnavailable: 0 for zero-downtime. This creates a new pod before removing an old one. With 30-second startup, the rolling update is slower — budget for it in deployment pipelines./actuator/prometheus endpoint provides most of what's needed. A spike in GC pause time is often the first signal of a misconfigured heap.-Xmx) equal to the container memory limit — OOMKilled before the app can respond to a single request