Jenkins / CI-CD — Field Guide

Core Concepts

🔁 CI vs CD vs CD

Continuous Integration (CI): every code commit triggers an automated build and test run. The goal is to detect integration failures early. All developers merge to a shared branch frequently (at least daily). CI is a practice, not just a tool. Continuous Delivery (CD): every successful CI build produces a deployable artifact. Deployment to production is automated up to a manual approval gate — you can deploy at any time with one click. The artifact is always production-ready. Continuous Deployment (CDs): every successful CI build is automatically deployed to production with no manual gate. Requires high test confidence, feature flags for risk mitigation, and fast rollback capability. Used by companies like Netflix, GitHub. The difference between Delivery and Deployment is the human approval step. Most organizations target Continuous Delivery; Continuous Deployment is adopted incrementally as confidence in automated quality gates grows.

CI = integrate often Delivery = manual gate Deployment = fully automated

📋 Jenkins Declarative Pipeline

Jenkins Pipelines defined in a Jenkinsfile (committed to the repo) are the standard. Two syntaxes: Declarative (structured, easier to read, required pipeline {} block) and Scripted (Groovy DSL, full flexibility, harder to read). Prefer Declarative — use script {} blocks for complex Groovy logic within a Declarative pipeline.

groovy pipeline {
  agent { label 'linux' }
  stages {
    stage('Build') {
      steps { sh 'mvn package -DskipTests' }
    }
    stage('Test') {
      steps { sh 'mvn test' }
      post { always { junit 'target/surefire-reports/*.xml' } }
    }
    stage('Deploy') {
      when { branch 'main' }
      steps { sh './deploy.sh production' }
    }
  }
}

Jenkinsfile in repo Declarative preferred when{} for conditions

🖥️ Agents & Executors

Jenkins has a controller (formerly master) that orchestrates builds, and agents (formerly slaves) that execute them. Never run builds on the controller — it's a security and stability risk. Agent types: any (any available agent), label 'linux' (agents with that label), docker { image 'maven:3.9' } (run in a container on the agent), kubernetes {} (spin up a pod per build — highly scalable, ephemeral). Kubernetes agents (Jenkins Kubernetes Plugin): Each build spawns a fresh Kubernetes pod with a jnlp container (Jenkins agent) plus any tool containers you declare. Pod is destroyed after build. Enables: unlimited parallel builds (cluster scales), clean environments per build, no agent config drift. Agent labeling: tag agents by capability (docker, gpu, windows, high-mem). Pipeline steps request the right label — no manual assignment needed.

never build on controller k8s agents = ephemeral label by capability

🚀 Deployment Strategies

Rolling update: replace instances one by one. Kubernetes default. Zero downtime if health checks pass. Risk: both v1 and v2 serve traffic simultaneously — must be backward-compatible. Blue/Green: maintain two identical environments. Switch traffic from Blue (current) to Green (new) instantly. Rollback = switch back. Cost: double the infrastructure. Eliminates version coexistence but requires full duplicate stack. Canary: route a small % of traffic (1–5%) to new version. Monitor error rate and latency. Progressively increase to 100%. Rollback by routing all traffic back to stable. Best risk management. Requires a traffic-splitting mechanism (Nginx, Istio, AWS ALB). Feature flags: deploy code to 100% of users but control feature activation per user/segment. Decouple deployment from release. Allows dark launches (code deployed, feature off) and targeted rollouts (5% of users, internal team only).

steps

canary = lowest risk blue/green = fast rollback feature flags decouple

🔐 Pipeline Security

Secrets management: never hardcode credentials in Jenkinsfiles or pipeline config. Use Jenkins Credentials Store (username/password, secret text, SSH key, certificates). Reference with withCredentials([string(credentialsId: 'aws-key', variable: 'AWS_KEY')]). Better: integrate with Vault or AWS SSM — Jenkins fetches secrets at runtime, they're never stored in Jenkins. Pipeline permissions: use Matrix Authorization Plugin or Role-Based Strategy. Developers can trigger builds; only ops can approve production deployments. Use input step for manual approval gates:

groovy stage('Deploy Production') {
  input { message 'Deploy to production?'; ok 'Deploy'; submitter 'ops-team' }
  steps { sh './deploy.sh prod' }
}

Script approval: Declarative pipelines limit Groovy by default. script {} blocks with new Groovy methods require admin approval (Script Security plugin). This prevents malicious Jenkinsfiles from calling arbitrary system commands.

credentials store input for approvals no secrets in code

📦 Artifact Management

CI produces artifacts — JAR files, Docker images, Helm charts, ZIPs. These must be versioned, stored, and traceable. Versioning strategy: tag artifacts with the git commit SHA and/or semantic version. image:${BUILD_NUMBER}-${GIT_COMMIT[0..7]}. Never use :latest in production artifacts. Artifact repositories: Nexus or Artifactory for JARs/Maven artifacts; ECR/GCR/ Docker Hub for Docker images; Helm repository (ChartMuseum, OCI registry) for Helm charts. Traceability: link artifact → git commit → CI build → deployment. Record in: MANIFEST.MF (JAR), Docker image labels (LABEL git.commit=${GIT_COMMIT}), deployment annotation in Kubernetes. Essential for incident response: "which commit is running in prod?" Retention policy: keep last N builds' artifacts in the artifact store. Prune older ones. Exception: release artifacts (tagged versions) are kept indefinitely.

git SHA in artifact tag never :latest in prod traceability commit→deploy

🌿 GitOps

GitOps uses Git as the single source of truth for infrastructure and application configuration. The desired state (Kubernetes manifests, Helm values, Terraform) lives in Git. An operator (Argo CD, Flux) continuously reconciles the actual cluster state with the desired state in Git. Push model (traditional CI/CD): CI pipeline pushes changes to the cluster. CI needs cluster credentials. The cluster is driven from outside. Pull model (GitOps): an operator inside the cluster pulls from Git and applies changes. The cluster pulls its own config. No cluster credentials needed in CI. More secure, self-healing (if someone manually changes the cluster, the operator reverts). Argo CD: watches a Git repo for changes in Kubernetes manifests. On change, syncs the cluster. Supports: multi-cluster, RBAC, health status, rollback to any Git commit, SSO. Each application in Argo CD points to a path in a Git repo (or Helm chart) and a target cluster + namespace.

Git = source of truth pull model = no CI creds Argo CD / Flux

🧪 Quality Gates

Quality gates are automated checks that must pass before a pipeline proceeds. Common gates: Code quality: SonarQube/SonarCloud analysis — coverage threshold (e.g., > 80%), no new critical bugs, no new security vulnerabilities. Block merge on gate failure. Security scanning: OWASP Dependency Check (CVE scan of dependencies), Trivy (Docker image CVE scan), SAST (static analysis security testing — Semgrep, Checkmarx). Test gates: unit tests pass (100%), integration tests pass, contract tests pass (Pact), performance regression test (< 10% latency increase vs baseline). Policy gates (OPA/Conftest): validate Kubernetes manifests, Terraform plans, Dockerfile against policy rules before applying. Prevents: missing resource limits, containers running as root, missing required labels. Approval gates: manual approval from a specific team (security, QA, ops) before deploying to production. Logged with who approved and when.

gate = fail fast SonarQube quality gate policy as code

Gotchas & Failure Modes

Long-running monolithic pipelines A single pipeline running all stages sequentially (build → unit test → integration test → security scan → deploy) that takes 45 minutes gives developers feedback too slowly. Parallelize independent stages. Run fast checks (lint, unit tests) first; fail early. Reserve slow checks (integration tests, security scans) for post-merge or parallel jobs.

Storing secrets in Jenkinsfiles or environment variables logged by CI echo $SECRET_KEY in a pipeline step prints the secret to the build log, which is often accessible to all developers. Jenkins masks credentials injected via the Credentials Store, but only if you use withCredentials{} properly. Never echo credentials. Use --add-host or environment files passed via secrets manager instead of printing values.

Flaky tests breaking CI reliability Flaky tests (pass sometimes, fail sometimes without code changes) erode trust in CI. Developers start re-running failed builds without investigating. The CI pipeline becomes a noise generator. Track test flakiness (ReportPortal, Allure). Quarantine known-flaky tests to a separate non-blocking suite. Fix or delete them — do not tolerate them in the blocking pipeline.

No artifact promotion — rebuilding for each environment Building fresh artifacts for each environment (dev, staging, prod) means what you tested in staging is not what runs in prod. Always build once, promote the same artifact. The artifact (Docker image, JAR) built on CI is tagged with the commit SHA and promoted to production. Environment-specific config comes from Kubernetes ConfigMaps/Secrets or environment variables — not from re-building.

Missing rollback strategy in the pipeline Deploying to production without an automated rollback path is high risk. Before writing the deploy step, define how to roll back. In Kubernetes: kubectl rollout undo. In GitOps: revert the Git commit in the environment repo. The rollback should be a one-command operation from the CI/CD tool — not a 30-minute manual process.

When to Use / When Not To

✓ Use CI/CD When

Any software project where manual building, testing, and deployment steps exist — automate them
Teams with multiple developers merging to shared branches where integration conflicts must be caught early
Multi-environment deployments (dev → staging → production) requiring consistent, repeatable processes
Organizations needing audit trails of what was deployed, when, by whom, and from what commit

✗ Don't Use CI/CD When

Throwaway scripts or one-off experiments where the overhead of a pipeline exceeds the benefit
Purely exploratory data science notebooks where reproducibility is not a current priority

Quick Reference & Comparisons

Jenkins Pipeline Syntax Reference

agent	Where to run: any, none, label 'x', docker{image}, kubernetes{yaml}. Stage-level agent overrides pipeline-level.
stages / stage	Sequence of named stages. Each stage runs steps. Visible in Blue Ocean / Stage View.
steps	Actual work: sh, bat, script, checkout, withCredentials, input, build, echo.
post	Run after stage/pipeline: always (cleanup), success, failure, unstable, changed. Use for: JUnit publish, Slack notify, cleanup.
when	Conditional execution: branch('main'), environment(name:'ENV',value:'prod'), expression{}, not{}, allOf{}, anyOf{}.
parallel	Run stages in parallel: parallel { stage('A'){...} stage('B'){...} }. Significant time savings for independent stages.
input	Pause for human approval: input(message:'Deploy?', ok:'Yes', submitter:'ops-team'). Logs who approved.
environment	Declare env vars for the block: environment { DEPLOY_ENV = 'prod' }. Access as env.DEPLOY_ENV.
options	Build settings: timeout(time:30,unit:'MINUTES'), retry(3), timestamps(), skipDefaultCheckout().
parameters	Pipeline parameters: string, boolean, choice. Access as params.PARAM_NAME.
triggers	Auto-trigger: cron('H 2 * * 1-5'), pollSCM('H/5 * * * *'), upstream(projects:'job', threshold:SUCCESS).
withCredentials	Inject credentials: withCredentials([string(credentialsId:'id',variable:'VAR')]){sh 'use $VAR'}. Masked in logs.

Deployment Strategy Comparison

Recreate	Stop all v1, start all v2. Downtime during transition. Use only for non-prod or when version coexistence is impossible.
Rolling	Replace instances incrementally. Kubernetes default. Zero downtime. v1 and v2 serve traffic simultaneously — requires backward compatibility.
Blue/Green	Two full environments. Instant cutover. Fast rollback (switch DNS/LB back). Double infrastructure cost. No version coexistence.
Canary	Route small % to new version. Progressive rollout. Lowest risk. Requires traffic splitting (Istio, nginx, ALB). Best for high-traffic services.
Shadow	Duplicate live traffic to new version (no user-visible impact). Compare responses. No production risk. High infra cost. Use before canary for risky changes.
A/B Testing	Route different user segments to different versions. Business experiment, not just risk mitigation. Requires feature flag / routing by header/cookie.

GitHub Actions vs Jenkins vs GitLab CI

Config format	GitHub: YAML in .github/workflows/. Jenkins: Groovy in Jenkinsfile. GitLab: YAML in .gitlab-ci.yml.
Hosting	GitHub Actions: SaaS (GitHub-hosted runners + self-hosted). Jenkins: self-hosted only. GitLab: SaaS + self-hosted.
Runners/Agents	GitHub: GitHub-hosted (Ubuntu/Windows/Mac) + self-hosted. Jenkins: controller+agents, Kubernetes pods. GitLab: shared runners + self-hosted.
Secret management	GitHub: Secrets in repo/org settings. Jenkins: Credentials Store + Vault integration. GitLab: CI/CD variables (masked, protected).
Marketplace/plugins	GitHub: 20k+ Actions in Marketplace. Jenkins: 1800+ plugins (mature, some unmaintained). GitLab: native features + includes.
Multi-repo pipelines	GitHub: reusable workflows. Jenkins: shared libraries, multibranch pipelines. GitLab: include + extends across projects.
Cost	GitHub: free for public; paid by minutes for private (2k free/month). Jenkins: infra cost only. GitLab: free tier + paid tiers.
Best for	GitHub: teams already on GitHub, OSS. Jenkins: existing enterprise installs, complex pipelines. GitLab: integrated DevSecOps platform.

GitOps Tools Comparison

Argo CD	Kubernetes-native. Rich UI. App-of-apps pattern. Sync waves. RBAC. SSO. Multi-cluster. Most popular GitOps tool.
Flux	CNCF graduated. CLI-driven. Stronger multi-tenancy. Kustomize and Helm support. Notification controller. Leaner than Argo CD.
Argo Rollouts	Progressive delivery controller (canary, blue/green) for Kubernetes. Integrates with Argo CD and Istio/Nginx for traffic splitting.
Jenkins X	Full GitOps platform built on Jenkins + Tekton. Opinionated, complex. Less popular than Argo/Flux.

💻 CLI Commands

{'cmd': 'java -jar jenkins-cli.jar -s http://jenkins:8080 build my-job -s -v', 'desc': 'Trigger build and wait for completion with verbose output'} {'cmd': 'java -jar jenkins-cli.jar -s http://jenkins:8080 console my-job 42', 'desc': 'Get console output of build #42'} {'cmd': 'java -jar jenkins-cli.jar -s http://jenkins:8080 list-jobs', 'desc': 'List all jobs'} {'cmd': 'java -jar jenkins-cli.jar -s http://jenkins:8080 reload-configuration', 'desc': 'Reload Jenkins config from disk'}

{'cmd': 'argocd app list', 'desc': 'List all Argo CD applications and their sync status'} {'cmd': 'argocd app sync my-app --prune', 'desc': 'Sync application with Git; prune resources removed from Git'} {'cmd': 'argocd app rollback my-app 42', 'desc': 'Rollback to history revision 42'} {'cmd': 'argocd app set my-app --sync-policy automated --auto-prune', 'desc': 'Enable automated sync and pruning for an application'}

Interview Q & A

0 / 0 reviewed

Senior Engineer — Execution Depth

S-01 What is the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment? Senior ▾

Continuous Integration: developers merge code to a shared branch frequently (daily or more). Every merge triggers an automated build and test run. Failures are fixed immediately. The goal: detect integration issues early, keep the main branch always buildable. Continuous Delivery: the CI pipeline produces a deployable artifact that is always production-ready. Deployment to production is automated but triggered by a human (manual approval gate or button click). The organization can deploy at any time — the decision is business-driven, not technical. Continuous Deployment: extends CD by removing the manual gate. Every commit that passes all automated tests is automatically deployed to production. Requires extremely high test confidence, feature flags for risk management, and instant rollback capability. The spectrum: most orgs target Continuous Delivery — automated all the way to staging, with a manual approval for production. Netflix, GitHub, and others practice Continuous Deployment. The distinction is cultural (trust in automation) as much as technical. Key enabler for both: small, frequent commits. Deploying once a week makes each deploy high-risk. Deploying 10 times a day makes each change small and easy to roll back.

S-02 Explain the structure of a Jenkins Declarative Pipeline. What are the key sections? Senior ▾

A Declarative Pipeline has a mandatory pipeline {} wrapper. Key sections: ```groovy pipeline { agent { label 'linux' } // Where to run options { timeout(time: 30, unit: 'MINUTES') timestamps() } environment { APP_ENV = 'staging' } // Pipeline-wide env vars

stages { stage('Build') { steps { sh 'mvn package -DskipTests' stash name: 'artifact', includes: 'target/.jar' } } stage('Test') { parallel { // Parallel sub-stages stage('Unit') { steps { sh 'mvn test' } } stage('Lint') { steps { sh 'mvn checkstyle:check' } } } post { always { junit 'target/surefire-reports//.xml' } } } stage('Deploy') { when { branch 'main' } // Condition input { message 'Deploy?'; submitter 'ops' } steps { unstash 'artifact' withCredentials([string(credentialsId: 'deploy-key', variable: 'KEY')]) { sh './deploy.sh $KEY' } } } } post { success { slackSend message: "Build ${BUILD_NUMBER} succeeded" } failure { slackSend color: 'danger', message: "Build failed" } } } `` Key sections:agent(execution environment),options(timeout, retry, timestamps),environment(env vars),stages(ordered stages),parallel(concurrent stages),when(conditions),input(approval gate),post` (cleanup/notification).

S-03 How would you implement a multi-environment promotion pipeline (dev → staging → production)? Senior ▾

Build once, promote the same artifact — this is the core principle. Do not rebuild for each environment. The artifact tested in staging must be identical to what's deployed in production. Pipeline structure:

groovy stages {
  stage('Build & Test') {
    steps {
      sh 'mvn verify'
      sh "docker build -t myapp:${GIT_COMMIT} ."
      sh "docker push myapp:${GIT_COMMIT}"
    }
  }
  stage('Deploy Dev') {
    steps { sh "kubectl set image deployment/app app=myapp:${GIT_COMMIT} -n dev" }
  }
  stage('Integration Tests') {
    steps { sh "mvn verify -Pintegration -Dbase.url=https://dev.internal" }
  }
  stage('Deploy Staging') {
    steps { sh "kubectl set image deployment/app app=myapp:${GIT_COMMIT} -n staging" }
  }
  stage('Smoke Tests Staging') {
    steps { sh './smoke-test.sh staging' }
  }
  stage('Deploy Production') {
    when { branch 'main' }
    input { message 'Approve prod deploy?'; submitter 'ops-team' }
    steps { sh "kubectl set image deployment/app app=myapp:${GIT_COMMIT} -n prod" }
  }
}

Environment config: Kubernetes ConfigMaps and Secrets per namespace provide environment-specific config. The image is the same; only the config differs. Rollback: kubectl rollout undo deployment/app -n prod reverts to the previous image. In GitOps: revert the image tag commit in the environment repo.

S-04 What is GitOps and how does it differ from traditional CI/CD push deployments? Senior ▾

Traditional push CI/CD: the CI pipeline builds the artifact and calls kubectl apply or similar to push the change to the cluster. The pipeline needs cluster credentials. If the pipeline is down, deployments can't happen. Drift between desired and actual state is possible (someone kubectl edits something manually — no record, no revert). GitOps: a Git repository is the single source of truth for the desired cluster state (Kubernetes manifests, Helm values). An operator (Argo CD, Flux) runs inside the cluster, watches the Git repo, and continuously reconciles the cluster to match Git. Key differences: - No cluster credentials in CI: the operator pulls from Git; CI only pushes to Git (updates the image tag in manifests). Much smaller attack surface. - Self-healing: if someone manually changes something in the cluster, the operator reverts it to match Git. Drift is eliminated. - Audit trail in Git: every deployment is a Git commit with author, timestamp, and diff. Rollback = git revert; Argo CD syncs the revert automatically. - Separation of concerns: CI (build and test) → updates Git manifest. GitOps operator → deploys from Git to cluster. Two separate tools, two separate concerns.

In practice: CI builds myapp:abc123, updates deployment.yaml image tag to abc123, commits to the infra repo. Argo CD sees the commit and syncs the cluster.

S-05 How do you implement a canary deployment strategy in a CI/CD pipeline? Senior ▾

Requirements: traffic splitting mechanism (Kubernetes + Istio, nginx-ingress, AWS ALB weighted target groups) and automated rollback trigger. Pipeline flow: 1. Build and push myapp:v1.2.0 2. Deploy canary: create a separate myapp-canary Deployment with replicas: 1 running v1.2.0. Existing myapp Deployment still runs v1.1.0 at replicas: 9. 3. Configure Istio VirtualService: weight: 10 to canary, weight: 90 to stable. 4. Observe metrics for 10 minutes: error rate, P99 latency on canary pods (Datadog/Prometheus monitor scoped to canary pod label). 5. Gate check: if canary error rate > 1% → automated rollback (delete canary deployment, reset weights to 100% stable). If metrics are healthy → continue. 6. Progressive promotion: 10% → 25% → 50% → 100%. At each step, repeat the metric gate. 7. Full promotion: update stable Deployment to v1.2.0, remove canary Deployment, reset weights. Automated rollback in the pipeline:

groovy stage('Canary Monitor') {
  steps {
    script {
      def errorRate = sh(script: './check-canary-error-rate.sh', returnStdout: true).trim()
      if (errorRate.toFloat() > 1.0) {
        sh './rollback-canary.sh'
        error "Canary rollback: error rate ${errorRate}%"
      }
    }
  }
}

Argo Rollouts automates this entire flow with built-in Prometheus/Datadog metric analysis and automatic promotion/rollback.

S-06 How do you handle secrets in a CI/CD pipeline securely? Senior ▾

Anti-patterns: - Hardcoding secrets in Jenkinsfile or .github/workflows/*.yml - Storing secrets in environment variables that get logged (echo $SECRET) - Committing .env files with production credentials Jenkins Credentials Store: Store credentials in Jenkins (encrypted at rest). Reference in pipeline:

groovy withCredentials([
  string(credentialsId: 'aws-access-key', variable: 'AWS_KEY'),
  usernamePassword(credentialsId: 'db-creds', usernameVariable: 'DB_USER', passwordVariable: 'DB_PASS')
]) {
  sh 'aws s3 cp ... --aws-access-key-id $AWS_KEY'
}

Jenkins masks the credential values in logs (replaces with ****). Better: HashiCorp Vault: Jenkins fetches secrets from Vault at runtime using the Jenkins Vault plugin. Vault credentials are short-lived, audited, and rotated automatically. No secrets stored in Jenkins at all. GitHub Actions: Store in repo/org Secrets (encrypted, masked in logs). Access as ${{ secrets.MY_SECRET }}. Never expose in echo steps. Use OIDC (OpenID Connect) with AWS/GCP — no static credentials at all; GitHub Actions exchanges an OIDC token for temporary cloud credentials. Principle: treat pipeline secrets like any other production secrets. Audit access. Rotate regularly. Prefer short-lived credentials over long-lived API keys.

S-07 How do you reduce flaky tests in a CI pipeline? Senior ▾

Identify flaky tests: Track pass/fail rates per test over N runs. Tests that pass < 98% without code changes are candidates. Tools: Gradle Test Retry Plugin, pytest-rerunfailures, ReportPortal, GitHub Actions test summary. Common causes and fixes: Time-dependent tests: Thread.sleep(500) hoping a background process finishes. Fix: use await().atMost(5, SECONDS).until(condition) (Awaitility). Shared test state: static variables modified by tests; database not cleaned between tests. Fix: @BeforeEach resets state; use @Transactional + rollback for DB tests; Testcontainers for fully isolated DBs per test run. Port conflicts: tests hardcode localhost:8080; parallel test runs clash. Fix: use random ports (@LocalServerPort in Spring Boot tests). Order-dependent tests: test B passes only after test A ran. Fix: each test must set up its own preconditions; never rely on test execution order. Quarantine strategy: move known-flaky tests to a @Flaky suite that runs separately and doesn't block the main pipeline. Alert the owning team. Give them 1 sprint to fix before the test is deleted. Deleting a flaky test > ignoring it.

S-08 What is 'shift left' in the context of CI/CD pipelines? Senior ▾

Shift left means moving quality and security checks earlier in the development process — further left on the timeline from development → testing → staging → production. Why: the cost of fixing a bug or vulnerability increases dramatically the later it's found. A bug caught in a pre-commit hook costs minutes. The same bug caught in production costs hours of incident response, potential data loss, and reputation damage. Shift-left practices: Pre-commit (developer machine): - Git hooks (pre-commit, Husky): run linters, formatters, secret scanners (git-secrets) before the commit is even created.

PR / CI: - Unit tests, static analysis (Checkstyle, SpotBugs), SAST (Semgrep, Checkmarx) - Dependency CVE scan (OWASP, Snyk) on every PR — not just nightly - Contract tests (Pact) on every PR — catch API breaking changes before merge Pre-staging: - DAST (Dynamic Application Security Testing) against a staging environment - Performance test to catch regressions before production Outcome: production incidents decrease because issues are caught and fixed when context is fresh and the cost is low. Developer velocity increases because late-stage bugs don't cause emergency rollbacks.

Staff Engineer — Design & Cross-System Thinking

ST-01 How do you design a CI/CD platform for 50 teams to use self-service with governance and cost control? Staff ▾

Core challenge: 50 teams need autonomy to ship fast, but the platform must enforce security, compliance, and cost guardrails without becoming a bottleneck. Paved road approach: Provide pre-built pipeline templates for common patterns: Spring Boot service, React SPA, Python ML model. Teams use the template (one-line include/reference) and get: - Build, test, Docker image build, scan, push - Security gates (SAST, Trivy, OWASP) - Argo CD GitOps deployment - Standard observability (Datadog integration) Without knowing any of the implementation details. Teams who need custom steps can extend the template, not replace it. Jenkins shared libraries / GitHub Actions reusable workflows: Platform team maintains company/pipeline-library. Teams call @Library('pipeline-library') import company.StandardPipeline. Updates to the library apply to all teams' pipelines on next run. Self-service infra: Teams create new pipelines via a portal (click-ops or terraform apply). No platform team approval needed for standard pipelines. Non-standard pipelines require review. Governance gates (non-negotiable): - All builds must pass security scan (CVE gate, SAST) - All production deployments must come from a signed artifact (cosign) - All prod deploys require approval from two engineers (enforced via GitHub protected branches + Argo CD sync policy) - Audit log of every deployment: who triggered, what artifact, what commit Cost control: - CI runner quotas per team (GitHub Actions: billing by minutes) - Build timeout enforced by shared library (default 30 min; teams can request extension) - Monthly cost report per team; alert when budget exceeded

ST-02 How do you implement database schema migrations safely in a CI/CD pipeline? Staff ▾

Core principle: backward-compatible migrations. Because rolling deployments mean v1 and v2 of the application run simultaneously against the same database, schema changes must be compatible with both versions during the transition period. Expand-Contract pattern: - Phase 1 (Expand): add new column as nullable with default. Both v1 (ignores new column) and v2 (uses new column) can coexist. - Phase 2 (Migrate): backfill data; application code fully uses new column. - Phase 3 (Contract): make column NOT NULL; remove old column. Only after all v1 instances are gone.

Migration tools in the pipeline: Flyway or Liquibase. Migrations are SQL files versioned in Git alongside application code. The application runs migrations on startup (spring.flyway.enabled=true) OR a pipeline step runs them before deployment: groovy stage('DB Migrate') { steps { sh 'flyway -url=$DB_URL -user=$DB_USER migrate' } } stage('Deploy') { steps { sh 'kubectl rollout ...' } } Safety in CI: - Run migrations against a copy of production data in staging before touching prod - Migration dry-run: flyway validate to check migration files match DB state - Rollback: Flyway Community doesn't support undo; design migrations to be forward-only. "Undo" migrations are a new migration that reverts the schema change.

Never: run DDL in the middle of a rolling deployment. Add a column before deploying new code; remove an old column only after all old code is gone.

ST-03 How do you handle a failed production deployment in a CI/CD pipeline? Staff ▾

The goal: restore service within minutes, investigate without pressure, prevent recurrence. Automated detection: The pipeline's deployment step must verify success before marking the job green. Kubernetes: wait for the rollout to complete and all pods to pass health checks: kubectl rollout status deployment/myapp -n prod --timeout=5m. If this times out or pods enter CrashLoopBackOff → deployment step fails. Automated rollback (preferred): If the deploy step fails: kubectl rollout undo deployment/myapp -n prod. In GitOps: revert the image tag commit in the environment repo; Argo CD auto-syncs the revert. Pipeline sends alert: "Deploy v1.2.0 failed. Rolled back to v1.1.9. See [runbook]." Total time to restored service: 2–5 minutes. Post-incident (within 24h): - Root cause: what failed? Health check? OOMKilled? DB migration? Config error? - Canary deployment: if this had been a 5% canary, would the health check have caught it before 100% rollout? If yes, add canary as a required step. - Add the specific failure mode to the test suite (regression test). - Update the runbook with the specific failure pattern and resolution. Never: hotfix directly on the cluster without going through the pipeline. Hotfixes in prod that aren't in Git create drift; the next deploy overwrites the hotfix. Fast-track a hotfix through the pipeline — keep the golden path intact even under pressure.

Principal Engineer — Architecture & Org-Scale Thinking

P-01 How do you design a DORA-metrics-driven engineering improvement program using CI/CD telemetry? Principal ▾

DORA four key metrics (from the State of DevOps research) measure software delivery performance: Deployment Frequency: how often deploys happen per service. Elite: multiple times/day. High: daily–weekly. Medium: weekly–monthly. Low: monthly+. Measure: count deploy events per service per week (Argo CD sync history, Jenkins build log). Lead Time for Changes: time from code commit to running in production. Elite: < 1 hour. High: < 1 day. Medium: < 1 week. Low: > 1 month. Measure: commit timestamp → production deploy timestamp. Extract from Git + deploy events. Change Failure Rate: % of deployments causing incidents/rollbacks. Elite: 0–15%. High: 0–15%. Medium: 16–30%. Low: > 30%. Measure: link PagerDuty incidents to the preceding deploy (within 1 hour). Failed Deployment Recovery Time (MTTR): time from incident start to service restored. Elite: < 1 hour. High: < 1 day. Low: > 1 week. Measure: PagerDuty alert → resolution timestamp. Operationalizing the metrics: - Dashboard: DORA metrics per team, per service, trending over 90 days - Monthly engineering all-hands: publish org-wide DORA percentile — celebrate improvements - Targeted improvement programs: teams in "Low" tier get a platform engineer embedded for one quarter - Connecting to practices: slow lead time → investigate pipeline stages (where is it waiting?); high CFR → invest in canary deployments and better tests; long MTTR → invest in observability and runbooks

Anti-pattern: using DORA metrics for individual performance evaluation. They measure system outcomes, not individual effort. A team in a "Low" tier may be working on critical compliance infrastructure with mandatory change windows — context matters.

System Design Scenarios

Design a CI/CD Pipeline for a Spring Boot Microservice

Problem

Design a complete CI/CD pipeline for a Spring Boot microservice deployed to Kubernetes. Requirements: automated build and test on every PR, Docker image build and push, deployment to dev automatically on merge to main, staging after manual approval, production after a second approval. Rollback must be possible in under 5 minutes.

Constraints

GitHub Actions for CI, Argo CD for GitOps CD
ECR for Docker images
SonarCloud for code quality gate
No production deploys without two-person approval

Key Discussion Points

PR pipeline (.github/workflows/pr.yml): Trigger: pull_request. Steps: checkout → setup Java 21 (actions/setup-java) → Maven cache (actions/cache, ~/.m2) → mvn verify (unit + integration tests) → SonarCloud analysis → Trivy image scan → post build summary. Gate: all must pass before merge is allowed (branch protection rules).
Main branch pipeline (.github/workflows/main.yml): Trigger: push to main. Steps: mvn package -DskipTests → docker build -t ecr/myapp:${GITHUB_SHA} → docker push → update infra/dev/deployment.yaml image tag to ${GITHUB_SHA} → commit + push to infra repo. Argo CD detects the commit and syncs dev automatically.
Staging promotion: Manual workflow dispatch or GitHub Environment with required reviewers. Job: update infra/staging/deployment.yaml image tag → commit → Argo CD syncs staging. GitHub Environment "staging" requires one reviewer from the qa-team.
Production promotion: GitHub Environment "production" requires two reviewers from ops-team. Includes deployment protection rules (only from main branch, time window restrictions). After approval: update infra/prod/deployment.yaml → Argo CD syncs with sync wave annotation to run DB migration before app deployment.
Rollback: Argo CD History tab → select previous revision → Rollback. Argo CD updates the deployment; Kubernetes rolls out the previous image. Under 3 minutes. Alternatively: git revert the image tag commit in the infra repo → Argo CD auto-syncs.
OIDC for AWS auth: no static AWS credentials in GitHub secrets. Configure OIDC trust between GitHub Actions and AWS IAM. Actions exchange GitHub OIDC token for temporary ECR push credentials. Credentials are short-lived (15 min), scoped to ECR push only, audited in CloudTrail.

🚩 Red Flags

Building a new Docker image for staging/production instead of promoting the same image from dev — what you tested is not what you deploy
Static AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY in GitHub secrets — use OIDC for short-lived credentials
No rollback plan — rolling out to 100% with no way to undo quickly
SonarCloud quality gate not blocking PR merge — analysis without enforcement is theater

Recover from a Broken CI/CD Pipeline Blocking All Deployments

Problem

It's Friday afternoon. A Jenkins pipeline change broke the shared library used by all 50 teams' pipelines. Every build is now failing with NoSuchMethodError in the shared library. No team can deploy. A critical security patch needs to go to production today.

Constraints

Shared Jenkins library is used by all 50 pipelines
The security patch is ready and tested — only deployment is blocked
Cannot deploy directly to the cluster bypassing the pipeline (SOC 2 requirement)
The pipeline engineer who made the change is unavailable

Key Discussion Points

Immediate: revert the shared library change. Find the breaking commit in the shared library repo: git log --oneline -10. git revert HEAD --no-edit → push to main. Jenkins automatically uses the latest library version (or pin to a specific tag in consuming pipelines). Verify: re-run one pipeline to confirm it passes.
Emergency deploy the security patch: Once the library is reverted, the security patch's pipeline runs normally. Deploy through the standard pipeline (now unblocked). This maintains the audit trail.
Root cause analysis (next week): The shared library had no automated tests. The breaking change was tested manually by one person against one pipeline. Fix: add a test suite for the shared library (run all pipelines in dry-run mode against the new library version in a staging Jenkins). Require PR approval from two people for shared library changes.
Library versioning: Don't pin all pipelines to main of the shared library — that means every library change immediately affects all 50 teams. Version the library with semantic versioning. Pipelines pin to a major version: @Library('company-lib@v2'). Breaking changes bump the major version; teams upgrade at their own pace. A breaking change in v2 only affects pipelines that opted into v2.
Circuit breaker for the platform: If the shared library pipeline itself is the gate, the platform team needs a "break-glass" procedure: documented, approved, audited way to do an emergency deploy without the standard pipeline. Requires CISO approval and is logged. Used only in genuine emergencies — not as a shortcut.

🚩 Red Flags

Deploying directly to the cluster via kubectl bypassing the pipeline — violates SOC 2 controls
Not reverting first before investigating root cause — restoring service is the first priority
Shared library pinned to 'main' with no versioning — one bad commit breaks all 50 teams
No test suite for the shared library — changes are tested manually in production