// pipelines · stages · agents · GitOps · deployment strategies · senior → principal
Jenkinsfile (committed to the repo) are the standard. Two syntaxes: Declarative (structured, easier to read, required pipeline {} block) and Scripted (Groovy DSL, full flexibility, harder to read). Prefer Declarative — use script {} blocks for complex Groovy logic within a Declarative pipeline.
groovy pipeline {
agent { label 'linux' }
stages {
stage('Build') {
steps { sh 'mvn package -DskipTests' }
}
stage('Test') {
steps { sh 'mvn test' }
post { always { junit 'target/surefire-reports/*.xml' } }
}
stage('Deploy') {
when { branch 'main' }
steps { sh './deploy.sh production' }
}
}
}
any (any available agent), label 'linux' (agents with that label), docker { image 'maven:3.9' } (run in a container on the agent), kubernetes {} (spin up a pod per build — highly scalable, ephemeral).
Kubernetes agents (Jenkins Kubernetes Plugin): Each build spawns a fresh Kubernetes pod with a jnlp container (Jenkins agent) plus any tool containers you declare. Pod is destroyed after build. Enables: unlimited parallel builds (cluster scales), clean environments per build, no agent config drift.
Agent labeling: tag agents by capability (docker, gpu, windows, high-mem). Pipeline steps request the right label — no manual assignment needed.
withCredentials([string(credentialsId: 'aws-key', variable: 'AWS_KEY')]). Better: integrate with Vault or AWS SSM — Jenkins fetches secrets at runtime, they're never stored in Jenkins.
Pipeline permissions: use Matrix Authorization Plugin or Role-Based Strategy. Developers can trigger builds; only ops can approve production deployments. Use input step for manual approval gates: groovy stage('Deploy Production') {
input { message 'Deploy to production?'; ok 'Deploy'; submitter 'ops-team' }
steps { sh './deploy.sh prod' }
}
Script approval: Declarative pipelines limit Groovy by default. script {} blocks with new Groovy methods require admin approval (Script Security plugin). This prevents malicious Jenkinsfiles from calling arbitrary system commands.
image:${BUILD_NUMBER}-${GIT_COMMIT[0..7]}. Never use :latest in production artifacts.
Artifact repositories: Nexus or Artifactory for JARs/Maven artifacts; ECR/GCR/ Docker Hub for Docker images; Helm repository (ChartMuseum, OCI registry) for Helm charts.
Traceability: link artifact → git commit → CI build → deployment. Record in: MANIFEST.MF (JAR), Docker image labels (LABEL git.commit=${GIT_COMMIT}), deployment annotation in Kubernetes. Essential for incident response: "which commit is running in prod?"
Retention policy: keep last N builds' artifacts in the artifact store. Prune older ones. Exception: release artifacts (tagged versions) are kept indefinitely.
echo $SECRET_KEY in a pipeline step prints the secret to the build log, which is often accessible to all developers. Jenkins masks credentials injected via the Credentials Store, but only if you use withCredentials{} properly. Never echo credentials. Use --add-host or environment files passed via secrets manager instead of printing values.
kubectl rollout undo. In GitOps: revert the Git commit in the environment repo. The rollback should be a one-command operation from the CI/CD tool — not a 30-minute manual process.
| agent | Where to run: any, none, label 'x', docker{image}, kubernetes{yaml}. Stage-level agent overrides pipeline-level. |
| stages / stage | Sequence of named stages. Each stage runs steps. Visible in Blue Ocean / Stage View. |
| steps | Actual work: sh, bat, script, checkout, withCredentials, input, build, echo. |
| post | Run after stage/pipeline: always (cleanup), success, failure, unstable, changed. Use for: JUnit publish, Slack notify, cleanup. |
| when | Conditional execution: branch('main'), environment(name:'ENV',value:'prod'), expression{}, not{}, allOf{}, anyOf{}. |
| parallel | Run stages in parallel: parallel { stage('A'){...} stage('B'){...} }. Significant time savings for independent stages. |
| input | Pause for human approval: input(message:'Deploy?', ok:'Yes', submitter:'ops-team'). Logs who approved. |
| environment | Declare env vars for the block: environment { DEPLOY_ENV = 'prod' }. Access as env.DEPLOY_ENV. |
| options | Build settings: timeout(time:30,unit:'MINUTES'), retry(3), timestamps(), skipDefaultCheckout(). |
| parameters | Pipeline parameters: string, boolean, choice. Access as params.PARAM_NAME. |
| triggers | Auto-trigger: cron('H 2 * * 1-5'), pollSCM('H/5 * * * *'), upstream(projects:'job', threshold:SUCCESS). |
| withCredentials | Inject credentials: withCredentials([string(credentialsId:'id',variable:'VAR')]){sh 'use $VAR'}. Masked in logs. |
| Recreate | Stop all v1, start all v2. Downtime during transition. Use only for non-prod or when version coexistence is impossible. |
| Rolling | Replace instances incrementally. Kubernetes default. Zero downtime. v1 and v2 serve traffic simultaneously — requires backward compatibility. |
| Blue/Green | Two full environments. Instant cutover. Fast rollback (switch DNS/LB back). Double infrastructure cost. No version coexistence. |
| Canary | Route small % to new version. Progressive rollout. Lowest risk. Requires traffic splitting (Istio, nginx, ALB). Best for high-traffic services. |
| Shadow | Duplicate live traffic to new version (no user-visible impact). Compare responses. No production risk. High infra cost. Use before canary for risky changes. |
| A/B Testing | Route different user segments to different versions. Business experiment, not just risk mitigation. Requires feature flag / routing by header/cookie. |
| Config format | GitHub: YAML in .github/workflows/. Jenkins: Groovy in Jenkinsfile. GitLab: YAML in .gitlab-ci.yml. |
| Hosting | GitHub Actions: SaaS (GitHub-hosted runners + self-hosted). Jenkins: self-hosted only. GitLab: SaaS + self-hosted. |
| Runners/Agents | GitHub: GitHub-hosted (Ubuntu/Windows/Mac) + self-hosted. Jenkins: controller+agents, Kubernetes pods. GitLab: shared runners + self-hosted. |
| Secret management | GitHub: Secrets in repo/org settings. Jenkins: Credentials Store + Vault integration. GitLab: CI/CD variables (masked, protected). |
| Marketplace/plugins | GitHub: 20k+ Actions in Marketplace. Jenkins: 1800+ plugins (mature, some unmaintained). GitLab: native features + includes. |
| Multi-repo pipelines | GitHub: reusable workflows. Jenkins: shared libraries, multibranch pipelines. GitLab: include + extends across projects. |
| Cost | GitHub: free for public; paid by minutes for private (2k free/month). Jenkins: infra cost only. GitLab: free tier + paid tiers. |
| Best for | GitHub: teams already on GitHub, OSS. Jenkins: existing enterprise installs, complex pipelines. GitLab: integrated DevSecOps platform. |
| Argo CD | Kubernetes-native. Rich UI. App-of-apps pattern. Sync waves. RBAC. SSO. Multi-cluster. Most popular GitOps tool. |
| Flux | CNCF graduated. CLI-driven. Stronger multi-tenancy. Kustomize and Helm support. Notification controller. Leaner than Argo CD. |
| Argo Rollouts | Progressive delivery controller (canary, blue/green) for Kubernetes. Integrates with Argo CD and Istio/Nginx for traffic splitting. |
| Jenkins X | Full GitOps platform built on Jenkins + Tekton. Opinionated, complex. Less popular than Argo/Flux. |
A Declarative Pipeline has a mandatory pipeline {} wrapper. Key sections:
```groovy pipeline {
agent { label 'linux' } // Where to run
options {
timeout(time: 30, unit: 'MINUTES')
timestamps()
}
environment { APP_ENV = 'staging' } // Pipeline-wide env vars
stages {
stage('Build') {
steps {
sh 'mvn package -DskipTests'
stash name: 'artifact', includes: 'target/.jar'
}
}
stage('Test') {
parallel { // Parallel sub-stages
stage('Unit') { steps { sh 'mvn test' } }
stage('Lint') { steps { sh 'mvn checkstyle:check' } }
}
post { always { junit 'target/surefire-reports//.xml' } }
}
stage('Deploy') {
when { branch 'main' } // Condition
input { message 'Deploy?'; submitter 'ops' }
steps {
unstash 'artifact'
withCredentials([string(credentialsId: 'deploy-key', variable: 'KEY')]) {
sh './deploy.sh $KEY'
}
}
}
}
post {
success { slackSend message: "Build ${BUILD_NUMBER} succeeded" }
failure { slackSend color: 'danger', message: "Build failed" }
}
} ``
Key sections:agent(execution environment),options(timeout, retry, timestamps),environment(env vars),stages(ordered stages),parallel(concurrent stages),when(conditions),input(approval gate),post` (cleanup/notification).
groovy stages {
stage('Build & Test') {
steps {
sh 'mvn verify'
sh "docker build -t myapp:${GIT_COMMIT} ."
sh "docker push myapp:${GIT_COMMIT}"
}
}
stage('Deploy Dev') {
steps { sh "kubectl set image deployment/app app=myapp:${GIT_COMMIT} -n dev" }
}
stage('Integration Tests') {
steps { sh "mvn verify -Pintegration -Dbase.url=https://dev.internal" }
}
stage('Deploy Staging') {
steps { sh "kubectl set image deployment/app app=myapp:${GIT_COMMIT} -n staging" }
}
stage('Smoke Tests Staging') {
steps { sh './smoke-test.sh staging' }
}
stage('Deploy Production') {
when { branch 'main' }
input { message 'Approve prod deploy?'; submitter 'ops-team' }
steps { sh "kubectl set image deployment/app app=myapp:${GIT_COMMIT} -n prod" }
}
}
Environment config: Kubernetes ConfigMaps and Secrets per namespace provide environment-specific config. The image is the same; only the config differs.
Rollback: kubectl rollout undo deployment/app -n prod reverts to the previous image. In GitOps: revert the image tag commit in the environment repo.Traditional push CI/CD: the CI pipeline builds the artifact and calls kubectl apply or similar to push the change to the cluster. The pipeline needs cluster credentials. If the pipeline is down, deployments can't happen. Drift between desired and actual state is possible (someone kubectl edits something manually — no record, no revert).
GitOps: a Git repository is the single source of truth for the desired cluster state (Kubernetes manifests, Helm values). An operator (Argo CD, Flux) runs inside the cluster, watches the Git repo, and continuously reconciles the cluster to match Git.
Key differences: - No cluster credentials in CI: the operator pulls from Git; CI only pushes to Git
(updates the image tag in manifests). Much smaller attack surface.
- Self-healing: if someone manually changes something in the cluster, the operator
reverts it to match Git. Drift is eliminated.
- Audit trail in Git: every deployment is a Git commit with author, timestamp, and
diff. Rollback = git revert; Argo CD syncs the revert automatically.
- Separation of concerns: CI (build and test) → updates Git manifest.
GitOps operator → deploys from Git to cluster. Two separate tools, two separate concerns.
In practice: CI builds myapp:abc123, updates deployment.yaml image tag to abc123, commits to the infra repo. Argo CD sees the commit and syncs the cluster.
myapp:v1.2.0 2. Deploy canary: create a separate myapp-canary Deployment with replicas: 1
running v1.2.0. Existing myapp Deployment still runs v1.1.0 at replicas: 9.
3. Configure Istio VirtualService: weight: 10 to canary, weight: 90 to stable. 4. Observe metrics for 10 minutes: error rate, P99 latency on canary pods
(Datadog/Prometheus monitor scoped to canary pod label).
5. Gate check: if canary error rate > 1% → automated rollback (delete canary deployment,
reset weights to 100% stable). If metrics are healthy → continue.
6. Progressive promotion: 10% → 25% → 50% → 100%. At each step, repeat the metric gate. 7. Full promotion: update stable Deployment to v1.2.0, remove canary Deployment, reset weights.
Automated rollback in the pipeline: groovy stage('Canary Monitor') {
steps {
script {
def errorRate = sh(script: './check-canary-error-rate.sh', returnStdout: true).trim()
if (errorRate.toFloat() > 1.0) {
sh './rollback-canary.sh'
error "Canary rollback: error rate ${errorRate}%"
}
}
}
}
Argo Rollouts automates this entire flow with built-in Prometheus/Datadog metric analysis and automatic promotion/rollback.Jenkinsfile or .github/workflows/*.yml - Storing secrets in environment variables that get logged (echo $SECRET) - Committing .env files with production credentials
Jenkins Credentials Store: Store credentials in Jenkins (encrypted at rest). Reference in pipeline: groovy withCredentials([
string(credentialsId: 'aws-access-key', variable: 'AWS_KEY'),
usernamePassword(credentialsId: 'db-creds', usernameVariable: 'DB_USER', passwordVariable: 'DB_PASS')
]) {
sh 'aws s3 cp ... --aws-access-key-id $AWS_KEY'
} Jenkins masks the credential values in logs (replaces with ****).
Better: HashiCorp Vault: Jenkins fetches secrets from Vault at runtime using the Jenkins Vault plugin. Vault credentials are short-lived, audited, and rotated automatically. No secrets stored in Jenkins at all.
GitHub Actions: Store in repo/org Secrets (encrypted, masked in logs). Access as ${{ secrets.MY_SECRET }}. Never expose in echo steps. Use OIDC (OpenID Connect) with AWS/GCP — no static credentials at all; GitHub Actions exchanges an OIDC token for temporary cloud credentials.
Principle: treat pipeline secrets like any other production secrets. Audit access. Rotate regularly. Prefer short-lived credentials over long-lived API keys.Thread.sleep(500) hoping a background process finishes. Fix: use await().atMost(5, SECONDS).until(condition) (Awaitility).
Shared test state: static variables modified by tests; database not cleaned between tests. Fix: @BeforeEach resets state; use @Transactional + rollback for DB tests; Testcontainers for fully isolated DBs per test run.
Port conflicts: tests hardcode localhost:8080; parallel test runs clash. Fix: use random ports (@LocalServerPort in Spring Boot tests).
Order-dependent tests: test B passes only after test A ran. Fix: each test must set up its own preconditions; never rely on test execution order.
Quarantine strategy: move known-flaky tests to a @Flaky suite that runs separately and doesn't block the main pipeline. Alert the owning team. Give them 1 sprint to fix before the test is deleted. Deleting a flaky test > ignoring it.Shift left means moving quality and security checks earlier in the development process — further left on the timeline from development → testing → staging → production. Why: the cost of fixing a bug or vulnerability increases dramatically the later it's found. A bug caught in a pre-commit hook costs minutes. The same bug caught in production costs hours of incident response, potential data loss, and reputation damage. Shift-left practices: Pre-commit (developer machine): - Git hooks (pre-commit, Husky): run linters, formatters, secret scanners (git-secrets) before the commit is even created.
PR / CI: - Unit tests, static analysis (Checkstyle, SpotBugs), SAST (Semgrep, Checkmarx) - Dependency CVE scan (OWASP, Snyk) on every PR — not just nightly - Contract tests (Pact) on every PR — catch API breaking changes before merge Pre-staging: - DAST (Dynamic Application Security Testing) against a staging environment - Performance test to catch regressions before production Outcome: production incidents decrease because issues are caught and fixed when context is fresh and the cost is low. Developer velocity increases because late-stage bugs don't cause emergency rollbacks.
company/pipeline-library. Teams call @Library('pipeline-library') import company.StandardPipeline. Updates to the library apply to all teams' pipelines on next run.
Self-service infra: Teams create new pipelines via a portal (click-ops or terraform apply). No platform team approval needed for standard pipelines. Non-standard pipelines require review.
Governance gates (non-negotiable): - All builds must pass security scan (CVE gate, SAST) - All production deployments must come from a signed artifact (cosign) - All prod deploys require approval from two engineers (enforced via GitHub protected branches + Argo CD sync policy) - Audit log of every deployment: who triggered, what artifact, what commit
Cost control: - CI runner quotas per team (GitHub Actions: billing by minutes) - Build timeout enforced by shared library (default 30 min; teams can request extension) - Monthly cost report per team; alert when budget exceededCore principle: backward-compatible migrations. Because rolling deployments mean v1 and v2 of the application run simultaneously against the same database, schema changes must be compatible with both versions during the transition period. Expand-Contract pattern: - Phase 1 (Expand): add new column as nullable with default. Both v1 (ignores new column) and v2 (uses new column) can coexist. - Phase 2 (Migrate): backfill data; application code fully uses new column. - Phase 3 (Contract): make column NOT NULL; remove old column. Only after all v1 instances are gone.
Migration tools in the pipeline: Flyway or Liquibase. Migrations are SQL files versioned in Git alongside application code. The application runs migrations on startup (spring.flyway.enabled=true) OR a pipeline step runs them before deployment: groovy stage('DB Migrate') {
steps { sh 'flyway -url=$DB_URL -user=$DB_USER migrate' }
} stage('Deploy') {
steps { sh 'kubectl rollout ...' }
}
Safety in CI: - Run migrations against a copy of production data in staging before touching prod - Migration dry-run: flyway validate to check migration files match DB state - Rollback: Flyway Community doesn't support undo; design migrations to be forward-only.
"Undo" migrations are a new migration that reverts the schema change.
Never: run DDL in the middle of a rolling deployment. Add a column before deploying new code; remove an old column only after all old code is gone.
kubectl rollout status deployment/myapp -n prod --timeout=5m. If this times out or pods enter CrashLoopBackOff → deployment step fails.
Automated rollback (preferred): If the deploy step fails: kubectl rollout undo deployment/myapp -n prod. In GitOps: revert the image tag commit in the environment repo; Argo CD auto-syncs the revert. Pipeline sends alert: "Deploy v1.2.0 failed. Rolled back to v1.1.9. See [runbook]." Total time to restored service: 2–5 minutes.
Post-incident (within 24h): - Root cause: what failed? Health check? OOMKilled? DB migration? Config error? - Canary deployment: if this had been a 5% canary, would the health check have caught it
before 100% rollout? If yes, add canary as a required step.
- Add the specific failure mode to the test suite (regression test). - Update the runbook with the specific failure pattern and resolution.
Never: hotfix directly on the cluster without going through the pipeline. Hotfixes in prod that aren't in Git create drift; the next deploy overwrites the hotfix. Fast-track a hotfix through the pipeline — keep the golden path intact even under pressure.DORA four key metrics (from the State of DevOps research) measure software delivery performance: Deployment Frequency: how often deploys happen per service. Elite: multiple times/day. High: daily–weekly. Medium: weekly–monthly. Low: monthly+. Measure: count deploy events per service per week (Argo CD sync history, Jenkins build log). Lead Time for Changes: time from code commit to running in production. Elite: < 1 hour. High: < 1 day. Medium: < 1 week. Low: > 1 month. Measure: commit timestamp → production deploy timestamp. Extract from Git + deploy events. Change Failure Rate: % of deployments causing incidents/rollbacks. Elite: 0–15%. High: 0–15%. Medium: 16–30%. Low: > 30%. Measure: link PagerDuty incidents to the preceding deploy (within 1 hour). Failed Deployment Recovery Time (MTTR): time from incident start to service restored. Elite: < 1 hour. High: < 1 day. Low: > 1 week. Measure: PagerDuty alert → resolution timestamp. Operationalizing the metrics: - Dashboard: DORA metrics per team, per service, trending over 90 days - Monthly engineering all-hands: publish org-wide DORA percentile — celebrate improvements - Targeted improvement programs: teams in "Low" tier get a platform engineer embedded for one quarter - Connecting to practices: slow lead time → investigate pipeline stages (where is it waiting?); high CFR → invest in canary deployments and better tests; long MTTR → invest in observability and runbooks
Anti-pattern: using DORA metrics for individual performance evaluation. They measure system outcomes, not individual effort. A team in a "Low" tier may be working on critical compliance infrastructure with mandatory change windows — context matters.
mvn verify (unit + integration tests) → SonarCloud analysis → Trivy image scan → post build summary. Gate: all must pass before merge is allowed (branch protection rules).mvn package -DskipTests → docker build -t ecr/myapp:${GITHUB_SHA} → docker push → update infra/dev/deployment.yaml image tag to ${GITHUB_SHA} → commit + push to infra repo. Argo CD detects the commit and syncs dev automatically.infra/staging/deployment.yaml image tag → commit → Argo CD syncs staging. GitHub Environment "staging" requires one reviewer from the qa-team.ops-team. Includes deployment protection rules (only from main branch, time window restrictions). After approval: update infra/prod/deployment.yaml → Argo CD syncs with sync wave annotation to run DB migration before app deployment.git revert the image tag commit in the infra repo → Argo CD auto-syncs.NoSuchMethodError in the shared library. No team can deploy. A critical security patch needs to go to production today.git log --oneline -10. git revert HEAD --no-edit → push to main. Jenkins automatically uses the latest library version (or pin to a specific tag in consuming pipelines). Verify: re-run one pipeline to confirm it passes.main of the shared library — that means every library change immediately affects all 50 teams. Version the library with semantic versioning. Pipelines pin to a major version: @Library('company-lib@v2'). Breaking changes bump the major version; teams upgrade at their own pace. A breaking change in v2 only affects pipelines that opted into v2.