// containers · images · Dockerfile · networking · volumes · compose · security · senior → principal
RUN, COPY, ADD). Layers are content-addressed (SHA256) and cached: if a layer hasn't changed, Docker reuses the cache. This makes builds fast but also means layer order matters — put infrequently changing instructions (install dependencies) before frequently changing ones (copy source code). A container is an image plus a thin writable layer on top. Images are stored in a registry (Docker Hub, ECR, GCR); pulled by digest or tag. Tags are mutable (:latest can point to any image); digests are immutable (image@sha256:abc123). Always pin by digest in production.
pom.xml / package.json and run dependency install before copying source — dependencies are cached until they change.
Other practices: use .dockerignore to exclude node_modules, .git, test files. Prefer COPY over ADD (ADD has implicit tar extraction and URL fetch). Combine RUN instructions with && to reduce layers. Run as non-root: RUN adduser -D appuser && USER appuser. Use specific base image tags, never :latest.
-p hostPort:containerPort binds to host interface. -p 8080:8080 binds all interfaces; -p 127.0.0.1:8080:8080 binds loopback only (more secure). Expose ports in Dockerfile (EXPOSE) as documentation only — it doesn't publish.
docker rm. Three persistence mechanisms: Volumes (managed by Docker, stored in /var/lib/docker/volumes/, best for production data, can be shared between containers, driver plugins for NFS/EBS). Bind mounts (mount host path into container — great for dev, poor for prod: ties container to host filesystem layout). tmpfs (in-memory, not persisted, useful for secrets or temp files).
Volumes survive container removal. Named volumes (-v mydata:/data) are preferred over anonymous volumes. Use docker volume inspect and docker volume prune to manage lifecycle.
docker-compose.yml. Key concepts: services (each container), networks (services on the same network resolve by service name), volumes (shared or named), depends_on (start order, but not health — use healthcheck + condition: service_healthy).
docker compose up -d starts all services; docker compose logs -f tails logs; docker compose down -v stops and removes containers + volumes.
Compose is for local dev and CI — not for production. In production use Kubernetes or Docker Swarm. Compose v2 (plugin, docker compose) replaces the deprecated v1 (docker-compose binary). Use profiles to start optional services (e.g., --profile debug starts a monitoring container).
--cap-drop=ALL) and add only what's needed (--cap-add=NET_BIND_SERVICE). Use read-only filesystem (--read-only) with tmpfs for writable paths.
Image security: scan images for CVEs (Trivy, Snyk, Docker Scout). Use minimal base images (distroless, scratch) to reduce attack surface. Never store secrets in image layers — they persist in history even if deleted in a later layer.
Runtime security: set resource limits (--memory, --cpus) to prevent DoS. Use --security-opt=no-new-privileges to prevent privilege escalation. Enable user namespace remapping to remap root inside container to unprivileged UID on host. Seccomp profiles restrict syscalls (Docker applies a default profile).
dockerd) → containerd → runc (OCI runtime that actually creates the container using Linux namespaces and cgroups). Kubernetes uses containerd or CRI-O directly, bypassing Docker daemon. "Docker is deprecated in Kubernetes" means the Docker shim (CRI adapter) was removed — images still work because they're OCI-compliant.
Namespaces provide isolation: PID, Network, Mount, UTS, IPC, User. cgroups enforce resource limits: CPU, memory, I/O.
--memory=512m --memory-swap=512m (swap=memory means no swap). --cpus=0.5 limits to half a CPU core. In Compose: deploy.resources.limits.
Health checks let Docker (and Kubernetes) know when a container is ready: HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1 Unhealthy containers are restarted automatically. In Compose: depends_on with condition: service_healthy waits for a dependency to pass its health check before starting the dependent service.
USER nonroot in your Dockerfile. Many base images now provide a non-root user (e.g., node image has node user, openjdk has no default — create one explicitly).
docker inspect, process listings, and logs. Secrets baked into image layers persist in history (docker history). Use Docker secrets (Swarm), Kubernetes Secrets, or a secrets manager (Vault, AWS SSM). At build time: use BuildKit's --secret flag to mount a secret without baking it in.
RUN creates a layer. Deleting files in a later RUN does NOT reduce image size — the data is still in the earlier layer. Clean up in the same RUN: RUN apt-get install -y curl && rm -rf /var/lib/apt/lists/*. Better yet: use multi-stage builds and only copy final artifacts to the runtime image.
.dockerignore, COPY . . sends the entire build context (including node_modules, .git, test fixtures, IDE files) to the Docker daemon. This bloats the build context, invalidates the layer cache unnecessarily, and can bake development credentials into the image.
:latest is mutable — it changes with every push. docker pull myimage:latest today and tomorrow may pull different images. This breaks reproducibility and makes rollbacks ambiguous. Pin to specific semantic version tags or image digests (image@sha256:...) in CI/CD and deployment manifests.
| FROM | Base image. Use specific version tags. Multi-stage: multiple FROM lines with AS aliases. |
| RUN | Execute command in a new layer. Combine with && to minimize layers. Clean up in same RUN. |
| COPY | Copy files from build context. Preferred over ADD. Supports --chown=user:group. |
| ADD | Like COPY but also auto-extracts tar archives and fetches URLs. Use COPY unless you need these features. |
| ENV | Set environment variables available at build and runtime. Visible in docker inspect. |
| ARG | Build-time variable. Not persisted in image. Pass with --build-arg. Don't use for secrets. |
| EXPOSE | Documents the port the container listens on. Does NOT publish the port — use -p at runtime. |
| ENTRYPOINT | Main command. Always executed. Use exec form: ["java", "-jar", "app.jar"]. Prefer over CMD alone. |
| CMD | Default arguments to ENTRYPOINT (or default command if no ENTRYPOINT). Overridable at runtime. |
| USER | Set UID/GID for subsequent instructions and container runtime. Always set to non-root. |
| WORKDIR | Set working directory. Creates it if absent. Prefer absolute paths. |
| HEALTHCHECK | Command Docker runs to check container health. --interval, --timeout, --retries. Exit 0 = healthy. |
| docker build -t name:tag --no-cache . | Build image from Dockerfile in current dir. --no-cache forces fresh build. |
| docker run -d -p 8080:8080 --name app image | Run container detached, port mapped, named. |
| docker exec -it app /bin/sh | Open interactive shell in running container. |
| docker logs -f --tail=100 app | Tail last 100 lines of container logs. |
| docker inspect app | Full container metadata: IP, mounts, env vars, health status. |
| docker stats | Live CPU, memory, network, block I/O per container. |
| docker image prune -a | Remove all unused images. -a includes images not referenced by any container. |
| docker system df | Disk usage: images, containers, volumes, build cache. |
| docker buildx build --platform linux/amd64,linux/arm64 | Multi-platform build (BuildKit). Push manifest list to registry. |
| docker cp app:/app/logs ./logs | Copy files from container to host. |
| bridge (default) | Isolated network. NAT to host. Containers on same bridge resolve by name (user-defined only). Use for most local workloads. |
| host | Shares host network stack. No isolation. Best performance (no NAT). Risk: container can bind any host port. Use for high-throughput network services where NAT overhead matters. |
| none | No networking. Completely isolated. Use for batch jobs that need compute but no network. |
| overlay | Cross-host network (Docker Swarm). Containers on different hosts communicate transparently. Encrypted with --opt encrypted. Used in Swarm clusters. |
| macvlan | Container gets its own MAC and IP on the physical network. Appears as a physical device. Use when containers must be directly reachable on the LAN. |
| Daemon | dockerd (background daemon) | containerd daemon | Daemonless |
| Root requirement | Daemon runs as root | Daemon runs as root | Rootless by default |
| CLI | docker | ctr / nerdctl | podman (Docker-compatible) |
| Compose support | docker compose | nerdctl compose | podman-compose |
| Kubernetes CRI | No (removed in k8s 1.24) | Yes (primary CRI) | Yes (via CRI-O) |
| Image compatibility | OCI + Docker format | OCI standard | OCI standard |
| Security | Daemon root = risk | Daemon root = risk | Rootless = better isolation |
| Best for | Local dev, CI | Production k8s runtime | Rootless environments, RHEL |
RUN, COPY, ADD) creates a new layer. Layers are content-addressed (SHA256 of contents). Docker compares the cache key for each instruction — if unchanged, it reuses the cached layer.
Cache invalidation rules: - FROM: invalidated if base image digest changes - RUN: cache key is the command string — any change invalidates it and all subsequent layers - COPY/ADD: cache key includes file content checksums — any file change invalidates it
Optimization — order from stable to volatile: dockerfile FROM eclipse-temurin:21-jre WORKDIR /app # 1. Copy dependency manifest only (changes rarely) COPY pom.xml . RUN mvn dependency:go-offline -q # 2. Copy source (changes frequently — but deps layer is cached) COPY src ./src RUN mvn package -DskipTests
Without this order: every source change invalidates the dependency download layer, re-downloading hundreds of MB each build. With this order: only the final two layers are rebuilt on source changes.Multi-stage builds use multiple FROM instructions in one Dockerfile. Each FROM starts a new stage. You can COPY --from=<stage> to copy artifacts between stages. Only the final stage becomes the image.
```dockerfile # Stage 1: build FROM maven:3.9-eclipse-temurin-21 AS builder WORKDIR /app COPY pom.xml . RUN mvn dependency:go-offline COPY src ./src RUN mvn package -DskipTests
Why it matters: - Builder image: ~600MB (JDK + Maven + dependencies + source) - Runtime image: ~100MB (JRE + JAR only) - No build tools, source code, or intermediate artifacts in production image - Smaller attack surface, faster pull, less storage cost
ENTRYPOINT sets the main executable — it's always run and cannot be overridden at docker run time (without --entrypoint flag). CMD provides default arguments to ENTRYPOINT, or the default command if no ENTRYPOINT is set. CMD is easily overridden: docker run image arg1 arg2.
Shell form vs exec form: - Shell form: ENTRYPOINT java -jar app.jar — runs as /bin/sh -c "java -jar app.jar".
PID 1 is the shell, not Java. Signals (SIGTERM) don't reach Java — graceful shutdown breaks.
- Exec form: ENTRYPOINT ["java", "-jar", "app.jar"] — Java is PID 1. Signals
reach Java directly. Always use exec form.
Best practice: dockerfile ENTRYPOINT ["java", "-jar", "app.jar"] CMD ["--spring.profiles.active=prod"] Override CMD at runtime: docker run image --spring.profiles.active=dev
Use CMD alone (no ENTRYPOINT) for flexible base images where the command varies.
ENV SECRET=value — visible in docker inspect, process listing, logs - ARG SECRET=value + RUN use $SECRET — baked into the layer (visible in docker history) - COPY secrets.txt . — file persists in the image
Build-time secrets (BuildKit): dockerfile RUN --mount=type=secret,id=npmrc,target=/root/.npmrc npm install bash docker buildx build --secret id=npmrc,src=$HOME/.npmrc . The secret is mounted only for that RUN step — never written to a layer.
Runtime secrets: - Docker Swarm secrets: docker secret create, mounted as files in /run/secrets/.
Never exposed as env vars.
- Kubernetes Secrets: mounted as files or env vars (prefer files). Use external
secrets operator to sync from Vault/AWS SSM.
- Environment injection at runtime from a secrets manager: the container's
entrypoint fetches secrets from Vault/SSM on startup, sets them in the process
environment. They never touch the image.docker0): containers get an IP in the 172.17.0.0/16 range. They can reach each other by IP but NOT by name (no DNS). Traffic out goes via NAT. Avoid for multi-container setups.
User-defined bridge network: bash docker network create mynet docker run --network=mynet --name=db postgres docker run --network=mynet --name=app myapp Now app can reach db by name (ping db). Docker's embedded DNS server (127.0.0.11) resolves container names within the network.
Network isolation: containers on different user-defined networks cannot communicate by default. Explicitly connect a container to multiple networks if cross-network communication is needed.
Port publishing: -p 8080:8080 creates an iptables rule forwarding host port 8080 to container port 8080. -p 127.0.0.1:8080:8080 binds only loopback — better for local dev where you don't want to expose to the network.--cpus=0.5 → cgroup cpu.shares / cpu.quota - Memory: --memory=512m → OOM killer triggers at limit - Block I/O, network bandwidth
seccomp profiles restrict system calls (Docker applies a default profile blocking ~40 dangerous syscalls). AppArmor/SELinux provide mandatory access control.
Key insight: all containers share the host kernel. A kernel exploit potentially affects all containers on the host — this is the fundamental difference from VMs.Multi-stage builds — most impactful. Build in a fat image; copy artifact to a minimal runtime image.
Minimal base images:
alpine (~5MB) for most languagesdistroless (Google) — no shell, no package manager, minimal attack surfacescratch — literally empty; use for Go binaries that are statically compiled
Combine RUN instructions and clean up in the same layer:
dockerfile
RUN apt-get update && apt-get install -y curl \
&& rm -rf /var/lib/apt/lists/*
Use .dockerignore to exclude node_modules, .git, tests, docs
--no-install-recommends for aptInspect with docker image history image:tag to see which layer is large.
Use dive tool for interactive layer exploration.
Compress static assets before COPY rather than inside the container
--cache-from and --cache-to allow sharing build cache across machines (e.g., CI workers). Remote cache backends (registry, S3, GHA).
Secret mounting (as described above): RUN --mount=type=secret — secrets never baked into layers.
SSH forwarding: RUN --mount=type=ssh — forward SSH agent for private git dependencies without baking SSH keys.
Cache mounts: RUN --mount=type=cache,target=/root/.m2 — persist a directory between builds (Maven/npm cache) without it becoming part of the image.
Multi-platform: docker buildx build --platform linux/amd64,linux/arm64 builds images for multiple architectures in one command.
Enable: set DOCKER_BUILDKIT=1 env var or use docker buildx build.bash git diff --name-only origin/main...HEAD | grep '^services/' | cut -d/ -f2 | sort -u Only build changed services. For a PR touching services/orders/, build only orders.
Shared base images: Extract common dependencies (JDK + framework dependencies) into a versioned base image. Services FROM company/java-base:1.4. Base image changes rarely — built on its own pipeline. Services only rebuild their app layer. Cache hit rate dramatically improves.
BuildKit cache layers in CI: Use registry-based cache: --cache-from type=registry,ref=ecr/orders:cache --cache-to type=registry,ref=ecr/orders:cache,mode=max. Each CI build restores cache from registry, saving dependency install time.
Parallel builds: build changed services in parallel CI jobs (matrix strategy). 20 services → 20 parallel 2-minute jobs = 2 minutes total wall time.
Tag strategy: image:${git_sha} for traceability. Promote to image:v1.2.3 on release. Never push to :latest in production pipelines./bin/sh).
Runtime hardening: - USER nonroot — never root - --read-only filesystem + tmpfs for /tmp - --cap-drop=ALL --cap-add=<only what's needed> - --security-opt=no-new-privileges - --security-opt seccomp=custom-profile.json for fine-grained syscall restriction - Resource limits: --memory, --cpus, --pids-limit
Network: - User-defined networks; containers only on networks they need - No --network=host unless unavoidable - Publish only required ports; bind to 127.0.0.1 for services not publicly exposed
Secrets: - Never in env vars or image. Use Docker secrets / Kubernetes secrets / Vault agent.
Audit: - Enable Docker daemon audit logging - Use Falco for runtime threat detection (anomalous syscalls, container escapes) - Regular docker system prune to remove stale images with known CVEscompany/java-base:21-v3, company/node-base:20-v2). Each base image includes: security hardening (non-root, minimal packages, seccomp profile), observability agents (OpenTelemetry Java agent pre-configured), CA certificates, timezone data. Services inherit all compliance without knowing the details.
Image governance: - Mandatory base image: CI rejects images not based on approved bases (OPA/Conftest policy) - Mandatory scan gate: Trivy in CI, block on CRITICAL - Image signing: cosign signs all approved images; admission controller (Kyverno/OPA Gatekeeper)
rejects unsigned images in production clusters
- SBOM (Software Bill of Materials) generation on every build; stored in registry
Registry strategy: private registry per environment (dev/prod). Immutable tags in prod registry — once pushed, cannot overwrite. Retention policy: keep last 10 versions per service; auto-delete older.
Build platform: centralized BuildKit fleet (depot.dev or self-hosted) with remote cache backend. Teams get fast builds without managing build infrastructure.
Developer experience: make docker-build in every service repo just works — pre-configured Makefile target that uses the right base, BuildKit cache, and registry. Paved road: easy to do the right thing, hard to do the wrong thing.
Cost governance: track image size trends per service. Large images = slow deploys, expensive registry storage, higher CVE surface. Alert teams when image grows > 20% between releases.FROM maven:3.9-eclipse-temurin-21 (builder + runtime in one) with a two-stage build: Maven stage builds the JAR; eclipse-temurin:21-jre-alpine stage copies only the JAR. Expected result: 1.2GB → ~120MB. Smaller image = faster pull in Kubernetes.COPY pom.xml . then RUN mvn dependency:go-offline before COPY src .. Maven dependencies (the slow part) are cached as a layer and only re-downloaded when pom.xml changes.docker/build-push-action with cache-from/cache-to pointing to ECR: cache-from: type=registry,ref=ecr/myapp:cache cache-to: type=registry,ref=ecr/myapp:cache,mode=max Each runner restores from ECR cache; dependency layer hit rate > 90% for most commits.FROM eclipse-temurin:21-jre-alpine + OpenTelemetry agent + non-root user setup into company/java-base:21. Build it once on change; 15 services use it. Cache hit for the base layer across all service builds.RUN groupadd -r payment && useradd -r -g payment payment and USER payment. Verify with docker run image whoami → payment.trivy image payment-service:latest. For CRITICAL CVEs in OS packages: RUN apt-get update && apt-get upgrade -y in Dockerfile to pull patched packages. For application dependency CVEs: update pom.xml/package.json. Add Trivy scan as a CI gate: trivy image --exit-code 1 --severity CRITICAL.yaml secrets:
db_password:
external: true
services:
payment:
secrets: [db_password] Secret mounted at /run/secrets/db_password. Application reads file, not env var.--cap-drop=ALL --cap-add=NET_BIND_SERVICE (only if binding privileged port), --read-only --tmpfs /tmp, --security-opt=no-new-privileges, --memory=512m --cpus=1 --pids-limit=100.--log-driver=journald and ship container logs to a SIEM. Log all container start/stop events. Install Falco for runtime threat detection — alerts on exec into containers, unexpected file writes, etc.