Load Testing — Field Guide

Core Concepts

📋 Test Types

Load test: apply expected production load. Validate the system meets SLOs under normal conditions. Baseline: "does this handle our typical peak?" Stress test: gradually increase load beyond normal levels until the system degrades or breaks. Find the breaking point and failure mode. "What happens at 2× peak?" Soak test (endurance): run normal load for a long time (hours to days). Detect: memory leaks, connection pool exhaustion, disk fill, log rotation issues, gradual degradation. "Does it hold up over 24 hours?" Spike test: sudden dramatic increase in load (10× in seconds). Validate autoscaling, circuit breakers, queue behavior. "What happens when Black Friday traffic hits?" Breakpoint test: find the exact throughput where latency or error rate crosses the SLO threshold. Precise capacity number for planning. "What's our maximum sustainable RPS?" Smoke test: minimal load (1–2 users). Quick sanity check that the system is functional before running a full load test.

soak for leaks stress = breaking point smoke first

📊 Key Metrics

Throughput: requests per second (RPS) or transactions per second (TPS) successfully processed. The primary capacity metric. Latency percentiles: P50 (median — typical user experience), P95 (worst 5%), P99 (worst 1%), P99.9 (worst 0.1%). Never use average latency — it hides outliers and tail latency. A service with 99ms average and 5000ms P99 is failing its worst users. Error rate: % of requests returning errors (5xx, timeouts, connection refused). Distinguish: application errors (4xx), server errors (5xx), network timeouts. Saturation: resource utilization: CPU %, memory %, DB connection pool %, thread pool queue depth. Saturation approaching 100% is a leading indicator of degradation. Concurrency: active users / virtual users at a point in time. Not to be confused with throughput — a system with 100 concurrent users but slow responses has low throughput. APDEX (Application Performance Index): (Satisfied + Tolerating/2) / Total. Satisfied: < threshold (e.g., 500ms). Tolerating: < 4× threshold. Score 0–1. Human-readable summary of performance vs threshold.

P99 not average throughput = RPS saturation = headroom

🛠️ Load Testing Tools

k6 (Grafana): scripts in JavaScript. Outputs metrics to Grafana Cloud, InfluxDB, Datadog. Lightweight binary. Git-friendly scripts. Best for: API load testing, CI integration. Gatling: Scala/Java DSL. High performance (async, non-blocking). Excellent HTML reports out of the box. Maven/Gradle plugin for CI integration. Best for: high-throughput testing, teams comfortable with JVM. JMeter: GUI + XML scripts. Long-established, large plugin ecosystem. Heavy resource usage (JVM + threads). Best for: existing JMeter knowledge, complex protocol support (JDBC, JMS, LDAP, FTP). Artillery: YAML + JS hooks. Easy to get started. Good for: quick API tests, simple workflows. Less scalable than k6/Gatling. Locust (Python): script in Python. Distributed mode. Great for: custom behavior, complex user flows, Python teams. AWS Distributed Load Testing / Azure Load Testing: managed cloud services. No infrastructure to manage. For bursting to millions of RPS.

k6 for CI-friendly Gatling for JVM JMeter = legacy, still common

📈 Load Model Design

A realistic load model is the difference between a useful test and a misleading one. Virtual Users (VUs) vs Arrival Rate: VU-based models simulate N concurrent users (each making requests sequentially). Arrival rate models inject a fixed RPS regardless of response time. Prefer arrival rate (open model) for most services — it reflects real traffic better. VU models (closed model) underestimate throughput under slow responses. Ramp-up period: always start with a ramp (e.g., 0 → 500 RPS over 2 minutes). Immediate full load can overwhelm JVM warmup and cause misleading initial failures. Think time: real users pause between actions. For API tests without a browser, think time is often 0. For browser-level tests, add realistic think time (1–5s). Scenario mix: production has a mix of endpoints (read-heavy, some writes). Replicate: 70% GET, 20% search, 10% POST. Testing only one endpoint misses interactions. Test data: use realistic data volume and distribution. A test with 10 users in the DB vs 10 million users in production gives different cache hit rates, query plans, and index behavior.

arrival rate = open model always ramp up realistic scenario mix

🔍 Bottleneck Identification

After running a load test, correlate performance degradation with resource utilization: CPU-bound: CPU > 80% while latency degrades. No room for more load. Fix: optimize hot code paths (JFR profiling), add more compute, scale horizontally. Memory-bound / GC: heap utilization high, GC pauses spike under load. Latency spikes correlate with Full GC events. Fix: tune heap size and GC algorithm, fix memory leaks. Database bottleneck: application CPU is low, but DB CPU/connections are high. Slow queries appear under load (lock contention, index scans switch to seq scans at scale). Fix: add indexes, optimize queries, add read replicas, add caching layer. Thread pool exhaustion: request queue depth grows. Response time increases proportionally to queue depth. CPU is not saturated. Fix: increase thread pool size (if I/O-bound), or add capacity (if CPU-bound tasks exhausting threads). Downstream dependency: your service is fast, but a downstream (payment API, third-party) slows or fails under load. Circuit breaker trips. Fix: cache downstream responses, add retry/fallback, negotiate capacity with the downstream. Network saturation: NIC throughput maxed out. Large response payloads at high RPS. Fix: enable compression (gzip), paginate large responses, move large payloads to S3.

correlate resource + latency DB is often the bottleneck thread pool vs CPU bound

📉 Little's Law

Little's Law: L = λ × W. Where L = average number of requests in the system (concurrency), λ = arrival rate (throughput, RPS), W = average time a request spends in the system (average latency). Practical implications: - At 100 RPS with 200ms average latency: L = 100 × 0.2 = 20 concurrent requests in flight. You need 20 threads (or connections) minimum to handle this. - If latency doubles (degradation), concurrency doubles for the same RPS. Thread pool exhaustion becomes twice as likely. - To handle 1000 RPS at 200ms: 200 concurrent requests needed. Using it in capacity planning: measure current throughput and latency → calculate required concurrency → verify thread pool/connection pool is sized appropriately. If thread pool = 50 but Little's Law says you need 200 → thread pool is the bottleneck. Latency under load: as load increases, queue forms, W increases, L increases. The relationship is not linear at saturation — latency increases superlinearly as utilization approaches 100% (queueing theory, M/M/1 queue model).

L = λW concurrency = RPS × latency nonlinear at saturation

🏗️ Test Environment Considerations

Production-like environment: load tests on undersized staging environments give misleading results. The system must be production-like in: hardware (same instance type and count), data volume (similar record counts in DB), network topology (same AZs and latency), third-party integrations (or realistic mocks). Test isolation: load test traffic must not contaminate production monitoring or customer data. Use a dedicated test environment with its own databases and monitoring. If you must test in production: use a separate tenant/account, tag all test requests with a header (X-Load-Test: true), filter them from business metrics. Load generator sizing: the load generator itself can become the bottleneck. A single k6 process on a 2-core VM can generate ~10k RPS for simple requests; for complex flows, less. Distribute the load generator: k6 cloud, k6 Operator (Kubernetes), Gatling Enterprise. Ensure load generator CPU and network are not the constraint. Warm up first: send light traffic (10% of target) for 2–5 minutes before the main test phase. Allows JVM warmup, connection pool fill, cache warm, CDN warm. Metrics from the ramp-up phase should be excluded from the pass/fail evaluation.

production-like or misleading generator can be bottleneck always warm up

✅ Performance SLOs & Pass/Fail Criteria

A load test without pass/fail criteria is a reporting exercise, not a quality gate. Define acceptance criteria before running the test. Example SLO-based criteria: - P99 latency ≤ 500ms at target load (500 RPS) - P95 latency ≤ 200ms - Error rate < 0.1% - No requests timeout (connection refused, socket timeout) - System recovers to baseline within 2 minutes after load stops Regression criteria: - P99 latency must not increase > 10% vs previous baseline - Throughput must not decrease > 5% vs baseline In CI: automate the pass/fail. k6 thresholds block the pipeline:

javascript thresholds: {
  http_req_duration: ['p(99)<500'],
  http_req_failed: ['rate<0.001'],
}

A failing threshold exits with code 1 — CI marks the build as failed. Baselining: run a full load test on a known-good build. Store results as baseline. Every subsequent test compares against it. Catch regressions before production.

define criteria before test k6 thresholds in CI baseline for regression

Gotchas & Failure Modes

When to Use / When Not To

✓ Use Load Testing When

Before a major release or feature launch expected to significantly increase traffic
After performance-critical changes (new GC algorithm, DB query changes, new caching layer)
As part of CI for critical paths — catch regressions before they reach production
During capacity planning — determine how many more instances are needed for projected growth

✗ Don't Use Load Testing When

As a substitute for production monitoring — load tests use synthetic traffic; real user behavior differs
Against production systems with real customer data unless carefully isolated (use staging)
For unit-level performance microbenchmarks — use JMH (Java Microbenchmark Harness) instead

Quick Reference & Comparisons

Test Type Reference

Smoke	1-2 VUs, few minutes. Sanity check the system is functional. Run before any full load test.
Load	Expected production peak load, 30-60 min. Validate SLOs are met under normal conditions.
Stress	Increase load until failure (150%, 200%, 300% of peak). Find breaking point and failure mode.
Soak/Endurance	Normal load for 4-24 hours. Detect memory leaks, connection exhaustion, slow degradation.
Spike	Sudden 10× traffic increase, short duration. Validate autoscaling, circuit breakers, queues.
Breakpoint	Ramp load until SLO threshold is crossed. Precise capacity number: 'We handle up to 850 RPS at P99 < 500ms.'
Volume	Normal RPS but very large data payloads. Validate streaming, chunking, memory usage.
Scalability	Add capacity (pods/instances), re-run test. Measure linear vs sublinear scaling. Find coordination bottlenecks.

Key Load Testing Metrics

Throughput (RPS/TPS)	Requests successfully served per second. Primary capacity metric. Distinguish: attempted vs successful.
P50 (median) latency	Half of requests are faster than this. Typical user experience.
P95 latency	95% of requests are faster. Most users' worst case.
P99 latency	99% of requests are faster. Tail latency. Use for SLOs. High P99 = some users have bad experience.
P99.9 latency	Worst 0.1%. Important for high-volume services: at 1M req/day, 0.1% = 1000 bad requests.
Error rate	% of requests returning errors. Separate HTTP errors (4xx, 5xx) from timeouts.
Saturation	Resource utilization: CPU %, memory %, DB connections %, thread pool queue. Leading indicator.
Time to first byte (TTFB)	Time from request sent to first byte received. Separates server processing from network/transfer.
Requests in flight	Concurrency. Little's Law: = RPS × avg_latency. Must match thread pool / connection pool size.

k6 Script Structure

Options block	vus (virtual users), duration, stages (ramp), thresholds (pass/fail criteria), scenarios.
Default function	The scenario each VU executes in a loop. Import: http from 'k6/http'.
Thresholds	{ 'http_req_duration': ['p(99)<500'], 'http_req_failed': ['rate<0.001'] }. Exit code 1 on failure.
Stages (ramp)	[{ duration: '2m', target: 100 }, { duration: '10m', target: 500 }, { duration: '2m', target: 0 }]. Ramp up, hold, ramp down.
Checks	check(response, { 'status 200': r => r.status === 200 }). Tracks pass rate as a metric.
Groups	group('login flow', () => { ... }). Groups metrics by user journey step.
Output	k6 run --out influxdb=http://influx:8086/k6 --out json=results.json script.js
Scenarios	Define multiple scenarios with different executors: constant-arrival-rate, ramping-vus, per-vu-iterations.

Bottleneck Diagnosis Guide

High CPU + high latency	CPU-bound. Profile with async-profiler/JFR to find hot methods. Optimize or scale horizontally.
Normal CPU + high latency + DB slow queries	Database bottleneck. Check slow query log, explain plans, connection pool utilization. Add index, cache, read replica.
Normal CPU + increasing latency over time	Soak issue: memory leak (GC overhead increasing) or connection leak (pool exhaustion). Monitor GC and pool metrics.
High CPU + low RPS (generator)	Load generator is the bottleneck. Distribute load generation. Generator CPU must stay < 70%.
Thread pool queue depth growing	Thread pool too small for the offered load, or downstream is slow (threads blocked in I/O). Diagnose with thread dump.
Error rate spike at specific RPS	Saturation point found. Resource (thread pool, DB connections, file descriptors) exhausted. Increase limit or add capacity.
Latency OK, error rate rises	Circuit breaker tripping, downstream quota exceeded, or connection refused on downstream. Check downstream health.

💻 CLI Commands

{'cmd': 'k6 run script.js', 'desc': 'Run k6 script with default options'} {'cmd': 'k6 run --vus 100 --duration 5m script.js', 'desc': 'Override VUs and duration from CLI'} {'cmd': 'k6 run --out json=results.json script.js', 'desc': 'Output metrics to JSON for analysis'} {'cmd': 'k6 run --out influxdb=http://localhost:8086/k6 script.js', 'desc': 'Stream metrics to InfluxDB (visualize in Grafana)'} {'cmd': 'k6 cloud script.js', 'desc': 'Run distributed test on Grafana Cloud (auto-provisions generators)'}

{'cmd': 'mvn gatling:test', 'desc': 'Run Gatling simulations via Maven plugin'} {'cmd': 'mvn gatling:test -Dgatling.simulationClass=MySimulation', 'desc': 'Run specific simulation class'} {'cmd': './gradlew gatlingRun', 'desc': 'Run Gatling simulations via Gradle plugin'} {'cmd': 'open target/gatling/simulation-timestamp/index.html', 'desc': 'Open Gatling HTML report'}

{'cmd': 'wrk -t12 -c400 -d30s http://localhost:8080/api', 'desc': '12 threads, 400 connections, 30 second load test'} {'cmd': 'wrk -t4 -c100 -d60s -s post.lua http://localhost:8080/orders', 'desc': 'POST requests using a Lua script'} {'cmd': "curl -o /dev/null -s -w '%{time_total}\\n' http://localhost:8080/health", 'desc': "Measure a single request's total time"}

Load Testing Tools Comparison

Language	JavaScript	Scala/Kotlin/Java	XML (GUI) / Groovy	Python
Protocol support	HTTP, WebSocket, gRPC	HTTP, WebSocket, JMS	HTTP, JDBC, JMS, FTP, LDAP	HTTP, WebSocket
Performance	High (Go engine)	Very high (async Akka)	Moderate (JVM threads)	Moderate (Python async)
Reporting	Grafana integration, JSON	Excellent built-in HTML	Plugins required	Web UI, plugin
CI integration	Excellent (binary, exit code)	Good (Maven/Gradle)	Moderate (XML + plugins)	Good (CLI)
Distributed	k6 Cloud / k6 Operator	Gatling Enterprise	Controller + remote agents	Built-in master/worker
Learning curve	Low (JS)	Medium (Scala DSL)	Low (GUI) / Medium (scripting)	Low (Python)
Best for	API tests, CI pipelines	High-throughput, JVM teams	Existing JMeter adoption, complex protocols	Python teams, complex flows

Interview Q & A

0 / 0 reviewed

Senior Engineer — Execution Depth

S-01 What is the difference between a load test, stress test, and soak test? When would you run each? Senior ▾

Load test: apply the expected production load (e.g., 500 RPS) for a realistic duration (30–60 minutes). Validate that the system meets its SLOs under normal conditions. Run: before every major release and after significant infrastructure changes. Stress test: incrementally increase load beyond normal levels (150%, 200%, 300%) until the system degrades significantly or breaks. Goals: (1) find the breaking point (actual capacity ceiling), (2) observe the failure mode — does it fail gracefully (queue fills, returns 429) or catastrophically (OOM, cascading failure)? Run: during capacity planning, when preparing for a high-traffic event (product launch). Soak test (endurance): apply normal production load for an extended period (4–24 hours). Goals: detect issues that only appear over time — memory leaks (heap trend upward over hours), connection pool exhaustion (leaked connections accumulate), disk fill (log files grow), gradual degradation (GC becomes less effective as heap fills with long-lived objects). Run: monthly baseline, before a product launch where sustained high traffic is expected. Spike test: sudden dramatic load increase (e.g., 50 → 500 RPS in 10 seconds). Validate autoscaling response time, circuit breaker behavior, request queuing. Run: before Black Friday/Cyber Monday, before a marketing campaign that generates sudden traffic.

S-02 Why should you never use average latency as a performance metric? What should you use instead? Senior ▾

The problem with averages: average latency is distorted by outliers and hides the tail. Consider 100 requests: 95 complete in 50ms, 5 complete in 5000ms. Average: (95×50 + 5×5000) / 100 = 297ms. This looks like a slow service. But 95% of users had a great experience (50ms); 5% had a terrible one (5000ms). The average tells you neither story accurately. Worse example: 99 requests at 10ms, 1 request at 100,000ms (100 seconds — timeout). Average: (99×10 + 100000) / 100 = ~1010ms. Looks slow. In reality, 99% of users see 10ms; one request is hung. Use percentiles instead: - P50 (median): 50% of requests complete within this time. Typical user. - P95: 95% of requests complete within this time. Most users' worst case. - P99: 99% of requests. Your SLO should be here. "99% of users get < 500ms." - P99.9: important at scale. At 1M req/hour, 0.1% = 1000 bad requests per hour. - Max: can be misleading (GC pause, cold start) but useful for detecting extreme outliers. When to care about which percentile: - User experience SLO: P95 or P99 - Internal microservice SLO: P99 (failures propagate and amplify) - Tail latency for high-volume services: P99.9 Never mention average latency in an interview when discussing performance metrics.

S-03 How would you write a k6 load test script for an e-commerce checkout flow? Senior ▾

``javascript import http from 'k6/http'; import { check, sleep, group } from 'k6'; import { Rate } from 'k6/metrics'; const errorRate = new Rate('errors'); export const options = { stages: [ { duration: '2m', target: 50 }, // ramp up { duration: '10m', target: 200 }, // hold at 200 VUs { duration: '2m', target: 0 }, // ramp down ], thresholds: { http_req_duration: ['p(99)<500'], // P99 < 500ms errors: ['rate<0.01'], // error rate < 1% }, }; const BASE_URL = __ENV.BASE_URL || 'https://staging.example.com'; export default function () { group('Browse products', () => { const res = http.get(${BASE_URL}/api/products?category=electronics`); check(res, { 'products 200': r => r.status === 200 }); errorRate.add(res.status !== 200); });

sleep(1); // think time between steps

group('Add to cart', () => { const payload = JSON.stringify({ productId: 'prod-123', qty: 1 }); const res = http.post(${BASE_URL}/api/cart/items, payload, { headers: { 'Content-Type': 'application/json', 'Authorization': Bearer ${__ENV.TEST_TOKEN} }, }); check(res, { 'add to cart 201': r => r.status === 201 }); });

group('Checkout', () => { const res = http.post(${BASE_URL}/api/orders, JSON.stringify({ paymentMethod: 'test_card', shippingAddress: 'test_addr' }), { headers: { 'Content-Type': 'application/json' } }); check(res, { 'order created 201': r => r.status === 201 }); });

sleep(2); } ``` Key patterns: stages for ramp-up/hold/ramp-down; thresholds for pass/fail gate; group() to see latency by step; separate custom error rate metric; environment variables for config (URL, token); check() for response validation.

S-04 How do you identify and fix a database bottleneck discovered during a load test? Senior ▾

Detection: during the load test, application CPU is low (< 30%) but latency climbs above SLO. Check the database: high CPU, slow query log entries appearing, connection pool at max capacity. Step 1: Identify slow queries. Enable the slow query log (MySQL: slow_query_log=ON, long_query_time=0.5; PostgreSQL: log_min_duration_statement=500). Under load, queries that are fast with 10 users become slow with 500 users due to lock contention and data volume. Step 2: Analyze query plans. EXPLAIN ANALYZE the slow queries. Look for: - Seq Scan on a large table → missing index - High row estimates vs actual → stale statistics (ANALYZE table) - Nested loop joins with large tables → bad join order, missing index Step 3: Common fixes. - Missing index: add a covering index for the query's WHERE + ORDER BY + SELECT columns. Re-run load test: query goes from 800ms to 5ms. - N+1 queries: use JOIN or batch fetch instead of querying in a loop. (JPA: use fetch join or @EntityGraph.) - Lock contention: queries blocking each other. Optimize transaction scope (shorter transactions). Use SELECT FOR UPDATE SKIP LOCKED for queue patterns. - Connection pool exhaustion: HikariCP maxPoolSize too small. Add connections, but also check if adding connections actually helps (if the DB itself is saturated, more connections don't help — fix the query). - Read replicas: route read-heavy queries to read replicas. Reduces primary load. Step 4: Verify fix. Re-run the load test. Confirm: DB CPU reduced, slow queries gone from log, latency at SLO.

S-05 Explain Little's Law and how you apply it to thread pool sizing. Senior ▾

Little's Law: L = λ × W Where: - L = average number of requests in the system (concurrency / requests in flight) - λ = arrival rate (throughput in RPS) - W = average time a request spends in the system (average latency in seconds) Application to thread pool sizing: If your service handles 200 RPS and average latency is 150ms: L = 200 × 0.15 = 30 concurrent requests You need at least 30 threads to handle this load. Each thread handles one request; at 150ms per request, one thread handles 6.7 req/s. For 200 RPS: 200 / 6.7 ≈ 30 threads. Under degradation: if latency doubles to 300ms (DB slow): L = 200 × 0.3 = 60 concurrent requests Thread pool of 50 is now too small. Threads queue up waiting for DB. Latency increases further. Queue grows. This is the cascading failure pattern. Sizing with headroom: Target RPS = 500 (peak). P99 latency target = 200ms (0.2s): L = 500 × 0.2 = 100 concurrent requests needed. Set thread pool to 120–150 for headroom (spikes, GC pauses). The practical lesson: thread pool size = target_RPS × target_latency_in_seconds × safety_factor. I/O-bound services with high latency (DB-heavy, 200ms average) need larger thread pools than CPU-bound services with low latency (compute-heavy, 10ms average).

S-06 How do you integrate load tests into a CI/CD pipeline without slowing down the developer workflow? Senior ▾

The problem: full load tests take 30–60 minutes — too slow for every PR. But shipping without any performance validation causes regressions. Tiered approach: Tier 1 — Fast smoke test (every PR, 2–5 min): 1-5 VUs for 2 minutes. Just validates the endpoints are functional under minimal load. Catches: deployment failures, catastrophic regressions (10× slower), basic errors. k6 exits with code 1 if any threshold fails → PR build fails. Tier 2 — Targeted regression test (merges to main, 10–15 min): Load test at 50% of production peak for 10 minutes. Compare P99 latency vs stored baseline. Fail if P99 increases > 10% or error rate > 0.1%. Runs after unit/integration tests. Tier 3 — Full load test (nightly, 60+ min): Full production load + stress test + soak test. Runs against the latest main branch. Results stored as the new baseline. Alert if any regression vs previous night. Tier 4 — Pre-release load test (before major releases): Full suite including spike test and soak test (4 hours). Manual trigger by the release manager. Implementation in GitHub Actions: - Tiers 1 and 2: automated steps in the main workflow - Tier 3: scheduled cron workflow (on: schedule: cron: '0 2 * * *') - Tier 4: workflow_dispatch (manual trigger with optional approver) Baseline storage: store k6 JSON output in S3 per build. A comparison script reads previous baseline and current results, outputs a diff. Fail the build if regressions exceed thresholds.

S-07 What is the difference between throughput and concurrency? Why does this matter for capacity planning? Senior ▾

Throughput: requests completed per second. Measures how much work the system does. Concurrency: number of requests in flight simultaneously at any given moment. Measures how many requests the system is holding at once. They're related by Little's Law: Concurrency = Throughput × Latency. A system with 100 RPS throughput and 200ms latency has 20 concurrent requests. A system with 100 RPS throughput and 2000ms latency has 200 concurrent requests. Why the distinction matters: For thread pool sizing: you need enough threads to hold the concurrency (not the throughput). If you have 200 RPS but 500ms average latency: 200 × 0.5 = 100 concurrent requests. Thread pool must be ≥ 100. Many engineers mistakenly size thread pools to throughput (200) instead of concurrency (100 at 500ms latency, 20 at 50ms latency). For capacity planning: doubling throughput with the same latency requires doubling concurrency (threads, connections, memory). But if throughput doubles AND latency doubles (system under stress), concurrency quadruples. The system saturates much faster than expected. Closed vs open model load tests: - Closed model (VUs): throughput = VUs / latency. As latency increases, throughput drops. This is how real users behave (they wait for response before next request). - Open model (arrival rate): throughput is fixed. As latency increases, concurrency builds up. This is how real traffic arrives (independent requests, not waiting users). For most backend services, use open model (arrival rate) — it reveals saturation behavior correctly.

S-08 How do you design and run a soak test to detect memory leaks? Senior ▾

Goal: confirm that memory usage is stable over time under sustained load. A leak manifests as heap growth that GC cannot reclaim. Setup: - Duration: 4–8 hours minimum. Memory leaks in Java are often slow (MB/hour). - Load: 50–75% of production peak. Enough to exercise the code paths, not enough to trigger resource exhaustion that masks the leak signal. - Monitoring: heap used (post-GC, via JVM metrics), Old Gen %, GC pause frequency and duration. Use Datadog/Prometheus with jvm.memory.used{area=heap} and jvm.gc.memory.promoted.

What to watch for: - Sawtooth pattern (healthy): heap rises as requests are processed, drops after GC. The drop baseline stays constant. Memory is being reclaimed. - Rising floor (leak): heap still shows sawtooth, but the post-GC baseline rises steadily over hours. GC reclaims Young Gen but Old Gen grows. A memory leak. - GC overhead growing: GC runs more frequently and longer as Old Gen fills. Manifests as increasing GC pause time and increasing jvm.gc.memory.promoted rate.

Response to detected leak: 1. Take a heap dump at the peak (before GC reclaims Young Gen): jmap -dump:live,format=b,file=heap.hprof <pid> 2. Analyze with Eclipse MAT: Leak Suspects → Dominator Tree → Path to GC roots 3. Find the retaining reference chain and the code responsible 4. Fix and re-run the soak test to validate the fix Automating the pass/fail: Export JVM jvm.memory.used{id="heap"} to a time series. Fit a linear regression to the post-GC values over the test duration. If slope > threshold (e.g., 1MB/min), mark the test as failing (leak detected).

Staff Engineer — Design & Cross-System Thinking

ST-01 How do you design a load testing strategy for a new microservices platform before its first production launch? Staff ▾

Start with the threat model: which services are on the critical path for checkout? Which have external dependencies (payment processor, shipping API)? Where are the single points of failure? These are the highest-value test targets. Phase 1 — Baseline each service individually: Test each service in isolation (with mocked or stubbed dependencies). Find each service's max sustainable throughput and latency at SLO. Document: "Orders-service handles 800 RPS at P99 < 200ms with 4 replicas." Phase 2 — Integration load test: Test the critical path end-to-end (checkout flow through all real services). Identify: which service is the bottleneck of the chain? Does the chain's P99 latency equal the sum of all hops' P99 latencies (worst case) or their max? Phase 3 — Dependency capacity: Coordinate with payment processor and shipping API: what is their rate limit? Can they handle your projected peak? If not, test with circuit breakers and fallbacks. Phase 4 — Failure mode testing: Inject failures during the load test: kill one service instance, introduce 500ms latency on DB, saturate the payment processor. Validate: circuit breakers trip correctly, retries don't cascade, autoscaling responds. Phase 5 — Capacity for peak: Project traffic: launch + marketing = 5× normal traffic. Run a spike test at 5× normal. Validate: Kubernetes HPA scales in time (usually 2–3 min), no requests lost during scale-up. If HPA is too slow: pre-scale before launch, use KEDA for faster scaling signals. Success criteria for launch: - All services meet SLOs at 2× projected launch traffic (safety margin) - Spike test at 5× passes with < 0.1% error rate - Soak test at 1× for 4 hours passes (no memory leaks) - Autoscaling validated to respond within 3 minutes

ST-02 How do you use load testing results to make a capacity planning recommendation? Staff ▾

Capacity planning inputs from load tests: 1. Baseline capacity per instance: From the breakpoint test: "A single orders-service pod handles 200 RPS at P99 < 200ms. At 250 RPS, P99 degrades to 800ms." Baseline = 200 RPS per pod. 2. Current production traffic + growth projection: Current: 600 RPS peak. Projected growth: 15% MoM × 6 months = 2.3× = 1380 RPS in 6 months. Add safety margin (2×): need capacity for 2760 RPS in 6 months. 3. Required instances: ceiling(2760 / 200) = 14 pods. Currently running 4 pods (600 / 200 = 3 + 1 buffer). 4. Autoscaling configuration: HPA target: 70% of max capacity per pod (140 RPS as the scale trigger). Min replicas: 3 (covers current peak with 1 spare). Max replicas: 20 (covers 6-month projection). 5. Cost model: 14 pods × pod cost ($X/month) = total compute cost. Present with and without caching optimization (if adding Redis reduces per-pod load by 40%, need only 9 pods → 35% cost saving). 6. Database scaling: Load test showed DB CPU at 60% under peak load. At 2.3× traffic: DB CPU at ~138% — saturated. Recommendation: add read replica, upgrade instance class, or shard. The recommendation format: Present as: current state → projected need in 6/12 months → required changes → cost. Include: what happens if we don't scale (user impact at specific RPS thresholds). Never present just "we need more servers" — quantify exactly how many, when, and why.

ST-03 How do you run distributed load tests at 100,000+ RPS? Staff ▾

Single machine limits: a single k6 instance on a 4-core machine can generate ~10k–30k RPS for simple HTTP requests. Complex scenarios with encryption, JSON parsing, think time: much less. For 100k RPS, you need distributed load generation. k6 distributed options: k6 Cloud (Grafana): SaaS solution. Specify vus: 10000; Grafana provisions cloud generators automatically. Aggregates results centrally. Simple but has cost per test. k6 Operator (Kubernetes): run k6 as a Kubernetes job with N parallel pods. yaml apiVersion: k6.io/v1alpha1 kind: K6 spec: parallelism: 20 # 20 k6 pods script: { configMap: { name: test-script } } 20 pods × 5k RPS = 100k RPS. Results aggregated via the operator. Self-hosted. Gatling Enterprise: official enterprise version. Cluster of Gatling injectors. Each injector node is async (Akka-based) — very high RPS per node. Centralized report. Architecture for 100k RPS tests: - Load generators: 10–20 EC2 instances or Kubernetes pods, same region as target - Monitoring: separate monitoring stack (don't load the monitoring with test traffic) - Network: generators and target must have sufficient NIC bandwidth (100k RPS × 5KB avg = 500 MB/s) - Stateful test: if test uses auth tokens, pre-generate 100k test accounts and distribute across generator pods (each pod gets a slice of the test user pool)

Controlling blast radius: - Start at 10% (10k RPS). Validate generators and system are healthy. - Ramp to 25%, 50%, 75%, 100%. Stop if error rate > 1% at any stage. - Never start at 100k RPS — generators might not be configured correctly, and you'd DDoS your own infrastructure without realizing it.

Principal Engineer — Architecture & Org-Scale Thinking

P-01 How do you build a performance engineering function for an organization with 50 teams and 200 services? Principal ▾

The scaling problem: performance testing can't be a centralized team that each of 50 teams must request. That's a bottleneck that never clears. But without standardization, every team invents their own approach — incomparable results, no baseline, no regression detection. Platform approach: performance testing as infrastructure. Standardized test framework: Publish a company k6 library with: auth helpers (get a test token with one call), standard scenarios (checkout flow, search, browse — reusable across services), standard thresholds (company SLO defaults), reporting to the shared Grafana instance. Teams extend the library; they don't start from scratch. CI integration (tier 1 and 2 tests): Platform team provides a GitHub Actions reusable workflow: uses: company/perf-test-action@v2. Teams include it in their pipeline. Tier 1 (smoke) and Tier 2 (regression vs baseline) run automatically. Teams don't configure anything unless they need custom scenarios. Centralized performance baseline store: S3 bucket per service: s3://perf-baselines/orders-service/. Each successful nightly test uploads results. The regression test compares to the last successful baseline. The performance team owns the comparison logic and threshold defaults. Nightly dashboard: Grafana dashboard showing all services: last test date, P99 vs SLO, trend (improving/degrading). Engineering leads review weekly. Red services are flagged for the owning team. Performance engineering team role: Not to run tests for teams. To: maintain the platform (library, CI action, baseline store), run performance reviews for complex scenarios (new architecture, scaling decisions), own the performance testing standards, and embed in squads for major launches. Incentive structure: Track performance regression rate per team. Teams that ship P99 regressions get a "performance debt" metric. SLO breach in production links back to the test that should have caught it. Makes performance visible as an engineering quality metric alongside test coverage and incident rate.

System Design Scenarios

Design a Load Testing Strategy for a Black Friday Sale

Problem

An e-commerce platform expects 10× normal traffic on Black Friday. Normal peak is 500 RPS on the checkout API. The team has 6 weeks. Last year, the site went down 45 minutes into the sale — no load testing was done. This year, leadership wants zero downtime.

Constraints

6 weeks to prepare
Staging environment is 50% of production capacity (scale separately for testing)
Payment processor can handle max 2000 RPS before rate-limiting
Kubernetes with HPA autoscaling

Key Discussion Points

Week 1 — Baseline: Run load test at current production peak (500 RPS). Establish P99 latency and error rate per service as the baseline. Identify current bottlenecks. Document: which services are headroom-constrained, DB connection pool sizes, HPA configuration, payment processor limit.
Week 2–3 — Fix known bottlenecks: From baseline tests: DB query slow at 300+ RPS (missing index found, fixed). Payment service rate-limited at 800 RPS (add circuit breaker + queue). HPA min replicas too low — scale event takes 3 minutes (add pre-scaling job for Black Friday window).
Week 4 — Scale test on production-sized environment: Provision production-scale staging (or temporarily scale down prod + run test after hours). Run load test at 2× peak (1000 RPS) → 5× (2500 RPS) → 10× (5000 RPS). Observe: which service breaks first? DB? Thread pool? HPA lag?
Week 5 — Chaos + spike test: Inject failures during 5× load test: kill one checkout-service pod, simulate DB slow, saturate payment processor (test circuit breaker). Validate: requests are queued or gracefully rejected (429 with Retry-After), not silently dropped. HPA scales within SLA.
Week 6 — Dress rehearsal: Full Black Friday simulation: pre-scale to Black Friday levels at midnight Thursday. Run 10× spike over 30 minutes. Run at 5× sustained for 2 hours. Incident team on standby. Run-through of the runbook: who does what if checkout degrades?
Black Friday day: Pre-scale (don't rely on HPA for the initial surge). Monitor: Datadog dashboard with checkout error rate, P99, DB connection pool, payment processor throughput. Pre-agreed thresholds: error rate > 0.5% → P1 alert → who gets paged and what they do.

🚩 Red Flags

Testing at 50% staging without scaling to prod-like capacity — results won't predict production behavior
Not testing the payment processor rate limit scenario — it's a hard cap that will cause silent failures under load
Relying on HPA to scale from normal to 10× during the surge — HPA takes 2-3 min; initial spike overwhelms the baseline pod count
No chaos testing — you don't know how the system fails until you make it fail in a controlled environment

Diagnose Why Your Service Passes Load Tests but Fails in Production

Problem

The orders-service passes all load tests in staging (500 RPS, P99 < 300ms). But in production at the same load, P99 is 1200ms and error rate is 2%. Staging and production use the same code. The team is confused and losing trust in the tests.

Constraints

Both environments use 4 pods, same instance type
Same JVM config, same database engine
Production DB has 50M records; staging DB has 100K records

Key Discussion Points

Root cause #1 — Data volume difference: 50M production records vs 100K staging. Queries that run in milliseconds with 100K rows (full index scan is fast at small scale) become slow with 50M rows. A sequential scan or a poor join order that's invisible at small scale becomes catastrophic at large scale. Solution: load a representative data sample into staging (at least 10% of production volume, with realistic distribution). Re-run the load test. Query plans at 5M rows will reveal the same index issues as 50M rows.
Root cause #2 — Cache hit rate: With 100K records, 5000 unique requests populate the Redis cache quickly. Cache hit rate: ~95%. With 50M records, 5000 requests barely dent the cache. Cache hit rate: ~0.01%. Production is cache-cold for most requests; staging is always cache-warm. Solution: test with a working set larger than the cache size. Or pre-populate cache before the load test starts and use a realistic request distribution (Zipf distribution: a few products are accessed far more often than others).
Root cause #3 — Production connections: Production has other services sharing the DB (analytics jobs, reporting queries, admin tools) that don't exist in staging. DB connection pool is shared. Under load, production DB is also serving analytics reads — staging is not. Solution: test production-like by simulating background load alongside the primary load test. Or isolate the production DB from analytics traffic (read replica for analytics).
Root cause #4 — Network path: Staging is in a single AZ; production spans 3 AZs. Cross-AZ calls add 1–2ms per hop. An order service with 10 microservice calls: 10 × 1.5ms = 15ms additional latency under cross-AZ routing — invisible in single-AZ staging. Solution: staging must mirror production network topology (multi-AZ), or account for this latency offset when comparing results.
The lesson: A load test is only as good as its resemblance to production. The primary divergences are: data volume, data distribution, cache state, background noise, and network topology. Document these gaps explicitly for every environment. Where gaps are unavoidable, account for them in the SLO thresholds (e.g., staging SLO is 20% tighter than production SLO to account for cache-warm advantage).

🚩 Red Flags

Concluding the load test is useless — it caught real bottlenecks; the environment gaps need fixing, not the test approach
Only fixing the staging DB data without checking cache behavior — multiple root causes, fix all
Not documenting the environmental differences in the test report — next engineer will repeat this investigation