// test types · metrics · tools · bottleneck analysis · capacity planning · senior → principal
(Satisfied + Tolerating/2) / Total. Satisfied: < threshold (e.g., 500ms). Tolerating: < 4× threshold. Score 0–1. Human-readable summary of performance vs threshold.
X-Load-Test: true), filter them from business metrics.
Load generator sizing: the load generator itself can become the bottleneck. A single k6 process on a 2-core VM can generate ~10k RPS for simple requests; for complex flows, less. Distribute the load generator: k6 cloud, k6 Operator (Kubernetes), Gatling Enterprise. Ensure load generator CPU and network are not the constraint.
Warm up first: send light traffic (10% of target) for 2–5 minutes before the main test phase. Allows JVM warmup, connection pool fill, cache warm, CDN warm. Metrics from the ramp-up phase should be excluded from the pass/fail evaluation.
javascript thresholds: {
http_req_duration: ['p(99)<500'],
http_req_failed: ['rate<0.001'],
} A failing threshold exits with code 1 — CI marks the build as failed.
Baselining: run a full load test on a known-good build. Store results as baseline. Every subsequent test compares against it. Catch regressions before production.
| Smoke | 1-2 VUs, few minutes. Sanity check the system is functional. Run before any full load test. |
| Load | Expected production peak load, 30-60 min. Validate SLOs are met under normal conditions. |
| Stress | Increase load until failure (150%, 200%, 300% of peak). Find breaking point and failure mode. |
| Soak/Endurance | Normal load for 4-24 hours. Detect memory leaks, connection exhaustion, slow degradation. |
| Spike | Sudden 10× traffic increase, short duration. Validate autoscaling, circuit breakers, queues. |
| Breakpoint | Ramp load until SLO threshold is crossed. Precise capacity number: 'We handle up to 850 RPS at P99 < 500ms.' |
| Volume | Normal RPS but very large data payloads. Validate streaming, chunking, memory usage. |
| Scalability | Add capacity (pods/instances), re-run test. Measure linear vs sublinear scaling. Find coordination bottlenecks. |
| Throughput (RPS/TPS) | Requests successfully served per second. Primary capacity metric. Distinguish: attempted vs successful. |
| P50 (median) latency | Half of requests are faster than this. Typical user experience. |
| P95 latency | 95% of requests are faster. Most users' worst case. |
| P99 latency | 99% of requests are faster. Tail latency. Use for SLOs. High P99 = some users have bad experience. |
| P99.9 latency | Worst 0.1%. Important for high-volume services: at 1M req/day, 0.1% = 1000 bad requests. |
| Error rate | % of requests returning errors. Separate HTTP errors (4xx, 5xx) from timeouts. |
| Saturation | Resource utilization: CPU %, memory %, DB connections %, thread pool queue. Leading indicator. |
| Time to first byte (TTFB) | Time from request sent to first byte received. Separates server processing from network/transfer. |
| Requests in flight | Concurrency. Little's Law: = RPS × avg_latency. Must match thread pool / connection pool size. |
| Options block | vus (virtual users), duration, stages (ramp), thresholds (pass/fail criteria), scenarios. |
| Default function | The scenario each VU executes in a loop. Import: http from 'k6/http'. |
| Thresholds | { 'http_req_duration': ['p(99)<500'], 'http_req_failed': ['rate<0.001'] }. Exit code 1 on failure. |
| Stages (ramp) | [{ duration: '2m', target: 100 }, { duration: '10m', target: 500 }, { duration: '2m', target: 0 }]. Ramp up, hold, ramp down. |
| Checks | check(response, { 'status 200': r => r.status === 200 }). Tracks pass rate as a metric. |
| Groups | group('login flow', () => { ... }). Groups metrics by user journey step. |
| Output | k6 run --out influxdb=http://influx:8086/k6 --out json=results.json script.js |
| Scenarios | Define multiple scenarios with different executors: constant-arrival-rate, ramping-vus, per-vu-iterations. |
| High CPU + high latency | CPU-bound. Profile with async-profiler/JFR to find hot methods. Optimize or scale horizontally. |
| Normal CPU + high latency + DB slow queries | Database bottleneck. Check slow query log, explain plans, connection pool utilization. Add index, cache, read replica. |
| Normal CPU + increasing latency over time | Soak issue: memory leak (GC overhead increasing) or connection leak (pool exhaustion). Monitor GC and pool metrics. |
| High CPU + low RPS (generator) | Load generator is the bottleneck. Distribute load generation. Generator CPU must stay < 70%. |
| Thread pool queue depth growing | Thread pool too small for the offered load, or downstream is slow (threads blocked in I/O). Diagnose with thread dump. |
| Error rate spike at specific RPS | Saturation point found. Resource (thread pool, DB connections, file descriptors) exhausted. Increase limit or add capacity. |
| Latency OK, error rate rises | Circuit breaker tripping, downstream quota exceeded, or connection refused on downstream. Check downstream health. |
| Language | JavaScript | Scala/Kotlin/Java | XML (GUI) / Groovy | Python |
| Protocol support | HTTP, WebSocket, gRPC | HTTP, WebSocket, JMS | HTTP, JDBC, JMS, FTP, LDAP | HTTP, WebSocket |
| Performance | High (Go engine) | Very high (async Akka) | Moderate (JVM threads) | Moderate (Python async) |
| Reporting | Grafana integration, JSON | Excellent built-in HTML | Plugins required | Web UI, plugin |
| CI integration | Excellent (binary, exit code) | Good (Maven/Gradle) | Moderate (XML + plugins) | Good (CLI) |
| Distributed | k6 Cloud / k6 Operator | Gatling Enterprise | Controller + remote agents | Built-in master/worker |
| Learning curve | Low (JS) | Medium (Scala DSL) | Low (GUI) / Medium (scripting) | Low (Python) |
| Best for | API tests, CI pipelines | High-throughput, JVM teams | Existing JMeter adoption, complex protocols | Python teams, complex flows |
(95×50 + 5×5000) / 100 = 297ms. This looks like a slow service. But 95% of users had a great experience (50ms); 5% had a terrible one (5000ms). The average tells you neither story accurately.
Worse example: 99 requests at 10ms, 1 request at 100,000ms (100 seconds — timeout). Average: (99×10 + 100000) / 100 = ~1010ms. Looks slow. In reality, 99% of users see 10ms; one request is hung.
Use percentiles instead: - P50 (median): 50% of requests complete within this time. Typical user. - P95: 95% of requests complete within this time. Most users' worst case. - P99: 99% of requests. Your SLO should be here. "99% of users get < 500ms." - P99.9: important at scale. At 1M req/hour, 0.1% = 1000 bad requests per hour. - Max: can be misleading (GC pause, cold start) but useful for detecting extreme outliers.
When to care about which percentile: - User experience SLO: P95 or P99 - Internal microservice SLO: P99 (failures propagate and amplify) - Tail latency for high-volume services: P99.9 Never mention average latency in an interview when discussing performance metrics.``javascript import http from 'k6/http'; import { check, sleep, group } from 'k6'; import { Rate } from 'k6/metrics';
const errorRate = new Rate('errors');
export const options = {
stages: [
{ duration: '2m', target: 50 }, // ramp up
{ duration: '10m', target: 200 }, // hold at 200 VUs
{ duration: '2m', target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ['p(99)<500'], // P99 < 500ms
errors: ['rate<0.01'], // error rate < 1%
},
};
const BASE_URL = __ENV.BASE_URL || 'https://staging.example.com';
export default function () {
group('Browse products', () => {
const res = http.get(${BASE_URL}/api/products?category=electronics`);
check(res, { 'products 200': r => r.status === 200 });
errorRate.add(res.status !== 200);
});
sleep(1); // think time between steps
group('Add to cart', () => {
const payload = JSON.stringify({ productId: 'prod-123', qty: 1 });
const res = http.post(${BASE_URL}/api/cart/items, payload, {
headers: { 'Content-Type': 'application/json',
'Authorization': Bearer ${__ENV.TEST_TOKEN} },
});
check(res, { 'add to cart 201': r => r.status === 201 });
});
group('Checkout', () => {
const res = http.post(${BASE_URL}/api/orders, JSON.stringify({
paymentMethod: 'test_card', shippingAddress: 'test_addr'
}), { headers: { 'Content-Type': 'application/json' } });
check(res, { 'order created 201': r => r.status === 201 });
});
sleep(2); } ``` Key patterns: stages for ramp-up/hold/ramp-down; thresholds for pass/fail gate; group() to see latency by step; separate custom error rate metric; environment variables for config (URL, token); check() for response validation.
slow_query_log=ON, long_query_time=0.5; PostgreSQL: log_min_duration_statement=500). Under load, queries that are fast with 10 users become slow with 500 users due to lock contention and data volume.
Step 2: Analyze query plans. EXPLAIN ANALYZE the slow queries. Look for: - Seq Scan on a large table → missing index - High row estimates vs actual → stale statistics (ANALYZE table) - Nested loop joins with large tables → bad join order, missing index
Step 3: Common fixes. - Missing index: add a covering index for the query's WHERE + ORDER BY + SELECT columns.
Re-run load test: query goes from 800ms to 5ms.
- N+1 queries: use JOIN or batch fetch instead of querying in a loop.
(JPA: use fetch join or @EntityGraph.)
- Lock contention: queries blocking each other. Optimize transaction scope (shorter
transactions). Use SELECT FOR UPDATE SKIP LOCKED for queue patterns.
- Connection pool exhaustion: HikariCP maxPoolSize too small. Add connections, but
also check if adding connections actually helps (if the DB itself is saturated, more
connections don't help — fix the query).
- Read replicas: route read-heavy queries to read replicas. Reduces primary load.
Step 4: Verify fix. Re-run the load test. Confirm: DB CPU reduced, slow queries gone from log, latency at SLO.L = 200 × 0.15 = 30 concurrent requests
You need at least 30 threads to handle this load. Each thread handles one request; at 150ms per request, one thread handles 6.7 req/s. For 200 RPS: 200 / 6.7 ≈ 30 threads.
Under degradation: if latency doubles to 300ms (DB slow): L = 200 × 0.3 = 60 concurrent requests
Thread pool of 50 is now too small. Threads queue up waiting for DB. Latency increases further. Queue grows. This is the cascading failure pattern.
Sizing with headroom: Target RPS = 500 (peak). P99 latency target = 200ms (0.2s): L = 500 × 0.2 = 100 concurrent requests needed. Set thread pool to 120–150 for headroom (spikes, GC pauses).
The practical lesson: thread pool size = target_RPS × target_latency_in_seconds × safety_factor. I/O-bound services with high latency (DB-heavy, 200ms average) need larger thread pools than CPU-bound services with low latency (compute-heavy, 10ms average).on: schedule: cron: '0 2 * * *') - Tier 4: workflow_dispatch (manual trigger with optional approver)
Baseline storage: store k6 JSON output in S3 per build. A comparison script reads previous baseline and current results, outputs a diff. Fail the build if regressions exceed thresholds.Concurrency = Throughput × Latency. A system with 100 RPS throughput and 200ms latency has 20 concurrent requests. A system with 100 RPS throughput and 2000ms latency has 200 concurrent requests.
Why the distinction matters:
For thread pool sizing: you need enough threads to hold the concurrency (not the throughput). If you have 200 RPS but 500ms average latency: 200 × 0.5 = 100 concurrent requests. Thread pool must be ≥ 100. Many engineers mistakenly size thread pools to throughput (200) instead of concurrency (100 at 500ms latency, 20 at 50ms latency).
For capacity planning: doubling throughput with the same latency requires doubling concurrency (threads, connections, memory). But if throughput doubles AND latency doubles (system under stress), concurrency quadruples. The system saturates much faster than expected.
Closed vs open model load tests: - Closed model (VUs): throughput = VUs / latency. As latency increases, throughput drops.
This is how real users behave (they wait for response before next request).
- Open model (arrival rate): throughput is fixed. As latency increases, concurrency builds up.
This is how real traffic arrives (independent requests, not waiting users).
For most backend services, use open model (arrival rate) — it reveals saturation behavior correctly.Goal: confirm that memory usage is stable over time under sustained load. A leak manifests as heap growth that GC cannot reclaim.
Setup: - Duration: 4–8 hours minimum. Memory leaks in Java are often slow (MB/hour). - Load: 50–75% of production peak. Enough to exercise the code paths, not enough
to trigger resource exhaustion that masks the leak signal.
- Monitoring: heap used (post-GC, via JVM metrics), Old Gen %, GC pause frequency and duration.
Use Datadog/Prometheus with jvm.memory.used{area=heap} and jvm.gc.memory.promoted.
What to watch for: - Sawtooth pattern (healthy): heap rises as requests are processed, drops after GC.
The drop baseline stays constant. Memory is being reclaimed.
- Rising floor (leak): heap still shows sawtooth, but the post-GC baseline rises
steadily over hours. GC reclaims Young Gen but Old Gen grows. A memory leak.
- GC overhead growing: GC runs more frequently and longer as Old Gen fills.
Manifests as increasing GC pause time and increasing jvm.gc.memory.promoted rate.
Response to detected leak: 1. Take a heap dump at the peak (before GC reclaims Young Gen): jmap -dump:live,format=b,file=heap.hprof <pid> 2. Analyze with Eclipse MAT: Leak Suspects → Dominator Tree → Path to GC roots 3. Find the retaining reference chain and the code responsible 4. Fix and re-run the soak test to validate the fix
Automating the pass/fail: Export JVM jvm.memory.used{id="heap"} to a time series. Fit a linear regression to the post-GC values over the test duration. If slope > threshold (e.g., 1MB/min), mark the test as failing (leak detected).
ceiling(2760 / 200) = 14 pods. Currently running 4 pods (600 / 200 = 3 + 1 buffer).
4. Autoscaling configuration: HPA target: 70% of max capacity per pod (140 RPS as the scale trigger). Min replicas: 3 (covers current peak with 1 spare). Max replicas: 20 (covers 6-month projection).
5. Cost model: 14 pods × pod cost ($X/month) = total compute cost. Present with and without caching optimization (if adding Redis reduces per-pod load by 40%, need only 9 pods → 35% cost saving).
6. Database scaling: Load test showed DB CPU at 60% under peak load. At 2.3× traffic: DB CPU at ~138% — saturated. Recommendation: add read replica, upgrade instance class, or shard.
The recommendation format: Present as: current state → projected need in 6/12 months → required changes → cost. Include: what happens if we don't scale (user impact at specific RPS thresholds). Never present just "we need more servers" — quantify exactly how many, when, and why.Single machine limits: a single k6 instance on a 4-core machine can generate ~10k–30k RPS for simple HTTP requests. Complex scenarios with encryption, JSON parsing, think time: much less. For 100k RPS, you need distributed load generation.
k6 distributed options:
k6 Cloud (Grafana): SaaS solution. Specify vus: 10000; Grafana provisions cloud generators automatically. Aggregates results centrally. Simple but has cost per test.
k6 Operator (Kubernetes): run k6 as a Kubernetes job with N parallel pods. yaml apiVersion: k6.io/v1alpha1 kind: K6 spec:
parallelism: 20 # 20 k6 pods
script: { configMap: { name: test-script } } 20 pods × 5k RPS = 100k RPS. Results aggregated via the operator. Self-hosted.
Gatling Enterprise: official enterprise version. Cluster of Gatling injectors. Each injector node is async (Akka-based) — very high RPS per node. Centralized report.
Architecture for 100k RPS tests: - Load generators: 10–20 EC2 instances or Kubernetes pods, same region as target - Monitoring: separate monitoring stack (don't load the monitoring with test traffic) - Network: generators and target must have sufficient NIC bandwidth (100k RPS × 5KB avg = 500 MB/s) - Stateful test: if test uses auth tokens, pre-generate 100k test accounts and distribute
across generator pods (each pod gets a slice of the test user pool)
Controlling blast radius: - Start at 10% (10k RPS). Validate generators and system are healthy. - Ramp to 25%, 50%, 75%, 100%. Stop if error rate > 1% at any stage. - Never start at 100k RPS — generators might not be configured correctly, and you'd DDoS your own infrastructure without realizing it.
uses: company/perf-test-action@v2. Teams include it in their pipeline. Tier 1 (smoke) and Tier 2 (regression vs baseline) run automatically. Teams don't configure anything unless they need custom scenarios.
Centralized performance baseline store: S3 bucket per service: s3://perf-baselines/orders-service/. Each successful nightly test uploads results. The regression test compares to the last successful baseline. The performance team owns the comparison logic and threshold defaults.
Nightly dashboard: Grafana dashboard showing all services: last test date, P99 vs SLO, trend (improving/degrading). Engineering leads review weekly. Red services are flagged for the owning team.
Performance engineering team role: Not to run tests for teams. To: maintain the platform (library, CI action, baseline store), run performance reviews for complex scenarios (new architecture, scaling decisions), own the performance testing standards, and embed in squads for major launches.
Incentive structure: Track performance regression rate per team. Teams that ship P99 regressions get a "performance debt" metric. SLO breach in production links back to the test that should have caught it. Makes performance visible as an engineering quality metric alongside test coverage and incident rate.