// metrics · APM · logs · traces · dashboards · monitors · alerting · senior → principal
env:prod, service:orders, region:us-east-1). Tags are the primary grouping and filtering mechanism. Metrics are stored for 15 months; high-resolution (1-second) retained for 15 hours.
/_log endpoint).
Log pipeline: raw logs → parsing (Grok patterns extract structured fields) → enrichment (add host/service tags) → indexing or archiving. Not all logs need to be indexed (expensive) — route low-value logs to archives (S3) with Flex Logs.
Log attributes: extracted structured fields (status_code, duration, user_id) become searchable facets. Reserved attributes automatically map to Datadog conventions: status → log level, message, host, service, source, timestamp.
Log-Trace correlation: Datadog auto-injects dd.trace_id and dd.span_id into logs (via MDC in Java). Click a trace span → "View Logs" shows all logs from that exact request. Critical for debugging distributed failures.
$env, $service) make dashboards reusable across environments - Use Formula & Functions for derived metrics (error rate = errors / requests) - Share dashboards via public URL or embed via iframe - Screenboards (freeform layout, good for NOC walls) vs Timeboards (time-synced,
good for deep investigation — changing time range applies to all widgets)
@team-oncall in monitor messages. Priority P1–P5 classifies severity.
Monitor best practices: - Alert on symptoms (latency up, error rate up), not causes (CPU up) - Set Warning before Alert to give early warning - Always include runbook link and context in alert message - Suppress flapping with require full evaluation windows before alerting - Use downtime to suppress alerts during planned maintenance
successful_requests / total_requests). More accurate than monitor-based.
Burn rate alerts: alert when the error budget is being consumed faster than sustainable. A burn rate of 1 = consuming budget at exactly the pace to exhaust it by window end. Alert at burn rate 14.4 (1-hour budget burn at 30-day window). Allows catching incidents early without waiting for the SLO to actually breach.
user_id as a tag to a request metric with 1M users creates 1M custom metrics — potentially thousands of dollars per month. Tags on metrics must have bounded cardinality (env, service, region, status_code are good; user_id, request_id, order_id are not). Use traces and logs for high-cardinality analysis.
| COUNT | Number of events in a flush interval. Resets each interval. Use for: requests, errors, logins. Submit: statsd.increment('requests') |
| RATE | COUNT per second. Datadog calculates automatically from COUNT. Normalized across flush intervals. |
| GAUGE | Point-in-time value. Last value wins in flush interval. Use for: memory, CPU, queue depth, active connections. |
| HISTOGRAM | Distribution of values per flush interval. Generates: .avg, .count, .median, .95percentile, .max. Use for: latency, response sizes. |
| DISTRIBUTION | Like histogram but aggregated globally (not per-agent). Accurate percentiles across hosts. Preferred for latency at scale. Counts as custom metric per percentile. |
| SET | Count of unique values per flush interval. Use for: unique users, unique IPs. submit: statsd.set('users', user_id) |
| Trace | Complete request journey across all services. Has a trace_id. Composed of spans. |
| Span | One unit of work (HTTP call, DB query, function). Has span_id, parent_span_id, start, duration, resource, service, error flag. |
| Service | Logical grouping of spans. Configured via DD_SERVICE env var or tracer config. Appears as a node in the Service Map. |
| Resource | The specific operation within a service. For HTTP: 'GET /orders/{id}'. For DB: the SQL query template. |
| Sampling rate | Fraction of traces ingested. Default: automatic (prioritize errors and slow traces). Override per service with DD_TRACE_SAMPLE_RATE=0.1. |
| Retention filter | Server-side rule keeping specific traces: all errors, P95 latency outliers, specific service. Ingestion and indexing are billed separately. |
| Continuous Profiler | CPU/memory/lock profiles captured continuously and linked to traces. Identifies hot code paths without separate profiling sessions. |
| Threshold | Alert when metric crosses a static threshold for N minutes. Simple and predictable. Needs manual tuning for seasonal traffic. |
| Change | Alert when metric changes by X% or X units compared to N minutes/hours ago. Good for detecting sudden spikes. |
| Anomaly | ML-based. Alert when metric deviates from expected pattern (learned from historical data). Good for cyclical metrics (day-of-week patterns). |
| Forecast | Predicts when metric will cross threshold in the future. Use for capacity planning (disk will fill in 3 days). |
| Outlier | Alert when one entity (host, container) behaves differently from its peers. Good for detecting bad pods in a fleet. |
| Composite | Boolean combination of other monitors. 'Alert when error rate is high AND latency is high'. Reduces false positives. |
| SLO Alert | Alert on burn rate: error budget consumed faster than safe rate. Two-window burn rate: fast (1hr) + slow (5hr) for precision + recall. |
| Increment count | statsd.increment('orders.created', tags=['env:prod','region:us-east']) |
| Gauge | statsd.gauge('queue.depth', queue.size(), tags=['queue:orders']) |
| Histogram | statsd.histogram('request.duration', elapsed_ms, tags=['service:api']) |
| Distribution | statsd.distribution('db.query.time', query_ms, tags=['table:orders']) |
| Timed decorator (Python) | @statsd.timed('function.duration', tags=['func:checkout']) |
| Event | statsd.event('Deploy complete', 'v1.4.2 deployed', alert_type='info', tags=['env:prod']) |
| Service check | statsd.service_check('payment.health', statsd.OK, tags=['env:prod']) |
| Type | SaaS, fully managed | Open-source, self-hosted | SaaS, fully managed | SaaS, fully managed |
| Metrics | Agent + integrations | Scrape-based pull model | Agent + integrations | OneAgent auto-discovery |
| APM | dd-trace language agents | OpenTelemetry + Tempo | Language agents | OneAgent (auto, no code) |
| Logs | Agent + pipeline | Loki (label-based) | Logs UI | Log management |
| Dashboards | Polished, built-in | Grafana (powerful, complex) | Polished | Polished, auto-generated |
| Alerting | Monitors + PD/Slack | Alertmanager | Alert conditions | Davis AI (auto-detect) |
| Auto-instrumentation | Good (agents) | Manual/OTel needed | Good | Best (OneAgent bytecode) |
| Cost model | Per host + custom metrics | Infrastructure cost only | Per GB ingested | Per host (DEM/APM) |
| Best for | All-in-one, fast time to value | Cost control, OSS ecosystem | Simple setup | Enterprise, AI-ops |
trace.http.request.duration histogram; custom distribution metric from your code.
Traffic: demand on the system. Requests per second, events per second, active users. Datadog: trace.http.request.hits (from APM); custom COUNT metric.
Errors: rate of failing requests. 5xx rate, exception rate, business error rate. Datadog: trace.http.request.errors from APM; log-based monitor on ERROR-level logs; custom COUNT metric tagged with status:error.
Saturation: how "full" the service is. CPU, memory, thread pool utilization, queue depth. Datadog: system metrics from the agent (system.cpu.user, system.mem.used); JVM metrics (jvm.heap_memory, jvm.thread_count); custom gauge for queue depth.
Dashboard template: one timeseries widget per signal, scoped with $service template variable. Add SLO widget showing error budget. This becomes the default on-call dashboard for every service.error_rate > 1% is a symptom. cpu > 80% is
a cause (CPU high doesn't always mean user impact). Users experience latency and
errors — monitor those directly.
2. Set evaluation window. Don't alert on a single data point. Alert if error_rate > 1%
for the last 5 minutes (sustained, not a blip).
3. Warning before Alert. Warning at 0.5% error rate, Alert at 1%. Gives early signal
without waking on-call immediately.
4. Use anomaly detection for metrics with variable baseline (traffic is lower at 3am;
anomaly detection knows this and doesn't fire for normal low-traffic lulls).
5. Composite monitors. Alert only when error_rate > 1% AND latency_p99 > 2s.
Single signal spikes don't page; correlated signals do.
6. Include runbook in alert message: link to runbook, recent deploys (via event overlay),
related dashboards.
7. Review monthly: mute monitors that fired > 5 times with no action. That's noise.Head-based sampling: the sampling decision is made at the start of the trace (first span). A percentage of traces are sampled in; the rest are dropped entirely. Problem: a 1% head-based sample means 99% of error traces are never captured — you're flying blind on most incidents.
Tail-based sampling (Datadog "Tracing Without Limits"): - Datadog ingests ALL spans from all services - After the trace is complete (or within a window), a decision is made on what to store - Retention filters define what's kept: "keep all traces with errors", "keep all
traces > 5s P95", "keep 5% of healthy traces"
- Errors and outliers are always kept; normal traffic is sampled down
In practice: - Default: Datadog's adaptive sampling ingests 100% up to a volume limit, then
applies intelligent sampling prioritizing errors
- DD_TRACE_SAMPLE_RATE=0.1 sets the head-based rate for a service (10% of traces
enter Datadog at all). Combine with DD_TRACE_ANALYTICS_ENABLED=true so key spans
are still counted for metrics even when the trace is dropped.
Recommendation: let Datadog manage sampling with retention filters. Manually forcing low sample rates saves cost but blind-spots errors.
java @Bean public StatsDClient statsd() {
return new NonBlockingStatsDClientBuilder()
.prefix("myapp")
.hostname("localhost") // or DD_AGENT_HOST env
.port(8125)
.build();
}
// In service: statsd.incrementCounter("orders.created", "env:prod", "payment:card"); statsd.recordHistogramValue("checkout.duration", durationMs, "region:us-east"); statsd.recordGaugeValue("cart.items", cart.size());
Option 2: Micrometer (Spring Boot Actuator) Spring Boot auto-configures Micrometer. Add micrometer-registry-datadog dependency. All existing @Timed, MeterRegistry metrics automatically ship to Datadog. yaml management.datadog.metrics.export.api-key: ${DD_API_KEY} management.datadog.metrics.export.step: 10s
Option 3: dd-trace Java agent custom spans java Tracer tracer = GlobalTracer.get(); Span span = tracer.buildSpan("payment.process").start(); span.setTag("payment.method", "card"); span.setTag("amount", amount); // ... do work ... span.finish();
Tag discipline: never add high-cardinality tags (user_id, order_id) to metrics. Use traces for per-request detail; metrics for aggregated signals.Structured investigation workflow: 1. Alert fires → click the alert notification → lands on the triggered monitor with the anomalous time window highlighted.
Metrics context: look at the timeseries in the monitor. Add correlated metrics to a dashboard (error rate + latency + DB connection pool + downstream error rate). When did it start? What else changed at the same time?
Event overlay: add deployment events to the dashboard (@deployment.env:prod).
Did the metric change right after a deploy?
Traces: from the APM service page, filter to the error time window. Sort by error rate. Open a sample failing trace. The flame graph shows exactly which span failed and what the error was.
Logs: from the failing span, click "View Related Logs". Datadog uses the
trace_id injected in logs to show only log lines from that exact request.
The stack trace or error message is there.
Infrastructure correlation: from the span, pivot to the host/container metrics — was the container OOMKilled? Was the DB instance at 100% CPU?
This flow — Alert → Metrics → Traces → Logs → Infrastructure — is the structured on-call investigation path. The critical enabler is trace_id propagated into logs.
sum:trace.http.request.hits{service:orders,http.status_code:2xx}.as_count() - Total events: sum:trace.http.request.hits{service:orders}.as_count() - SLI = good / total
Step 2: Set SLO target - Target: 99.9% (43.8 minutes downtime/month) - Window: 30 days rolling - Error budget: 0.1% of requests = allowed failures
Step 3: Burn rate alert Alert when error budget is being consumed too fast. Multi-window burn rate (Datadog SLO alert): - Fast burn: burn rate > 14.4 over 1-hour window (consumes 2% error budget in 1 hour) - Slow burn: burn rate > 6 over 6-hour window (consumes 5% error budget in 6 hours) Alert if EITHER fires — catches both sudden spikes and slow degradation.
Step 4: SLO widget in dashboard Add SLO widget showing: current burn rate, error budget remaining (%), and time to exhaustion. This is the team's reliability health at a glance.
Burn rate = 1 means consuming budget at exactly the sustainable rate. Burn rate 14.4 means you'll exhaust a 30-day budget in ~2 days. Page immediately.cluster-agent, and application logs.
Cluster Agent: a separate deployment providing: Kubernetes state metrics (aggregated, not per-node), HPA (Horizontal Pod Autoscaler) custom metrics support, and admission controller for auto-injecting APM config into pods.
Auto-discovery: agent discovers pods and applies integrations based on annotations: yaml annotations:
ad.datadoghq.com/app.check_names: '["http_check"]'
ad.datadoghq.com/app.init_configs: '[{}]'
ad.datadoghq.com/app.instances: '[{"url": "http://%%host%%:8080/health"}]'
APM injection: enable DD_APM_INSTRUMENTATION_ENABLED=true on the Cluster Agent. Admission controller automatically injects the APM init container and env vars into new pods — no Dockerfile changes needed.
Key Kubernetes monitors: - Pod restarts > 3 in 15 minutes (CrashLoopBackOff signal) - Deployment unavailable replicas > 0 (rollout failure) - Node not ready (node problem) - HPA at max replicas for 30 minutes (capacity ceiling hit)jvm.gc.pause_time spike matches latency spike?)
7. Infrastructure correlation: Was the host/pod at capacity? Check CPU, memory, thread pool metrics in the same time window. OOMKilled pods cause latency spikes as requests queue during restart.env (prod/staging/dev), service (service name), version (deploy version/git SHA). Apply to ALL telemetry — metrics, traces, logs. This enables: filtering by environment, correlating a deploy version with a latency spike, filtering logs to a single service.
Implementation: - Kubernetes: set DD_ENV, DD_SERVICE, DD_VERSION env vars from pod labels - Helm: standard values block in all service charts - Java agent: DD_TRACE_GLOBAL_TAGS=env:${ENV},service:${SERVICE}
Additional tag taxonomy: - team: (e.g., team:checkout) — route alerts to the right team - region: / az: — identify regional issues - tier: (critical/standard) — priority-based alert routing - feature_flag: on spans — compare performance with flag on vs off
Governance rules: - Banned on metrics: user_id, request_id, order_id, trace_id (unbounded cardinality) - Allowed on traces/logs: any cardinality (stored per-event, not per-combination) - Tag values must be lowercase, snake_case, bounded (< 1000 unique values on metrics)
Tag governance in practice: - OPA/Conftest policy in CI that validates DD_ENV, DD_SERVICE, DD_VERSION are
set in Kubernetes manifests before merge
- Custom metric registry: teams register new metric names + tag schema; reviewed
for cardinality before shipping to productionIdentify the spend drivers: Use Datadog's Cost Estimator and the datadog_estimated_usage metrics to understand current spend breakdown: hosts, APM hosts, custom metrics, indexed logs, Synthetics.
Custom metrics (often the biggest surprise): - Audit with datadog.estimated_usage.metrics.custom grouped by metric name - Find the offenders: metrics with cardinality > 10k are suspect - Fix: remove high-cardinality tags (user_id, request_id) from metrics - Use Log-based metrics: extract a metric from logs (count ERROR logs by service) —
cheaper than emitting from code if the log already exists
Log cost control: - Audit: which services generate the most log volume? (@source facet) - Sampling: high-volume debug logs — DD_LOG_LEVEL=INFO in prod; sample DEBUG at 10% - Exclusion filters: drop health check logs, static asset logs (always 200, no signal) - Flex Logs: archive instead of index; query ad-hoc when needed
APM cost control: - APM host count = hosts running the tracer. Audit unexpected services with APM enabled. - Ingestion volume: set DD_TRACE_SAMPLE_RATE=0.1 on high-throughput services;
always keep errors via retention filter.
Infrastructure hygiene: - Agent running on stopped/terminated instances still counts as a host for ~2 hours. Ensure proper agent shutdown in auto-scaling termination hooks.
The core problem: 20 teams, each instrumenting differently — some using Datadog, some using home-grown logging, inconsistent tagging, no shared dashboards. The result: every incident requires different investigation paths per service. Standardize on OpenTelemetry as the instrumentation layer: Instrument with OTel SDKs; export to Datadog (or any backend) via the OTel Collector. This decouples instrumentation from the vendor — if you switch from Datadog to Grafana Cloud in 3 years, the instrumentation stays. Teams write OTel; the platform team decides the backend. Platform team responsibilities: - Maintain the OTel Collector fleet (DaemonSet in Kubernetes, aggregation layer) - Provide language-specific base libraries that wrap OTel with company conventions (auto-set service name from pod labels, inject correlation IDs, apply tag governance) - Manage Datadog org: API keys, access controls, cost budgets per team - Maintain golden path dashboards (one template per service type: HTTP service, Kafka consumer, batch job)
Paved roads: - Helm chart for all services includes the Datadog agent sidecar config automatically - Golden Prometheus recording rules and Datadog monitors deployed via Terraform module — teams include the module, get standard monitors for free - Runbook templates in a shared repo; monitors link to them SLO governance: - All customer-facing services must have a published SLO - SLO definitions live in a central Terraform repository, reviewed by the SRE team - Monthly SLO review with engineering VPs: burn rate trends, error budget health Cost governance: - Monthly cost allocation report per team (custom metric cardinality, log volume, APM hosts) - Budget alerts: team lead is notified if their team's estimated monthly cost exceeds budget - Cardinality governance: PR check validates new metrics for tag cardinality before merge
DD_ENV, DD_SERVICE=payment-service, DD_VERSION=${GIT_SHA} as Kubernetes pod env vars (via Helm values). Apply the same labels as pod labels for auto-discovery. This ensures all telemetry (metrics, traces, logs) is correlated by service and version.-javaagent:/dd-java-agent.jar). It auto-instruments Spring Boot HTTP endpoints, JDBC, and common libraries. No code changes needed for basic tracing.micrometer-registry-datadog. Instrument critical business operations: payment success/failure counters, payment amount distribution, fraud score histogram. Tag all metrics with payment_method (bounded cardinality).trace_id and span_id fields (Datadog Java agent injects these into MDC automatically with Logback). Set log level INFO in prod. Ship via Kubernetes log pipeline (Datadog agent DaemonSet tails pod logs).$env template variable.{{deploy_link}}", escalation path.resource (endpoint). Which endpoint has 8% errors? POST /checkout. Click → top error traces. Open a failing trace.NullPointerException in FraudCheckService.evaluate() line 142. New code in v2.1.0 introduced a null check bug in fraud evaluation.kubectl rollout undo deployment/payment-service. Monitor error rate — drops to 0.1% within 3 minutes.