The observability strategy decision record: why the metrics, traces, and logs platform you chose determines what questions you can answer when incidents happen

Observability platform adoption is treated as a DevOps improvement, not an architecture decision. The metrics backend is chosen when the first dashboard is needed. The distributed tracing library is chosen when the first latency mystery appears. The log aggregation tool is chosen when stdout starts overflowing. Two years later, the cardinality limits of the metrics backend determine whether you can answer "which customer is experiencing the slowest API response?" during an incident. The tracing format you chose determines whether you can switch backends without re-instrumenting every service. None of this was visible at the beginning. None of it is written down.

An on-call engineer receives a PagerDuty alert at 2:47 AM: API p99 latency has exceeded the SLO threshold for 8 minutes. They open the Grafana dashboard. The latency graph shows a spike starting at 2:39 AM, affecting the checkout endpoint. They need to answer three questions in order: which service in the request path is slow, which customers are affected, and whether the issue is isolated to a specific region or instance.

The first question they can answer. The distributed tracing dashboard shows a trace waterfall with elevated spans in the payments service — a downstream RPC call to the tax calculation service is taking 4 seconds instead of 200 milliseconds. The tax service appears to be the culprit.

The second question they cannot answer. They navigate to the metrics dashboard and search for a way to filter the latency histogram by customer ID. There is no customer_id label on the checkout latency metric. The metric has labels for region, instance, and status_code, but adding customer_id was discussed and deferred eight months ago when someone raised a concern about cardinality limits. The engineer cannot determine which customers are experiencing checkout latencies above 4 seconds — they can only determine that the aggregate p99 is affected across all customers. An accurate customer impact statement for the incident report requires a manual query against the application database after the incident is resolved.

The third question they can partially answer. There are instance labels on the metrics, but the checkout service has 40 instances across three regions and the metrics dashboard does not expose a per-region aggregate view. The engineer suspects the issue is regional because the tax service's external API vendor has a status page history of regional outages, but they cannot confirm this from the observability platform. They page the tax service on-call engineer, who is in a different timezone, to ask which region is affected.

Like most foundational infrastructure decisions, the observability strategy is visible as a fact — "we use Prometheus for metrics, Jaeger for tracing, Elasticsearch for logs" — but invisible as a decision. The fact tells the on-call engineer where to look. The decision record answers the questions that determine whether they can find what they need: what are the cardinality constraints on the metrics platform, what trace context propagation policy ensures end-to-end trace coverage, and what is the explicit statement of which incident response questions the current platform can and cannot answer? Without the record, the cardinality limit is a surprise discovered during an incident. With it, the on-call engineer knows before opening the runbook that per-customer metrics require a trace query rather than a metrics query — and the runbook was written with that constraint in mind.

What "we have observability" means across three signal types

Observability is not a binary property. The industry shorthand "three pillars of observability" — metrics, traces, and logs — describes three different signal types with different query characteristics, different cardinality profiles, and different cost structures. A system with all three pillars deployed is not necessarily an observable system; it is a system with three independent data sources that may or may not answer the incident response questions the on-call engineer needs to ask.

Metrics are aggregated numerical measurements recorded at regular intervals — request counts, latency histograms, error rates, CPU utilization, memory usage. The key characteristic of metrics is that they are pre-aggregated at collection time: a latency histogram collapses every individual request latency from a 10-second scrape interval into a set of bucket counts and a sum. The aggregation makes metrics fast to query (stored as time series, indexed by label set, retrievable with a single range query) and cheap to store (the size of a time series is proportional to the retention window and the scrape interval, not the volume of underlying requests). The cost of the pre-aggregation is that individual request context is lost: a latency histogram cannot tell you which specific request caused the 4-second outlier, or which customer submitted it.

Distributed traces are records of individual requests propagated across service boundaries. A trace is a directed acyclic graph of spans, where each span represents a unit of work — an RPC call, a database query, a queue publish — with a start time, duration, service name, and a set of key-value attributes. Traces preserve individual request context, which is what metrics discard: a trace for the 4-second checkout request contains the customer ID, the request parameters, the specific SQL query that ran slowly, and the sequence of service calls in the causal chain. The cost of preserving individual request context is storage: at high request volumes, storing every trace is expensive. Most production systems apply a sampling strategy that records only a fraction of traces — typically 1% to 10% by default — which means that the specific trace for the incident-causing request may not be in the system.

Logs are timestamped records of discrete events — structured JSON lines or unstructured text emitted when the application code reaches an instrumented point. Logs can capture arbitrary context that metrics and traces cannot: the full request body, the specific configuration value that produced an unexpected result, the stack trace for an exception. The cost of logs is query performance: without an indexing strategy, finding the log entries relevant to an incident requires scanning terabytes of data. The log aggregation tool's indexing model — whether it indexes all structured fields (Elasticsearch, Datadog Logs) or only deployment-level labels (Grafana Loki) — determines whether a query by customer_id or tenant_id returns results in seconds or requires a minutes-long content scan.

The three pillars are most valuable when they are correlated: a metrics anomaly leads to a trace that shows which service is slow, and the trace's trace ID appears in the relevant log entries that show the specific error message. Correlation requires that the trace ID propagated through the request chain is recorded in the log entry — a deliberate instrumentation decision, not a default behavior in most logging frameworks. Without a documented trace-log correlation policy, the correlation capability exists as potential rather than as a reliable incident response tool.

The metrics backend decision: cardinality, query language, and storage cost

The metrics backend selection is the most consequential observability decision for day-to-day operational costs and for the incident response question "how granular can my metrics be?" It is typically made when the engineering team first wants a latency dashboard — a low-urgency context where the cardinality implications of the label design are not yet visible.

Self-hosted Prometheus is the most common starting point for teams building their own observability stack. Prometheus scrapes metrics endpoints on a configured interval and stores them in a local time-series database (TSDB). The query language, PromQL, is expressive and well-documented. The cardinality limit is Prometheus's central operational constraint: the TSDB stores all active time series in memory; a Prometheus instance with 10 million active time series requires roughly 10 GB of RAM just for the series heads, before accounting for chunks, WAL, and OS overhead. Teams that add high-cardinality labels (customer_id, user_id, request_id, session_id) to widely-used metrics trigger cardinality explosions that exhaust Prometheus memory and cause scrape failures. The cardinality limit is not a configuration parameter that can be raised — it is a function of the available RAM and the cardinality of the label set. The documented limit for a given Prometheus deployment is the RAM divided by approximately 1 KB per active time series, which for a 16 GB Prometheus instance is approximately 16 million series. In practice, teams targeting high cardinality at production scale need either a Prometheus-compatible long-term storage backend that handles horizontal scaling (VictoriaMetrics, Thanos, Cortex/Mimir) or a purpose-built high-cardinality backend.

Managed SaaS metrics backends (Datadog, New Relic, Grafana Cloud, Dynatrace) remove the operational burden of self-hosting but introduce a cost model that scales with the number of active custom metrics and the number of active hosts. Datadog's pricing as of late 2025 charges per custom metric per host — a team with 50 services, 20 instances each, and 200 custom metrics per service pays for 200,000 active custom metric-host pairs per month. Adding a new label to an existing metric multiplies the time series count by the cardinality of the new label, which multiplies the cost proportionally. The cost implication of a cardinality explosion in a managed SaaS is a billing spike rather than a memory exhaustion failure, but the economic consequence is the same: the label design choices made when instrumentation was first added determine the operational cost of the metrics platform at production scale. A team that adds a customer_id label to a latency metric with 50,000 active customers multiplies the time series count for that metric by 50,000, which at Datadog's per-custom-metric pricing produces a cost increase that is often the discovery event for the cardinality constraint.

High-cardinality observability backends (Honeycomb, Grafana Tempo with trace-based exemplars, ClickHouse with a custom schema) are built for workloads where per-request, per-customer, or per-tenant dimensions are necessary for incident diagnosis. Honeycomb stores individual events with arbitrary key-value fields and queries them using BubbleUp — a grouping interface rather than a time-series query language. The query model is column-store based: filtering by customer_id on a dataset of 100 million events is fast because the customer_id column is stored independently and the filter scans only that column before intersecting with other filter columns. The cost model is per-event or per-gigabyte rather than per-active-series, which makes high-cardinality workloads affordable but makes low-cardinality high-volume workloads expensive relative to a Prometheus-based stack. The tradeoff between Prometheus-style pre-aggregated metrics and Honeycomb-style per-event queryability is the central metrics backend decision that most teams never document explicitly — they adopt one model at the beginning and discover the other model's advantages during an incident where the chosen model cannot answer the needed question.

The query language adoption is a secondary lock-in dimension. A team that has built 200 Grafana dashboard panels using PromQL cannot migrate to a non-Prometheus-compatible backend without rebuilding all dashboards. The metric naming conventions, the alert expressions, the recording rules, and the runbook references to specific PromQL queries are all PromQL-specific artifacts that couple the operational knowledge base to the backend. Like the service mesh decision's coupling between the tracing backend integration format and the mesh version, the metrics query language decision determines which operational artifacts must be rebuilt on a backend migration — a migration cost that is invisible at the time the query language is first adopted.

The distributed tracing decision: format, sampling strategy, and propagation policy

Distributed tracing is the observability signal most directly linked to the ability to diagnose cross-service latency regressions. The tracing decision has three components that are each independently load-bearing: the instrumentation format, the sampling strategy, and the trace context propagation policy. All three are typically chosen together when tracing is first adopted, and all three are rarely documented.

The instrumentation format determines which backends the tracing data can be sent to and whether the instrumentation libraries can be replaced without re-instrumenting the application. Three format families are in common use: OpenTelemetry (OTLP), Jaeger's native format, and Zipkin B3. OpenTelemetry is a vendor-neutral instrumentation standard with exporters for every major tracing backend — a service instrumented with the OpenTelemetry SDK can send traces to Jaeger, Grafana Tempo, Datadog, Honeycomb, Lightstep, or any OTLP-compatible backend by changing the exporter configuration, without changing the instrumentation code. Jaeger's native format and Zipkin B3 are backend-specific: a service instrumented with the Jaeger Go client library sends data in Jaeger's Thrift or Protobuf format; migrating to a Tempo or Honeycomb backend requires either deploying a format translator or replacing the instrumentation library in every service. The format lock-in implication is not visible at adoption time because the first tracing backend is chosen without migration in mind — but two years later, when a cost analysis or a capability requirement triggers a backend evaluation, the migration cost includes re-instrumenting every service that uses the backend-specific library.

The sampling strategy determines what fraction of traces are recorded and under what conditions. Three strategies are in common use. Head-based probabilistic sampling makes the sampling decision at the trace root (the entry point of the first service) before any downstream spans are generated: if the root decides to sample at 1%, then 1% of traces are recorded in full, and 99% generate no spans anywhere in the system. Head-based sampling is computationally cheap (one random number per request at the entry point) but produces a sample that is uniformly distributed across the traffic volume rather than targeted at the interesting traffic — an error that affects 0.1% of requests is represented in only 0.1% of the 1% of samples, meaning that in a system handling 1,000 requests per second, the error-causing trace is sampled approximately once per 1,000 seconds, or about once every 17 minutes. Tail-based sampling makes the sampling decision after the full trace is assembled, using the completed trace's properties — error status, latency, specific span attributes — to decide whether to retain it. Every error trace is retained; every trace with latency above a threshold is retained; normal-latency success traces are sampled at 1%. Tail-based sampling requires a trace aggregation component (the OpenTelemetry Collector's tail sampling processor, Tempo's compactor, or a custom service) that holds traces in memory until they are complete, which adds infrastructure complexity and a memory cost proportional to the trace completion window. Always-on sampling records every trace for every request — practical only at low request volumes or in systems where the trace data is stored in a cost-efficient columnar backend.

The sampling strategy determines which incident response scenarios the tracing platform can serve. A head-based 1% sampler makes it likely that a latency regression affecting 10% of requests will have enough samples to diagnose — 0.1% of traffic is still a large number of traces for a high-volume system. The same sampler makes it likely that an error affecting 0.1% of requests will not have a trace in the system when the on-call engineer searches for it. Without a documented sampling strategy, the on-call engineer who searches for "the trace for this specific failed checkout" does not know in advance whether the trace exists or whether the sampling probability means it was discarded.

The trace context propagation policy is the most commonly underdocumented aspect of distributed tracing. A trace that spans multiple services requires each service to extract the trace context from the incoming request and inject it into all outgoing requests. The standard mechanism is HTTP header propagation — the trace ID, span ID, and sampling decision travel as W3C Trace Context headers (traceparent and tracestate) or as Jaeger/B3 headers (X-B3-TraceId, uber-trace-id). Without explicit propagation, each service starts a new trace at its entry point, and the distributed trace graph is fragmented: the frontend service's trace is not linked to the API service's trace, which is not linked to the database service's trace. The incident response consequence is that a latency regression visible in the frontend trace cannot be directly correlated to the slow span in the API trace without manual cross-referencing of trace IDs in logs — losing the primary diagnostic advantage of distributed tracing.

Propagation failures are particularly common at asynchronous service boundaries. HTTP-to-HTTP propagation is well-supported by instrumentation libraries. HTTP-to-queue propagation (a request that publishes a message to a Kafka topic or SQS queue, where a consumer service picks up the message and performs work) requires the trace context to be serialized into the message metadata. If the publisher service does not write the trace context headers into the message attributes, and if the consumer service does not extract them, the consumer's processing span is orphaned — a root span with no parent, disconnected from the originating request trace. Like the error handling strategy's failure mode decision, the trace context propagation policy is a decision that must be made before the first asynchronous boundary is crossed — the propagation implementation is much harder to retrofit across an established codebase than to build in from the start.

The log aggregation decision and cross-pillar correlation

The log aggregation decision record deserves its own treatment, and the indexing model — whether structured fields are indexed for sub-second query performance or whether only deployment labels are indexed with content scan for application fields — is the most consequential constraint for incident response. The dimension specific to the observability strategy is how log aggregation integrates with the metrics and tracing pillars for cross-signal correlation.

Cross-signal correlation requires that the same identifiers appear across all three signals. A metrics anomaly at a specific timestamp should be investigable by finding traces from that time window and then finding log entries for those traces. This chain requires: (1) the trace ID appears in the log entry — the logging framework must extract the trace ID from the request context and include it as a structured field in every log record; (2) the Grafana or observability console surfaces a "view logs" link from a specific trace span — the log aggregation backend must be queryable by trace ID, which requires trace ID to be an indexed field; (3) the exemplar mechanism or equivalent links a specific metrics data point to a specific trace — Prometheus supports exemplars (trace IDs attached to histogram observations), which Grafana surfaces as a link from a latency graph to the representative trace, but exemplars require explicit configuration in the instrumentation code.

Without cross-signal correlation, the three pillars function as three independent data sources that require manual coordination: the on-call engineer separately queries the metrics dashboard, the tracing dashboard, and the log search interface, attempting to connect events by timestamp rather than by shared identifiers. The manual coordination adds minutes to each diagnostic step — minutes that matter when an SLO breach is accumulating and customer impact is growing. For platform teams that own the observability infrastructure, cross-signal correlation is a platform capability that every product team inherits automatically if it is configured correctly, and that every product team must implement manually if it is not — a platform-level decision with product-team-level operational consequences.

The observability contract: what questions you've committed to answering

The observability contract is the document that most clearly distinguishes an observability strategy from an observability installation. An observability installation describes what is deployed — "we run Prometheus with a 30-day retention window, Jaeger with 1% head-based sampling, and Grafana Loki with 7-day log retention." An observability contract describes what the installation can and cannot answer during an incident.

An example observability contract, written as explicit capability statements:

We can answer: "Which service in the request path is responsible for a latency regression?" — distributed tracing with service-level span duration data, sampled at 1% head-based. "Is the error rate elevated above the SLO threshold?" — Prometheus error rate metrics with 30-day retention, alerting via Alertmanager. "What was the full stack trace for an exception in the payments service?" — structured log entries in Grafana Loki with a 7-day retention window, queryable by service name and log level. "Was this incident region-specific?" — Prometheus metrics with a region label, allowing per-region aggregate queries.

We cannot answer without additional investigation: "Which specific customers are experiencing the latency regression?" — the checkout latency metric does not carry a customer_id label due to cardinality constraints; per-customer impact requires a manual query against the application database using the time window of the incident. "What was the database query that caused the slow span?" — Jaeger spans for database calls do not capture the full SQL query as a span attribute; the slow query log in PostgreSQL is the authoritative source, accessible via SSH to the database host. "Is this incident isolated to requests from a specific user agent or client library version?" — no user_agent label on request metrics; trace search by attribute is possible but requires the 1% sample to contain a representative trace from the affected client."

The observability contract converts undocumented capability gaps from surprises discovered under incident pressure into known constraints that the on-call engineer can consult before starting the diagnosis. Like the postmortem's role in converting an incident's root cause into a documented decision, the observability contract converts the "I can't find that data" discovery from an incident event into a design-time acknowledgment that the gap exists, with an explicit decision about whether to close it with additional instrumentation or accept it as a known limitation.

The SLO and SLI decisions belong in the observability contract. A service-level objective ("the checkout API p99 latency must remain below 2 seconds for 99.9% of requests over a rolling 28-day window") is only meaningful if the observability platform can measure it. The SLI definition — which metric, which label set, which percentile, measured over which time window — is an observability contract element: it names the specific capability the platform must provide to support the SLO. Like the security ADR's compliance scope decision, the SLO definition is a commitment that must be backed by a platform capability — making a p99 SLO commitment without a metrics backend that retains enough granularity to compute p99 over a 28-day window is a compliance gap waiting to be discovered at the next SLO review.

Writing the observability strategy decision record

The Nygard ADR format adapts for observability strategy decisions with five sections that most observability deployments leave entirely undocumented.

The observability contract. Before naming the tools, name the commitments. "The observability strategy is evaluated against four incident response questions that the on-call engineer must be able to answer within 5 minutes of alert receipt: (1) which service in the request path is responsible for elevated latency? (2) what is the current error rate for the checkout API, and how does it compare to the 7-day baseline? (3) did the latency regression affect all customers or a subset of customers? (4) is the issue isolated to a specific region or deployment? The current platform answers questions 1, 2, and 4 with the tooling below. Question 3 is answered with a 10-minute lag via a manual database query, because cardinality constraints prevent per-customer metrics labels at the current data volume. This gap is an accepted limitation documented here; it will be closed if the platform migrates to a high-cardinality backend (see Revisitation Conditions)."

The metrics backend decision with alternatives evaluated. "Prometheus 2.45 with a 30-day hot retention and VictoriaMetrics for 12-month long-term storage. Selected in August 2023. Alternatives evaluated: (1) Datadog — evaluated for 30 days on a trial; rejected on cost: at current scale (180 services, 25 instances average, 150 custom metrics per service) the projected monthly cost was $18,000/month; the team does not have budget for managed observability at this cost; Datadog APM distributed tracing was also evaluated and rejected on the same basis; (2) Grafana Cloud — evaluated as a Prometheus-compatible managed option at lower cost; rejected because the free tier's 10,000 active series limit was reached in the first day of a trial, and the paid tier's cardinality limit of 1M active series was projected to be exceeded within 6 months at current label growth rate; (3) Honeycomb — evaluated as a high-cardinality alternative; rejected because the event-based query model requires a different dashboard and alert expression approach than PromQL and the migration cost of existing dashboards was estimated at 3 engineer-weeks. Cardinality limit: the production Prometheus instance is provisioned with 32 GB RAM, supporting approximately 30 million active series. Current active series count is 4.2 million (as of 2024-01-15). Labels permitted on high-volume metrics: region, service, instance, environment, http_method, status_code, endpoint (for endpoints with a bounded set, not parameterized by ID). Labels explicitly prohibited on high-volume metrics: customer_id, user_id, request_id, session_id, tenant_id — these labels produce cardinality explosions at current customer volume (85,000 active customers). Per-entity-dimension analysis requires distributed traces or log queries."

The distributed tracing decision. "OpenTelemetry SDK (Go: go.opentelemetry.io/otel v1.22, Node.js: @opentelemetry/sdk-node v0.48) with Jaeger as the backend (self-hosted Jaeger 1.52, all-in-one mode). Exporter: OTLP/gRPC to the OpenTelemetry Collector, which fans out to Jaeger and to a custom trace-metrics bridge that emits histogram observations with exemplars for Prometheus. Selected March 2024, migrating from Jaeger Go client library used previously. Migration motivation: the Jaeger client library was vendor-specific and would require re-instrumentation to migrate backends; OpenTelemetry allows backend-independent instrumentation — we can change the OTEL Collector exporter configuration to switch backends without touching service code. Sampling strategy: 1% head-based probabilistic sampling for all requests (configured in the OTEL Collector using the probabilistic_sampler processor). Exceptions: always-on sampling for requests that produce an error (5xx or business-logic error flag set in span attributes) — configured in the OTEL Collector using the tail_sampling processor with a rules set. Sampling decision caveat: a 1% head-based sample means that for a low-frequency error (< 1% of requests), there may be no error trace in the system within a given 5-minute window. The always-on error rule mitigates this for server errors; business-level errors that are classified as 4xx rather than 5xx are not covered by the always-on rule and may be under-represented in the trace sample. Trace context propagation policy: W3C Trace Context headers (traceparent/tracestate) are the canonical propagation format. All services must extract W3C headers from incoming HTTP requests and inject them into all outgoing HTTP requests using the OpenTelemetry propagator API. Asynchronous boundaries: for Kafka-published messages, the trace context is serialized into message headers using the OpenTelemetry Go propagation package; Kafka consumer services extract the context from message headers as the root span parent. SQS message attributes are used for the same purpose. Services that do not propagate trace context correctly produce orphaned root spans — a service team that discovers their spans are not connected to the request trace in Jaeger should check the propagation middleware. Trace-log correlation: each log entry must include a trace_id field populated from the OpenTelemetry trace context for the current request. This is implemented in the shared logging middleware (pkg/observability/logger.go in the monorepo); services using the shared middleware get trace-log correlation automatically. Services with custom logging outside the shared middleware are responsible for extracting and logging the trace ID — this is a known gap with 3 identified services as of 2024-03-01."

The log aggregation integration. "Grafana Loki 2.9 with a 7-day retention window and a 30-day archive to S3 (read-only via Loki's S3 backend, not queryable directly). Loki indexes service name, environment, region, and pod_name deployment labels — these are the only fields available as instant filters in the Grafana Explore interface. Application-level fields (customer_id, tenant_id, user_id, request_id, trace_id) are stored in the log line content as structured JSON and require a LogQL JSON extraction filter (| json | trace_id="...") for field-level filtering. The LogQL content scan is slower than an indexed label query — a trace_id filter over 1 hour of logs from 20 service replicas takes approximately 8 seconds. For trace-log correlation during incidents, the recommended workflow is: (1) identify the affected time window from Prometheus alerts or the Jaeger trace; (2) filter Loki by service name label (instant, indexed); (3) filter the result by trace_id using a LogQL JSON extractor (8-second scan). The Grafana Loki data source is configured to support derived fields: a trace_id field in a log line is automatically linked to the Jaeger trace for that ID in the Grafana Explore interface. Cross-pillar correlation relationship: the log aggregation indexing model's constraint — that only deployment-level labels are indexed — is the reason per-tenant and per-customer log filtering is slower than per-service filtering. This constraint is consistent with the metrics cardinality constraint; both reflect the same decision that per-entity-dimension analysis uses traces rather than indexed signals."

The revisitation conditions. "Re-evaluate the observability strategy if any of the following triggers occur: (1) Prometheus active series count exceeds 25 million (the 80% capacity threshold for the current 32 GB instance) — this triggers an evaluation of either adding a Prometheus federation tier or migrating to a backend that scales cardinality horizontally without additional hardware; (2) the monthly cost of VictoriaMetrics long-term storage exceeds $1,500/month — at this point, a managed long-term storage option (Grafana Cloud for long-term only, Datadog with limited custom metrics) should be re-evaluated for cost competitiveness; (3) a compliance requirement is identified that requires trace data to be retained for longer than the current 72-hour Jaeger retention window — SOC 2 Type 2 or a customer's enterprise security questionnaire may impose trace retention requirements; the current backend cannot meet retention requirements beyond 72 hours without storage cost that exceeds the Honeycomb alternative; (4) the number of services instrumented with the Jaeger native client library (pre-OpenTelemetry migration) exceeds 0 — all remaining non-OTel-instrumented services should be migrated before the Jaeger client library falls out of security support; current count is 3 services as of 2024-03-01; (5) an incident occurs in which a question in the observability contract's 'cannot answer' list was the critical missing piece — this triggers an immediate evaluation of whether closing that specific capability gap is worth the instrumentation cost."

Finding observability decisions in AI chat

The WhyChose extractor surfaces observability decisions from four session types that contain the reasoning most teams cannot reconstruct when a new on-call engineer asks why the current platform cannot answer a specific incident question, or when a cardinality incident or an unexpected cost spike triggers a platform re-evaluation.

The initial instrumentation session. "How do I add Prometheus metrics to a Node.js Express app?", "OpenTelemetry vs. Jaeger vs. Zipkin — which distributed tracing library should I use?", "how do I set up Grafana with a Prometheus data source?", "should I use Datadog or build my own monitoring stack?", "how do I instrument a Go microservice with distributed tracing?", "what's the difference between OpenTelemetry and Jaeger?", "should I use head-based or tail-based sampling for distributed traces?". These sessions contain the metrics backend selection, the tracing library choice, the sampling strategy, and the alternatives evaluated before the decision was made. The initial instrumentation session is the highest-value recovery target because it contains the specific rejection reasons for each alternative — the information that prevents the same alternatives from being re-evaluated by a new platform engineer who arrives two years later and wonders why Datadog wasn't used instead of Prometheus, or why Honeycomb wasn't evaluated for the high-cardinality use case. The rejection reasons are captured in the initial evaluation session and scattered within weeks of adoption — they are rarely consolidated into documentation because the team is moving on to the next priority immediately after the first dashboard is live.

The incident response session. "Our API latency spiked — how do I find which service is slow in Jaeger?", "how do I filter Prometheus metrics by customer ID?", "why aren't my traces showing up in Grafana Tempo?", "how do I correlate a Loki log entry with a Jaeger trace?", "my Grafana dashboard shows elevated p99 but I can't find the slow requests in the traces", "why are some of my spans missing in the distributed trace waterfall?". These sessions contain the incident diagnosis workflow and, crucially, the moments when the current observability platform could not answer a question the engineer needed. Like the postmortem ADR that converts a production incident's lessons into a documented decision, the incident response AI session contains the raw evidence for which capability gaps exist in the current observability platform. The question "how do I filter Prometheus metrics by customer ID?" is the discovery event for the cardinality constraint — finding this session in the AI chat history gives the observability contract its "we cannot answer: which specific customers are affected?" section, documented with the actual incident context rather than as a theoretical limitation.

The cardinality incident session. "Prometheus is running out of memory — could it be a cardinality explosion?", "how do I find which metrics are causing high cardinality in Prometheus?", "Datadog costs spiked this month — how do I find which custom metrics are responsible?", "should I use histograms or gauges to avoid cardinality issues?", "how do I reduce the number of active time series in Prometheus?", "what are the Prometheus cardinality best practices?", "how do I use prom-label-proxy to enforce label cardinality limits?". These sessions document the specific cardinality incident — the event that surfaced the cardinality constraint as an operational problem rather than a theoretical risk — and the policy change that resulted. Like the cache stampede incident that reveals the missing stampede protection policy, the cardinality incident reveals the missing label cardinality policy and the prohibited label set that should have been in the observability strategy ADR from the beginning. Recovering this session produces the cardinality limit section of the metrics backend ADR without requiring the team to reconstruct the policy from first principles.

The platform migration session. "How do I migrate from Datadog to Grafana Cloud?", "how do I move from Jaeger B3 propagation format to OpenTelemetry W3C Trace Context?", "is there a way to export Prometheus metrics to Honeycomb without re-instrumenting?", "how do I avoid re-instrumenting all services when changing the tracing backend?", "can I use the OpenTelemetry Collector to fan out to multiple tracing backends during a migration?", "how do I migrate from the Jaeger Go client library to the OpenTelemetry Go SDK?". These sessions contain the migration cost analysis and the format lock-in consequences that should have been in the original decision record. Like the ADR lifecycle record that captures what made the prior decision obsolete, the migration session contains the accumulated cost of the original format choice and the specific pain that drove the migration decision. The migration session also reveals the new decision — which backend, which format, which sampling strategy — which becomes the superseding ADR that the original record points to.

What the decision record prevents

A documented observability strategy prevents three recurring problems that teams encounter as their systems grow and their engineering team turns over.

It prevents the cardinality surprise under incident pressure. A team without a documented cardinality policy adds per-customer dimensions to metrics when the first customer-specific incident requires them — not knowing that the addition will exhaust the Prometheus instance within hours of deployment. The cardinality explosion surfaces at the worst possible moment: during the incident that prompted the per-customer metrics addition. The decision record that names "labels prohibited on high-volume metrics: customer_id, user_id, request_id" with the explicit reason ("cardinality limits at current customer volume") converts the cardinality constraint from a surprise to a known design parameter that engineers consult before adding a new label. The on-call engineer who wants per-customer visibility reads the observability contract, finds the "we cannot answer: which specific customers are affected?" entry with the documented reason, and knows to query the application database rather than attempting to add a label that will cause an incident.

It prevents the trace propagation gap from becoming a diagnostic blind spot. A team without a documented trace context propagation policy accumulates asynchronous service boundaries where trace context is not propagated — Kafka consumers that start new root spans rather than continuing the publishing service's trace, HTTP outbound calls from background workers that carry no parent context, third-party webhook handlers that receive requests with no trace context and emit orphaned spans. Each unconnected span is a diagnostic blind spot: an on-call engineer investigating a latency regression that crosses an asynchronous boundary will find the originating trace ending at the Kafka publish and the consumer trace starting as an unconnected root, with no Jaeger link between them. Without a documented propagation policy, the gap is discovered during the incident when the engineer searches for the connected trace and finds a broken chain. Like other foundational infrastructure decisions, the propagation policy is only easy to implement correctly when the first asynchronous boundary is crossed — retrofitting trace context propagation to an established asynchronous codebase requires identifying every publish and consume site across every service, an archaeological exercise that a documented policy would have prevented.

It prevents the format lock-in from becoming a migration debt surprise. A team without a documented tracing format decision does not know, when evaluating a new tracing backend, whether their existing instrumentation is vendor-specific or vendor-neutral. A team that instrumented with the Jaeger Go client library discovers during a migration evaluation that switching to Grafana Tempo or Honeycomb requires replacing the instrumentation library in every service — a migration scope that was not anticipated because the format lock-in was never documented as a property of the original decision. The decision record that names "instrumentation format: OpenTelemetry SDK — selected to avoid vendor lock-in; backend migration requires only OTEL Collector exporter reconfiguration, not service re-instrumentation" converts the format flexibility from an undocumented benefit to a documented commitment. The next platform evaluation begins with the certain knowledge that a backend change is a Collector reconfiguration exercise, not a service re-instrumentation project — changing the scope estimate from weeks to hours before the evaluation begins. A technical leader who inherits an observability stack without a documented format decision cannot assess the migration cost without instrumenting a test service against each candidate backend and measuring the diff — a manual exercise that a format decision record makes unnecessary.

Further reading