2026-06-26 · ~19 min read

The logging strategy decision record: why the log structure you chose determines your incident investigation latency and your storage cost at scale

Log format, log level policy, correlation IDs, sampling rate, and retention tiers are decided by default — console.log strings in development that ship to production, DEBUG enabled because nobody wrote a policy, no traceId contract because OpenTelemetry wasn't in the initial stack. None of these decisions are documented. Three years later, the log format is the primary reason a production incident takes four hours to diagnose instead of fifteen minutes, and the absence of a log level policy is why $108 a month in unexpected CloudWatch charges appeared in the cost review. The logging strategy is an architectural decision. It should be documented before the first service ships to production.

A 28-person SaaS company had been running a document collaboration product for two years. Logging had grown organically from the initial Node.js service: console.log statements using template literals, formatted as human-readable sentences. "User 4291 requested document export at 14:32:07." "Document 8831 export job queued, queue depth 14." "Export worker 3 started job for document 8831." The logs were readable when a developer watched them stream in a terminal during debugging. They were the only kind of log entry the company had ever emitted, and nobody had written down a decision about why.

When a production incident happened — intermittent 500 errors on document exports affecting approximately 8% of export requests over a 40-minute window — the on-call engineer opened CloudWatch and began investigating. The first query was to find all log entries related to the failing exports. Without structured fields, there was no filter document_id = 8831 query available. The engineer used CloudWatch Insights with a regex pattern against the message string: filter @message like /document export/. This returned 12,000 entries from a 40-minute window — all export-related messages from all services, for all documents, with no way to separate the successful exports from the failing ones except by reading each entry. The regex pattern matched entries from four different services that had slightly different message formats: the web server logged "document export request received," the job dispatcher logged "export job queued," the worker logged "export worker started job," and the storage service logged "export file written." Because none of these entries shared a common request identifier, correlating the web server request with the job dispatcher entry with the worker entry with the storage entry for the same user's export required comparing timestamps — inexact when multiple exports were in flight simultaneously — and user IDs, which meant the full set of 12,000 entries had to be scanned to group them by user.

The root cause was a race condition in the job dispatcher: when the queue depth exceeded 10, the dispatcher emitted a WARN-level message that included the job ID and a flag indicating the job was delayed. But the WARN message format used a slightly different template from the INFO messages — "WARN: dispatcher queue depth exceeded threshold, job ${jobId} delayed" — while the worker's entry for the same job used the format "Export worker ${workerId} started job for document ${docId}." There was no shared identifier between the dispatcher's WARN entry and the worker's INFO entry. The engineer identified the race condition only after manually tracing seven individual exports through the logs, matching timestamps within a 200ms window and cross-referencing user IDs, to establish that exports that coincided with a queue depth WARN were failing at a 40% rate. The investigation took four hours and sixteen minutes. A logging strategy that required a traceId field in every log entry — assigned at the web server when the export request arrived, propagated through the job metadata to the dispatcher and the worker — would have made the query "filter trace_id = [the trace ID from any one of the failing 500 responses]" return every relevant log entry across all four services in a single result set. The investigation would have taken fifteen minutes.

After the incident, the team added structured logging to all new services. But the existing four services kept their unstructured log format, because migrating hundreds of log emission sites across four production services was not prioritized in the sprint queue. Eighteen months later, the production environment had two log formats: structured JSON logs from services created after the incident, and unstructured template-literal strings from the original services. Dashboards that queried by field value worked only for the new services. Dashboards that used regex patterns worked across all services but were slower and failed silently when the string format changed. The split log format was the direct consequence of an initial logging strategy that was never documented — no one could point to a decision record that said "the log format is template-literal strings, the rationale was development velocity, the re-evaluation trigger is when incident investigation requires field-level queries." Without that record, the migration was indefinitely deferred rather than scheduled, because there was no documented definition of "correct" to measure the existing format against. The pattern is the one described in decisions never written down: the logging format exists as a set of individual console.log calls whose design rationale is distributed across the first developer's personal style and the institutional memory of engineers who have since left the company.

The second incident was a cost and performance failure. A 19-person team maintained a REST API that processed 3 million requests per day. The logging strategy, insofar as one existed, was documented only in a comment in the logger configuration: "debug level for development, info for production." There was no enforcement mechanism — no CI check, no PR review criterion — to prevent DEBUG-level statements from reaching production. A senior engineer added DEBUG-level logging to a particularly tricky payment processing flow during a debugging session: the log entries captured the full serialized request payload, the intermediate state of the payment object at each processing step, and the outbound API call parameters to the payment provider. The average DEBUG log entry size was 2.4 kilobytes. The DEBUG logging was committed alongside the bug fix it was written to investigate, deployed in the same PR, and the DEBUG log entries began appearing in production.

The payment processing flow handled 320,000 requests per day. At 2.4 kilobytes per entry, with three DEBUG entries emitted per request, the flow was generating 2.3 gigabytes of debug logs per day. At CloudWatch Logs ingestion pricing of $0.50 per gigabyte, this added $1.15 per day — $34.50 per month — to the team's AWS bill. The cost was not zero, but it was small enough to be lost in invoice rounding. The more serious problem was performance. The team's CloudWatch Insights dashboards queried across the same log groups where the payment processor was now contributing 2.3 gigabytes per day of DEBUG entries. Dashboard queries that previously ran in 2 to 3 seconds began timing out: CloudWatch Insights has a query result size limit, and the increased log volume caused queries to scan and discard large volumes of debug entries before reaching the INFO-level entries the dashboards were actually monitoring. Three production dashboards became unreliable within two weeks of the DEBUG deployment, failing with timeout errors at peak traffic times. An on-call alert that depended on one of those dashboards missed a genuine payment processing slowdown for 23 minutes because the alerting query had timed out before the alert evaluated.

The investigation to identify which service was generating the excess log volume took three days, because CloudWatch log group ingestion costs were not monitored per service. The team's monthly AWS Cost Explorer showed the total CloudWatch Logs line item growing, but the drill-down to per-log-group ingestion required a custom Cost Allocation Tag query that the team had never configured. Once the source was identified and the DEBUG statements removed, the team added a CI check: a script that searched new commits for logger.debug( calls in files within the payment processing module and rejected the PR if the count exceeded the baseline. The check addressed the specific case that caused the incident but did not address the general policy gap — there was still no written definition of what DEBUG-level logging was permitted to emit, no cardinality policy for log field values, and no per-service cost ceiling that would have triggered a review before three production dashboards failed. The new engineering lead onboarding problem applies here: the engineer who joins the team after this incident finds a CI check that rejects logger.debug( calls in one specific module, with no explanation of the broader logging policy it was meant to enforce, no ability to judge whether the pattern applies to other modules, and no record of what incident caused the rule to exist.

The three structural properties that the logging strategy determines

When teams add logging to a service for the first time, the scope is narrow: emit some information that helps diagnose problems during development. A console.log statement achieves this in seconds. The structural properties that the choice sets — log queryability, correlation graph coverage, and ingestion cost trajectory — are not visible when the product has one service and ten requests per second. They become visible when the product has eight services, 3 million requests per day, and a production incident that needs to be diagnosed in under fifteen minutes to meet the SLA.

Log queryability and field cardinality. The log format determines what questions can be answered in the log aggregation system without writing a new parsing expression or exporting the log data. Unstructured string logs answer questions expressible as substring matches and regex patterns: "find all log entries containing 'payment failed'" — useful for counting occurrences but not for querying "find all payment failures for user 4291 in the last hour where payment_method equals stripe" without a regex that correctly parses the user ID and payment method from the string format. Structured JSON logs answer field-equality and range queries natively, because the log aggregation system indexes the JSON fields and can evaluate "filter user_id = 4291 AND payment_method = 'stripe' AND level = 'ERROR'" as an indexed lookup rather than a full-scan regex. The query performance difference — seconds versus minutes — is the investigation latency difference between a 15-minute incident resolution and a 4-hour one.

Queryability also depends on field cardinality. A structured log field whose values are unbounded — a request payload body serialized as a string field, a SQL query string with parameter values embedded — produces a high-cardinality field that log aggregation systems store but cannot index efficiently. CloudWatch Logs Insights can filter on any JSON field but only indexes a subset of fields for fast lookup; Datadog Logs indexes fields up to a per-field cardinality limit before falling back to full-text search; Grafana Loki does not index field values at all beyond stream labels, relying on log-time regex matching for any query that goes beyond the indexed labels. A logging strategy that logs full request payloads at INFO level produces logs that are queryable by the payload's contents but defeats field indexing for high-frequency services, where the cardinality of payload content is effectively unbounded. The logging strategy ADR must define which fields are permitted at which log levels, with an explicit policy that high-cardinality values — request bodies, API response payloads, SQL query strings with parameter values — are logged only at DEBUG level (with the expectation that DEBUG is not enabled in production at steady state) or are truncated to a fixed-length fingerprint (a hash or a fixed-length prefix) at INFO level. The logging infrastructure decision record covers tool selection — CloudWatch versus Datadog versus Loki — and the cost and query characteristics of each; the logging strategy ADR covers what data is emitted to that tool, which is a decision independent of the tool chosen.

Correlation graph coverage. A correlation ID — a unique identifier that propagates through every service, database call, and background job that participates in handling a single user request — determines whether the logs from a production incident can be reassembled into a coherent execution trace. Without correlation IDs, the log evidence for an incident is a set of log entries from multiple services where the relationship between entries is reconstructed by timestamp proximity and shared context fields (user ID, resource ID). Timestamp proximity is inexact: two services with 50ms clock skew and concurrent requests will produce log entries that overlap in time, making it impossible to determine with confidence which web server log entry corresponds to which worker log entry without an explicit shared identifier. Shared context fields narrow the candidate set but do not eliminate ambiguity: a user_id field returns all log entries for that user, not just the entries from the specific request being investigated.

The OpenTelemetry tracing model provides the correlation vocabulary that makes log entries linkable across service boundaries. A traceId — a 128-bit identifier assigned at the entry point of a request (the API gateway, the webhook receiver, the cron job trigger) — is propagated via HTTP headers (traceparent header per the W3C Trace Context specification) to every downstream service call. A spanId — a 64-bit identifier assigned to each unit of work within the trace — links log entries within a service to the specific function call, database query, or outbound HTTP request they were emitted during. Every log entry that includes traceId and spanId can be joined to the distributed trace if the trace was sampled, or queried independently across services if it was not. The correlation graph coverage — the percentage of log entries that carry a traceId — is a function of how completely the traceId propagation is implemented. If HTTP middleware injects traceId into the logging context for synchronous service calls but background job processing does not propagate the originating traceId through the job metadata, then the portion of a request's execution that happens asynchronously is invisible in the correlation graph. An export that triggers a background job that calls a third-party service — three steps, only the first carries the traceId if the job metadata doesn't propagate it — produces a log trail that stops at the job queue boundary. The correlation gap is where the root cause evidence most often lives: the background step that failed is the one that cannot be correlated back to the originating user request. The service mesh decision record is relevant here: a service mesh that handles inter-service HTTP calls can inject and propagate the traceparent header automatically, reducing the surface area over which engineers must remember to propagate the traceId manually.

Ingestion cost trajectory. Log ingestion cost is a function of log volume, and log volume is a function of log verbosity, request throughput, and log retention configuration. Each of these is controllable through policy, but only if the policy is written down and enforced. Log verbosity without a written policy tends toward higher volume over time: each debugging session adds log entries that are never removed after the session ends; each new service is initialized with the log level of the developer's local environment (DEBUG) rather than the production default (INFO); each new feature adds INFO-level log entries that include more context than is needed for production monitoring. At 100,000 requests per day the resulting volume is affordable; at 10 million requests per day the same logging density is expensive and may degrade the log aggregation system's query performance for dashboards and alerts that depend on fast queries over recent log data.

The cost trajectory is non-linear because it combines two variables that grow with product success — request throughput and service count — with a policy variable that tends toward more-verbose over time unless actively managed. A logging strategy ADR that establishes a per-service monthly cost ceiling, an ingestion cost alert (fire when a service's log group ingestion rate increases by more than 50% week-over-week), and a policy that requires a PR comment explaining the log volume impact of any new INFO-level emission in a hot path is the mechanism that keeps the cost trajectory controllable. Without it, the cost grows until someone notices the monthly invoice, at which point the remediation is a reactive audit of every log emission site to identify and remove the highest-volume offenders — the kind of work that the performance optimization decision record describes as expensive to retrofit and cheap to prevent with an upfront policy. The observability strategy decision record establishes the overall observability model — logs, metrics, traces, and how they relate — and the logging strategy ADR documents the specific structural decisions within the logging dimension of that model.

Logging strategy options and their structural properties

Unstructured string logging is the path of least initial friction: emit a human-readable message at each log site, possibly with interpolated values. It is correct for single-service applications with low throughput and a single developer who knows every log format by memory. It becomes incorrect as soon as a second service is added (no shared field schema), as soon as throughput makes grep-based investigation impractical, or as soon as a second developer needs to write a query against the logs. The migration cost from unstructured to structured logging is proportional to the number of log emission sites and the number of downstream consumers (dashboards, alerts, saved queries) that parse the string format — both of which grow with time. Teams that discover the cost of unstructured logging during a production incident and then migrate only new services to structured JSON leave a split-format log environment that persists for years, because the migration of existing services never rises to priority against the backlog of product work. The correct choice is structured logging from the first service, not as a premature optimization but as the minimum viable log format that supports the investigation queries that every production system will eventually require.

Structured JSON logging with a mandatory field contract is the standard approach for production services with more than one developer. Each log entry is a JSON object with a documented set of required fields: timestamp (ISO 8601, UTC), level (enum: DEBUG/INFO/WARN/ERROR/FATAL), service (string identifier for the emitting service), trace_id (128-bit hex, from the OpenTelemetry context), span_id (64-bit hex), message (string, human-readable summary). Additional contextual fields are permitted based on the log level and the event type, with a cardinality policy that prohibits unbounded-value fields (request bodies, query strings with parameter values) at INFO level and above. The mandatory field contract is enforced at two levels: the logger library wrapper (a thin module that wraps the underlying logger — Winston, pino, logrus, zap — to validate that required fields are present before emission) and a CI check that asserts the JSON structure of sampled log entries in the test suite. Log entries that fail validation are emitted with a log_format_error: true field that is alertable, so field contract violations are discovered in development and testing rather than during an incident investigation. The test strategy decision record should specify that log format tests assert field presence and type by querying the emitted JSON object, not by string-matching the message — a test that checks log.fields.trace_id is not null is resilient to message wording changes, while a test that checks log.message contains 'trace_id' is not.

Structured logging with schema registration extends the mandatory field contract with a schema registry that versions the log format and validates each emission against the registered schema at CI time. The schema is defined in Avro or Protobuf and registered per service per version in the schema registry; a pre-commit hook or CI step calls the schema registry's compatibility API to verify that the new log schema is backwards-compatible with the registered schema before the commit is accepted. Schema registration adds the ability to detect breaking log schema changes at CI time — a field rename, a type change from string to integer, a required field removed — before they reach production and break dashboards or alerts that query the renamed field. The operational overhead is the schema registry itself (Confluent Schema Registry or AWS Glue Schema Registry are the common choices) and the discipline of updating the registered schema when adding new log fields. Schema registration is correct for teams with multiple services where different teams own the log consumers (dashboards, alerting rules, data pipelines) and the log producers (the services), and where a breaking change to the log format in one service can silently break a cross-service dashboard maintained by a different team. For single-team products with fewer than ten services, the mandatory field contract enforced by the logger wrapper and CI assertions is typically sufficient without the full registry overhead.

Log level policy variants determine what information is available at what cost in production. The four common models are: (1) INFO-and-above in production at steady state, with temporary DEBUG elevation for specific services during incident investigation — the standard model for services where INFO-level logs provide sufficient context for monitoring and most investigations, and DEBUG elevation is a controlled operational action rather than a default; (2) INFO-and-above in production with tail-based sampling for DEBUG on error traces — emit DEBUG-level log entries only for traces that result in an error response, using a tail-based sampling agent (the OpenTelemetry Collector supports this) that buffers log entries and releases the full trace (including DEBUG entries) when it observes an error span; this model provides DEBUG context for failures without the cost of DEBUG at 100% throughput; (3) WARN-and-above in production for cost-sensitive services — correct for high-throughput services where even INFO-level logs at 100% throughput produce unacceptable ingestion cost, with a compensating strategy of emitting metrics at high resolution for signals that would otherwise appear in INFO logs; (4) dynamic log level adjustment via an operator API — each service exposes an endpoint that allows an operator to change the active log level at runtime without redeployment, enabling temporary DEBUG elevation for a specific service during incident investigation without a deploy cycle. The choice among these models is a function of throughput, cost tolerance, and the investigation latency target: a service with 100 requests per second can afford INFO at 100% and DEBUG on error traces without significant cost; a service with 100,000 requests per second at 2.4 kilobytes per INFO entry generates 240 megabytes per second of log data at full verbosity, which is $10.37 per hour at CloudWatch ingestion pricing — a cost that requires either WARN-and-above or aggressive sampling.

Correlation ID propagation models range from manual propagation to framework-automatic injection. Manual propagation requires each service to explicitly extract the traceparent header from the incoming request, create a span in the local tracing context, and include the traceId and spanId in every log entry emitted during the request. Manual propagation is correct but requires developer discipline at every new log emission site and at every asynchronous handoff (job queue, message bus, scheduled task). Framework-automatic injection uses the logger library's integration with the OpenTelemetry SDK context: when the OTel SDK instruments the HTTP framework (Express middleware, Go's net/http, Django middleware), it creates a span for each request and stores the trace context in a context propagation mechanism (Node.js AsyncLocalStorage, Go context.Context, Python contextvars) that the logger can read automatically when emitting a log entry. With framework injection, developers do not need to explicitly pass the traceId to the logger — the logger wrapper reads it from the current OTel context. The risk is incomplete context propagation: code that spawns goroutines, threads, or async callbacks outside the OTel-instrumented framework may not carry the OTel context, producing log entries with a missing traceId. The logging strategy ADR must specify the propagation mechanism and the monitoring criteria: a CI check that counts log entries with a missing trace_id field in the test suite and fails if the count exceeds zero for INFO-level entries is the enforcement mechanism for the propagation contract. The CI/CD pipeline decision record should include the log format and correlation coverage checks as first-class pipeline gates.

AI chat session types and what each one misses

The logging strategy decision follows a consistent pattern of AI chat sessions. The WhyChose extractor surfaces these sessions from chat export files, and the structural decisions they omit are consistent across the decision records reviewed. The logging format is typically chosen during the first "how do I add logging to my Node.js app" or "how do I set up structured logging" session, is not revisited as a decision until an incident investigation reveals the format's limitations, and is then patched reactively — add correlation IDs to new services but not old ones, add a CI check for the specific pattern that caused the incident but not for the underlying policy gap.

The initial logging setup session covers: how to add a logger library to a service (Winston, pino, logrus, zap, log4j), what logger configuration options are available, how to set the log level from an environment variable, how to make logs appear in the terminal during development. The session ends when logs appear and are readable. What the session does not cover: whether the log format should be structured JSON or human-readable strings, what fields should be mandatory in every log entry, how correlation IDs will propagate through service calls, what the log level policy will be for production versus development, or what the ingestion cost implications are for different verbosity settings at production scale. These questions are not visible in the session because the product has one service and the log volume is small enough that every log entry is readable in a terminal. The decision made in this session — use pino's default JSON format with whatever fields pino includes by default — is the logging format the product will carry until an incident investigation or a cost alert reveals that the default format is missing the fields the investigation needs.

The "add logging to the new service" session is the moment when the logging format diverges across services if no format contract is documented. The team has added a second service (a background job processor, an async notification service, a data pipeline worker) and needs to add logging. The AI session covers: how to set up logging in this service with the same library used in the first service, how to connect the logs to the same aggregation destination. The session ends when logs appear in CloudWatch or Datadog. What the session misses: the new service uses the same logger library but configures it independently, adding fields in a different order, using slightly different key names for the same concepts (the first service uses user_id, the second service uses userId — camelCase versus snake_case), and not propagating the traceId because the first service's traceId was implemented as a manual pass-through that the second service's developer was not aware of. The second service's logs are now in the same log aggregation system as the first service's logs, but cross-service queries that filter on user_id return only the first service's entries, and correlation by traceId is impossible because the second service does not emit the field. The gap is not discovered until an incident requires cross-service investigation. The decisions never written down pattern: the logging format conventions in the first service existed as developer practice rather than documented contract, so the second service's developer had no source of truth for what the format should be.

The "investigate the production incident" session is where the limitations of the logging strategy become visible under operational pressure. The AI session covers: how to write a CloudWatch Insights query to find the relevant log entries, how to filter by user ID, how to narrow the time range. The session may produce a working query that finds the relevant entries — by regex against the message string, by timestamp correlation across services — after multiple iterations. What the session misses: the queries the engineer writes to investigate this incident are not documented. The query that finally surfaces the root cause — a regex pattern that extracts the queue delay duration from the WARN message string format — is closed with the incident. The next engineer who investigates a similar incident starts from scratch. The postmortem and incident decision log framing applies: the investigation queries, the fields that were missing and made the investigation slower, and the log format gaps that caused the latency are the outputs of the postmortem that should drive the logging strategy ADR update. Without the ADR, postmortem action items ("add correlation IDs to all services") are tracked as tickets that get deprioritized, because there is no documented specification of what the logging system is required to guarantee — only a list of individual fixes without a shared model to update.

The "optimize the AWS bill" session covers: why the CloudWatch Logs line item increased, how to identify which log group is generating the most ingestion, how to set a log retention policy that deletes old logs after N days to reduce storage cost. The session ends with a Terraform change that sets log group retention to 30 days and a cost projection showing the monthly saving. What the session misses: the retention policy change is the first written-down logging policy in the system — set via a Terraform resource attribute rather than an ADR — and it addresses storage cost but not ingestion cost. The DEBUG logging in the payment processing service is still generating 2.3 gigabytes per day of ingestion; the retention policy reduces storage cost for logs older than 30 days but does not reduce the $34.50 per month in ingestion cost or the query performance degradation that the debug volume is causing. The ingestion cost optimization requires a different intervention — removing the DEBUG statements from the hot path — but that intervention is not surfaced in the billing optimization session because the two problems (ingestion cost and storage cost) appear as different line items in Cost Explorer and are investigated in separate sessions. The infrastructure-as-code strategy context applies: the retention policy set in Terraform is a logging infrastructure decision, while the log level policy that should prevent DEBUG from reaching production is a logging strategy decision. The absence of a logging strategy ADR means there is no place to document that both the retention policy and the log level policy are components of the same logging cost model, and that a retention policy change without a log level policy review addresses the symptom (storage accumulation) without the cause (ingestion volume).

Five ADR sections for logging strategy selection

A logging strategy ADR that prevents the slow incident investigations and cost surprises described in this post covers five sections that teams consistently omit.

First, the log format specification and mandatory field contract. The ADR documents the JSON schema every log entry must conform to, the required fields and their types, the cardinality policy for optional fields, and the validation enforcement mechanism. "Log format: structured JSON. Every log entry must include the following fields: timestamp (string, ISO 8601 UTC, required), level (string, enum: DEBUG | INFO | WARN | ERROR | FATAL, required), service (string, the service name as defined in the service registry, required), trace_id (string, 128-bit lowercase hex, from the OpenTelemetry context, required for INFO and above — a missing trace_id at INFO level is a contract violation), span_id (string, 64-bit lowercase hex, from the OpenTelemetry context, required for INFO and above), message (string, human-readable description of the event, required, maximum 512 characters, must not contain interpolated request payload data). Optional contextual fields permitted at all levels: user_id (string, the authenticated user's UUID), resource_id (string, the primary resource ID involved in the event), error_code (string, machine-readable error identifier for ERROR-level entries), duration_ms (integer, elapsed time in milliseconds for timed operations). Prohibited fields at INFO level and above: any field whose value is a full request or response body, a SQL query string with inline parameter values, a stack trace longer than 10 frames, or any field whose cardinality is unbounded by design. These fields are permitted only at DEBUG level. Field naming convention: snake_case. The logger wrapper module validates the required fields before emission and emits a log_format_error: true entry if a required field is absent. CI: the test suite includes a log format smoke test for each service that emits a test request, captures the log entries, and asserts that each entry's JSON structure satisfies the schema." The cardinality policy is the section most often missing: teams define the required fields but do not specify what is prohibited at which levels, so the first engineer who adds a full request body to an INFO log entry has no written policy to check against. The secrets management decision record is directly related: sensitive values — API keys, authentication tokens, personally identifiable information, payment card data — must be explicitly prohibited from all log levels as a corollary to the log format contract, because a log format that permits arbitrary contextual fields at DEBUG level will eventually emit a credential that appears in the log aggregation system's storage and crosses the compliance boundary. The log format contract must state: "no credential, token, PII, or payment data in any log entry at any level; scrub these values at the logger wrapper layer before emission."

Second, the log level production policy. The ADR documents the default log level for each environment, the definition of each level, the policy for merging DEBUG statements into production-deployed code, and the procedure for temporary verbosity elevation during incident investigation. "Default log level by environment: production: INFO; staging: INFO; development: DEBUG; load testing: WARN. Log level definitions: DEBUG — detailed diagnostic information useful during development and targeted incident investigation; not permitted in hot paths (code executed more than 1,000 times per second in production) without a documented exception; intended for temporary use and must be removed before merge or marked with a // KEEP: [reason] comment that is reviewed in the PR. INFO — events that are significant for production monitoring but not errors: request received, job queued, external API called, background task completed; each INFO entry should represent a distinct named event, not a progress narrative within a single operation. WARN — events that are abnormal but recoverable and do not require immediate investigation: rate limit approaching threshold, cache miss on a normally-cached resource, retry attempt on a transient failure; WARN entries should include the context needed to assess severity without requiring a follow-up INFO query. ERROR — events that indicate a failure that requires investigation; every ERROR entry must include error_code and error_message fields; errors that are expected and handled (e.g., user-provided invalid input, 404 on a resource the user searched for) are logged at WARN, not ERROR. FATAL — events that indicate the service is unable to continue operating and will exit; emitted once before the process exit; include the full error chain. Hot path DEBUG policy: a DEBUG log statement in a function that is called more than 1,000 times per second in production requires a PR comment stating: the function's estimated calls-per-second at production scale, the estimated log entry size, the resulting per-day ingestion volume at CloudWatch pricing, and the justification for why the DEBUG information is necessary at that call frequency. The PR comment is the mechanism that forces cost awareness at the time of writing the log statement, rather than at the time of the monthly invoice review. Temporary verbosity elevation: during incident investigation, the on-call engineer may elevate a specific service's log level to DEBUG using the POST /ops/log-level endpoint (returns to INFO after 30 minutes automatically). This elevation is logged as an operational event in the ops audit log. The elevated service must be returned to INFO before the incident postmortem closes." The hot path DEBUG policy is the mechanism that would have prevented the payment processing cost incident: the engineer who added the DEBUG statements to the hot loop would have been required to calculate the ingestion volume before merge, surfacing a $34.50/month addition as a PR discussion item rather than a post-deployment cost anomaly.

Third, the correlation ID propagation contract. The ADR documents the traceId and spanId source, the propagation mechanism for synchronous and asynchronous calls, the coverage requirement, and the CI enforcement. "Correlation IDs: all log entries at INFO level and above must include trace_id and span_id from the OpenTelemetry context. Source: the HTTP ingress middleware (the load balancer, the API gateway) generates a traceId for every incoming request using the W3C Trace Context traceparent header format; if the incoming request already carries a traceparent header (from an upstream service or a webhook sender), that traceId is used. The OTel SDK's HTTP server instrumentation creates a root span for the request and stores the trace context in the request-scoped context object (Node.js: AsyncLocalStorage; Go: context.Context; Python: contextvars). The logger wrapper reads the trace context from the request-scoped context at emission time — developers do not pass the traceId explicitly to log calls. Asynchronous propagation: when a request enqueues a background job, the job record includes a trace_id field populated with the originating request's traceId. The job processor extracts this field and creates a child span under the originating trace before processing begins, restoring the OTel context for all log entries emitted during job processing. This makes the background job's log entries queryable as part of the originating request's trace. When the job processor calls external services, it injects the traceparent header with the current span's traceId and spanId. Coverage requirement: 100% of INFO-level log entries from production services must include a valid trace_id field. Coverage is monitored as a metric: the logger wrapper increments a counter log_missing_trace_id for any INFO-or-above entry emitted without a trace context. This counter is exported as a CloudWatch metric and alerts when it exceeds zero for more than 5 minutes, indicating a new code path that is not correctly propagating the OTel context. CI enforcement: the integration test suite for each service includes a trace context assertion: after exercising the service's HTTP endpoints and job processing paths, the test asserts that every log entry captured during the test run includes a non-null trace_id. A service cannot pass its integration tests if any code path emits an INFO-level log entry without a traceId." The background job propagation requirement — storing the originating traceId in the job record — is the section that would have prevented the incident investigation gap where the document export job's log entries were disconnected from the originating web request. It requires a small schema addition to the job queue (a trace_id column) but produces a fundamental improvement in incident investigation capability for any asynchronous processing. The queue and messaging decision record should reference the logging strategy's asynchronous trace propagation requirement as a constraint on the job record schema design.

Fourth, the sampling and ingestion cost policy. The ADR documents the per-level sampling rates, the ingestion cost ceiling per service per month, the alerting thresholds, and the review process for high-volume log emission sites. "Sampling rates by level: FATAL: 100% (always emit). ERROR: 100% (always emit). WARN: 100% at steady state; reduce to 10% if a single WARN code is emitting more than 10,000 entries per minute (a log rate spike that indicates an ongoing condition being surfaced repeatedly — reduce the rate but alert on the condition). INFO: 100% at steady state for services with fewer than 1 million requests per day; for services above this threshold, INFO sampling is configurable per-service in the service's configuration file, defaulting to 100% and reducible to a minimum of 10% with a documented justification in the logging strategy ADR appendix. DEBUG: not emitted in production at steady state; emitted at 100% during operator-initiated verbosity elevation, which is time-limited to 30 minutes. Cost ceiling: each service has a monthly log ingestion budget of $25 at CloudWatch pricing ($0.50/GB). A CloudWatch cost alarm fires when a service's log group's estimated monthly ingestion cost exceeds $20 (80% of budget), triggering a log verbosity review for that service. Alarm resolution requires either a reduction in log volume (remove high-cardinality fields, reduce sample rate, remove DEBUG statements from hot paths) or a documented exception approved by the platform team that updates the service's budget ceiling in the logging strategy ADR appendix. New INFO-level log emission sites in hot paths (functions called more than 1,000 times per second): require a PR comment that states the estimated calls-per-second, the estimated log entry size, and the monthly ingestion cost contribution. The PR comment is reviewed as part of the logging cost governance process; the reviewer adds a cost estimate tag to the PR for tracking. High-cardinality field review: any PR that adds a new field to a log entry at INFO level where the field value is not from a bounded enum — a UUIDs, a user-provided string, a resource name — requires a reviewer comment explaining why the field is necessary at INFO level and not DEBUG level, and estimating the cardinality impact on query performance in the team's log aggregation system." The cost ceiling and the 80%-of-budget alert are the mechanisms that surface the ingestion cost problem before the monthly invoice, rather than after. The alert fires at $20/month per service — a number that is still small enough to act on before it compounds, rather than after three months of accumulation. The build-versus-buy framing applies to the sampling decision: a tail-based sampling agent (OpenTelemetry Collector with a tail sampling processor) is the buy option that automates DEBUG emission for error traces without requiring each service to implement its own sampling logic; the alternative (per-service DEBUG sampling configuration managed by each team) is the build option that produces inconsistent sampling policies across services.

Fifth, the retention tier and audit log policy. The ADR documents the retention period per log level, the retention tiers for compliance-sensitive events, the access control model for log data, and the log export policy for compliance audits. "Retention tiers: FATAL and ERROR: 90 days live in CloudWatch Logs, then archived to S3 Glacier with a 7-year retention for compliance (financial and healthcare customers have contractual audit obligations that require 7-year retention of error events). WARN and INFO: 30 days in CloudWatch Logs, then deleted (no archival — INFO logs at 30+ days have low investigative value relative to their storage cost). DEBUG: 7 days in CloudWatch Logs, then deleted (DEBUG is only emitted during operator-requested verbosity elevation; 7 days is sufficient to close the incident that triggered the elevation). Access control: CloudWatch log group read access is granted to the engineering team on-call role and the security team; full CloudWatch log group list and query access is granted to the platform team. Log data is not accessible to non-engineering roles (customer support, sales, operations) because log entries may contain internal system state including user IDs, resource IDs, and error context that is not appropriate for non-technical access. Direct S3 Glacier access for archived error logs is restricted to the security team and the compliance team. Audit log for security events: security-relevant events — authentication success/failure, authorization denial, permission grant/revocation, admin action — are logged to a separate log group with a 7-year retention at all environments. Security audit log entries are immutable (CloudWatch log group with resource policy preventing log stream deletion), and access is restricted to the security team and the compliance team. Compliance audit export: when a compliance audit requires log data older than 30 days, the request goes through the platform team, who retrieves the archived data from S3 Glacier. The retrieval SLA is 4 hours for Glacier Standard restore. Audit requests are logged as operational events. Log format for the compliance export is JSON with the same mandatory field schema as production logs — no reprocessing or reformatting before delivery to the compliance team." The separation of security audit events into a dedicated log group with longer retention and immutability is the section that addresses compliance obligations without requiring the full production log volume to be retained for 7 years. The security ADR and threat model should reference the logging strategy's audit log policy as a complementary control: the authorization model ADR describes what authorization denials generate an audit event, and the logging strategy ADR describes where that audit event is stored, for how long, and who can access it. The two ADRs together constitute the audit trail architecture that compliance auditors will ask to review.

None of these five sections are visible in the logger library configuration, the log aggregation tool setup, or the individual log emission statements in the services. They are the logging reasoning that every engineer who adds a new service, debugs a production incident, responds to a compliance audit, or optimizes infrastructure cost depends on to understand what the logging system is designed to guarantee and what trade-offs were made in its design. The four-hour incident investigation and the three-day cost root-cause analysis are not caused by poor engineering in the individual sessions. They are caused by a logging strategy that was chosen without being documented — without specifying the log format contract, the correlation ID propagation requirement, the log level production policy, or the ingestion cost model — so that each engineer who later added a service, added a log statement, or investigated an incident had no source of truth for what the logging system was designed to guarantee. The WhyChose extractor surfaces the initial logging setup session, the new service logging configuration session, and the cost optimization session from AI chat history; the logging strategy ADR is what takes the reasoning from those sessions and makes it legible to the team that inherits the logging infrastructure and must add to it, investigate with it, or explain it to an auditor.

FAQs

What is the difference between unstructured, structured, and schema-registered log formats?

Unstructured logs are plain text strings where fields (user ID, action, timestamp) are embedded in a human-readable sentence. They are readable by a developer watching a terminal but not queryable by field value in a log aggregation system — finding all log entries where user_id equals 4291 requires a regex pattern match against the full string, which is slower than a field index lookup and fails when the string format changes between services.

Structured JSON logs emit each field as a key-value pair. They are queryable by field value natively: "filter user_id = 4291 AND level = ERROR" returns matching entries in milliseconds using the aggregation system's field index, without regex. The query performance difference — seconds versus minutes — is the incident investigation latency difference between a 15-minute root cause identification and a 4-hour one.

Schema-registered logs register the JSON field schema (required fields, types, compatibility rules) in a schema registry (Confluent Schema Registry, AWS Glue Schema Registry). Breaking changes to the log format — a field rename, a type change, a required field removed — are detected at CI time by the registry's compatibility API before they reach production and break dashboards or alerts that query the renamed field. Correct for teams where different teams own log producers (services) and log consumers (dashboards, alerts, data pipelines), and where a format change in one service can silently break a cross-service dashboard maintained by another team.

Why are correlation IDs (traceId and spanId) required in every log entry, not just in traces?

Without correlation IDs, identifying which log entries from the API gateway, the business logic service, the queue worker, and the external API call all belong to the same user request requires correlating by timestamp (inexact when multiple requests are in flight simultaneously) or by user ID (which returns all entries for that user, not the entries from the specific failing request). With a traceId — a 128-bit identifier assigned at the entry point and propagated to every downstream service via HTTP headers — the investigation query is "filter trace_id = abc123" across all services, returning the complete chronological record of everything that happened during that request in a single result set.

Correlation IDs belong in logs as well as traces because traces are frequently sampled — high-traffic systems cannot store a full trace for every request — while ERROR-level log entries are almost never sampled. An ERROR log entry with a traceId links the error to the full trace if the trace was sampled, and still allows field-level querying of the error context (user_id, resource_id, error_code) if it was not. The background job propagation requirement — storing the originating request's traceId in the job record so the job processor can restore the OTel context during processing — is the extension that makes asynchronous processing investigable as part of the originating request's trace, preventing the investigation gap where the log trail stops at the queue boundary.

What should a logging strategy ADR document that teams typically skip?

Teams typically document their logger library choice and their log aggregation tool. The five sections that prevent slow incident investigations and cost surprises: first, the log format and mandatory field contract — the specific JSON schema every entry must conform to, which fields are required (including trace_id and span_id at INFO and above), what is prohibited at which levels (no request bodies at INFO), and how the schema is validated in CI; second, the log level production policy — the default level per environment, the definition of what qualifies as ERROR versus WARN versus INFO, the review requirement before merging DEBUG statements into hot paths, and the temporary verbosity elevation procedure for incident investigation; third, the correlation ID propagation contract — that trace_id and span_id are required at INFO and above, how they propagate through async job processing, and the CI assertion that enforces 100% coverage; fourth, the sampling and ingestion cost policy — per-level sampling rates, the per-service monthly cost ceiling, the alert threshold that fires before the invoice, and the PR review requirement for new INFO-level emission sites in hot paths; fifth, the retention tier and audit log policy — different retention periods per log level (ERROR 90 days, INFO 30 days, DEBUG 7 days), a separate immutable log group for security audit events with 7-year retention, and the access control model that restricts log data to engineering and security roles.

None of these sections are visible in the logger configuration or the individual log statements. They are the logging reasoning that every engineer who adds a service, debugs an incident, responds to a compliance audit, or optimizes infrastructure cost depends on to understand what the logging system guarantees and where its designed limits are.