Why does log aggregation tool selection need an architecture decision record if it can be changed later?

Log aggregation appears changeable because the aggregation infrastructure is separate from the application code — you can swap Loki for Elasticsearch without touching a single service. In practice, the tool choice is deeply embedded in three places that are expensive to change. First, the structured logging instrumentation in every service: field names, log formats, and context injection patterns are written to match the aggregation tool's query model, and changing aggregation tools requires auditing and often rewriting these across every service. Second, the alert and dashboard definitions written against the aggregation tool's query language: Logfmt queries in Grafana Loki, KQL in Elasticsearch, and DQL in Datadog are not interchangeable; a tool migration requires recreating every alert and dashboard. Third, the institutional knowledge of which queries answer which incident questions: engineers who have internalized 'to find all errors for customer X, run this KQL query' lose that knowledge on migration and must rebuild it. None of this prevents migration — but documenting the original decision makes future migrations deliberate rather than reactive, and it names the specific costs (query rewrites, alert migrations, team retraining) that the decision involves.

What is the difference between field-indexed and label-indexed log aggregation, and why does it affect incident response?

Field-indexed aggregation (Elasticsearch, Datadog Logs, Splunk) parses log lines at ingest and stores them in an inverted index where every field in every log record is queryable. A query like 'find all ERROR logs from the checkout service for customer X in the last hour' can filter on service, level, customerId, and time range in a single pass without scanning raw log content. Label-indexed aggregation (Grafana Loki, AWS CloudWatch Logs with metric filters) stores raw log content and indexes only a fixed set of labels defined at ingest time — typically deployment-level metadata like service name, namespace, and environment. A query for customer X requires either that customerId was defined as a label at ingest (fixing the cardinality to the number of distinct customers, which is often too high for a label), or a full-text content scan across all log lines in the selected label set, which is effective for small volumes and slow for large ones. The practical incident response consequence: if the aggregation tool is label-indexed and customerId was not included as a label at deployment time, 'find all errors for customer X in the last hour' is a multi-minute query that may time out on high-volume services — and this limitation is discovered during an active incident, not during aggregation tool evaluation.

What should a logging infrastructure architecture decision record include?

A logging ADR needs five sections that most infrastructure setups omit. First, the aggregation mechanism decision: the log routing pattern (application writes to stdout, a collector agent ships to the aggregator), the aggregation backend chosen with alternatives and rejection reasons, and the query interface model (field-indexed versus label-indexed). Second, the structured logging policy: the required fields every log record must include (service name, request ID, log level, timestamp, trace ID), the optional fields for specific contexts (user ID, tenant ID, duration), and the field naming convention that applies across all services. Third, the log level contract: a precise definition of each level — what DEBUG, INFO, WARN, ERROR, and FATAL mean for this system — plus the on-call response that each level triggers, because log levels without an explicit contract become whatever each engineer needs them to be to make their service visible in production filters. Fourth, the retention policy with cost projection: the number of days at each storage tier (hot, warm, cold), the approximate cost at the current daily log volume, and the condition that triggers a retention policy review. Fifth, the revisitation conditions: the specific circumstances that would make the current aggregation tool the wrong choice — typically a cost threshold, a query latency threshold during incidents, or a volume threshold that changes the cost model.

How do logging infrastructure decisions appear in AI chat history?

Logging decisions surface in four session types. Aggregation tool selection sessions contain the mechanism decision: 'should we use ELK or Loki for our Kubernetes cluster?', 'Datadog Logs vs self-hosted ELK for a 50-person startup', 'how does Grafana Loki compare to Elasticsearch for log queries?' These sessions hold the alternatives evaluated and the criteria that drove the selection. Structured logging adoption sessions contain the field and format decisions: 'how do I add structured JSON logging to Express?', 'what log fields should microservices include?', 'how do I propagate request IDs across service boundaries?' Incident debugging sessions contain the revealed query constraints: 'how do I search Elasticsearch logs by user ID?', 'why is my Kibana query timing out?', 'how do I query Loki for a specific customer ID?' — these are the sessions where the team discovered what the aggregation tool cannot answer efficiently. Cost investigation sessions contain the retention policy decision: 'why is our Datadog Logs bill increasing?', 'how do I reduce Elasticsearch storage costs?', 'how to implement Loki log retention and deletion policies?' The cost session is typically where the retention trade-off is first made explicit — often under budget pressure rather than in the calm of initial planning.

2026-06-18 · ~17 min read

The logging infrastructure decision record: why the log aggregation tool you chose shapes what questions you can answer when something breaks at 3am

Log aggregation tool selection is treated as infrastructure configuration, not architecture. An engineer sets up Fluentd with Elasticsearch or installs the Datadog agent during the initial cluster setup and moves on. Two years later, the choice determines which incident response questions the on-call engineer can answer without writing a new query, whether the team can afford to retain logs for 90 days, and whether the structured fields the application teams added to their log output can actually be filtered efficiently at query time. None of this was visible at selection time. None of it is written down.

An on-call engineer gets paged at 3am. A checkout service is returning 500s for a subset of users. He opens the log aggregation dashboard and searches for errors from the checkout service. He finds 400 error log lines in the last hour. He needs to narrow them to a specific customer who filed a support ticket. He types a filter on customerId. The aggregation tool returns an error: the field is not indexed. He can search log content for the customer ID as a free-text string, but at 50,000 log lines per hour across the cluster, the full-text scan takes four minutes and intermittently times out. He cannot tell which errors belong to the affected customer and which are unrelated. The incident runs for two hours while he reasons about the error pattern without being able to isolate it to a single customer's request chain.

The next morning, the team learns that the log aggregation tool was set up 18 months ago using Grafana Loki. Loki is a label-indexed log aggregator: it stores raw log content and indexes a fixed set of labels defined at deployment time. The labels were set to service, namespace, and environment — standard deployment metadata that Fluentd injects automatically. Nobody documented that customerId could not be added as a label at this volume (label cardinality limits are an explicit Loki design constraint), and nobody wrote down that this made customer-scoped queries a full-text scan rather than an indexed lookup. The aggregation tool selection, and the label schema decision, were made in a single afternoon and never revisited.

Like most infrastructure decisions that accumulate detail by detail, the logging configuration is visible as a fact — the cluster has Loki, the Fluentd config is in a Helm chart — but invisible as a decision. The fact answers "what is true now?" The decision record answers "what was evaluated, what was chosen, what the label schema implies for query capability, what is expensive to add in production, and when the decision should be revisited." Without the record, the 3am limitation is discovered as a surprise constraint rather than a documented trade-off.

What "we use a logging stack" actually means across five patterns

The first decision inside "we have centralized logging" is the aggregation mechanism — the architectural pattern by which log output is collected from services, transported to storage, indexed, and made queryable. The mechanism choice is often made by whoever set up the initial cluster, and it carries specific query model commitments that determine what the on-call engineer can ask at incident time.

Print-to-stdout with external aggregation is the standard Kubernetes-native pattern. Services write log output to stdout and stderr; a node-level collector agent (Fluent Bit, Fluentd, the Datadog agent, the OpenTelemetry Collector) reads from the container log files, optionally parses and enriches the lines, and ships them to the aggregation backend. The application has no knowledge of the aggregation destination — it writes to stdout, the infrastructure handles the rest. This pattern is nearly universal in Kubernetes deployments and it is the right default. What varies — and what most teams leave undocumented — is what happens inside the collector agent (is structured parsing happening there, or is the aggregation backend receiving raw strings?) and which aggregation backend the collector ships to, which determines the query model for everything downstream.

Field-indexed aggregation (Elasticsearch via the ELK or EFK stack, Datadog Logs, Splunk, OpenSearch) parses each log record at ingest and stores it in an inverted index where every field — service, level, customerId, requestId, duration — is a first-class queryable dimension. A filter like service:checkout AND level:ERROR AND customerId:acme-corp is an indexed lookup that returns in milliseconds regardless of total log volume. The trade-off is operational complexity (Elasticsearch cluster tuning, shard management, index lifecycle management) and storage cost — field-indexed storage is larger than raw log storage because the index overhead is significant. Self-hosted Elasticsearch requires engineering effort to keep healthy at production scale; this operational cost is real and is frequently underestimated at initial adoption.

Label-indexed aggregation (Grafana Loki, AWS CloudWatch Logs without Logs Insights, Promtail-based stacks) stores raw log content compressed in object storage and indexes only a small, fixed set of labels defined at deployment time. Ingestion cost is very low — Loki's headline claim is that it costs an order of magnitude less per GB than Elasticsearch because it does not build an inverted index. The query model is a label-first filter followed by a content search: {service="checkout", namespace="production"} |= "ERROR" selects the Loki stream for the checkout service in production and then scans that stream's content for the string "ERROR." This works efficiently for queries that filter on the indexed label dimensions. It does not work efficiently for queries that filter on fields embedded in the log content that were not included in the label set at deployment time. Adding a new label in production requires either a Fluentd/Promtail configuration change plus a restart of all log collectors, or re-ingesting historical data — both are operational events. The label cardinality design constraint is the key limitation: Loki's architecture assumes labels are low-cardinality deployment-level dimensions (service, namespace, environment) rather than high-cardinality application dimensions (customerId, tenantId, userId). Teams that discover this constraint after deployment — when they try to add a high-cardinality field as a label — are encountering a query model decision they did not know they made.

SaaS aggregation (Datadog Logs, Logz.io, New Relic Logs, Papertrail) eliminates the operational overhead of self-hosting an aggregation backend. The Datadog agent or a compatible shipper sends logs directly to the SaaS backend; query, alerting, and dashboard infrastructure is provided as a service. The operational trade-off is well-understood: low setup cost, no cluster management, higher per-GB cost than self-hosted options. The less visible trade-off is the cost growth model: log volume grows roughly proportionally to application traffic, but most engineering teams scale their application traffic significantly faster than they revise their logging cost budget. A Datadog Logs deployment that costs $300/month at launch may cost $4,000/month three years later on the same pricing plan at 10× traffic — not because Datadog's pricing changed, but because log volume grew with traffic and the retention period was never revisited. The cost model at initial adoption determines whether the tool remains appropriate at scale.

No centralized aggregation — services log to stdout, stdout is not collected beyond what the node's container runtime retains (typically the last few hundred megabytes), and incident response uses kubectl logs directly against running pods — is a legitimate pattern for small teams running a small number of services. It does not scale past a handful of services because kubectl logs cannot query across services, cannot search historical logs from terminated pods, and cannot filter by content. The decision to not adopt centralized aggregation deserves the same documentation as the decision to adopt a specific tool, because the constraints of the no-aggregation pattern will eventually drive the adoption decision and the team that hasn't written down why they chose it cannot evaluate whether the reasons still apply.

The incident-response constraint: the 3am query model

The incident-response constraint is the logging infrastructure decision's most direct consequence: the aggregation tool's query model determines which incident questions the on-call engineer can answer without a multi-minute scan, and which questions require workarounds, escalation, or waiting until morning when the right person is available.

In a field-indexed aggregation tool, an incident question like "find all requests from tenant X that failed with a 500 in the last two hours" is a query with three indexed predicates: tenantId:tenant-x, statusCode:500, and a time range filter. The query planner uses the inverted index to retrieve only the matching records — typically returning in seconds regardless of whether the matching set is 30 records or 30,000. The on-call engineer can refine iteratively: narrow by service, add a trace ID filter, join with application-specific fields that were logged alongside the error. The query capability tracks the quality of the structured fields the application emits.

In a label-indexed aggregation tool where tenantId was not included in the label set at deployment time, the same question requires a full-text content scan over all log lines from all services for the time range. The scan is bounded by the label dimensions that are indexed — if service is a label, the scan can be restricted to the checkout service's streams rather than all service streams — but within that constraint, every log line must be examined for the presence of the tenant ID string. At 50,000 lines per hour for a busy service, a two-hour scan is 100,000 lines of content evaluation. The query is accurate but slow, and at high enough volume, it times out.

The practical consequence is that two teams — one running Elasticsearch, one running Loki — can have identical application logging (same structured fields, same log levels, same request ID propagation) and dramatically different incident response capability, because the aggregation tool's query model determines how efficiently those fields can be used. Like performance decisions where the baseline measurement determines the optimization headroom, the aggregation tool's query model determines the upper bound on incident response efficiency. Writing it down as a decision means the team knows which questions the tool can answer efficiently and which require a workaround — before an incident reveals the gap.

The secondary incident constraint is log availability for terminated pods. Kubernetes retains logs from terminated pods only as long as the node's container runtime log rotation policy allows — typically until the node's log directory fills to a threshold, at which point older logs are evicted. A service that crashed and was restarted three times before the on-call engineer opens their laptop may have had its initial crash logs rotated away. Centralized aggregation that captures logs continuously solves this problem; the decision record that specifies the collector's buffering policy (what happens when the aggregation backend is unavailable — does Fluent Bit buffer to disk? for how long? how large?) determines whether the logs from a crashed pod that occurred during a brief network partition are recoverable.

The retention cost constraint: what log volume actually costs

Log retention is a decision most teams make implicitly when they accept the aggregation tool's default configuration, then revisit reactively when a cost alert fires or a compliance requirement asks for 90-day log history. The decision has three components that interact: the daily log volume, the retention period, and the cost per GB at each storage tier — and all three change over time in ways that the initial decision did not account for.

The daily log volume is proportional to application traffic, but not linearly: verbosity decisions (log every incoming request at INFO, or only log requests that exceed a latency threshold?) made per-service produce super-linear growth in log volume relative to traffic. A service that logs every database query at DEBUG in development and every incoming request at INFO in production can produce 10× more log volume than a service that logs only errors and significant events at equivalent traffic levels. The verbosity level is rarely documented as a deliberate decision — it is typically inherited from the framework's default logging configuration and adjusted by individual engineers who want visibility into their service's behavior.

The cost model for self-hosted aggregation (Elasticsearch, Loki) is primarily storage cost plus compute for the aggregation cluster itself. Elasticsearch's default index lifecycle management policy keeps indices in a "hot" tier (NVMe SSDs, queryable at full speed) for a configurable retention window, then moves them to "warm" and "cold" tiers (slower, cheaper storage). For a cluster receiving 100 GB/day of raw logs, a 30-day hot retention at $0.13/GB-month on an NVMe storage class means approximately $390/month in hot storage alone — before indexing overhead (Elasticsearch indices at default settings are typically 2–3× the raw log size), before compute for the Elasticsearch nodes, and before the warm and cold tier costs for logs retained beyond the hot window. Teams that set up Elasticsearch with a 90-day retention policy and did not project the storage cost at their expected log volume three years out are commonly surprised by the cost.

Loki's architectural separation of log ingestion from log storage (ingest via Fluentd or Promtail, store in object storage like S3 at $0.023/GB-month versus $0.13/GB-month for NVMe) produces a dramatically lower storage cost for the same retention window — but a higher query cost for full-text scans that touch large volumes of stored data. The cost model trade-off between Loki and Elasticsearch is not "Loki is cheaper" but "Loki has a lower storage cost at the cost of slower full-text queries at scale." As with service mesh selection, the cost model that is correct for the team's scale at adoption may be wrong three years later — and the revisitation condition in the decision record is what makes that reassessment happen deliberately rather than in response to a budget crisis.

SaaS aggregation cost models add a retention pricing dimension that self-hosted options do not: Datadog Logs charges for log ingestion and log indexing separately, with indexed logs priced by retention period (15-day, 30-day) on top of ingestion costs, and rehydration costs for logs stored in Flex tier (compressed, not queryable, available for rehydration to indexed storage on demand). A team that set up Datadog Logs at launch with a 15-day retention policy and 50 GB/day ingest may have made the correct cost decision at the time — and the wrong policy decision for a compliance requirement that arrives two years later asking for 90-day audit log retention. Decisions that interact with compliance requirements need revisitation conditions that name the compliance triggers explicitly: "re-evaluate the retention policy if a compliance requirement asks for retention longer than 15 days."

Structured logging as the aggregation tool's access key

The most consequential sub-decision inside the logging infrastructure decision is rarely discussed as a decision at all: whether services write structured log records (JSON objects with named fields) or unstructured text strings (human-readable messages with embedded data). The aggregation tool's query capabilities are only accessible if the application produces the structured fields that the query model expects. A field-indexed aggregation tool with 1,000 services that all log unstructured text is operationally equivalent to a full-text search over all log content — the index exists, but there is nothing structured to index.

Structured logging adoption is a team-level decision that plays out service by service. The first service to add requestId as a structured field gains the ability to trace a single request's path through that service's logs. If no other service propagates requestId as a structured field, tracing a request across service boundaries requires assembling the chain manually from timestamps and IP addresses — a multi-minute exercise at incident time. The value of structured logging is proportional to how consistently it is applied across all services; inconsistent adoption produces a cluster where some services have queryable structured fields and others do not, and on-call engineers must know which query strategy applies to which service.

The field naming convention is a sub-decision that produces subtle inconsistency without a documented standard. Different engineers building different services write user_id, userId, user-id, and customerId for the same conceptual field. In a field-indexed aggregation tool, these are four different fields — a query for userId:acme-corp does not return records where the field was named customerId. The on-call engineer running an incident query must know the convention used by each service they are querying, or they miss records. The field naming convention, documented once as a shared standard and enforced via a shared logging library, eliminates this class of incident query error. Without documentation, the convention is reconstructed from existing log samples — which produces a different answer depending on which services the person sampling happened to look at.

Trace ID propagation across service boundaries is the highest-value structured field decision and the one most commonly applied inconsistently. A distributed request ID — a unique string generated at the ingress point and passed via HTTP header to every downstream service — enables correlating all log records for a single user request across all services, regardless of which services the request touched. If the checkout service propagates the request ID but the payment service does not extract it from the incoming header and include it in its log records, the trace breaks at the payment service boundary. Every incident that crosses the checkout-to-payment boundary is harder to diagnose because the log records on the payment side cannot be associated with the specific request that caused the error. Cross-service decisions like trace ID propagation require a documented standard that each team can implement independently but consistently — which means the standard must be written down rather than communicated via example.

The log level contract: the missing policy

Log levels — the DEBUG / INFO / WARN / ERROR / FATAL taxonomy present in every logging framework — appear self-explanatory. They are not. Without a written log level contract, the taxonomy becomes whatever each engineer needs it to be to make their service visible in production dashboards and alert filters. The result is a level distribution that is inconsistent across services, misleading about actual error rates, and expensive to use as a production signal.

The most common log level violation is WARN misuse. WARN in most logging framework documentation means "something unexpected happened that was handled gracefully — the application is continuing normally, but this condition is worth investigating." In practice, WARN is used as "I want this line to be visible in a production log filter without triggering an alert." A service that logs "connection pool at 80% capacity" at WARN every time a connection is used (because the developer wanted pool utilization visible without setting up a metric) produces hundreds of WARN-level log lines per minute on a busy service. An on-call engineer whose alert fires on a WARN spike investigates a pool saturation alert and finds 99% of the WARN lines are routine capacity reporting. The alert is accurate and misleading simultaneously: something is flagged at WARN, but the WARN level cannot be trusted to mean "investigate this."

The log level contract is a document that names what each level means for a specific system and what on-call action each level triggers. It is the decision that converts log levels from a suggestion into a production signal. A minimal contract might read: "DEBUG: execution path detail for active debugging sessions only — never enabled in production. INFO: normal operations that are significant to understanding the system's current state — service startup, configuration load, major request milestones. WARN: an unexpected condition was handled — the system is continuing, but the condition should be investigated within 24 hours during business hours. ERROR: an operation failed — user-visible impact or data integrity risk is possible, requires investigation within the current on-call shift. FATAL: an unrecoverable condition — the process is exiting." The contract then specifies the alert policy: "a WARN rate spike (>100% increase over 5-minute rolling average) creates a non-paging ticket for morning review. An ERROR rate spike (>50% increase over 5-minute rolling average) creates a paging alert." The contract makes the log level semantics explicit rather than implicit — which means new engineers can instrument their services to match the system's expectations rather than their own mental model of what WARN means.

The error handling strategy decision and the log level contract are companion decisions: the error handling strategy defines which errors are surfaced to users, which are retried internally, and which are logged and monitored; the log level contract defines what log level each category of error receives and what on-call response it triggers. A team that documented its error handling strategy without documenting its log level contract has defined what happens when errors occur but not how those errors appear in the aggregation tool — which is where the on-call engineer will look for them.

Writing the logging infrastructure decision record

The Nygard ADR format adapts for logging infrastructure decisions with five sections that most infrastructure setups leave undocumented.

The aggregation mechanism decision. Name the log routing pattern, the aggregation backend, and the alternatives evaluated with rejection reasons. "We evaluated Grafana Loki 2.8 (label-indexed, self-hosted), Elasticsearch 8.x via the EFK stack (field-indexed, self-hosted), and Datadog Logs (SaaS, field-indexed). Loki was selected for three reasons: (1) storage cost at our current daily log volume (30 GB/day) is approximately $21/month in S3 at 30-day retention, versus $117/month in Elasticsearch hot storage with equivalent retention; (2) Loki's operational model (no index management, no shard tuning) matches the team's operational capacity — we have one part-time platform engineer; (3) Datadog Logs was evaluated and rejected because the per-GB cost at 3× current volume (projected for 18 months) was $1,800/month, which exceeds our observability budget. Elasticsearch was evaluated and rejected on operational overhead grounds — maintaining an Elasticsearch cluster requires routine index lifecycle management that the current team does not have capacity for." The rejection reasons make a future Elasticsearch or Datadog proposal engage with specific trade-offs rather than starting from neutral ground.

The structured logging policy. Name the required fields, the optional fields, and the field naming convention that applies across all services. "Every log record must include: service (string, the deploying service's name — must match the Kubernetes service label for correlation), level (string, one of DEBUG / INFO / WARN / ERROR / FATAL per the log level contract below), timestamp (ISO 8601 UTC, injected by the shared logging library — do not set manually), requestId (string, the distributed trace request ID propagated from the ingress gateway via the X-Request-ID header — inject into every log record within a request handler scope). For user-context operations, additionally include: userId (string, the authenticated user's ID), tenantId (string, the user's tenant for multi-tenant contexts). For database and external service calls, additionally include: duration (integer, milliseconds). Note on Loki label schema: service, namespace, and env are injected as Loki labels by Promtail; all other fields are log content and are queryable only via full-text scan, not via label filter. Queries that filter on requestId or userId in production should use |= "requestId=" content filters rather than label filters. This is a known query performance limitation of label-indexed aggregation at our log volume." Documenting the Loki-specific query constraint converts the 3am incident discovery into a documented limitation with a documented workaround.

The log level contract. Name what each level means and what on-call response it triggers. "DEBUG: enabled in development environments only — never in staging or production. Detailed execution path information for active debugging. INFO: normal significant operations — service startup (log configuration at startup), successful completion of a significant multi-step operation (checkout complete, payment processed), external API responses that are slow but successful. INFO is the default level in production. WARN: an unexpected condition was handled and the application continued normally. A WARN should be investigated within one business day. Examples: a database query took longer than the P99 threshold but completed; a retry succeeded after one failure; a cache miss required a full database lookup. WARN-level log lines must represent recoverable anomalies, not routine high-frequency events — do not log connection pool utilization as WARN unless pool exhaustion is imminent. ERROR: an operation failed with user-visible impact or a data integrity risk. Requires investigation within the current on-call shift. On-call is paged if the ERROR rate exceeds the alert threshold. Examples: a database write failed after all retries, a payment processor returned a fatal error, a required external service is unreachable. FATAL: the service is exiting due to an unrecoverable error. Requires immediate on-call response. Examples: cannot connect to database on startup, required configuration is invalid, unhandled exception in the main request handler."

The retention policy with cost projection. Name the retention period at each tier, the current cost, and the review trigger. "Current policy: Loki streams stored in S3 with 30-day retention. No warm or cold tier configured — Loki's retention deletes records after 30 days via the compactor's delete-request API. Current daily log volume: 30 GB/day. Storage cost: 30 GB × 30 days × $0.023/GB-month = $20.70/month. Query performance note: Loki chunk queries over 30 days of content at 30 GB/day are slow for full-text scans; queries scoped to a 24-hour window are acceptable. Revisit retention policy if: (1) a compliance or security audit requirement asks for retention longer than 30 days — the current policy cannot support forensics on slow-burn security incidents that span more than 30 days of log history; (2) daily log volume exceeds 100 GB/day — at that volume, the query performance of full-text scans over 30 days of content will exceed the 60-second query timeout during incident response; consider migration to a field-indexed backend or adoption of Loki's query acceleration features (bloom filters, TSDB index)."

The revisitation conditions. Name the specific circumstances that make the current tool the wrong choice. "Re-evaluate the aggregation backend if: (1) monthly storage + infrastructure cost exceeds $500/month — this is the approximate crossover point where Datadog Logs SaaS becomes cost-competitive after accounting for self-hosting operational overhead; (2) on-call engineers report that incident response queries requiring filtering on user or tenant context take longer than 60 seconds more than twice per quarter — this is the label-indexed query model's limit signal; (3) the team's platform engineering capacity increases such that Elasticsearch cluster management becomes feasible — Elasticsearch provides field-indexed query capability that is qualitatively superior for incidents that filter on application-level fields. Do not re-evaluate purely on the basis of team preference or tool familiarity — the current tool's cost model is the primary constraint, and any migration must model the cost at the projected log volume in 18 months, not at current volume."

Finding logging decisions in AI chat

The WhyChose extractor surfaces logging infrastructure decisions from four session types that contain the reasoning most teams cannot reconstruct when a new engineer joins or when a compliance review asks for the decision rationale.

The aggregation tool selection session. "Should we use ELK or Loki for our Kubernetes cluster?", "Datadog Logs vs self-hosted ELK for a 50-person startup", "How does Grafana Loki compare to Elasticsearch for log queries?", "Is Loki production-ready for high-volume logging?", "What is the difference between Loki, Elasticsearch, and Datadog for Kubernetes logging?" These sessions contain the alternatives evaluated and the criteria that drove the selection. The selection session is typically the most important session to recover because the mechanism choice — field-indexed versus label-indexed — determines the incident response capability of every future on-call rotation.

The structured logging adoption session. "How do I add structured JSON logging to Express?", "What log fields should I include in microservices?", "How do I propagate request IDs across service boundaries in a Node.js application?", "What is the correct way to log user context without logging PII?", "How do I add correlation IDs to all log records in Spring Boot?" These sessions contain the field naming decisions, the request ID propagation approach, and the initial structured logging policy. They often happen per-service at different times, which means the field decisions are made independently rather than from a shared standard — and recovering multiple engineers' structured logging sessions reveals the field naming divergence that makes cross-service queries unreliable. For platform teams defining logging standards, recovering these sessions from multiple application engineers identifies the fields that teams actually need at incident time versus the fields that were added for convenience.

The incident debugging session. "How do I search Elasticsearch logs by user ID?", "Why is my Kibana query timing out for a large time range?", "How do I query Loki for logs from a specific customer ID?", "Loki query taking too long — how do I speed it up?", "How do I find all requests from a specific user across multiple services in Datadog?" These sessions contain the revealed query constraints — the moments when the team first discovers what the aggregation tool cannot answer efficiently. They are the most valuable sessions for documenting the logging infrastructure's actual limitations, because they describe real incident scenarios rather than hypothetical query patterns. The 3am constraint is most reliably documented from these sessions: not "Loki cannot filter on high-cardinality fields efficiently" (which is abstract) but "when we tried to find all errors for customer acme-corp during the November incident, the Loki query for the content filter took four minutes and then timed out, and we ended up using kubectl logs directly on the checkout pod" (which is concrete).

The cost investigation session. "Why is our Datadog Logs bill increasing faster than our traffic?", "How do I reduce Elasticsearch storage costs without losing retention?", "How do I implement Loki log retention and deletion policies?", "What is Datadog Logs Flex retention and how does it reduce cost?", "How do I reduce log volume without losing important log lines?" These sessions contain the retention policy decision — typically made under budget pressure after the initial deployment, when cost has grown to a point where the default retention configuration is no longer acceptable. The session reveals which retention compromise the team made: 15 days instead of 30, Flex tier for older data, reduced ingestion via sampling or filtering. A new technical leader who reads the logging cost budget without the retention history cannot determine whether the current retention policy was a deliberate cost decision or an accidental default — the cost session is the record of the trade-off that was actually made.

What the decision record prevents

A documented logging infrastructure decision prevents three recurring problems that teams encounter as they grow and their logging needs evolve.

It prevents the incident response surprise. The on-call engineer who discovers at 3am that the aggregation tool cannot filter efficiently on customer ID is discovering a query model constraint that was known (implicitly) at selection time and never documented. A decision record that names the label schema — which fields are indexed as Loki labels and which are log content requiring full-text scan — converts the 3am surprise into a documented limitation with documented workarounds: "for queries that require filtering on customerId, use the content filter pattern |= "customerId=" scoped to a 24-hour window; for multi-day incident forensics, use the Loki line filter with an increased query timeout." Like test strategy decisions that determine which kinds of failures the test suite can detect, the aggregation tool's query model determines which kinds of incidents the on-call engineer can investigate efficiently — and both need to be documented before the production condition reveals the gap.

It prevents the retention policy crisis. A compliance audit that asks for 90-day log retention is not a logging problem; it is a retention policy problem. The team that set up Loki with 30-day retention made a cost-versus-retention trade-off that was correct at the time. Without documentation, the audit finding arrives as a surprise: "we don't have logs older than 30 days." With documentation, the finding arrives as a recoverable decision gap: "our current retention policy was set to 30 days in December 2023 to keep storage costs below $50/month; the compliance requirement is for 90 days; the estimated cost increase for extending retention to 90 days is $41/month; we recommend extending the retention policy." The decision record makes the audit finding a scoped remediation rather than an unexplained gap. Security and compliance requirements interact with infrastructure decisions most violently when the infrastructure decision was never documented — without the record, there is no baseline to compare against the new requirement.

It prevents the log level erosion. The log level contract erodes without a documented standard because each engineer who adds logging to a new service applies their personal interpretation of what WARN means. Over two years, a cluster where the log level semantics were never formalized accumulates dozens of services with incompatible log level usage. Some services log every cache miss at WARN; others log all request latencies above 100ms at WARN; others reserve WARN for genuine anomalies. The aggregation tool's WARN-rate alert fires with mixed signal — some WARN spikes represent genuine problems, others represent a busy cache layer. The on-call engineer cannot tell which kind of WARN spike they are looking at without knowing which service's WARNs are trustworthy. The log level contract is one of the decisions most likely to never be written down because it feels like a matter of engineering judgment rather than a policy — but engineering judgment applied inconsistently across teams produces a monitoring signal that is difficult to trust.