Why does error handling need an architecture decision record if it's just implementation detail?

Error handling is not an implementation detail — it is a cross-cutting policy that determines what your users see when something fails, what your engineers see in logs, and which failure modes are silent vs. visible. The decision about whether to fail fast or degrade gracefully determines user experience during partial outages. The decision about which errors are logged vs. alerted vs. ignored determines your team's operational visibility. The decision about which operations are retried with what policy determines whether transient failures are invisible to users or produce duplicate submissions, double charges, or lost data. These decisions are made implicitly in every application, but when they are undocumented, new engineers make inconsistent local decisions — one endpoint throws a 422, another returns a 400 for the same class of validation error; one service retries with exponential backoff, another retries immediately three times. The accumulation of inconsistent local decisions becomes the policy your users experience, which is why the strategy needs a decision record.

What is the difference between the fail-fast and graceful degradation error handling strategies?

Fail-fast means that when a required dependency fails — the database is unavailable, a required API call returns an error — the operation immediately returns an error to the caller rather than attempting to proceed with partial data or default values. The user sees an error message. The engineer sees a clear stack trace pointing to the failure. The fail-fast strategy makes failures visible and makes the system's dependency graph explicit: if the system cannot proceed without the database, it says so, rather than returning silently degraded results that look correct but are incomplete. Graceful degradation means that when a non-required dependency fails, the operation continues with reduced functionality rather than failing entirely. The recommendation sidebar doesn't load, but the product page does. The analytics event doesn't fire, but the checkout completes. The audit log write fails, but the action is taken. Both strategies are correct for their intended domain — fail-fast for operations where partial success would be worse than no success, graceful degradation for features where a degraded experience is better than a missing one. The problem is that most applications have neither strategy documented: individual engineers make local decisions about whether each operation should fail-fast or degrade, and the decisions accumulate inconsistently.

How do you find error handling decisions in AI chat history?

Error handling decisions appear in AI chat in four recognizable session shapes: (1) early design sessions — 'what should I return when the user tries to access something they don't have permission to?', 'should I throw here or return null?', 'what's the right HTTP status code for this case?' — where individual error handling choices are made without reference to a strategy; (2) production incident sessions — 'users are seeing a white screen after the payment fails, how do I fix this?', 'why is this 500 error swallowing the real error message?' — where the cost of a particular error handling decision becomes visible for the first time; (3) retry logic sessions — 'how should I handle rate limiting from this API?', 'should I retry this database query if it fails?', 'how do I implement exponential backoff?' — where the retry strategy is set for one specific integration and rarely generalized to a policy; (4) observability sessions — 'what level should I log this at — info, warn, or error?', 'how do I know when to alert vs. just log?', 'why are we getting so many error alerts that nobody reads?' — where the team debates the signal-to-noise problem that an undocumented logging strategy produces. These sessions collectively contain the reasoning behind the de facto error handling policy the application has accumulated, and extracting them with the WhyChose extractor surfaces the implicit decisions before the strategy can be documented.

What should an error handling strategy ADR include?

An error handling strategy ADR should document four decisions that most teams leave implicit: (1) the error taxonomy — the named categories of errors the application recognizes (validation errors, authorization errors, not-found errors, upstream dependency failures, internal unexpected errors) with the associated HTTP status codes, log levels, and user-visible messages for each; (2) the failure mode decision — which operations fail-fast when a dependency fails and which operations degrade gracefully with reduced functionality, stated as a general principle plus the specific dependency classifications that determine which rule applies; (3) the retry and idempotency policy — which operations are safe to retry (idempotent reads, safe writes with idempotency keys) and which are not (non-idempotent mutations), with the backoff strategy and retry budget for each category; (4) the observability contract — what gets logged at what level, what triggers an alert, what is tracked as a metric vs. a log line, and what is explicitly ignored (expected errors that don't need engineer attention). Without these four decisions documented, every engineer makes their own local determination and the application ends up with inconsistent behavior across endpoints and services.

2026-06-15 · ~16 min read

The error handling strategy decision record: why "we'll handle errors properly later" becomes the policy your users experience in production

Every application has an error handling strategy. Most teams chose theirs by not choosing it — by writing "TODO: better error handling" and shipping, by returning null instead of throwing because throwing felt premature, by catching every exception at the top level and logging it because that was the easiest thing to do. Six months later, that pattern of local decisions is the de facto error handling policy the application has. Changing it requires touching every layer of the stack at once. The policy was never documented as a decision, because it was never made as a decision.

The phrase "we'll handle errors properly later" is one of the most consequential things engineers say, because it is almost never followed by "later." By the time "later" arrives, error handling has been woven into every layer of the stack through the accumulation of individual local decisions. The service returns 500 for every unhandled exception because that was the framework default. The UI shows "Something went wrong" because that was the copy in the first error component someone built. Database failures silently return empty arrays because the original developer wrapped the query in a try-catch and returned an empty list on any exception. Each of these is a reasonable local decision. Together, they are the error handling strategy.

The problem is not that these decisions are wrong — many of them are defensible. The problem is that they are invisible as decisions. A new engineer sees a service that swallows database exceptions and returns empty arrays and doesn't know whether this is a deliberate graceful degradation strategy or an accident of the first implementation. They don't know what the team decided about fail-fast behavior. They don't know whether they should follow the pattern or fix it. So they ask the engineer who built the feature, who doesn't remember the original reasoning, or they import their prior team's convention, or they make a new local decision that doesn't match the surrounding code. The error handling policy accumulates another inconsistency.

This is the same pattern as every undocumented architectural decision: the decision is invisible as a decision because it was made through the accumulation of implementation choices rather than through a deliberate policy. Error handling is one of the most consequential places where this pattern appears, because the accumulated policy is what users experience when the system fails — and the system will fail.

The implicit default policy: what happens when you don't handle an error

Every framework, language, and runtime has a default behavior for unhandled errors. In most web frameworks, an unhandled exception in a request handler produces a 500 Internal Server Error response. An unhandled rejected Promise in Node.js before version 15 was silently swallowed. An unhandled panic in Go crashes the goroutine and propagates up to the next recovery point, or crashes the process if there is none. An uncaught exception in a Python service either crashes the process or — in frameworks like Django or Flask — produces a 500 response depending on the DEBUG setting.

These defaults are the error handling policy for any team that hasn't made a different decision. They are not wrong by default — 500 for unhandled exceptions is a reasonable behavior. But they are not a policy that was deliberately designed for the application's specific failure modes, its users, or its operational context. The default policy treats all unhandled errors the same way, regardless of whether the error is a database connection failure that should trigger an alert, a validation error that should be a 422 with a specific message, or an expected race condition that should be retried transparently.

When the team hasn't decided which of these categories applies in which context, the default applies everywhere. Users see 500 for validation errors. Engineers get alerted for expected race conditions. Database failures return the same generic error page as payment processing failures. The user can't tell whether to retry, fix their input, or contact support. The engineer can't tell whether the alert is a real outage or expected noise. The on-call rotation starts ignoring alerts because the signal-to-noise ratio has degraded to the point where every alert requires investigation to determine whether it matters.

This is the cost of the implicit default policy: it treats every error the same, which means the team gets no signal differentiation, users get no actionable feedback, and the operational visibility into what is actually failing and why is poor. The team knows the application is producing errors; they don't know which errors are normal and which are anomalous, because the policy never drew that line.

The error surface decision: what users see versus what engineers see

The most consequential decision in any error handling strategy is the boundary between what users see and what engineers see from the same error. This decision is almost never made explicitly. Most teams discover they haven't made it when a user reports an error and pastes a stack trace into a support ticket — the full stack trace, including internal file paths, database query strings, and connection credentials embedded in the connection error message — because the application was sending the raw exception to the browser.

The error surface decision has two sides. The user-facing side is a product decision as much as a technical one: what does the user need to know when something fails? A validation error needs enough specificity to let the user fix their input — "Email address is already in use" is actionable; "422 Unprocessable Entity" is not. A payment failure needs enough specificity to let the user know whether to retry, use a different card, or contact their bank — "Your card was declined" covers the user's next action; "Stripe API error: insufficient_funds" is implementation detail that doesn't help the user and may leak information they shouldn't have. A server error needs enough information to communicate that the problem is on the system's side, not the user's — "We're having trouble processing this right now, please try again in a few minutes" is the correct user-facing message for a database connection failure; "ECONNREFUSED 127.0.0.1:5432" is not.

The engineer-facing side is an observability decision: what does an engineer need to know when the same error fires? They need the full exception chain, the stack trace, the request context (user ID, endpoint, payload shape, timestamp), the system context (which server, which deployment, which version of the code), and ideally the preceding log entries that led to the error. This is the opposite of what the user needs — more context, not less; raw technical detail, not human-readable explanation.

The error surface decision draws the line between these two outputs from the same error. Most applications serve both from the same code path, which means either users get too much technical detail (a security risk that also produces confusion) or engineers get too little context (an observability failure that makes incidents harder to diagnose). The decision that separates the two — "what goes in the HTTP response body" versus "what goes in the log and the error tracking system" — is one of the most important undocumented decisions in most applications.

This intersects directly with the security architecture decision record: error messages that expose internal system state are a real attack surface. An error message that reveals whether a user account exists (different message for "wrong password" versus "account not found") enables user enumeration. An error message that exposes a SQL query fragment enables an attacker to probe the database schema. An error message that returns the full internal exception chain in a JSON API response exposes the library versions and internal file paths that an attacker can use to identify known vulnerabilities. The error surface decision is a security boundary, not just a UX concern, and it belongs in a decision record that is accessible to the security review process.

The failure mode decision: fail-fast versus graceful degradation versus silent

Every operation that depends on an external resource — a database, a cache, a third-party API, a message queue — must make a decision about what to do when that resource is unavailable. The team has three options: fail fast and return an error to the caller, degrade gracefully by continuing with reduced functionality, or fail silently by proceeding as if the dependency succeeded. All three strategies are appropriate in some contexts. The problem is that most teams apply them inconsistently, because the choice is made individually for each integration rather than as a policy.

Fail-fast is correct when partial success would be worse than no success. A checkout operation that writes an order to the database but fails to deduct inventory should not silently succeed — the resulting state (an order with no inventory reservation) is worse than a failed order that the user can retry. A data migration that fails on row 500 of 10,000 should not continue processing the remaining rows with the failed row missing from the output — silent partial completion is harder to detect and recover from than an explicit failure that identifies the point of failure. Fail-fast is the appropriate strategy when the operation's contract guarantees that either all parts succeed or none do — when the semantic meaning of success depends on the entire operation completing.

Graceful degradation is correct when a feature is genuinely optional relative to the user's primary goal. If the recommendation sidebar fails to load because the recommendation engine is down, the user can still browse products. If the analytics event fails to fire because the analytics endpoint is unreachable, the business loses tracking data but the user completes their task. If the audit log write fails because the audit logging service is temporarily unavailable, the action should still proceed — the missing audit entry is a compliance concern to address separately, but failing the primary operation to protect the audit log inverts the priority. Graceful degradation requires clear thinking about which features are on the critical path for the user's goal and which are ancillary — a distinction that is easy to assert and hard to maintain without documentation.

Silent failure is correct in exactly one scenario: when the operation is genuinely fire-and-forget with no business consequence to the caller if it fails, and when the failure is expected frequently enough that logging it would create noise rather than signal. In practice, silent failure is almost always wrong, because operations that seem fire-and-forget at implementation time often turn out to have business consequences that become visible only when they fail consistently. The most common version of this is the silent analytics failure — "if the tracking event fails, it doesn't matter, just ignore it" — which becomes very visible when the analytics dashboard shows zero events for a feature that is clearly being used, or when the team tries to attribute revenue to a marketing campaign and discovers the tracking was silently failing for three months.

The failure mode decision needs to be made as a policy, not case by case, because the case-by-case approach produces inconsistency. A team that has documented "all operations on the critical payment path fail fast; all analytics and recommendation features degrade gracefully; no failure is ever silent — every caught exception is logged at warning or above" has a policy that new engineers can apply without a judgment call. A team that has not documented this leaves every engineer to determine, for each new integration, which category applies — and different engineers make different determinations, producing an application whose failure behavior is inconsistent across features and whose failure modes are hard to characterize to users or on-call engineers.

The error taxonomy problem: when teams invent local conventions that conflict

One of the clearest signals that an application lacks a documented error handling strategy is HTTP status code inconsistency. One endpoint returns 400 for invalid input. Another returns 422. A third returns 409 for a conflict that a different endpoint handles with a 400. A fourth returns 200 with a JSON body containing an "error" field because the original developer thought non-2xx responses were inappropriate for a JSON API. An engineer building the frontend cannot write a single error handling component for these cases — they need to handle each endpoint's specific behavior, often discovered by reading the source or by observing production errors.

The status code inconsistency is not the core problem — it is a symptom of the core problem, which is that the team has no error taxonomy. An error taxonomy names the categories of errors the application recognizes and assigns consistent handling to each category. "Validation errors are 422 with a JSON body containing an 'errors' array where each entry names the field and the validation message. Authorization errors are 403 with a JSON body containing a human-readable 'message' field. Not-found errors are 404 with a consistent body shape. Dependency failures are 503 when the dependency is external to the system and 500 when the failure is internal. Conflict errors — for cases where the request is valid but the action cannot be taken because of the current state — are 409."

A named taxonomy gives every engineer building an endpoint the same reference. It gives frontend engineers a stable contract: validation errors look like this, authorization errors look like this, server errors look like this. It gives the API documentation a schema to describe. It gives the error handling tests a specification to verify against. Without a taxonomy, each of these groups makes their own local determination and the application ends up with behavioral inconsistency that requires per-endpoint special-casing across every consumer.

The error taxonomy problem is closely related to the interface contract decision record: the error contract is part of the interface contract, and an interface whose error behavior is undocumented is an interface whose callers cannot reliably handle failures. When a service changes its error format — moves from flat error strings to structured error arrays, changes a 400 to a 422, starts returning 503 instead of 500 for database failures — callers that were handling the old format break silently or loudly depending on how they were written. The error taxonomy is a versioned contract that must be maintained alongside the rest of the interface contract.

This is also where the wrong constraint problem most often appears in error handling: a constraint that was correct at one system scale becomes incorrect at another. "We always return 200 because our frontend client library didn't handle non-2xx responses correctly" was a valid constraint for the first API client. When the API gained a mobile client, a third-party integration, and a command-line tool, the constraint propagated to all three consumers — who now have to implement a second layer of error detection inside successful HTTP responses. The constraint was never documented as a constraint, so the later engineers who built those clients didn't know they were inheriting it, and didn't know to question it.

The retry and idempotency decision

Every operation that can fail transiently — network calls, database writes under lock contention, rate-limited API calls — must make a decision about whether and how to retry. This decision has two parts: whether the operation is safe to retry (the idempotency decision) and how retries should be attempted if the operation is safe (the retry policy decision). Both parts are almost never documented, which means both parts are determined locally by the engineer implementing each integration.

The idempotency decision is the more consequential one. An idempotent operation can be retried any number of times without producing different results — reading a record from a database, updating a user's display name to a specific value, deleting a resource that may or may not exist. A non-idempotent operation changes state in a way that retry produces a different outcome — creating a new payment, sending an email notification, incrementing a counter, appending to a log. Retrying a non-idempotent operation on transient failure without an idempotency mechanism produces duplicate payments, duplicate emails, and inflated counters. The user's card is charged twice. The welcome email is sent three times. The event count in the analytics system is wrong.

The engineering solution to non-idempotent retries is the idempotency key: a unique token that the caller generates and includes with the request, which the system uses to deduplicate: if a request with the same idempotency key has already been processed, return the original response rather than processing the request again. This is the correct mechanism for making non-idempotent operations safe to retry. The decision that it is required — that all payment operations, all email sends, all inventory mutations must accept and process idempotency keys — is an architectural decision that must be made at the system level before the feature is built. Making it retroactively, after the endpoint is live and callers are retrying without idempotency keys, requires either accepting duplicate state or building a deduplication mechanism after the fact with imperfect coverage.

The retry policy decision determines the backoff strategy, the retry budget, and the conditions under which retries are attempted. The canonical approach — exponential backoff with jitter — is well known, but the specific parameters are not: how long is the initial delay, what is the backoff multiplier, what is the maximum delay, how many total attempts, which error conditions trigger a retry and which don't? A 429 rate limit error should trigger a retry after the Retry-After header interval. A 500 internal server error from an upstream service should trigger a retry with backoff. A 400 bad request should never trigger a retry — the request is malformed and will fail the same way every time. A 503 service unavailable should trigger retries for a limited period, after which the request should be failed and the caller should decide whether to present a "try again later" message or surface a degraded experience.

When the retry policy is made locally for each integration, the application ends up with retry behavior that varies by dependency: the Stripe integration has exponential backoff with jitter because the engineer who built it had worked with payment APIs before; the internal microservice call has three immediate retries because the engineer who built it thought three attempts seemed reasonable; the email send has no retry because the engineer who built it thought email was fire-and-forget. The result is a system whose behavior under partial failure — the most common production failure mode — is hard to characterize, hard to test, and hard to explain to on-call engineers. Platform teams that provide shared HTTP clients, queue consumers, or task schedulers are the natural place to encode the retry policy decision into infrastructure rather than leaving it to each team to decide independently — but only if the policy has been decided and documented.

The observability contract: signal versus noise

The most common symptom of an undocumented error handling strategy is an alert fatigue problem. The on-call rotation receives alerts that nobody reads because the error rate is always elevated and "elevated" has been the baseline for so long that it no longer signals anything actionable. The Slack error channel has alerts firing continuously that the team has learned to ignore. A real production incident starts and the initial alerts are dismissed as noise before an engineer checks the dashboard and realizes the error rate is ten times the usual elevated baseline, not just the usual elevated baseline.

Alert fatigue is not primarily a tooling problem — it is a consequence of the undocumented observability contract. The observability contract answers three questions: what gets logged (and at what level), what triggers an alert, and what is explicitly ignored. Without a documented contract, each engineer makes their own determination when implementing a feature: "this seems like it should be an error log," "this failure is probably not worth alerting on," "I'll just let the framework handle this." The accumulation of these individual determinations produces a logging strategy where the same class of event is sometimes logged at error, sometimes at warning, sometimes at info, and sometimes not logged at all, depending on who implemented that feature. It produces an alerting strategy where some expected errors trigger pages and some unexpected errors produce no alert at all.

The observability contract is often treated as an operations problem rather than a product development problem, which is why it tends to be addressed reactively after alert fatigue has set in rather than proactively as a design decision. The reactive address — "let's review our alerts and prune the noisy ones" — is a clean-up operation, not a strategy. Without a documented strategy for what should and should not alert, the next feature will recreate the same noise, because the next engineer has no guidance on the question "should I alert on this?"

A documented observability contract gives that guidance in the form of a taxonomy. "Error-level logs are for unexpected failures that require engineer investigation — failures that should not happen in normal operation and that indicate either a bug or an infrastructure problem. Warning-level logs are for expected failures that may require operational attention — rate limit hits on external APIs, validation errors from internal callers, resource contention that resolved — but that are individually expected and do not require immediate engineer investigation. Info-level logs are for significant events in normal operation — successful operations on the critical path, state transitions, external API calls. Debug-level logs are for implementation detail that is only useful during active debugging and should not appear in production. Alerts fire on: error-level events above a threshold rate per endpoint; latency above SLA thresholds on the critical path; health check failures. No alert fires on: validation errors (422s); authorization failures (403s); not-found errors (404s); any expected operational event that has been explicitly categorized as warning or below."

This taxonomy is not technically complex — it is four categories with one-sentence descriptions. Its value is that every engineer who builds a feature can apply it without judgment. The judgment has been recorded in the decision, not deferred to each implementation.

The user-facing error as a product decision, not just a technical one

The decision about what users see when something fails is often made by the engineer implementing the feature rather than by the product team, because it feels like a technical implementation detail. It is not a technical detail — it is a product decision that determines user experience at the moments when the product is failing, which are the moments when users form their strongest impressions of whether the product is trustworthy and competent.

An error message that says "Something went wrong. Please try again." is a product decision: the team decided that the generic message was acceptable across all failure modes. An error message that says "We couldn't process your payment. Please check your card details and try again, or contact your bank." is also a product decision: the team decided that payment failures warrant enough specificity to reduce support ticket volume and to help users resolve the issue themselves. Neither decision is wrong in the abstract, but neither should be made individually by the engineer implementing the payment endpoint — it should be made at the product level, informed by user experience principles and support ticket data, and applied consistently across similar failure modes.

The error message decision connects to the product decision record: user-facing error copy is a product artifact that deserves the same decision record treatment as any other product copy. The decision about the copy implies decisions about specificity, tone, and the action the message asks the user to take — decisions that interact with brand voice, support process, and the product's promise about what it can do for the user. Leaving these decisions to individual engineer implementations produces inconsistent voice and tone across failure modes, inconsistent guidance about what the user should do next, and inconsistent specificity that gives users more detail in some cases and less in others without a principled reason.

How error handling strategy degrades over time

An error handling strategy that starts well-designed degrades through the same mechanism that any undocumented strategy degrades: gradual accumulation of exceptions that are not recognized as exceptions at the time they are made. The strategy says "all operations on the payment path fail fast." A new feature is added that calls a feature-flag service before processing a payment — "to determine whether the new pricing algorithm applies." The engineer implementing the feature makes the reasonable judgment that the feature flag service is an ancillary dependency: if it fails, the old pricing algorithm should apply as a default, rather than blocking the payment. The operation degrades gracefully on feature flag service failure, which contradicts the documented strategy for the payment path.

This exception is correct — degrading gracefully to the old pricing algorithm is probably the right behavior. The problem is that it was made locally, without updating the strategy to reflect the new classification. The next engineer who reads the strategy and sees "all operations on the payment path fail fast" doesn't know about the exception. When the feature flag service fails in a way that degrades to an incorrect default — a bug in the fallback logic — the engineer investigating assumes the failure is an unexpected system failure rather than the execution of an intentional exception they didn't know about.

The degradation of error handling strategy is often what the new technical leader encounters when they audit the codebase: error handling behavior that varies across features without an apparent reason, a mix of explicit and implicit failure modes, retry logic that exists in some places and not others, error taxonomies that are mostly consistent but have unexplained exceptions. Without the original strategy and the accumulation of undocumented exceptions, the variation looks like inconsistency or oversight. With the strategy and the exception log, it looks like a living system that has accreted reasonable local decisions that happen to not be documented. The technical leader who understands this distinction makes better decisions about what to standardize and what to preserve than one who reads variation as simply "poor engineering."

Writing the error handling strategy decision record

The Nygard ADR format adapts for error handling strategy with four sections that most teams leave entirely undocumented.

The error taxonomy. Name the categories of errors the application produces and the consistent handling for each. "We recognize five error categories: (1) validation errors — the request is syntactically correct but semantically invalid; HTTP 422, log level: warning, user message: specific field-level feedback naming what was invalid; (2) authorization errors — the authenticated user lacks permission for the requested action; HTTP 403, log level: info, user message: 'You don't have permission to do this'; (3) not-found errors — the requested resource does not exist; HTTP 404, log level: info, user message: 'We couldn't find that'; (4) dependency failures — a required upstream service or database is unavailable; HTTP 503, log level: error, user message: 'We're having trouble right now — please try again in a few minutes', alert on: yes; (5) unexpected internal errors — exceptions that fall outside the above categories; HTTP 500, log level: error with full stack trace, user message: 'Something went wrong on our end', alert on: yes."

The failure mode policy. Name the general rule for when operations fail fast versus degrade gracefully versus the specific list of ancillary features that are exempt from the fail-fast rule. "Operations on the critical user path — any operation that completes a user's stated intent (checkout, save, submit, publish) — fail fast on any dependency failure. Operations that provide ancillary enrichment — recommendations, analytics, feature flags, activity feeds — degrade gracefully if their dependency fails, with a defined fallback behavior named in the feature's own documentation. No operation fails silently: every caught exception is logged at warning level or above, even for operations on the graceful degradation path."

The retry and idempotency policy. Name the general rule for idempotency and retry. "All non-idempotent operations on the critical path require idempotency keys: the caller generates a unique key for each logical action, and the server deduplicates by key for a 24-hour window. All retries use exponential backoff starting at 100ms, doubling to a maximum of 30 seconds, with ±25% jitter. Retry conditions: 429 (after the Retry-After header interval), 503, and network-level timeouts. No retry on: 4xx errors other than 429, 500 errors (which indicate a server-side problem that retry will not resolve). Retry budget: maximum 5 attempts for synchronous user-facing operations, maximum 20 attempts for background jobs."

The observability contract. Name the four log levels with their trigger conditions and the alert threshold. "Error level: unexpected failures that require engineer investigation. Warning level: expected operational failures that may require attention but are individually unremarkable. Info level: significant events in normal operation. Debug level: implementation detail, disabled in production. Alerts: error-level events above 0.1% of requests on any endpoint for more than 2 minutes; latency above 500ms p99 on the critical path for more than 1 minute; any dependency health check failure. Explicitly not alerting: 4xx responses of any type; warning-level events below a 5% error rate per endpoint; expected operational retries below the maximum retry budget."

The revisitation conditions for error handling strategy follow the same pattern as test strategy revisitation conditions: "Re-evaluate this strategy if: (1) a class of production incidents consistently escapes the current alerting policy — if the team discovers after the fact that the observability contract's thresholds were too high or too low to surface the incident during its early stages; (2) a new integration category is added that doesn't fit the existing taxonomy — an async job processor or a webhook receiver may require a separate error classification from the synchronous API; (3) the support ticket volume for a specific error category is consistently high — user-facing error messages that generate support tickets are not specific enough for the user to self-resolve; (4) the on-call alert volume reaches a level where genuine incidents take more than 15 minutes to be identified because the alert signal is below the noise floor."

Finding error handling decisions in AI chat

The WhyChose extractor surfaces error handling decisions from four session types in AI chat.

The design question session is where individual error handling choices are first made. "What should I return when the database is down?", "Should this be a 400 or a 422?", "Is it okay to just throw an error here?", "How should I handle the case where the user ID doesn't exist in the database?" These sessions contain the reasoning behind individual choices — the engineer's understanding of the context, the options they considered, the recommendation they received. Individually, each session represents a local decision. Extracted together, they reveal the pattern of local decisions that constitute the implicit error handling strategy. The pattern may be internally consistent or inconsistent, and the review process of extracting and examining the full set is often the first time the team sees the whole picture at once.

The production incident session is where the cost of a specific error handling decision first becomes visible. "Users are seeing a blank screen after checkout fails — how do I show them a proper error?", "Why is the 500 error in the logs not giving me enough context to debug this?", "We had a duplicate charge — why did the retry happen twice?", "The on-call got paged at 3am for what turned out to be expected rate limiting from the Stripe API — how do we stop alerting on this?" Each of these sessions identifies a specific failure in the implicit error handling strategy and often contains a fix that is applied locally without being reflected in a documented policy update. The post-mortem to ADR pipeline is the mechanism for converting these local fixes into documented policy changes — but only if the team knows to look for the pattern of incidents, not just the individual incidents.

The retry and idempotency session is where the team first confronts the non-idempotent operation problem, usually triggered by a duplicate payment or duplicate email complaint. "We got a report that a user was charged twice — how do I add idempotency to the payment endpoint?" is one of the most valuable sessions in any payment product's chat history. It contains the engineer's first explicit engagement with idempotency as a concept, the specific solution they implemented, and the reasoning for the scope of the change — did they add idempotency keys to just the payment endpoint, or to all mutations on the payment path? The scope decision is the policy decision, and it is almost always implicit in the conversation.

The observability frustration session is where the alert fatigue problem surfaces. "Our Slack error channel is so noisy that nobody reads it — how do we fix this?", "We keep getting paged for errors that are just expected behavior — how do we filter those out?", "The error logs have so much detail that it's hard to find the important ones." These sessions contain explicit naming of what the team considers "expected" versus "unexpected" errors — the raw material for the observability contract — but the naming is typically applied to the immediate noise problem rather than generalized to a policy. The quarterly review is the mechanism for extracting these sessions and asking: what category system did we implicitly apply when we decided which alerts to suppress? That implicit category system is the observability contract that should be documented.

What the error handling strategy record protects

A documented error handling strategy protects three things that the implicit strategy leaves vulnerable.

It protects users from variable experience quality at failure moments. Users who interact with an application whose error handling strategy is documented and consistently implemented get specific, actionable error messages when they make mistakes, clear feedback about whether the problem is on their end or the system's, and consistent behavior across features that allows them to build mental models about what to do when something fails. Users who interact with an application whose error handling is inconsistent get generic messages in some places and overly technical detail in others, different messages for the same class of error depending on which feature they're using, and no reliable signal about whether retry will help or not.

It protects the on-call engineer from false alerts and missed incidents. A team with a documented observability contract knows what an alert means: when an alert fires, it means the system has crossed a named threshold for a named category of failure. The alert can be triaged against the contract — "is the error rate above our documented threshold for this endpoint?" — rather than requiring investigative work to determine whether the alert is expected noise or a real incident. The team with no documented observability contract invests that investigative work on every alert, which is why on-call rotations are exhausting even when there is no real outage.

It protects the design review process from late discovery of error handling requirements. A team that includes error handling strategy in their design review — "how does this feature handle database failure? what does the user see? what gets logged? does any operation on this path require an idempotency key?" — discovers implementation requirements before the feature is built, when they are cheap to design in. A team that doesn't include error handling in design review discovers these requirements after the feature is live, in the form of support tickets, production incidents, and refactoring work to retrofit consistent error behavior onto a feature that was designed without it. The decision record is the reference that makes the design review questions concrete rather than aspirational.