The incident response playbook decision record: why the on-call rotation design you chose determines your MTTD floor and your runbook staleness rate

2026-07-03 · Decision record · Incident response · On-call operations

The on-call rotation and escalation policy are usually decided in the aftermath of a production incident — not before one. The first major outage produces a flurry of decisions: who should have been paged, who should have been escalated to, how the runbook should have been structured, what the severity classification should have been. These decisions are made under pressure, implemented immediately in PagerDuty and Confluence, and never written down as a deliberate process choice with explicit maintenance obligations. Eighteen months later, the engineer who set up the rotation has left, the runbooks reference dashboard panels that were renamed when the monitoring stack was upgraded, the escalation policy routes all database alerts to the infrastructure team including the ones caused by application-layer query behavior that the infrastructure team cannot fix, and the effective MTTD is determined not by how fast the alert fires but by how long it takes the on-call engineer to navigate from an accurate alert to an actionable understanding of what is wrong.

The incident response playbook is not a single document — it is the set of decisions that govern what happens between the moment an alert fires and the moment the incident is closed and reviewed. The on-call rotation model, the alert routing rules, the runbook structure, the severity classification policy, the escalation path, and the post-incident review process are each a decision with structural consequences. None of these are technical implementations in the sense of code that can be read to understand the current state; they are process decisions that live in PagerDuty configuration, in Confluence pages, in Slack norms, and in the institutional memory of the engineers who have been on-call long enough to know what the runbooks actually mean. When those engineers leave, the playbook goes with them.

The midnight runbook navigation problem

A 25-person SaaS company built a single-person engineering on-call rotation in their first year. The rotation was designed around the assumption that the primary on-call engineer — a senior engineer who had been at the company since founding and understood every service — would handle all incidents with a secondary escalation path to the CTO. The runbooks were written by the same senior engineer in a burst of documentation effort after the third major outage and lived in a Confluence space organized by service name. Each runbook had three sections: symptoms, remediation steps, and escalation contact. The remediation steps referenced Grafana dashboard panels by name.

Fourteen months later, the senior engineer went on a two-week vacation. The secondary on-call was a mid-level engineer who had joined four months earlier and had handled two incidents — both minor, both covered by runbooks whose instructions she could follow without prior context. At 2:17am on a Wednesday, PagerDuty paged her for a database connection pool exhaustion alert on the primary API service. The alert had fired before; the runbook said: "1. Open the db_connections panel in Grafana and verify the pool utilization percentage is above 90%. 2. Check the slow_queries panel for any queries consuming more than 500ms. 3. If pool utilization exceeds 95%, proceed to step 4."

There was no db_connections panel. The Grafana instance had been migrated from version 8 to version 10 six months earlier; the dashboard had been rebuilt during the migration and the panels renamed for consistency with the new naming convention. The connection pool panel was now called "DB Pool: Active / Max" on a dashboard called "Database Health" rather than the old "PostgreSQL Overview" dashboard. The runbook had been written against the old dashboard and never updated after the migration because the migration was treated as a display-layer change with no incident response implications.

The engineer spent 22 minutes finding the equivalent panel on the new dashboard. She found it — pool utilization was at 93% and rising. Step 2 referenced the slow_queries panel, which on the new dashboard was "Query Performance: P50 / P95 / P99 Latency." She found the equivalent. P95 query latency was normal. Step 3 directed her to step 4 at 95% pool utilization. Step 4 said: "If slow queries are not the cause, check the background job queue for stuck jobs. See the background jobs runbook at runbook.internal/background-jobs." The runbook.internal domain had been decommissioned four months earlier when the company migrated from a self-hosted wiki to Confluence. The URL returned a DNS resolution failure.

At 3:01am, pool utilization reached 98%. The engineer escalated to the CTO via PagerDuty. The CTO was asleep; the escalation notification arrived at 3:01am; he saw it at 3:17am and opened the Slack thread. By 3:19am, the database had exhausted all available connections and the primary API endpoint was returning 500 for every request. The CTO, who knew the background job system, identified within four minutes that a bulk export job added to the codebase three months earlier was holding connections open for the duration of long-running exports. The fix — adding a timeout and connection release to the export job — was deployed by 4:02am. Total incident duration from pool exhaustion to service restoration: 43 minutes. Total elapsed time from first alert to service restoration: 2 hours and 45 minutes. The runbooks had not been updated after the Grafana migration, the background-jobs runbook link had not been fixed when the wiki was decommissioned, and nobody had reviewed the runbooks after the bulk export job was added to the codebase.

The follow-the-sun handoff problem

A 40-person B2B SaaS expanded internationally over three years, adding engineering teams in the UK and India to cover US-Pacific working hours with a follow-the-sun on-call model. The rotation was designed to hand off at 15:00 UTC (UK to US-Pacific) and at 23:00 UTC (US-Pacific to India) and at 07:00 UTC (India to UK). Each team maintained their own section of the incident response playbook in Confluence; the US team's playbook was the original document, and the UK and India sections were added by the respective teams when they joined the rotation without coordination on the overall structure. The escalation policies were team-specific: UK escalation went to the UK engineering lead, US escalation went to the VP Engineering, India escalation went to the India team lead. There was no cross-team escalation policy documented for incidents that started in one team's window and remained active at the handoff.

At 14:52 UTC on a Thursday, an alert fired for a sustained error rate on the payment webhook receiver endpoint. The UK on-call engineer, Priya, acknowledged the alert at 14:55 UTC. She opened the webhook service dashboard, identified elevated error rates (11% of webhook deliveries returning 500), and began investigating the application logs. At 15:02 UTC — two minutes past the UK-to-US-Pacific handoff — PagerDuty rotated the on-call to the US-Pacific engineer, Marcus. PagerDuty showed the original alert as acknowledged (by Priya) and did not re-page Marcus. Marcus's PagerDuty view showed one active incident, acknowledged, no action required from him unless the acknowledging engineer explicitly transferred it.

Priya, still investigating, posted to the #incidents Slack channel at 15:04 UTC: "Looking at webhook errors, seeing Stripe signature verification failures — checking if this is a key rotation issue." Marcus, who had just come on-call, was not monitoring #incidents actively at the start of his shift; the handoff protocol called for the outgoing engineer to post a handoff note, but the protocol was informal and not consistently followed — the UK team had developed the habit of posting handoff notes only for open P1 incidents. A P2 incident (which this had been classified as, based on the error rate being below the P1 threshold of 15%) was commonly handed off by the simple act of PagerDuty rotating the on-call assignment.

At 15:09 UTC, the error rate crossed 15% and PagerDuty generated a new alert for the same underlying incident (a separate alert rule for the P1 threshold). This alert paged Marcus, who acknowledged it. Marcus opened the webhook service dashboard, saw the elevated error rate, and without seeing Priya's Slack thread — which was in #incidents while Marcus was checking #engineering-alerts — restarted the webhook service process at 15:13 UTC. The restart cleared the in-memory metric counters and briefly reduced the visible error rate as the service initialized, which Marcus noted as a potential resolution. At 15:14 UTC, the error rate began climbing again as the restarted service encountered the same Stripe signature verification failures. Marcus was now confused: the restart should have cleared a transient issue, but the errors were returning.

At 15:17 UTC, Priya — still in the #incidents thread — posted: "Found it — the STRIPE_WEBHOOK_SECRET env var was rotated two days ago but the webhook service wasn't redeployed to pick up the new value. The service is still using the old signing secret. Need to force a redeploy with the new env var." Marcus, who had been investigating the wrong hypothesis (transient process crash rather than configuration mismatch), saw Priya's message at 15:19 UTC. He applied the fix — a forced redeploy to pick up the current environment variable — at 15:21 UTC. Error rate returned to normal by 15:23 UTC.

Total incident duration from first alert to resolution: 31 minutes. But the useful investigation work had been done by Priya in the first 22 minutes; the restart Marcus applied consumed 8 minutes and produced no progress toward resolution while briefly masking the symptom. The root cause of the coordination failure was the undocumented handoff zone policy: the playbook had a rotation model but no documented protocol for incidents that start near the handoff time, no mechanism for the outgoing engineer to transfer active investigation context to the incoming engineer, and no alert deduplication policy that would have prevented Marcus from receiving a new P1 page for an incident Priya was actively investigating.

Three structural properties set at playbook design time

Both incidents share the same root structure: the incident response process was designed once, implemented in configuration and documents, and allowed to drift without a maintenance mechanism tied to the infrastructure changes that made the original design incorrect. Three properties of the playbook design decision determine whether the on-call rotation produces consistent incident response or consistently surprised responders.

1. On-call rotation model and MTTD floor

The on-call rotation model is the decision that determines who receives the first page for a given alert type and what context they have when it arrives. The rotation model sets a MTTD floor — the minimum achievable time from alert to substantive first response — because it determines the gap between alert arrival and the on-call engineer having enough context to take the first diagnostic action.

The MTTD floor is not determined by the alert firing latency (how quickly the monitoring system detects the problem and fires the alert) but by the cognitive distance between the on-call engineer's current knowledge and the knowledge required to begin investigation. An on-call engineer who owns the alerted service, has seen this alert type before, and has a runbook whose dashboard links are accurate can achieve MTTD of 5–10 minutes from alert to first diagnostic action: acknowledge the alert, open the dashboard, verify the anomaly, identify the first remediation candidate. An on-call engineer covering a broad rotation that includes services they do not own, using a runbook whose Confluence links are stale, achieving the same outcome requires navigating from the alert to the service owner (escalation), from the service owner to the runbook (if the service owner shares it), and from the runbook to the monitoring view (if the runbook's dashboard links are accurate). This is 20–45 minutes of MTTD floor for the same underlying alert type, independent of the engineer's skill level.

The rotation model choices exist on a spectrum between breadth and depth. A breadth-first rotation — all engineers rotate on-call for all services — distributes the on-call burden evenly but produces high MTTD for every alert that falls outside the current on-call engineer's primary domain. This model is appropriate for small teams where every engineer knows every service well enough to respond, but degrades as the service surface grows and individual engineers develop specializations. A depth-first rotation — engineers are only on-call for services they own or have primary domain expertise in — produces low MTTD but creates coverage gaps when service owners are unavailable and generates high burden for services with few owners. A tiered model (primary on-call is the service owner, secondary on-call is a generalist who escalates quickly to the service owner) is the structural compromise: the generalist handles the alert classification and first diagnostic steps, escalates to the service owner for remediation decisions, and the service owner's MTTD contribution is the escalation latency rather than the full investigation. This model requires the runbook to support a responder without domain expertise in its first-five-actions section — the generalist needs to be able to assess whether the alert is genuine and what service is affected without knowing the service's internal architecture.

The handoff zone is the period where the rotation model's MTTD properties are at their worst. An alert that fires 5 minutes before a rotation handoff will be handled by the outgoing engineer, who is cognitively preparing to go off-call, or by the incoming engineer, who has no context from the shift just ending. A rotation handoff protocol — requiring the outgoing engineer to post a structured handoff note covering all open alerts and their investigation state before the handoff time, and requiring the incoming engineer to read this note before acknowledging new alerts in the first 15 minutes of their shift — converts the handoff from a cold transfer of alert ownership to a warm transfer of incident context. The handoff note is not optional overhead; it is the mechanism that prevents the parallel investigation problem that cost Marcus and Priya 8 minutes of recovery time.

2. Runbook staleness model and the decay rate

A runbook is a point-in-time snapshot of three things: the monitoring view that represents the service's health, the remediation steps that address the alert's likely causes, and the escalation path for causes the on-call engineer cannot address. All three decay at rates proportional to the rate of change in the service's infrastructure and codebase. The db_connections panel reference in the first narrative decayed when the Grafana migration renamed it. The runbook.internal/background-jobs link decayed when the wiki was decommissioned. The runbook's coverage of background job connection leaks decayed when the bulk export job was added to the codebase without a runbook update.

The decay rate is not constant — it accelerates with team velocity. A team shipping weekly deployments and quarterly infrastructure upgrades generates more runbook-invalidating changes per month than a team shipping monthly. A runbook written for a service in its first year, when the team is experimenting rapidly, may be substantially stale within three months. A runbook written for a stable service with infrequent infrastructure changes may remain accurate for a year. The error is treating all runbooks with the same maintenance cadence regardless of the rate of change of the service they cover.

The maintenance trigger that works is event-driven rather than calendar-driven. Every deployment or infrastructure change that affects a monitoring dashboard (new panels, renamed panels, dashboard reorganization), an alert threshold (alert conditions that affect which runbook section applies), a service dependency (a downstream service that the runbook assumes is present), or a remediation procedure (a new restart command, a new environment variable check, a new connection drain procedure) must include a runbook review as a mandatory step. This is not a separate documentation task — it is a checkpoint in the deployment review process for the categories of changes that have historically invalidated runbooks. The Grafana migration and the bulk export job addition were both in this category; neither triggered a runbook review because the policy was not documented and the team did not know to apply it.

The staleness detection mechanism — the quarterly audit that finds stale entries — is not the primary mechanism; it is the fallback for changes that slipped through the event-driven trigger. For P1 services, a runbook not updated within 90 days generates an automatic ticket assigned to the service owner. For P2 services, the threshold is 180 days. These are not arbitrary intervals; they represent the typical time for a medium-velocity team's infrastructure changes to meaningfully invalidate runbook references. They are not guarantees of accuracy — a runbook updated yesterday can be stale if a migration happened this morning — but they are a floor that prevents the 14-month staleness that produced the first narrative's outcome.

3. Escalation path design and the wrong-escalation cost

The escalation path documents who gets paged when the on-call engineer cannot resolve the incident within the escalation time threshold. The structural decision is whether the escalation target is role-based (escalate to the on-call secondary, who may or may not own the service) or ownership-based (escalate to the engineer or team who owns the affected service). The cost difference between these two models becomes visible in incidents where the affected service's expertise is not represented in the standard escalation chain.

A role-based escalation to the secondary on-call is appropriate when the secondary is a technical lead with broad service knowledge who can either resolve the incident or identify the correct domain expert faster than the primary can through alert investigation alone. This model works at small team sizes (5–10 engineers) where the secondary's broad knowledge covers most of the alert surface. It degrades at medium team sizes (15–40 engineers) where service specialization means the secondary on-call may have no prior context on 60–70% of the services that can generate P1 alerts. In the first narrative, the correct escalation target was not the CTO (who was reachable as the terminal escalation point) but the senior engineer on vacation — who knew the background job system — or an engineer who had worked on the bulk export job. The CTO knew the answer, but the 16-minute gap between escalation page and response is structurally embedded in the model: the CTO escalation fires after the primary on-call engineer has exhausted their local diagnostic options, and the CTO response time at 3am is a feature of human availability, not of the escalation design. An ownership-based escalation that paged the engineer who authored the bulk export job would have produced a faster resolution path.

The wrong-escalation cost is not just the escalation latency. When the escalation reaches an engineer who does not own the affected service, that engineer must either spend time becoming oriented enough to help (adding to MTTD) or spend time identifying the correct owner and transferring the escalation (adding one full escalation timeout to MTTD, because the transfer requires the second engineer to acknowledge the incident and initiate a new escalation or direct contact). In PagerDuty, the escalation timeout between levels is typically 15–30 minutes. A wrong first escalation adds this timeout to MTTD before the correct engineer is reached. For a P1 incident with a 60-minute recovery time objective, a single wrong escalation can consume 25–50% of the RTO before the right person is engaged.

The escalation path design must specify the ownership mapping: which alerts route to which service owner, what the escalation timeout is per severity level, and what the terminal escalation (the fallback when service ownership is ambiguous or the owner is unreachable) is for each alert category. This mapping is not derivable from the service registry alone — it requires decisions about how cross-service incidents (where multiple service owners are implicated) are coordinated, who serves as the incident commander for multi-team incidents, and how alert deduplication prevents the parallel investigation problem the follow-the-sun narrative produced.

Five sections the incident response playbook ADR must contain

1. On-call rotation model and coverage policy

The first section documents the rotation model selection and its structural implications. The rotation model selection is not a PagerDuty configuration decision — it is an architectural decision about who holds domain context and how that context is transferred when context-holders are unavailable. The ADR should specify: the rotation model (breadth-first, depth-first, or tiered primary/secondary), the rotation cadence (weekly, biweekly, or follow-the-sun by timezone), the coverage scope per rotation slot (which services and alert types are covered by which rotation), and the shadow on-call protocol for engineers newly added to the rotation.

The coverage policy must address what happens when the on-call engineer's rotation scope overlaps with a service they have never operated. The shadow on-call protocol — requiring a new on-call engineer to handle their first 2–3 rotation slots while paired with an experienced on-call who is contactable but not primary — converts the first live incident from a solo navigation exercise into a supervised apprenticeship. The shadow protocol is documented in the ADR because it affects staffing decisions: a team that adds two engineers per quarter needs to budget shadow on-call slots for those engineers, which means the experienced engineers on the rotation absorb additional on-call burden for the quarters when new engineers are onboarding to the rotation. Without this explicit accounting, the rotation model degrades in practice: new engineers are added to primary on-call before they have sufficient context, and incidents during their rotation slots produce the same MTTD expansion that the second narrative illustrated.

The handoff protocol is also specified in this section: the format of the handoff note (open alerts with their investigation state, any in-progress remediation, any anomalies noticed during the shift that did not generate alerts), the channel where handoff notes are posted, the timing requirement (15 minutes before rotation handoff), and the incoming on-call engineer's first-action obligation (read the handoff note before acknowledging any new alerts in the first 15 minutes of the shift). For follow-the-sun models, the handoff protocol must also specify what happens to an active incident at the handoff time: the standard is that an active P1 incident does not transfer at handoff until the incident commander explicitly hands off with a context transfer in the incident Slack thread, acknowledged by the incoming engineer.

The coverage policy must include a capacity constraint: the on-call engineer during their rotation slot must have less than 50% of their sprint workload committed to feature work that requires deep focus during potential alert hours. This is an organizational constraint, not a PagerDuty setting, and must be documented as an explicit expectation rather than an implicit assumption — engineering managers who schedule sprints without accounting for on-call capacity create a systematic under-resourcing of incident response that produces engineer burnout and missed rotation SLAs.

2. Alert routing rules and service ownership mapping

The second section documents how alerts are routed from the monitoring system to the on-call engineer, how severity is classified at alert definition time, and how alert deduplication prevents parallel investigation from parallel pages. Alert routing is not a technical configuration that can be inferred from PagerDuty — it is a set of policy decisions about which engineer should receive which alert and under what conditions the routing should escalate beyond the primary on-call.

The severity classification taxonomy must be explicit and agreed upon before incidents occur, not assigned ad hoc during live incidents. A common framework: P1 — user-visible complete failure of a core user flow (authentication failure, payment processing failure, data loss, complete API unavailability); P2 — degraded core functionality affecting a significant portion of users (elevated error rate above threshold, latency above SLA, partial feature unavailability); P3 — non-critical degradation affecting a minority of users or non-core flows; P4 — service anomaly not causing user impact, investigation recommended but no immediate action required. The classification criteria must be specific enough to make the classification unambiguous for a responder seeing the alert for the first time: an error rate of 15% meets the P1 threshold; an error rate of 5% does not. Ambiguous severity criteria produce inconsistent escalation behavior and post-incident disputes about whether the incident was handled at the appropriate urgency.

The ownership mapping translates alert type to service owner contact. The mapping is not the service registry — it is the service registry enriched with on-call contact information, escalation path, and a known-alert section that lists the alert types this service generates and which escalation path applies to each. A database connection pool exhaustion alert for the primary API service routes to the on-call primary (first), then to the API service owner (escalation), then to the infrastructure team (terminal escalation if the service owner is unreachable). A database connection pool exhaustion alert caused by application-layer query behavior cannot be fixed by the infrastructure team — the terminal escalation must route to the application team, not the infrastructure team, for alert types where the root cause is above the infrastructure layer.

The alert deduplication policy must specify what happens when multiple alert rules fire for the same underlying incident. The follow-the-sun narrative illustrated the failure mode: a P2 alert acknowledges by the outgoing engineer, a P1 threshold triggers a new alert that pages the incoming engineer, the incoming engineer has no context from the P2 investigation. The deduplication policy must cover: when a new alert is generated for an already-acknowledged incident, should it page the acknowledging engineer, the current on-call engineer, or neither; how long after an incident is acknowledged does a new alert for the same service escalate to P1; what is the format of alert grouping in PagerDuty so that the incoming engineer can see the relationship between the new P1 alert and the already-acknowledged P2 alert. These are PagerDuty configuration decisions, but they must be documented as policy decisions in the ADR so that future configuration changes are made with an understanding of the policies they implement.

3. Runbook structure and the mandatory maintenance trigger

The third section specifies the standard structure of each service runbook and the policy that triggers a mandatory runbook update. The runbook structure is standardized across all services so that a responder who has never responded to an alert for a given service can navigate the runbook without prior training. The standard sections: service description (one paragraph on what the service does, who uses it, and what the failure impact is), alert inventory (each alert this service can generate, its classification, its typical cause, and a link to the relevant runbook section), first-response protocol (a numbered checklist of the first five actions to take when any alert fires for this service, using actions a non-owner can perform: verify the alert is genuine, identify the scope of impact, check the health dashboard, verify the recent deployment history), known causes with remediation steps (one subsection per known alert cause, with numbered remediation steps, accurate dashboard references — panel names quoted exactly as they appear in the current dashboard — and expected outcome of each step), escalation path (for this service specifically: primary on-call, service owner escalation contact, infrastructure team escalation for infrastructure-layer causes, application team escalation for application-layer causes), rollback procedure (exact commands or deployment steps to roll back the last deployment if it is identified as the cause), post-incident verification checklist (steps to confirm the service is fully recovered before closing the incident).

The mandatory maintenance trigger policy specifies the categories of change that require a runbook update before the change is merged. The categories: any change to a Grafana dashboard (new panels, renamed panels, reorganized dashboard structure — update runbook references to match the new panel names and dashboard location before the migration is complete); any change to alert rules (new alert rules need new runbook sections, modified thresholds need updated classification guidance, removed alert rules need runbook section removal); any change to service topology (split or merged services need runbook scope updates, new dependencies need runbook sections covering dependency failure scenarios); any change to the remediation procedure for a known cause (environment variable renames, new connection drain procedures, changed restart commands); any incident where the runbook was found to be inaccurate during response (the incident review must include a runbook update as a required action item, not a recommended one).

The trigger policy is implemented as a checklist item in the deployment review template for the change categories listed. Not as a reminder in a Slack channel, not as a quarterly audit, but as a blocking check in the pull request or change management process: infrastructure changes to monitoring dashboards, alert configurations, and service topology require the submitter to link the updated runbook section or document that no runbook changes are required with a brief justification. The Grafana migration in the first narrative was executed as an infrastructure upgrade with no runbook implications flagged because the policy was not documented. The policy's enforcement mechanism must be as visible as the deployment checklist itself.

4. Incident severity classification and escalation protocol

The fourth section documents the severity classification criteria and the time-bound obligations that follow from each severity level. The classification criteria must be objective enough to be applied consistently by a responder who has been awake for 45 minutes in the middle of the night. Subjective criteria ("significant impact to users") produce inconsistent classification and inconsistent escalation behavior; objective criteria ("API endpoint returning 5xx at a rate above 5% for more than 2 consecutive minutes") produce consistent behavior even under cognitive load.

The time-bound obligations per severity level define the playbook's operational commitments: P1 — acknowledge within 5 minutes of alert firing; post incident status update to the internal status channel within 15 minutes of acknowledgment; escalate to service owner if not mitigated within 20 minutes; post customer-facing status page update within 30 minutes if user impact is confirmed; brief engineering management within 30 minutes; resolve or produce a credible mitigation plan within 60 minutes. P2 — acknowledge within 15 minutes; escalate to service owner if not mitigated within 45 minutes; resolve within 4 hours. P3 — acknowledge within 30 minutes; resolve within the current business day; no stakeholder communication required unless requested. P4 — acknowledge within 2 business hours; schedule investigation within current sprint.

The incident commander role must be defined for multi-service incidents. When a P1 incident involves more than one service — the payment webhook failure in the second narrative involved both the webhook receiver service and the secrets management system that stored the Stripe signing secret — someone must be designated as the coordinator who owns the Slack thread, tracks the investigation state, prevents parallel investigations from conflicting, and makes the call on when the incident is resolved. Without an explicit incident commander designation, multi-service incidents produce the follow-the-sun narrative's failure mode: multiple engineers investigating simultaneously without coordination, with actions that can conflict (a restart that masks the symptom while another engineer is pursuing the root cause).

The escalation to management must also be tiered: at 30 minutes for a P1 with no mitigation, the VP Engineering or equivalent receives a page to be aware of the incident; at 60 minutes for a P1 with no mitigation, the CTO or equivalent receives a page to be available for decision authority (whether to roll back a deployment, whether to engage a vendor's emergency support, whether to post a customer communication that commits to a remediation timeline). Management escalation is not a request for technical help — it is a delegation of decision authority for business decisions that require seniority to make quickly. The CTO in the first narrative was paged at the wrong time and for the wrong reason: he was the repository of technical knowledge that the on-call engineer needed, not the decision authority for a business commitment. The correct escalation would have been to the engineer who knew the background job system, with a management escalation for customer communication at the 30-minute mark.

5. Post-incident review cadence and playbook update protocol

The fifth section documents the obligations that activate when an incident closes: which incidents require a postmortem, what the postmortem must contain, and how the postmortem's findings feed back into the playbook. The post-incident review process is the mechanism that converts incident experience into playbook improvement — without it, each incident produces lesson-learning in the engineers who responded but no institutional improvement to the process that governed the response.

The postmortem requirement threshold: P1 incidents require a postmortem within 72 hours of close, without exception; P2 incidents require a postmortem if the incident was a repeat of a previous P2 incident (same service, same alert type) or if the MTTD or MTTR exceeded the target in the SLA by more than 50%; P3 incidents require a postmortem at team discretion, with a recommendation for incidents where the root cause reveals a systemic gap in runbook coverage or alert routing. The 72-hour deadline for P1 postmortems is a commitment, not a recommendation — investigations completed in less than 72 hours while the incident is still in recent memory produce substantially more actionable findings than investigations conducted the following week against Slack thread archaeology.

The postmortem format must be standardized to ensure that each postmortem produces the same categories of output. Required sections: incident timeline (minute-by-minute from first alert to all-clear, reconstructed from PagerDuty timestamps, Slack thread timestamps, and deployment logs — not from memory); contributing factors (the conditions that allowed the incident to occur, categorized as code, configuration, process, communication, or external); MTTD and MTTR analysis (what the actual detection-to-response and response-to-resolution times were, what caused them to exceed the target if they did, what would need to change to achieve the target consistently); runbook review (was the runbook accurate and sufficient; what sections need updating; what new sections need to be created); alert review (did the alert fire at the right time; was the severity classification correct; should the threshold be adjusted); escalation review (did the escalation path reach the right person; was the escalation timeout appropriate for the incident; did parallel investigations occur, and if so, what handoff process failure caused them).

The mandatory action items from a P1 postmortem: at minimum, a runbook update or creation; an alert threshold review for the alert that fired; an escalation path review if the escalation reached the wrong engineer or exceeded the SLA. These are not optional improvements — they are the maintenance debt that the incident revealed and that must be paid before the next incident in the same service occurs. The playbook ADR is updated as postmortem action items close: the on-call rotation model, alert routing rules, runbook maintenance trigger, severity classification criteria, and escalation protocol evolve through postmortem findings rather than through advance design. The initial ADR is the foundation; the postmortem cadence is the maintenance mechanism.

What a ChatGPT session on incident response usually leaves out

Most AI chat sessions on incident response produce a PagerDuty setup guide, an escalation matrix template, and a runbook template with placeholder sections. The session covers the implementation mechanics — how to configure alert routing, how to structure an escalation policy in PagerDuty, how to write a runbook in the correct format. What the session rarely produces is the maintenance policy, the handoff protocol, the alert deduplication design for multi-alert incidents, and the ownership mapping that distinguishes infrastructure-layer escalation from application-layer escalation for the same alert type. These details require knowledge of the team's service topology, the team's incident history, and the organizational dynamics that determine who can actually make a remediation decision for a specific service at 3am on a Wednesday.

The decisions that matter most — who is on-call for which services, what the escalation path is for database errors caused by application-layer behavior, what triggers a mandatory runbook update after a Grafana migration — are made implicitly through the team's first several incidents and the ad hoc configurations that result from them. These implicit decisions are the incident response strategy; the PagerDuty configuration is their implementation. The strategy decisions live in the team's AI chat history: the post-incident discussion where someone said "we need to add a secondary on-call because this would have gone better if we had escalated to the backend team earlier," the runbook review session where someone said "the dashboard links are all wrong because of the Grafana upgrade, should we update them or wait until we have time to do it properly," the follow-up discussion two months later where someone said "wait, did we ever update those runbook links?"

An export of the AI chat sessions around incident response decisions typically surfaces the original on-call rotation design conversation, the escalation path debate (who should be the terminal escalation and why), and the series of post-incident discussions that each produced a runbook update commitment that may or may not have been completed. These conversations contain the documented version of the team's incident response strategy — the reasoning behind the choices, the alternatives that were considered and rejected, the specific incidents that drove each policy change. The five ADR sections above are the structured form of that strategy. Writing them down before the next incident, rather than reconstructing them from chat history after it, is the difference between an incident response process and an incident response aspiration.

Further reading