2026-06-26 · ~19 min read

The disaster recovery decision record: why the RPO and RTO targets you chose determine your cross-region replication cost and your recovery playbook gaps

Q: Why does a stated RPO of 5 minutes not guarantee that only 5 minutes of data will be lost in a real incident?

A stated RPO of 5 minutes means the team intends to lose no more than 5 minutes of data in a recovery scenario. Whether that target is achievable depends on the measured replication lag under production load, not the intended replication frequency. Asynchronous database replication has a lag that is a function of the write throughput, the network bandwidth between the primary and the replica, and the replica's apply throughput. Under normal load, the lag may be 30 seconds. During a write-heavy migration, a high-throughput event, or a network congestion period — exactly the conditions that precede many database incidents — the lag may be 30 minutes or more. A team that set a 5-minute RPO based on normal-load measurements and has never measured the replication lag under peak-write conditions will discover the true lag when they look at the replica's position relative to the primary during their first real incident. If the primary fails when the replica is 47 minutes behind, 47 minutes of data is lost, regardless of the stated RPO. The documented RPO must include the measurement methodology: the replication mechanism, the measured lag at P50 and P99 under production load, and the specific load condition at which lag was measured. A CloudWatch or Datadog monitor on replication lag that alerts when it exceeds the RPO target gives the operations team visibility into whether the RPO is currently achievable before an incident — an alert that fires regularly means the RPO target is aspirational, not contractual.

RPO and RTO targets are decided in the initial architecture session, written on a whiteboard, and never documented. A 340 GB database restore from S3 takes 7 hours when the assumed RTO was 2 hours. A cross-region replication lag of 47 minutes invalidates a 5-minute RPO that was set as a target without being measured under production load. The disaster recovery decision determines what data you lose and how long you are down when the incident you did not expect happens. It should be documented before you need it.

A 31-person SaaS company hosted a document management product on a single AWS region. The team's informal understanding of their disaster recovery posture was "we have nightly S3 backups, we can be back up in a couple of hours if something goes wrong." This understanding was not documented. It was not derived from a measurement of how long a restore actually took. It was an intuition formed in the initial infrastructure setup session, when the engineer who configured the RDS automated backup to S3 estimated the restore time based on the database size at that time — roughly 12 GB. The estimate was reasonable for a 12 GB database. Eighteen months later, the database was 340 GB.

At 11:17 PM on a Tuesday, the RDS primary instance failed due to a disk corruption event. The on-call engineer paged a colleague and they began the restore procedure. They discovered immediately that the procedure was documented only as "restore from S3 backup using AWS console" — there were no specific steps, no documented instance class for the recovery RDS instance, and no documented list of which services needed to be restarted after the database came up. The engineer chose a db.r5.2xlarge instance class (the same class as the failed primary) and started the restore.

The restore of a 340 GB snapshot from S3 to a new RDS instance took 6 hours and 53 minutes. This was not the "couple of hours" assumed in the team's informal RTO. The database had grown from 12 GB to 340 GB in 18 months, and no one had re-measured the restore time since the initial estimate. The time to detect the failure (11 minutes), page on-call (4 minutes), assess the situation and start the restore (14 minutes), wait for the restore to complete (6:53), run through which application services needed environment variable updates pointing to the new database endpoint (37 minutes), restart and smoke test services (28 minutes), and verify that the most recent backup was only 18 minutes old (since the failure occurred shortly after the nightly backup completed) added to a total outage duration of 8 hours and 25 minutes. The team's enterprise customer, who had a contract with an uptime SLA, opened a breach-of-contract claim the following morning. The actual SLA commitment in the contract was 99.9% uptime — approximately 8.7 hours per year. A single incident consumed the full year's allowance.

The disaster recovery posture of "nightly backup, a couple of hours to restore" was a decision that was never written down — never documented as a deliberate choice with the alternatives considered, the assumptions underlying the estimate, or the conditions under which the estimate would be invalidated. There was no record of what the team had chosen to accept: nightly backups meant a maximum data loss window of 24 hours minus the time between the last backup and the failure, not the 18 minutes the incident actually produced (which was lucky timing). There was no record of what the RTO was based on, what database size the estimate assumed, or when the estimate had last been validated. The engineer who configured the backup had left the company five months before the incident. No one currently on the team had been present for the decision.

A 22-person fintech had documented a disaster recovery commitment in their SOC 2 report: RPO of 5 minutes, RTO of 1 hour. The team had implemented cross-region replication from us-east-1 to us-west-2 using RDS read replica promotion. The 5-minute RPO was stated in the architecture document based on the team's understanding that asynchronous replication typically produces "low single-digit minute" lag. The 1-hour RTO was set based on a walkthrough of the recovery steps (estimated time for each step summed to 47 minutes, leaving a 13-minute buffer). Both targets were set in an architecture planning session and documented in a Google Doc that became the SOC 2 evidence artifact.

The team's annual DR test — required by their SOC 2 auditor — was scheduled for a Saturday morning in Q3. Three engineers joined a video call and began the recovery procedure. The first issue surfaced 8 minutes into the test: the IAM role referenced in the playbook for cross-region replica promotion — arn:aws:iam::123456789012:role/dr-failover-role — did not exist. It had been deleted seven months earlier during a permission audit when a junior engineer ran an automated scan for unused IAM roles; the role had not been used since the DR test the previous year and appeared dormant. The automated scan had not cross-referenced IAM role usage against DR playbooks. The deletion had been logged as a routine cleanup task, merged without review by a second engineer, and recorded as a completed infrastructure improvement.

Recreating the IAM role required a senior engineer who had not joined the test call, plus Terraform changes and a deployment. The reconstruction took 1 hour and 12 minutes, consuming the entire RTO budget before the failover itself had begun. The test was suspended and rescheduled. During the investigation that followed, the team also discovered that the replication lag on the us-west-2 read replica was being measured once per day by a CloudWatch metric with a 1-day evaluation period — not continuously. The one-day snapshot showed lag of 2 minutes. Continuous monitoring added after the failed test showed lag that varied between 2 minutes at low-write periods and 51 minutes during the end-of-month batch processing runs, when write throughput spiked 8× above the daily baseline. The 5-minute RPO was achievable at P50 throughput but not at P99. Under the conditions most likely to cause a regional failure — high-traffic events, batch processing — the RPO was closer to 51 minutes, not 5 minutes. The new CTO who joined two months after the failed DR test asked what the company's actual RPO was in production. No one could answer the question with a number they had measured.

The three structural properties that the disaster recovery decision determines

When teams set RPO and RTO targets in an initial architecture session, they are making a commitment about the cost and complexity of the recovery infrastructure they are willing to build and maintain. A 5-minute RPO requires a replication mechanism that can deliver 5-minute lag under P99 production load — not P50, not average, not light load. A 1-hour RTO requires that every step in the recovery process, from failure detection through smoke test and traffic restoration, can be completed end-to-end in under 60 minutes — as tested, not as estimated. The gap between stated targets and achievable targets is the structural consequence of three properties the DR decision determines.

Replication lag versus recovery point. The RPO is achievable only if the replication mechanism delivers that lag under the write throughput conditions that are most likely during a real incident — which are often peak conditions, not baseline. Asynchronous replication lag is a function of write throughput, network bandwidth between regions, and replica apply throughput. Under a normal weekday write load of 1,000 transactions per minute, asynchronous replication lag may be consistently under 60 seconds. Under the end-of-month batch processing write load of 8,000 transactions per minute, the same replication configuration may accumulate a lag of 47 minutes before the batch completes. A team that measures replication lag during business hours on a normal Tuesday and documents that measurement as validation of their 5-minute RPO has validated the RPO under the conditions most unlike a real failure scenario. The measurement that matters is the P99 replication lag under peak write throughput, sustained for the duration of a realistic high-activity period. The RPO documentation must state which measurement it is based on: the replication mechanism, the measured P50 and P99 lag values, the write throughput at which those values were measured, and the monitoring mechanism that produces a continuous lag signal so the operations team knows whether the RPO is currently achievable before they are asked to claim it during an incident. A CloudWatch monitor on replication lag that alerts when it exceeds 50% of the RPO target gives the team real-time visibility into RPO achievability. An alert that fires for three hours every end-of-month is not a noise issue — it is a signal that the documented RPO is aspirational for 10% of operational time.

Recovery time composition. The RTO is the sum of all steps in the recovery process, each of which must be pre-measured to produce a credible commitment. The steps are: failure detection time (how long before the monitoring alert fires), paging delay (time from alert to on-call engineer beginning active response), assessment time (determining the failure mode and confirming DR activation is the right response), failover decision authority (who has the authority to activate DR, how they are reached at 2 AM, and what the escalation path is if they are unreachable), DNS propagation time (how long after the failover record is updated before traffic reaches the recovery region — a function of the TTL on DNS records, which is a setting that must be pre-configured for fast failover, not left at the default 300 seconds), service startup time (how long each service takes to pass its health check after the recovery environment's infrastructure is active), database restore or replica promotion time (the longest component: measured restore time for the current backup size, not the size at the time the estimate was made), smoke test time (the minimum functional verification before traffic is restored), and traffic restoration time (updating load balancer targets, re-enabling health check routing, or completing DNS cutover). The sum of these pre-measured components is the realistic RTO floor. An RTO commitment below this floor is not achievable without reducing the time of one or more components. The multi-region deployment decision record addresses the infrastructure layer that makes the failover components possible — standby environment pre-provisioned versus created on demand, database replica pre-provisioned versus restored from backup — and directly determines the recovery time floor.

Recovery playbook drift. A DR playbook is correct on the day it is tested. Infrastructure changes accumulate after that test: IAM roles are deleted during permission audits, S3 bucket paths are restructured when backup tools change, EC2 instance types are discontinued, EKS cluster contexts are renamed during upgrades, DNS hosted zone IDs are replaced during provider migrations. Each change is made by an engineer who is not thinking about the DR playbook — they are solving a local infrastructure problem. The change is logged in Terraform or CloudFormation state and in git, but there is no automated cross-reference between infrastructure change events and the DR playbook. A playbook that was fully validated by a DR test 11 months ago and has not been tested since has accumulated whatever infrastructure drift occurred in those 11 months. The drift is invisible until the playbook is run. During a quarterly DR test, a step that fails because it references a deleted IAM role produces a 40-minute delay and a playbook correction — a manageable outcome. During a real incident, the same failed step produces an outage extension and a SOC 2 finding. The test cadence is the only mechanism that converts invisible drift into visible maintenance work before the incident that makes it visible at the worst possible time. The postmortem that follows a DR test failure should produce an ADR update that adds the missing infrastructure as a dependency tracked in the playbook, so that future infrastructure changes in that dependency class are cross-referenced against the playbook before being merged.

Disaster recovery options and their structural properties

No formal DR (implicit default). Many teams have no documented DR posture. They have backups — because the cloud provider's managed database service has automatic backups turned on by default — but no documented targets, no tested recovery procedure, and no named recovery infrastructure. This is not a decision that was made deliberately; it is the absence of a decision. The implicit RPO is "whenever the last backup ran," which may be daily or weekly depending on what defaults were left in place. The implicit RTO is "however long it takes to figure out how to restore," which includes the time to find the backup location, determine the current restore procedure, provision recovery infrastructure, and restart dependent services — work that has never been rehearsed. The document management team's 8-hour-25-minute outage is the canonical outcome of the implicit default: the posture was assumed to be adequate because no one had ever measured what it would actually produce. The implicit default is only appropriate for non-production environments and internal tools where extended outages have no business impact and no SLA commitments. Any service with an enterprise customer, an uptime SLA, or regulatory compliance requirements has already committed to a DR posture — the commitment is just implicit in the contract, not documented as a decision the team made with explicit awareness of the cost and trade-offs.

Backup and restore. The team takes regular backups (daily, hourly, or continuous with point-in-time recovery) and restores from the most recent backup when a failure occurs. The RPO is bounded by the backup interval: daily backups produce a maximum RPO of 24 hours; hourly backups produce a maximum RPO of 1 hour; continuous backup with point-in-time recovery reduces the RPO to near-zero for the data itself but the achievable recovery point depends on the log tail shipping rate and durability. The RTO is the sum of restore time plus all subsequent recovery steps. For large databases, restore time dominates: a 340 GB RDS snapshot restore takes 6-7 hours regardless of the instance class chosen, because the restore throughput is bounded by the S3 read rate and the RDS instance's initialization process. Backup-and-restore is the correct DR approach when the RPO tolerance is measured in hours, the RTO tolerance is measured in hours, and the cost of maintaining warm standby infrastructure is not justified by the business impact of extended outages. It is not appropriate when the SLA commitment or regulatory requirement specifies an RTO below the database restore time for the current data volume. The database vendor decision constrains the backup and restore options: Postgres RDS automated backup uses AWS RDS snapshot and point-in-time recovery; a MySQL binary log replication approach has different restore semantics and different measured restore times for the same data volume. The restore time must be measured against the current backup size, not the size at the time the approach was chosen, and re-measured annually or after any significant data growth event.

Warm standby. The team provisions a recovery environment in a second region and maintains a database replica in that region. In normal operation, the replica receives replication from the primary but does not serve application traffic. On failure, the team promotes the replica to primary, updates the application configuration to point to the recovery region, and restores traffic. The RPO is bounded by the replication lag on the replica at the time of the failure. The RTO is the sum of: failure detection, replica promotion time (typically minutes for a managed service like RDS read replica promotion), application configuration update, DNS propagation (bounded by the pre-configured DNS TTL — a 30-second TTL requires pre-configuration before the incident, not during it), and smoke test. A well-configured warm standby can produce an RTO of 15-45 minutes depending on the application startup time and smoke test procedure. The cost is maintaining two environments simultaneously — the recovery region's infrastructure (compute, networking, database replica) runs continuously even when not serving traffic. The infrastructure-as-code decision directly determines the cost of maintaining warm standby: recovery region infrastructure defined in Terraform and applied to a second workspace can be provisioned and maintained with the same code that manages the primary region, ensuring the two environments stay in sync as infrastructure changes are made. Warm standby is the correct choice for services with an RTO requirement of 30 minutes to 2 hours that cannot be met by backup-and-restore (because the backup restore time exceeds the RTO) but whose business model does not justify the cost of active-active. The fintech's 1-hour RTO commitment was achievable with warm standby under normal conditions but required an IAM role that was deleted — a playbook drift problem, not a fundamental architecture problem.

Active-active multi-region. The team serves traffic from multiple regions simultaneously, with each region able to absorb the full load if another fails. The RPO is near-zero (synchronous replication or eventual consistency with conflict resolution) and the RTO is near-zero (traffic is rerouted to surviving regions via DNS health checks or load balancer failover). The cost is 2× or more the production infrastructure cost (two or more full production environments running simultaneously), plus the complexity of distributed state management — a write in us-east-1 must be consistent with a write in us-west-2, which requires either synchronous cross-region replication (which adds latency to every write proportional to the inter-region network RTT) or an eventual consistency model (which requires the application to handle write conflicts and read-your-writes consistency violations). Active-active is the correct architecture for products where the cost of any regional outage — measured in revenue, regulatory exposure, or enterprise SLA penalty — exceeds the cost of maintaining double the infrastructure and solving distributed consistency. For most sub-100 engineer companies, active-active is over-engineered: the complexity of distributed consistency introduces a class of bugs more likely to cause data loss than the single-region incident active-active is designed to prevent. The observability strategy decision record must extend to active-active: monitoring in both regions, replication lag monitoring between regions, conflict detection for eventual-consistency models, and alerting on divergence between regional data states are prerequisites for operating active-active safely. The decision to use active-active should be documented with the consistency model chosen, the conflict resolution strategy, and the latency penalty accepted for cross-region synchronous writes — not just "we're active-active for high availability."

AI chat session types and what each one misses

The disaster recovery decision follows a consistent pattern in AI chat session history. The initial session establishes the backup approach. A later session adds a "high availability" requirement. A crisis session during the first real outage becomes the moment the undocumented assumptions are discovered. The WhyChose extractor surfaces these sessions from AI chat exports — the initial infrastructure session, the HA session, and the incident session are present in most engineering teams' chat histories, and the structural decisions each one omits are consistent across the decision records reviewed.

The initial infrastructure setup session covers: what cloud provider and region to use, how to configure the managed database service, what backup interval to set, and whether to enable multi-AZ for the primary instance. Multi-AZ is often enabled in this session because the cloud provider's console recommends it for production. What the session does not cover: the difference between multi-AZ (high availability within a region — automatic failover from primary to standby AZ when the primary AZ has a problem) and disaster recovery (recovery from a regional failure — a different class of event requiring a different class of infrastructure). A team that enables multi-AZ and documents "we have high availability" has conflated availability with disaster recovery. Multi-AZ provides automatic failover within a region in 60-120 seconds. It provides no protection against a regional failure that affects both the primary and standby AZ, a data corruption event that replicates to the standby, or a deployment error that corrupts the database schema. The session also does not cover: what the restore time will be for the current backup size, what the restore time will be in 18 months when the database is 28× larger, or who will lead the recovery process when the incident happens at 2 AM. These questions are not asked because the session's scope is "get the database running," not "document the recovery contract the team is implicitly signing by choosing this configuration."

The "we need to be highly available" session covers: adding a read replica, load balancing across multiple application instances, and in some cases, cross-region replication. The session ends when the team has an architecture diagram that shows redundancy. What the session misses: the replication lag measurement under production load conditions. The session typically contains a statement like "asynchronous replication has low latency, usually a few seconds" — which is true at the throughput level of the current session's discussion and false at P99 under peak conditions. The RPO that gets documented (or stated verbally and forgotten) in this session is based on this assumption: "replication is fast, so our RPO is a few minutes." The assumption is never validated against a continuous lag monitor. The database migration strategy decision record context applies here: a major schema migration that runs a multi-hour backfill on the primary while replication is active will drive replication lag to the duration of the backfill for the entire execution window, temporarily invalidating any RPO target under a few hours. The connection between migration strategy and recovery point is a cross-cutting concern that belongs in the DR ADR's replication lag section, not assumed away in the HA session.

The "we should test our DR" session — when it happens — covers: agreeing on a test date, walking through the recovery steps on paper, and sometimes scheduling the test. What the session misses: an audit of the infrastructure the playbook references to confirm it still exists as described. The test plan session reads the playbook and annotates it with time estimates. It does not verify that the IAM role on line 14 of the playbook is still the ARN currently deployed, that the S3 bucket path on line 22 matches the current backup location, or that the instance type on line 31 is still available in the recovery region. The verification gap between the planning session and the test execution is where IAM role deletions, bucket restructurings, and instance type discontinuations hide. The CI/CD pipeline decision record can reduce this drift with a pre-test playbook validation script — a shell script or Terraform plan that verifies each infrastructure resource referenced in the playbook exists with the expected properties before the test begins. This is the automation equivalent of running the test on paper before running it in production, and it converts the "did the test reveal drift?" question into a structured check with a pass/fail output.

The incident session — recovery under pressure. This is not a planned AI session. It is the sequence of queries fired into a chat interface at 2 AM while the on-call engineer is simultaneously monitoring the recovery progress, responding to the CTO on Slack, and checking AWS console status. The session covers: how do I restore an RDS snapshot, what instance type should I use, how do I update the application's database connection string after the restore, how do I restart the application services. These queries are the improvised equivalent of the playbook steps that were never written — or were written and are now two steps ahead of the engineer's current position in the recovery. What the session misses: documenting any of the decisions made during the recovery as an ADR update, so that the playbook reflects the actual restore procedure and actual instance type chosen, not the assumed procedure from the initial setup session. The incident session is the highest-value source of DR ADR content — the engineer discovered, under real conditions, what the recovery process actually requires — but the content is discarded when the chat session ends and the incident is closed. The WhyChose extractor surfaces this session as one of the most consequential in an infrastructure team's history: the "how do I restore" session at 2 AM contains the actual procedure, the actual time measurements, and the actual stumbling blocks. Those are the inputs to a DR ADR that would prevent the next incident from taking the same 8 hours and 25 minutes.

Five ADR sections for disaster recovery

A disaster recovery ADR that produces achievable targets, a tested playbook, and a sustainable drift management process covers five sections that teams consistently omit from their initial infrastructure documentation.

First, RPO and RTO targets per data tier, with the basis for each target. The ADR documents the RPO and RTO for each data tier, not a single pair of targets for the entire system. Customer transaction data (the most critical tier — subscription events, payment records, user-created content) may have an RPO of 5 minutes and an RTO of 1 hour. Audit log data may have an RPO of 24 hours (logs can be reconstructed from application events for the lost window) but a higher durability requirement (7-year retention, immutable after write). Derived data — computed aggregates, search indexes, analytics materialized views — may have an RPO of 24 hours and an RTO of 4 hours (can be rebuilt from the primary transaction data, does not need to be a focus of the replication strategy). Application code and configuration (in git and S3) may have an RPO of 0 and an RTO of 30 minutes (code is never lost; recovery time is the time to redeploy into the recovery region). Each target must include the basis: what replication or backup mechanism is used to achieve the RPO, and what has been measured to validate that the mechanism achieves the target under production load conditions. The stated RPO is a business requirement. The validated RPO is a measured property of the current infrastructure. The ADR documents both and notes when they diverge. A validated RPO that is worse than the stated RPO is a gap that requires either infrastructure investment to close or a business-side acceptance of the wider gap — a decision that must be documented explicitly, not papered over by leaving the stated target as the only number in the document.

Second, replication mechanism and measured lag with continuous monitoring. The ADR documents the replication mechanism for each data tier: RDS cross-region read replica with asynchronous replication, point-in-time recovery with WAL log shipping, DynamoDB global tables with eventual consistency, S3 cross-region replication, Elasticsearch or OpenSearch cross-cluster replication. For each mechanism, the ADR documents the measured replication lag at P50 and P99 under production load, the load conditions at which the measurement was taken (date, request throughput, write throughput, whether a migration or batch job was running), and the monitoring setup that provides continuous visibility into lag. The monitoring setup must produce a metric per data tier's replication lag that is scraped at a frequency appropriate for the RPO target: an RPO of 5 minutes requires a replication lag metric updated every 30 seconds, not every 5 minutes, so that lag accumulation is detectable before it breaches the RPO boundary. Alert thresholds: alert at 50% of the RPO target, not at 100%, to provide time for investigation before the target is breached. For the fintech's 5-minute RPO on customer transaction data, the alert threshold is 2.5 minutes of replication lag — an alert that fires continuously during end-of-month batch runs is the signal that the RPO target requires either a tighter replication mechanism (synchronous replication, not asynchronous) or an accepted exception for batch processing windows documented in the ADR. The secrets management decision record intersects here: the KMS encryption key used to encrypt the primary database must be replicated to the recovery region before a failure, not provisioned during recovery. The KMS key replication status is a pre-condition for a successful warm standby promotion — a pre-condition that belongs in the replication mechanism documentation, not discovered during the test when the promotion fails with a key not found error.

Third, recovery infrastructure specification. The ADR documents the recovery environment: what exists pre-provisioned in the recovery region, what is created during recovery, and the specification for each component. Pre-provisioned components for warm standby: the database replica (instance class, parameter group, subnet group, security group — all matching the primary's configuration), the networking infrastructure (VPC, subnets, security groups, NAT gateway — required for services to start), and the DNS records in both primary and recovery configurations with a low TTL (30 seconds) pre-set on the failover records so propagation time is bounded when failover is executed. Created during recovery: the application compute tier (ECS task count, EKS node group, or EC2 Auto Scaling group started at the minimum instance count documented in the playbook), and any external service configurations that must be updated to point to the recovery region (CDN origin, external webhook endpoints, third-party integration base URLs). Each component is specified with the Terraform resource reference so the recovery infrastructure is traceable to its infrastructure-as-code definition and maintained in sync with the primary as infrastructure changes are made. The instance classes and types in the recovery specification must match the current primary specification, not the specification at the time the ADR was written — a recovery instance class smaller than the primary will produce a recovery environment that cannot handle the production load when traffic is restored, an incident compounded by a performance problem. The infrastructure-as-code strategy is what makes maintaining the recovery specification in sync with the primary tractable: a Terraform workspace per region, sharing the same module definitions, ensures that an instance class change in the primary workspace is applied to the recovery workspace in the same PR, not discovered during the DR test as a specification drift.

Fourth, the recovery playbook with ownership, time budget, and pre-conditions. The ADR documents the recovery playbook as a structured procedure, not as a general description of the recovery approach. Each step is documented with: the exact action (including the specific AWS console path, CLI command, or Terraform command with parameters), the time budget for the step (measured from the most recent DR test), the responsible role (on-call engineer, DR lead, or specific named team), the expected outcome that confirms the step succeeded, and the escalation path if the step fails or exceeds its time budget. The total time budget across all steps must sum to less than the RTO target — not equal to it, because the RTO budget must include the detection and paging time that precedes the playbook execution. Pre-conditions documented before step 1: the IAM roles referenced in steps 3, 7, and 12 must exist (ARN for each listed in the playbook); the S3 bucket paths used in step 5 must exist and the on-call engineer must have read access; the DNS hosted zone ID used in step 9 must match the current Route 53 configuration; the encryption key ARN used in step 6 must be present in the recovery region. Pre-conditions are checked at the start of each DR test — a pre-condition check script that verifies each named resource exists and is accessible by the IAM principal that will execute the recovery. A pre-condition that fails at the start of a DR test is caught before the 90-minute recovery window has started; the same pre-condition that fails during a real incident is caught when the playbook step referencing it fails, at whatever point in the recovery that step falls. Pre-condition checks are the mechanism for detecting playbook drift between tests. The named DR lead — a specific engineer who has ownership of the DR plan, is trained on the full recovery procedure, and is the escalation point for on-call engineers who encounter steps they cannot complete — is the governance structure that prevents the "we figured it out but it took three people two hours to improvise the missing steps" outcome.

Fifth, playbook drift management and test cadence. The ADR documents the quarterly DR test schedule, the pre-test validation procedure, the post-test update procedure, and the infrastructure change hook that triggers playbook review. Quarterly test cadence: a full DR test on the first Saturday of Q1, Q2, Q3, and Q4, with the test executed against the recovery environment from the current backup and the recovery procedure run start-to-finish as written. The test is not a tabletop exercise — it is a live recovery that either succeeds in meeting the RTO target or identifies the gap between the documented procedure and the current infrastructure. Post-test update: within 5 business days of the DR test, the DR lead updates the playbook to reflect any steps that had to be improvised, any pre-condition checks that failed, and any time measurements that differ significantly from the budgeted time. The ADR version is incremented with each post-test update, creating a changelog of how the playbook has evolved. Infrastructure change hook: any Terraform change to infrastructure in the categories listed in the playbook pre-conditions — IAM roles, S3 buckets, KMS keys, DNS hosted zones, EKS cluster names, RDS parameter groups — triggers a required reviewer assignment to the DR lead on the PR. The DR lead's review task is to check whether the infrastructure change invalidates any step or pre-condition in the playbook, and to update the playbook before the PR is merged if it does. This is the change management mechanism that prevents IAM role deletions and S3 path restructurings from accumulating as playbook drift between tests. The quarterly decision review should include the DR test results as a standing agenda item: were the RPO and RTO targets met in the most recent test, what drift was found and corrected, and has the business growth since the last test changed the assumptions (database size, peak write throughput, consumer count) that the targets were based on. A business that was 22 people when the DR plan was written and is now 60 people with 4× the customer data has likely outgrown the restore-time assumption that underlay the RTO target, and the quarterly review is when that gap should be surfaced and re-evaluated — not when the 8-hour restore begins at 2 AM.

None of these five sections appear in the initial infrastructure setup documentation or in the high availability architecture diagram. They are the recovery contract that the team has implicitly signed with every enterprise customer who has an uptime SLA, every regulatory auditor who has seen the SOC 2 report with the stated RPO and RTO, and every engineer who will be paged at 2 AM and asked to execute a recovery procedure they have never tested. The document management company's 8-hour-25-minute outage and the fintech's failed DR test are not caused by poor engineering in the individual sessions. They are caused by a disaster recovery posture that was assumed rather than documented — without specifying the targets based on current data volume, the replication lag measurement under production load, the recovery infrastructure specification, the tested playbook with pre-condition checks, or the drift management process that keeps the playbook current between tests. The WhyChose extractor surfaces the initial infrastructure session, the "we need HA" session, and the 2 AM incident session from AI chat history; the disaster recovery ADR is what takes the assumptions from those sessions and converts them into a documented commitment the team can validate, test, and maintain — before the incident that tests it instead.

FAQs

What is the difference between RPO and RTO and why do both need to be documented separately?

RPO (Recovery Point Objective) is the maximum amount of data loss the business can accept, measured in time: if your RPO is 1 hour, a recovery is acceptable if it restores the system to the state it was in no more than 1 hour before the failure. RTO (Recovery Time Objective) is the maximum amount of time the business can accept the system being unavailable: if your RTO is 4 hours, the system must be restored and serving requests within 4 hours of the failure being declared.

The two are independent dimensions of the recovery contract and are met by different technical mechanisms. RPO is met by the replication or backup frequency: a 1-minute RPO requires synchronous or near-synchronous replication; a 1-hour RPO may be met by hourly backups. RTO is met by the speed of the recovery process: detection time, failover decision, DNS propagation, service startup, database restore or replica promotion, and smoke test — all of which must be pre-measured, not estimated.

Both targets must be set per data tier, not as a single number for the system. Customer transaction data may require a 5-minute RPO and a 1-hour RTO. Derived data — materialized views, search indexes — may tolerate a 24-hour RPO and a 4-hour RTO (they can be rebuilt from the primary transaction data). The recovery contract is incomplete without both dimensions specified per tier, validated against measured infrastructure behavior, and documented before the first incident that tests the commitment.

Why does a stated RPO of 5 minutes not guarantee that only 5 minutes of data will be lost in a real incident?

A stated RPO of 5 minutes is an intent. Whether it is achievable depends on the measured replication lag under production load — specifically, under the peak-write conditions most likely during a real incident, not under the average conditions when the architecture was reviewed.

Asynchronous replication lag is a function of write throughput, network bandwidth between regions, and replica apply throughput. Under normal load the lag may be 30 seconds. During a high-write event — an end-of-month batch, a major product launch, a database migration backfill — the lag may be 47 minutes. If the primary fails when the replica is 47 minutes behind, 47 minutes of data is lost regardless of the stated RPO. A team that has never measured lag at P99 throughput has never validated their RPO under the conditions most likely to coincide with a failure.

The RPO documentation must specify the measurement methodology: which replication mechanism, the P50 and P99 measured lag values, the write throughput at which those values were measured, and the continuous monitoring setup. A CloudWatch monitor on replication lag that alerts at 50% of the RPO target gives real-time visibility into RPO achievability before the incident — an alert that fires during every end-of-month batch is not a nuisance, it is a notification that the documented RPO is not achievable for 10% of operational time.

What makes a DR playbook drift and how does a team discover the drift before an incident?

A DR playbook drifts when the infrastructure it describes changes after the playbook was last tested. An IAM role referenced in the playbook is deleted during a permission audit. An S3 bucket path used for backup restore is restructured when the team migrates backup tools. An EC2 instance type specified for the standby environment is discontinued by the provider. A DNS hosted zone ID is replaced when the team migrates DNS providers. None of these changes are made with the intent to invalidate the playbook — each is a routine infrastructure maintenance task.

The drift accumulates silently because the playbook is not consulted during normal operations and infrastructure change reviews do not cross-reference the DR playbook. The only way to discover drift before an incident is to run the playbook. A quarterly DR test that executes the full recovery procedure surfaces every step that references infrastructure that no longer exists. A step that fails during a quarterly test produces a playbook correction. The same step that fails at 2 AM during a real incident produces an extended outage and an SLA breach.

Two mechanisms reduce drift accumulation between tests: a pre-condition check script that verifies each infrastructure resource named in the playbook exists before the test begins, and an infrastructure change hook that assigns the DR lead as a required reviewer on any PR that modifies infrastructure in the categories the playbook references — IAM roles, S3 buckets, KMS keys, DNS hosted zones. The hook converts the question "did anyone update the playbook for this change" from a social convention into a code review requirement.