The capacity planning decision record: why the provisioning model you chose determines your cost floor and your headroom ceiling under traffic spikes
In the summer of year three, a 22-person B2B SaaS company receives a cloud bill that is 340% of the previous month's bill. The explanation, once the finance team escalates it to the engineering lead, takes four hours to reconstruct: the company has been running all of its application servers on m5.4xlarge instances since the founding engineer provisioned the first production environment in year one. The m5.4xlarge was chosen because the founding engineer's previous job ran on m5.4xlarge instances, the application fit, and there was no reason to change it. Three years of growth later, the service runs on 14 of them — not because 14 m5.4xlarge instances is what the traffic requires, but because the auto-scaling minimum was set to 2 in year one, the traffic grew, and the auto-scaling policy has been adding instances at peak and never scaling back in because the scale-in cooldown is set to 60 minutes and peaks occur every 45 minutes. The cost floor — the minimum spend required to serve the actual traffic — is approximately 4 m5.4xlarge instances. The team is running 14.
The overprovisioning was invisible for three years because nobody had written down what the provisioning model was supposed to achieve. The instance size was chosen for historical reasons. The auto-scaling minimum was chosen for safety, with no record of what risk it was hedging against. The scale-in cooldown was copied from an AWS blog post about a workload with a different traffic pattern. The cost floor had never been calculated. The headroom ceiling — the maximum traffic the cluster could absorb before running out of capacity — had never been measured. The 340% bill spike was caused by a traffic-driven scale-out event that triggered correctly, added 8 instances in 20 minutes, and then never scaled back in because the cooldown policy was misconfigured for the traffic cadence. The correct fix took 90 minutes to implement and reduced the monthly bill by 65%. The investigation to understand why the configuration was what it was took four hours, because nobody had written down why the configuration was what it was.
The second story is the inverse failure. A 40-person marketplace runs a promotional event in November — a one-day flash sale affecting 15% of their catalog. Traffic projections suggest a 4× peak relative to the previous Tuesday. The engineering team reviews the auto-scaling configuration two days before the event and confirms that the maximum instance count is set to 20, which they calculate is sufficient for a 4× peak. What they do not calculate is the scale-out latency: from the moment the CloudWatch CPU alarm triggers to the moment a new EC2 instance is registered with the load balancer and passing health checks is 4 minutes and 12 seconds. The promotional email goes out at 10:00am. Traffic doubles in 35 seconds — a spike profile driven by email open rates in the first minute, not the gradual growth the auto-scaling policy was designed for. By 10:01am, the existing instances are at 94% CPU. The first scale-out event triggers at 10:00:48 (the alarm evaluation window is 2 × 60-second periods). New instances begin serving traffic at 10:04:58. Between 10:01am and 10:05am, the existing instances at 94–98% CPU begin shedding requests: the checkout service starts returning 503s at 10:02:14. By the time the new instances are serving traffic, 1,847 checkout requests have failed. The auto-scaling maximum was correct. The steady-state instance count was not — there was no static headroom to absorb a spike faster than the scale-out latency.
Both failures trace to the same undocumented decision: the provisioning model was chosen once under the constraints of the moment and was never written down as a set of explicit choices with recorded reasoning. The cost floor was never calculated. The headroom ceiling was never measured. The scale-out latency was never tested against the observed spike shape. The capacity planning decision record is the document that makes these choices explicit, records the reasoning behind them, and specifies the conditions under which the choices should be revisited.
What a capacity planning decision record covers
Capacity planning is not a single decision — it is a family of interconnected decisions that together determine how much the service costs to run and how much unexpected traffic it can absorb. The decisions interact: the instance sizing affects the scale-out latency (larger instances take longer to launch); the auto-scaling trigger threshold affects the steady-state utilization (a lower threshold means more idle capacity and a higher cost floor); the scale-in policy affects the overprovisioning cost (an aggressive scale-in policy reduces cost but increases the frequency of scale-out events, each carrying the scale-out latency cost). Writing them down separately in five different Slack threads or Terraform comments means the interactions are never visible and the tradeoffs are never made explicit.
The five decisions that belong in a capacity planning ADR are:
- Provisioning model: on-demand vs. reserved instances vs. spot vs. serverless, and the cost-reliability tradeoff at the time of the decision.
- Instance sizing policy: which instance family and size for each tier, what the sizing rationale was, and what the right-sizing review cadence is.
- Horizontal vs. vertical scaling policy: when to scale out vs. scale up for each tier, with the trigger metric and threshold.
- Auto-scaling configuration: trigger threshold, cooldown periods, minimum and maximum counts, scale-in policy, and the reasoning behind each parameter.
- Capacity buffer policy: what percentage of idle capacity to maintain at steady state, what the buffer is protecting against (spike shape, scale-out latency, or both), and the cost the buffer represents.
Three structural properties that the provisioning model decides
1. The cost floor and the buffer cost
The cost floor is the minimum monthly compute spend required to maintain the service at the target availability level. It is not the cost of running at average load — it is the cost of running with enough idle capacity to absorb the observed spike distribution without capacity-caused errors during the scale-out latency window. If the scale-out latency is 4 minutes and the steepest observed traffic spike doubles in 2 minutes, the static buffer must cover a 2× load increase entirely from pre-provisioned capacity. If each instance handles 500 requests per second at 80% CPU utilization, and steady-state traffic is 2,000 rps, then 4 instances running at 80% handles steady-state. A 2× spike to 4,000 rps requires 8 instances. The buffer requirement — 4 additional instances available before auto-scaling completes — is the cost floor above the minimum serving capacity.
The buffer cost is the delta between the cost floor and the cost of serving steady-state traffic with no buffer. For a service where the difference is 4 m5.xlarge instances at $0.192/hour, the buffer costs $552/month to maintain. The question documented in the ADR is: what does a capacity-caused incident cost, and is $552/month a rational insurance premium against that risk? If a capacity-caused checkout failure during a promotional event generates $40,000 in lost revenue and support escalation cost, $552/month is clearly rational. If the service has no revenue events and the cost of a 5-minute degradation is one support ticket, the buffer cost may be larger than the risk it covers. Writing the calculation down makes the tradeoff explicit — the buffer policy is a financial decision with an explicit cost and an explicit risk model, not an operational intuition about what "safe" means.
2. The headroom ceiling and scale-out latency
The headroom ceiling is the maximum traffic increase the service can absorb without a capacity-caused error rate increase, given the current provisioning configuration. It has two components that interact: the static headroom (idle capacity already provisioned above steady-state utilization) and the dynamic headroom (capacity that auto-scaling can add, but only at the rate determined by scale-out latency).
Scale-out latency has four components for EC2-based workloads: instance launch time (60–90 seconds for common instance types), application startup time (depends heavily on the application — a JVM application initializing connection pools and warming caches can add 45–180 seconds; a stateless Go binary serving immediately at process start adds near zero), health check grace period (typically 60–300 seconds depending on configuration), and load balancer registration propagation (15–60 seconds). The sum is the scale-out latency: the time from scaling trigger to new capacity serving production traffic. For a JVM application with a 120-second warmup, the total scale-out latency is typically 4–6 minutes. For a stateless Node.js application with a 5-second startup, the total is 2–3 minutes. For a serverless function, the cold start is 200ms–2000ms depending on runtime and package size.
The headroom ceiling is a function of all four: the spike rise time observed historically, the static buffer percentage, the scale-out latency, and the maximum instance count. A service with a 3-minute scale-out latency and a 20% static buffer can absorb a traffic spike that rises at 20% per 3 minutes without capacity-caused errors — any faster rise will exceed the static buffer before new capacity is available. This is the headroom ceiling: the maximum traffic spike rise rate the configuration can absorb without errors. It should be calculated from measured values (actual scale-out latency from CloudWatch, actual spike shapes from traffic logs) and documented in the ADR as a number, not as "we have auto-scaling configured so we should be fine."
3. The provisioning model and its cost-reliability tradeoff
On-demand instances offer full flexibility (scale-in and scale-out at any time) at the highest per-hour cost. Reserved instances (1-year or 3-year commitment) reduce per-hour cost by 30–60% but eliminate the flexibility for the committed capacity — a reserved instance that is terminated because traffic dropped below the commitment level still charges the hourly rate. Spot instances offer 60–90% discounts but can be reclaimed with a 2-minute warning, making them appropriate for stateless workloads with graceful shutdown capability and inappropriate for stateful workloads, session-bearing workloads, or workloads where a 2-minute interruption causes data loss. Serverless pricing eliminates idle capacity cost entirely (you pay per invocation, not per instance-hour) but introduces cold start latency as a headroom constraint (the first request to a cold function pays the cold start cost, which affects p99 latency for spiky workloads) and per-invocation pricing that becomes expensive at high sustained throughput.
The tradeoff is documented as the specific reason the chosen model was selected at the specific scale the service was at when the decision was made: "We chose on-demand because at 10 instances our monthly on-demand cost is $X, our reserved savings would be $Y, but we do not have 12-month traffic certainty to commit to reserved pricing for the baseline. We will revisit at 20 instances where the reserved savings justify the commitment risk." This reasoning becomes stale as traffic grows, which is why the capacity planning ADR should include the specific traffic volume at which the model should be re-evaluated — not "periodically" but "when monthly on-demand spend exceeds $3,000, which is the threshold where a 1-year reserved commitment for the baseline capacity produces positive ROI within the commitment window."
Five ADR sections for a capacity planning decision record
1. Provisioning model and cost-reliability tradeoff
Document the provisioning model for each tier: application servers, database, cache, queue workers, CDN/edge. For each tier, record: the chosen model (on-demand / reserved / spot / serverless), the current monthly cost under the chosen model, the monthly cost under the alternative models at current scale, the reasoning for the choice given the cost delta and the reliability requirements, and the scale or time threshold at which the model should be re-evaluated. For mixed models (a common pattern: reserved instances for the baseline instance count, on-demand for auto-scaling capacity above the baseline), document the split ratio and the reasoning for the split point.
Record the spot instance policy explicitly if spot is used for any tier: which tiers use spot, what the interruption handling strategy is (graceful shutdown hook duration, in-flight request draining policy, session stickiness handling), what percentage of the cluster can be spot before an interruption wave would reduce capacity below the minimum serving threshold, and what the auto-scaling policy does when a spot interruption notice arrives (launch on-demand replacement immediately, or defer until auto-scaling threshold would have triggered independently). The interruption handling design is a decision with a non-obvious failure mode — a spot fleet with 70% spot capacity and a 2-minute interruption draining window may lose more than 70% of capacity during a wide AZ-affecting spot price event if the 2-minute window does not complete before new spot capacity is available.
2. Instance sizing policy and right-sizing review cadence
Document the instance family and size for each tier, the rationale at the time of sizing (performance requirement, cost target, or historical inheritance), the measured utilization at steady state (CPU, memory, network, disk I/O — all four, not just CPU), and the right-sizing review trigger. The right-sizing review trigger is event-driven, not calendar-driven: a steady-state utilization below 30% for any resource on any tier for 30 consecutive days triggers a right-sizing review for that tier; a steady-state utilization above 70% for any resource triggers a scale-up or scale-out review; a cost delta exceeding $200/month between the current instance type and the next-smaller type triggers a right-sizing review regardless of utilization.
Document the instance family choice explicitly. Instance families are not interchangeable: m5 instances are general-purpose; c5 instances have a higher CPU-to-memory ratio and lower per-vCPU cost for CPU-bound workloads; r5 instances have a higher memory-to-CPU ratio for memory-bound workloads. A JVM application running a large in-memory cache on m5 instances may be overprovisioning CPU while underprovisioning memory — moving to r5 at the same cost may provide 2× the memory with adequate CPU for the actual CPU utilization profile. The instance family decision should record which resource is the actual bottleneck for the specific workload, not assume that general-purpose instances are always the right choice.
3. Auto-scaling configuration and cooldown policy
Document each auto-scaling parameter with its specific value and its reasoning. The parameters that matter most are: the scale-out trigger metric and threshold (CPU utilization at 70% is common but may be wrong for a latency-sensitive workload where CPU at 70% produces acceptable throughput but p99 latency at 400ms already); the scale-out cooldown period (the minimum time between consecutive scale-out events — too short causes thrashing, too long causes sustained underprovisioning during extended spikes); the scale-in cooldown period (the minimum time after a scale-out event before a scale-in event can occur — should be longer than the traffic spike duration to prevent scale-in while a spike is still active); the minimum instance count and the reasoning for the floor (availability requirement, cost floor, or baseline reserved instance commitment); and the maximum instance count and the reasoning for the ceiling (cost cap, database connection pool limit, or external API rate limit that limits horizontal scaling benefit).
The scale-out cooldown deserves particular attention. A cooldown set to 300 seconds on a workload where traffic spikes last 8 minutes means that after the first scale-out event fires and the cooldown begins, the second scale-out event cannot fire until the cooldown expires — if the spike is still growing at 305 seconds, the second scale-out event fires immediately but the new instances take 4 minutes to serve traffic. The effective scaling capacity during an 8-minute spike with a 300-second cooldown and a 4-minute scale-out latency is one scale-out event: the second event fires at 305 seconds but contributes capacity at 545 seconds (9 minutes), after the spike has receded. Document whether the cooldown policy is designed for the observed spike duration, and note the specific historical spike that drove the cooldown value.
4. Capacity buffer policy and cost calculation
Document the target steady-state utilization for each auto-scaling dimension (CPU, memory, concurrent connections, request queue depth). The target steady-state utilization is the inverse of the capacity buffer: a 70% CPU target means 30% idle capacity as the buffer. The buffer calculation should record: what spike shape the buffer is designed to absorb (derived from historical traffic logs — the 95th-percentile spike magnitude and rise time over the last 12 months); whether the buffer covers the full spike or only the first scale-out latency window; the monthly cost of the buffer at current instance count and pricing; and the cost of a capacity-caused incident relative to the buffer cost (the comparison that justifies the buffer policy as a financial decision).
For services with predictable traffic patterns (B2B SaaS with business-hours traffic, e-commerce with known promotional calendar), document scheduled scaling actions separately from reactive auto-scaling. A scheduled scale-out to 150% of steady-state capacity at 9:00am UTC every weekday costs less than reactive auto-scaling for the morning ramp (because the instances are available before the traffic arrives, eliminating the scale-out latency cost from the first-hour spike) and is simpler to reason about because the capacity decision is a calendar entry, not an algorithm. Document which traffic patterns are handled by scheduled scaling and which are handled by reactive auto-scaling, with the reasoning for the split.
5. Load testing methodology and traffic forecast model
Document the load testing approach that was used to validate the provisioning model: the tool used (k6, Locust, Artillery, or equivalent), the traffic pattern simulated (constant load to find throughput ceiling, or realistic spike shape to test scale-out latency), the specific test results that informed the instance sizing and scaling policy, and the date of the last load test. A load test result from the instance sizing decision two years ago is not evidence that the current provisioning model is correct — the application has changed, the traffic pattern has changed, and the instance count has changed. Document the last load test date and the condition under which a new load test is required: any instance type change, any application change that significantly alters resource consumption (new caching layer, new background job, dependency version upgrade with known performance regression), or any auto-scaling parameter change that affects the headroom calculation.
Document the traffic forecast model explicitly: what is the assumed monthly traffic growth rate, what event traffic multiplier is assumed for promotional events, and what is the process for updating the forecast when the observed growth deviates from the assumption. A service growing at 15% per month compounding will 5× in 12 months — the instance sizing that is appropriate at month 1 is severely underprovisioned at month 12 if the provisioning model was not designed with growth headroom. The forecast model in the ADR should include the calculation: at the assumed growth rate, when will the current maximum instance count become insufficient, and what is the lead time to provision additional capacity or switch to a higher-capacity instance type?
The decisions that look like operations but are actually architecture
Capacity planning decisions are often treated as operational configuration — the kind of work that gets done during an incident or a cost review, then handed to a junior engineer to maintain. This classification produces the failure modes described in the opening: instance sizes chosen for historical reasons, auto-scaling parameters copied from generic blog posts, buffer policies set by intuition rather than calculation. The cost floor is discovered during a finance escalation. The headroom ceiling is discovered during a traffic spike that generates customer-facing errors.
The reason capacity planning is architecture rather than operations is that its decisions cascade into the system's reliability and economics in ways that are not visible from the individual configuration parameters. The instance family affects the cost floor (an r5.2xlarge vs. an m5.2xlarge for a memory-bound workload changes the right-sizing calculation entirely). The scale-out latency — driven by application startup time — connects the application architecture (a JVM application with a slow startup vs. a Go binary that serves immediately) to the operational headroom ceiling (the static buffer required to cover the longer startup adds fixed cost). The auto-scaling maximum connects to external dependencies (the database connection pool maximum determines the horizontal scaling ceiling before connection exhaustion becomes the failure mode, not compute capacity).
These cross-cutting connections are the reason the capacity planning decision record belongs next to the database connection pooling decision record, the infrastructure architecture decision record, and the cost optimization decision record — not in the operations runbook. The decisions interact across all three, and the reasoning for each is only legible when the others are also written down.
The decisions that matter most — the static buffer percentage, the scale-out latency as a function of application startup time, the instance family choice relative to the workload's resource bottleneck — are made implicitly during the first production deployment and the first few scaling events. They live in the AI chat history of those sessions: the conversation where the founding engineer said "let's use m5.4xlarge because that's what we used at the last job and it's proven," the post-incident discussion where someone said "we should probably increase the auto-scaling minimum so this doesn't happen again," the cost review conversation where someone said "the bill is too high, let's reduce the minimum back down." These implicit decisions are the capacity model; the Terraform configuration is their implementation.
An export of the AI chat sessions around infrastructure provisioning decisions typically surfaces the original instance sizing conversation, the auto-scaling configuration discussion, and the series of capacity incidents that each produced a configuration change with no written rationale. The five ADR sections above are the structured form of those decisions. Writing them down before the next capacity event, rather than reconstructing them from Terraform history and chat logs after it, is the difference between a capacity model and a capacity accident waiting to happen.
Further reading
- The database connection pooling decision record — connection pool sizing determines the horizontal scaling ceiling before compute capacity
- The infrastructure decision record — the broader infrastructure architecture choices that constrain the provisioning model
- The cost optimization decision record — how right-sizing, reserved instance strategy, and idle resource elimination fit together as a cost model
- The database sharding decision record — how shard key design affects the capacity model for database tiers under horizontal scaling
- The multi-region deployment decision record — multi-region topology and how active-active vs. active-passive affects capacity planning for cross-region failover
- The CI/CD pipeline decision record — deployment frequency and how it interacts with auto-scaling cooldown design during rolling deploys
- The observability strategy decision record — the metrics and monitoring infrastructure that feeds the auto-scaling trigger and capacity alerting
- The incident response playbook decision record — capacity events as a category of production incident and how the runbook handles scale-out latency failures
- The feature flag decision record — feature flags as a capacity mitigation tool during traffic spikes (load shedding, graceful degradation)
- The microservices vs. monolith decision record — how service boundary design affects per-service capacity planning and scale-out granularity
- The test strategy decision record — load testing as a first-class testing discipline alongside unit and integration tests
- The decisions that never get written down — how capacity model choices join the class of consequential undocumented architectural decisions
- The WhyChose open-source extractor — recover the original capacity planning discussion from your AI chat history