Why does container orchestration need an architecture decision record?

The container orchestration decision is one of the highest-leverage architectural choices a team makes in year one, because it determines the operational complexity floor — the minimum ongoing maintenance work — for the life of the product. A team that chooses Kubernetes is committing to a continuous operational investment in control plane upgrades, node group maintenance, RBAC management, Helm release management, and the organizational knowledge of Kubernetes-specific tooling. A team that chooses ECS Fargate is committing to AWS-specific operational patterns with a lower knowledge floor but a tighter coupling to AWS abstractions. A team that chooses serverless containers is committing to the constraints of per-invocation billing, cold start latency, and execution time limits. These are structural commitments that shape hiring decisions, on-call runbooks, CI/CD pipelines, and how long it takes a new engineer to become productive in the production environment. Without a decision record, the team in year three cannot reason about whether to migrate to a different orchestration model, because the rationale for the original choice — the constraints that were present, the alternatives that were considered, the tradeoffs that were acceptable — is gone from the organizational memory.

What is the difference between Kubernetes and ECS from a team operations perspective?

From a team operations perspective, the primary difference between Kubernetes (including managed Kubernetes services like EKS, GKE, and AKS) and ECS is the operational complexity floor — the minimum ongoing work required to keep the platform running correctly. Kubernetes is a general-purpose container orchestration system with a rich extension model: the concepts required to operate it productively (Pods, Deployments, Services, Ingresses, PersistentVolumes, StorageClasses, RBAC, NetworkPolicies, ConfigMaps, Secrets, HorizontalPodAutoscalers, CustomResourceDefinitions) have steep learning curves, and the ecosystem of tooling built on top of Kubernetes (Helm, Argo CD, Flux, Istio, Karpenter, Prometheus Operator, cert-manager, external-dns) adds additional operational surface area. The control plane requires version upgrades on a cadence that managed services handle automatically, but node groups require explicit maintenance even on managed services. ECS with Fargate eliminates the node management problem entirely — AWS manages the underlying compute — and uses a simpler conceptual model (task definitions, service definitions, capacity providers) that requires less organizational knowledge to operate correctly. The tradeoff is that ECS is tightly coupled to AWS abstractions and has a narrower ecosystem of tooling; operational patterns and community solutions that work for Kubernetes are not directly portable to ECS. Teams choosing between them must weigh the operational investment that Kubernetes requires against the flexibility and ecosystem advantages it provides.

What autoscaling capabilities should the container orchestration ADR specify?

The container orchestration ADR must specify the autoscaling mechanism, the metrics used to trigger scaling, the scale-out latency, the scale-in behavior, and the infrastructure autoscaling required to support pod or task scale-out. For Kubernetes: whether the team uses the Horizontal Pod Autoscaler (HPA) with CPU and memory metrics, KEDA (Kubernetes Event-driven Autoscaling) with queue depth or HTTP request rate metrics, or Vertical Pod Autoscaler (VPA) for right-sizing. The node autoscaling mechanism — Cluster Autoscaler or Karpenter — that provides compute capacity for new pods, with the instance type selection strategy and the expected node provisioning time. For ECS: whether service autoscaling uses target tracking (maintain average CPU at 60%) or step scaling (add N tasks when queue depth exceeds threshold), and whether Fargate serverless capacity or EC2 capacity providers supply compute. The expected scale-out latency end-to-end: from the metric threshold being breached, to the scaling decision being made, to new tasks or pods being scheduled and ready to serve traffic. This latency number — which for Kubernetes can range from 90 seconds for a pre-provisioned node to 5 minutes for a new EC2 node to warm up — is the number that determines how the product behaves during sudden traffic spikes, and it must be measured rather than assumed.

When should a startup use serverless containers instead of Kubernetes or ECS?

Serverless container platforms — AWS Lambda, Google Cloud Run, AWS App Runner, Azure Container Apps — are appropriate when the workload has three properties: variable traffic with periods of near-zero load (the per-invocation billing model produces lower costs than always-on container platforms when average utilization is below 20–30%), stateless execution that can complete within the platform's maximum execution time limit (15 minutes for Lambda, 60 minutes for Cloud Run with request streaming), and tolerance for cold start latency (100ms–3s depending on the platform and runtime, though Lambda SnapStart, minimum instances, and provisioned concurrency can reduce cold starts to under 100ms at additional cost). Startups whose workloads meet these criteria and who want to minimize operational complexity in year one are appropriate candidates for serverless containers. The cases where serverless containers are inappropriate: long-running background jobs that exceed execution time limits, workloads requiring persistent local state (a websocket server, a streaming data processor that accumulates in-memory state across requests), teams whose observability and debugging tooling assumes always-on containers rather than ephemeral execution environments, and workloads where the cold start latency under sudden traffic bursts — every new execution environment starts cold — is unacceptable for the product's SLO. The container orchestration ADR must specify which services use serverless containers and why, including the expected cold start frequency and the acceptable cold start latency for each service.

2026-06-21 · ~22 min read

The container orchestration decision record: why the Kubernetes vs ECS vs serverless choice you made determines your operational complexity floor and your autoscaling behavior under load

Container orchestration looks like an infrastructure choice until a traffic spike reveals that your autoscaling model adds four minutes of latency to scale-out, your on-call engineer doesn't know why Kubernetes was chosen over ECS in the first place, and the incident postmortem session where the migration was decided is buried somewhere in a closed ChatGPT window from eighteen months ago. The orchestration platform you chose in year one sets the operational complexity floor — the minimum ongoing maintenance work your team carries for the life of the product — and it determines the autoscaling behavior your users experience when demand changes suddenly, not just when it grows predictably.

A twelve-person marketing analytics company ran their entire platform on a single DigitalOcean droplet for fourteen months. One docker-compose file, eight containers, 16 GB of RAM, and a backup cron job to S3. It was enough. The event ingestion pipeline processed 3,000 events per second at peak, the API served a few hundred concurrent client dashboard sessions, and the team shipped new features every two weeks without touching the infrastructure. The operational complexity floor was as low as it gets: restart a container, tail a log file, upgrade a base image. Any engineer on the team could do it in ten minutes.

Then a major e-commerce client ran a Black Friday campaign. Attribution traffic hit 47,000 events per second at 9am EST. The droplet's CPU pinned at 100% in six minutes. The event queue backed up faster than the ingestion workers could drain it. By 11am, the queue had accumulated an eight-hour backlog of unprocessed attribution events. The client's attribution data for their largest revenue day of the year had a four-hour gap. The incident post-mortem produced a three-hour ChatGPT session where the CTO and the lead infrastructure engineer decided they needed Kubernetes.

They chose Amazon EKS. Six weeks of migration, a new Helm chart repository, a new CI/CD structure, Karpenter for node autoscaling, and a fresh on-call runbook. In year two, the platform handled traffic spikes without dropping events. But in year three, the pattern the migration introduced had become visible: the team spent eight hours per week on EKS-specific operational work — control plane upgrade testing, node group AMI updates, Karpenter drift remediation, certificate rotation via cert-manager, Argo CD application sync debugging. When they hired a new infrastructure engineer, three months of their ramp was spent on Kubernetes-specific concepts before they could contribute to production incidents. And when a routine autoscaling investigation revealed that their Karpenter-provisioned nodes took four minutes from metric threshold to new-pod-ready — four minutes being exactly the gap between the SLA breach and the remediation during load events — the CTO opened a new investigation: should they move back to ECS Fargate?

The three-hour ChatGPT session where EKS was chosen over ECS Fargate, Cloud Run, and App Runner no longer existed in the company's accessible history. The rationale — whatever constraints, priorities, and tradeoffs were on the table that day — was gone. The team evaluating the migration back to ECS was starting from scratch, unable to reason from the original decision. The decisions that aren't written down don't disappear; they become the unexplained constraints that shape every subsequent decision until someone hits them hard enough to notice they were there.

Why container orchestration is a decision cluster, not a configuration

Container orchestration is typically framed as a configuration problem: pick a platform, write YAML, deploy containers. The decisions embedded in the platform choice — what the operational complexity floor will be, how autoscaling will work under sudden load, what the migration cost will be if the platform is wrong, what knowledge the team needs to carry in its heads to respond to incidents — are not surfaced in the initial configuration conversation. They are revealed incrementally over months and years as the team encounters the structural properties of the choice they made.

The operational complexity floor is a commitment, not a starting point. Kubernetes does not become simpler as the team grows more familiar with it — the surface area of Kubernetes concepts, tooling, and operational patterns expands as the team uses more of the platform's capability. A team that starts with a simple Kubernetes deployment discovers that production-grade Kubernetes requires: RBAC configuration to control who can do what to which resources; NetworkPolicies to implement zero-trust pod networking; PodDisruptionBudgets to prevent cluster maintenance from disrupting service availability; PodSecurityAdmission policies to prevent containers from running as root; HorizontalPodAutoscaler configuration with appropriate resource requests and limits to make autoscaling work correctly; resource quotas and LimitRanges to prevent runaway resource consumption; storage class configuration for any stateful workloads; ingress controller selection and configuration (nginx, Traefik, AWS ALB Ingress Controller, Envoy-based options). Each of these is a separate decision with its own operational implications. The infrastructure-as-code ADR must specify how all of these Kubernetes-specific resources are managed, versioned, and audited — which adds its own operational surface area on top of the Kubernetes platform itself.

The autoscaling model is determined by the orchestration choice, not configured independently. How the application scales in response to traffic changes is a function of the autoscaling mechanism available on the chosen platform, the metric types that mechanism can consume, the minimum scale-out latency the mechanism achieves, and the infrastructure autoscaling required to provide compute for scaled-out workloads. A team that chooses Kubernetes gets the HPA and KEDA as the primary scaling mechanisms. A team that chooses ECS gets target tracking and step scaling policies. A team that chooses Lambda gets concurrency-based scaling that responds in milliseconds but with cold start latency on new execution environments. The team does not choose these mechanisms independently — they inherit them when they choose the orchestration platform.

The migration cost is front-loaded into the initial choice. Every operational pattern, every CI/CD pipeline stage, every monitoring dashboard, every alerting rule, and every on-call runbook that the team builds around the orchestration platform is a migration cost embedded into the platform choice. A team migrating from ECS to Kubernetes must rewrite deployment configuration, retrain the team, rebuild pipelines and dashboards, and carry a period of dual-platform operational complexity during the migration. A team migrating from Kubernetes to ECS faces the same costs in reverse, plus the cost of disentangling from the Kubernetes-native tooling ecosystem (Argo CD, Flux, Istio, Helm) that may not have direct ECS equivalents. The orchestration ADR must acknowledge this migration cost as a constraint on future decisions — not because migration is impossible, but because the cost must be justified by the problems being solved, and that justification requires knowing why the original choice was made.

The platform determines the knowledge ceiling for incident response. When something fails in production at 2am, the on-call engineer's ability to diagnose and remediate is bounded by their knowledge of the orchestration platform. An ECS Fargate incident requires: knowing how to inspect stopped task exit codes, how to read ECS service event logs, how to check CloudWatch Container Insights metrics, how to update a task definition and force a service deployment. A Kubernetes incident requires: knowing how to read pod events and logs, how to describe a deployment and check its rollout status, how to identify a CrashLoopBackOff and its root cause, how to check node conditions and events, how to drain a node for maintenance, and — for complex failures — how to inspect etcd health, DNS resolution, and CNI plugin behavior. The Kubernetes incident surface is larger and requires more specialized knowledge. That knowledge requirement is not a criticism of Kubernetes — it is a structural property of the platform that belongs in the orchestration ADR alongside the autoscaling behavior and the migration cost.

The orchestration options and their structural properties

The orchestration landscape has consolidated around four categories: managed Kubernetes, AWS ECS (with Fargate and EC2 capacity), serverless container platforms, and self-managed Kubernetes. Each has structural properties that determine the operational complexity floor, the autoscaling behavior, and the team knowledge requirement.

Managed Kubernetes (EKS, GKE, AKS) is the most common choice for teams that want Kubernetes' ecosystem and extensibility without operating the control plane themselves. The managed control plane handles etcd, the Kubernetes API server, the scheduler, and controller-manager — the components that require the most specialized knowledge to operate. What remains for the team: node group management (instance types, AMI versions, node count, autoscaling configuration), cluster add-on management (CoreDNS, kube-proxy, VPC CNI on EKS, or equivalent), Kubernetes version upgrades (managed services handle control plane upgrades but the team initiates them and tests for compatibility), workload configuration (Deployments, Services, Ingresses, HPAs, RBAC, NetworkPolicies), and the operational tooling installed on the cluster (Helm chart releases, operators, CRDs). The critical structural property: even with a managed control plane, Kubernetes operational complexity does not reduce to zero. The team carries the ongoing work of keeping the cluster healthy, up-to-date, and correctly configured. GKE Autopilot and EKS Auto Mode reduce the node management overhead significantly — node provisioning, sizing, and maintenance are handled automatically — but they impose constraints on pod security contexts, compute class availability, and resource request requirements that must be understood before adoption.

AWS ECS with Fargate is the serverless-compute container orchestration model: the team defines what runs (task definitions) and how it is deployed (service definitions, load balancer configuration, autoscaling policies), and AWS manages the underlying compute entirely. There are no nodes to maintain, no AMI versions to track, no control plane to upgrade. The operational model for Fargate tasks is closer to Lambda than to Kubernetes: the team specifies the CPU and memory allocation, the container image, the environment variables, the task role, and the network configuration, and ECS places the task on Fargate compute that AWS manages. The structural properties: the operational complexity floor is substantially lower than Kubernetes — a team can operate ECS Fargate productively with a fraction of the Kubernetes-specific knowledge. The constraints: ECS is tightly coupled to AWS abstractions (IAM, VPC, ALB, CloudWatch, ECS-specific service discovery via AWS Cloud Map) and has a narrower ecosystem than Kubernetes. Multi-cluster, multi-region, and hybrid deployments require more custom tooling on ECS than on Kubernetes, where multi-region topology is better supported by the ecosystem. The Kubernetes tooling ecosystem — Argo CD for GitOps, Istio for service mesh, Argo Rollouts for canary deployments, Karpenter for intelligent node provisioning — does not apply to ECS. Teams that want the specific capabilities of Kubernetes-native tooling must either build equivalent capabilities on ECS or accept the operational overhead of Kubernetes to access them.

ECS with EC2 capacity providers combines ECS's deployment model with EC2-backed compute that the team partially manages. The team defines EC2 Auto Scaling groups as capacity providers, and ECS scales the cluster by launching new instances when insufficient capacity exists for pending tasks. This model is appropriate when Fargate cost per vCPU-hour is prohibitive at scale, when workloads require specific instance types (GPU instances for ML inference, high-memory instances for in-memory data processing, instances with specific hardware), or when network performance requirements exceed what Fargate offers. The operational complexity is higher than Fargate but lower than Kubernetes: the team manages EC2 Auto Scaling group configuration, launch templates, and instance lifecycle hooks, without the Kubernetes conceptual surface area. The autoscaling behavior is more complex than Fargate: scale-out requires both ECS service autoscaling (adding new tasks) and EC2 cluster autoscaling (adding new instances to provide capacity), with the EC2 instance launch time adding to the end-to-end scale-out latency.

Serverless container platforms — AWS Lambda (with container image support), Google Cloud Run, AWS App Runner, Azure Container Apps — treat each request or invocation as the unit of execution rather than the container instance. The platform scales from zero to thousands of concurrent executions without any capacity management, and billing is per invocation rather than per running instance. The structural properties: scale-out latency for new execution environments is limited only by the cold start time of the container image (100ms to 3 seconds depending on the platform and runtime), making them uniquely suited for workloads with sudden traffic spikes. The constraints: maximum execution time limits (15 minutes for Lambda, 60 minutes for Cloud Run), stateless execution requirements (local file system writes are ephemeral and not shared across invocations), and cold start latency that affects the first request to a new execution environment. Lambda SnapStart for Java functions and minimum instance configurations for Cloud Run can reduce cold start latency to under 100ms for critical paths, at additional cost. Teams whose workloads fit within these constraints and whose traffic patterns include significant idle periods (where per-invocation billing produces lower costs than always-on containers) are appropriate candidates for serverless containers as the primary orchestration model.

Self-managed Kubernetes — running the Kubernetes control plane on your own VMs, whether in a cloud provider or on-premises — is appropriate for regulatory requirements that prohibit managed cloud services, for air-gapped environments without internet access to a managed service's APIs, or for very large-scale deployments where the managed service cost model is prohibitive. The operational complexity floor is the highest of any option: the team is responsible for etcd health (including backup and restore procedures for the most critical data store in the cluster), control plane component availability (high availability control plane configuration requires at least three control plane nodes), Kubernetes version upgrades including etcd upgrades, and the full operational surface area of Kubernetes itself. This model is not appropriate for early-stage teams — the engineering cost of operating a self-managed Kubernetes cluster exceeds the benefit for teams below several hundred nodes and specialized compliance requirements.

Autoscaling behavior under load

The autoscaling behavior is the property of the orchestration choice that has the most direct effect on the product's behavior during traffic spikes. The opening narrative's four-minute scale-out latency is not a Kubernetes misconfiguration — it is the expected result of the Karpenter node provisioning time added to the HPA reaction time. Understanding this latency is a requirement, not an optimization, for the orchestration ADR.

Kubernetes HPA (Horizontal Pod Autoscaler) scales the number of pod replicas in a Deployment or ReplicaSet based on metrics from the metrics server (CPU and memory utilization, aggregated from Kubelet) or from external metrics providers (custom metrics from Prometheus via the custom metrics API, external metrics from cloud provider monitoring). The HPA control loop runs every 15 seconds by default and computes the desired replica count based on the observed metric value relative to the target. The scale-out latency has three components: (1) the metric collection interval (15 seconds from Kubelet to metrics server), (2) the HPA control loop evaluation (another 15 seconds), (3) the pod scheduling and startup time (new pods must be scheduled onto existing nodes, pull the container image if not cached, pass the readiness probe, and receive traffic from the service). For workloads with a warm node pool and a cached container image, end-to-end scale-out can be under 90 seconds. For workloads that require new nodes to accommodate the new pods, the node provisioning time is added: Karpenter can provision a new EC2 instance in 60–90 seconds under favorable conditions, adding 2–3 minutes to the end-to-end scale-out latency. The four-minute scale-out in the opening narrative falls directly into this range.

KEDA (Kubernetes Event-driven Autoscaling) extends the HPA model to scale on queue depth, HTTP request rate, cron schedules, and dozens of other external signal sources. A KEDA ScaledObject targeting an SQS queue can scale the Deployment from zero to N replicas based on the queue depth, enabling true scale-to-zero for workloads that are idle during off-hours. The KEDA controller polls the external scaler (SQS, Kafka, RabbitMQ, Prometheus, Datadog) at a configurable interval and computes the desired replica count. The scale-out latency for KEDA-driven scaling is similar to HPA: the polling interval plus pod scheduling plus startup time, with node provisioning added if required. KEDA is the right autoscaling mechanism for queue-driven background workers, scheduled batch jobs, and any workload whose load signal is not CPU utilization. The orchestration ADR must specify which workloads use HPA, which use KEDA, and which KEDA scaler and target metric applies to each — the choice of target metric is an implicit decision about what "high load" means for each service.

ECS service autoscaling uses Application Auto Scaling to adjust the desired task count for an ECS service. Target tracking policies maintain an average metric value (CPU utilization at 60%, request count per target at 1000, SQS queue depth per task at 100) by adding or removing tasks as needed. Step scaling policies respond to CloudWatch alarm threshold crossings with predefined capacity adjustments. The end-to-end scale-out latency for ECS Fargate is: the CloudWatch metric collection interval (60 seconds for standard resolution, 10 seconds for high-resolution custom metrics), the Application Auto Scaling evaluation interval (1–10 minutes depending on the cooldown period configuration), and the Fargate task launch time (30–90 seconds from task launch request to healthy task receiving traffic). Under typical configuration, ECS Fargate scale-out completes in 2–4 minutes from metric threshold breach to new task serving traffic. This is comparable to Kubernetes HPA with pre-warmed nodes, and potentially faster than Kubernetes HPA requiring new node provisioning. The structural property that makes ECS Fargate autoscaling simpler: there is no node layer to autoscale — the team configures only the ECS service autoscaling, not both pod autoscaling and node autoscaling as in Kubernetes.

ECS with EC2 capacity providers adds the EC2 Auto Scaling group to the scaling chain. When ECS requires more tasks than the current cluster capacity can accommodate, the capacity provider requests additional EC2 instances from the Auto Scaling group. EC2 instance launch time from the Auto Scaling request to the instance joining the ECS cluster and being available for task placement is typically 2–4 minutes for standard instances, longer for larger instance types or when Spot capacity is constrained. The end-to-end scale-out latency for ECS on EC2 with on-demand capacity is 4–7 minutes from metric threshold to new task serving traffic under typical configuration — comparable to Kubernetes with Karpenter, and slower than ECS Fargate for the same workload. Managed Spot capacity via ECS capacity providers adds the Spot interruption risk and the compensating logic (draining tasks gracefully on a Spot interruption notice) as an operational concern.

Serverless container autoscaling is concurrency-driven: each request to a Lambda function or Cloud Run service is served by one execution environment (one "instance"). When concurrent requests arrive, the platform scales out by launching new execution environments in parallel. The scale-out latency to a new execution environment is the cold start time — 100ms to 3 seconds depending on the platform, the runtime, the image size, and the VPC attachment configuration. This is dramatically faster than Kubernetes or ECS scale-out under sudden load, at the cost of the cold start latency on the first request to each new execution environment. Lambda Provisioned Concurrency pre-warms a specified number of execution environments, eliminating cold start latency for up to that concurrency level at the cost of paying for the provisioned concurrency continuously. Cloud Run minimum instances serve the same function. The orchestration ADR must specify whether provisioned concurrency or minimum instances are used, the target concurrency level, and the cost implication — because provisioned concurrency at scale can exceed the cost of an always-on ECS Fargate service.

Networking and service discovery

The networking model for container-to-container communication and external traffic routing is determined by the orchestration platform and carries its own operational complexity and configuration decisions.

Kubernetes networking is built on the CNI (Container Network Interface) plugin model. The CNI plugin implements pod networking — how pods get IP addresses, how pods communicate with each other across nodes, and how NetworkPolicies are enforced. AWS VPC CNI on EKS assigns each pod a real VPC IP address, allowing pods to be accessible from other VPC resources without any NAT translation, at the cost of consuming VPC IP space at the rate of one IP per pod (a constraint in IP-address-constrained VPCs). Flannel, Calico, and Cilium are alternative CNI plugins with different networking models, NetworkPolicy support, and operational characteristics. Kubernetes Services provide stable DNS names and IP addresses for pod groups, with kube-proxy implementing the load balancing via iptables or IPVS rules. Ingress controllers (nginx, Traefik, AWS ALB Ingress Controller) handle external traffic routing from outside the cluster to Services. The orchestration ADR must specify the CNI plugin, the service networking model, and the ingress controller choice — each is a separate decision with operational implications that affect the team's ability to implement service mesh policies and zero-trust networking in the future.

ECS networking uses VPC networking directly: each ECS task (in awsvpc network mode) gets an Elastic Network Interface with a VPC IP address, analogous to Kubernetes VPC CNI. ECS service discovery uses AWS Cloud Map to register task IP addresses and port mappings as DNS records, or AWS Application Load Balancer (ALB) target groups for HTTP/HTTPS traffic routing. ALB path-based and host-based routing rules replace the Ingress controller abstraction. The networking model is simpler than Kubernetes — no CNI plugin to manage, no kube-proxy, no Ingress controller configuration — at the cost of less flexibility. Complex routing scenarios (traffic splitting by header, canary weight-based routing, gRPC load balancing) that Kubernetes Ingress controllers and service mesh implementations handle natively require additional ALB listener rule configuration or a separate service mesh layer on ECS.

Service mesh placement relative to the orchestration choice is an important sequencing decision. Istio and Linkerd are Kubernetes-native service meshes with no ECS equivalent. A team that chooses ECS and later needs traffic splitting for canary deployments, mTLS for zero-trust networking, or distributed tracing via service mesh sidecars must either add these capabilities through ECS-compatible mechanisms (AWS App Mesh for mTLS and traffic splitting, AWS Distro for OpenTelemetry for tracing) or migrate to Kubernetes to access the native service mesh ecosystem. The orchestration ADR must acknowledge the service mesh dependency if any of these capabilities are anticipated requirements, because the dependency constrains the orchestration choice.

The operational complexity floor in practice

The operational complexity floor is the most significant structural property for teams making the orchestration choice in year one, because it determines how much ongoing engineering time the infrastructure platform consumes and how accessible incident response is to engineers with varying levels of infrastructure experience.

Kubernetes ongoing maintenance includes: cluster version upgrades (managed Kubernetes services release new minor versions every 3–4 months; clusters must stay within the supported version window — typically N-2 or N-3 versions — or lose patch support and eventually lose access to cloud provider integrations); node group maintenance (AMI versions for EKS managed node groups are updated independently of the cluster version; applying AMI updates requires rolling node group upgrades that drain and terminate each node while scheduling pods onto other nodes, with the risk of pod evictions that exceed PodDisruptionBudgets or fail due to insufficient capacity); add-on management (CoreDNS, kube-proxy, VPC CNI, and the AWS Load Balancer Controller have independent version lifecycles that must be kept within compatibility ranges with the cluster version); custom resource definition management (operators installed via Helm charts add CRDs to the cluster that must be upgraded when the operator is upgraded — CRD upgrades can be destructive if not handled correctly); and certificate management (TLS certificates for Ingress resources, mTLS for service mesh, and the Kubernetes API server's serving certificate have expiration dates that must be monitored and renewed, typically via cert-manager).

ECS Fargate ongoing maintenance includes: task definition updates when the container image base is updated (patching the OS in the container image is the team's responsibility, but there are no underlying EC2 nodes to patch); ECS agent version updates (relevant only for EC2 capacity providers, where the ECS agent running on each instance must be kept current); and CloudWatch metric configuration updates when new autoscaling targets are added. The maintenance surface area is significantly smaller than Kubernetes. A team on ECS Fargate can go multiple months without ECS-specific infrastructure maintenance and encounter no accumulating technical debt in the platform layer — the same is not true for Kubernetes, where deferring upgrades for more than a few months creates a compounding compatibility risk.

The incident knowledge floor is the set of concepts and tooling an on-call engineer must understand to diagnose and remediate production incidents on each platform. For ECS Fargate: CloudWatch Container Insights for container metrics and logs, ECS console or CLI for service and task state inspection, ALB access logs and target group health checks for traffic routing issues, and AWS CloudTrail for API-level audit of ECS configuration changes. For Kubernetes: kubectl commands for pod, deployment, service, ingress, and node inspection; understanding of pod scheduling (why is a pod Pending?), image pull errors, CrashLoopBackOff root causes, readiness probe failures, resource request and limit exhaustion, and eviction due to node memory pressure; Kubernetes API server audit logs for authorization failures; and the metrics and alerting surface area of whatever monitoring stack is installed (Prometheus, Grafana, or a managed equivalent). The observability ADR and the orchestration ADR are coupled: the observability tooling that provides signal for incident response is platform-specific, and switching orchestration platforms requires rebuilding the observability layer.

Team onboarding time is an operational cost that the orchestration choice imposes on every new hire who will work with the production infrastructure. The Kubernetes learning curve for an engineer who has not previously operated Kubernetes in production is 4–12 weeks before they can diagnose novel incidents confidently, contribute to cluster configuration changes, and operate the platform without senior engineer supervision. The ECS learning curve is 2–4 weeks for an engineer familiar with AWS. The difference is not a criticism of Kubernetes' design — it reflects the breadth of the platform's capability. But for a twelve-person team that hires one or two infrastructure engineers per year, the onboarding time difference is meaningful organizational capital. The orchestration ADR must acknowledge the onboarding time as a cost of the choice, because it affects the team's ability to grow the infrastructure function without creating a bottleneck on the existing Kubernetes-experienced engineers.

The migration cost and lock-in analysis

The migration cost analysis is the component of the orchestration ADR that teams most often skip, because migration is not the use case the initial decision is solving. It becomes relevant when the original choice turns out to be wrong — when the operational complexity floor is too high, when the autoscaling behavior doesn't meet the product's SLO, or when a new requirement (GPU support, specific compliance certification, multi-region active-active) is better served by a different platform.

Migrating from ECS to Kubernetes requires: rewriting all deployment configuration from ECS task definitions and service definitions to Kubernetes Deployments, Services, ConfigMaps, Secrets, and Ingress resources; selecting and deploying a CNI plugin, an ingress controller, and the operator tooling ecosystem; migrating CI/CD pipeline stages from ECS deploy commands to Helm releases or kubectl apply; rebuilding monitoring dashboards and alerting rules from ECS/CloudWatch metrics to Kubernetes/Prometheus metrics; retraining the team on Kubernetes concepts and the new tooling; and carrying a migration period where both platforms are operational. The migration effort for a twelve-service application is typically 6–10 engineer-weeks of focused migration work plus 2–4 months of parallel-running both platforms while the team gains confidence in the Kubernetes environment.

Migrating from Kubernetes to ECS involves the reverse costs, plus the cost of leaving behind the Kubernetes-native tooling that may have become load-bearing. If the team has adopted Argo CD for GitOps deployments, Istio for service mesh, or Argo Rollouts for canary deployments, migrating away from Kubernetes means either finding ECS-compatible equivalents (AWS App Mesh for some Istio use cases, CodeDeploy for some Argo Rollouts use cases) or accepting that those capabilities are lost in the migration. The Kubernetes ecosystem's depth is its primary advantage over ECS, and that depth is also the primary migration lock-in: the more Kubernetes-native tooling the team adopts, the higher the cost of migrating away.

Serverless container platform lock-in is primarily at the operational and tooling level rather than the application code level — a containerized application that runs on Lambda with a container image can often run on Cloud Run or App Runner with minimal changes, because the runtime behavior is similar. The lock-in is in the operational patterns (Lambda-specific cold start handling, Lambda concurrency limits, Lambda VPC attachment behavior), the monitoring and observability integration (Lambda CloudWatch metrics and X-Ray tracing versus Cloud Run Cloud Monitoring and Cloud Trace), and the CI/CD pipeline (Lambda deployment via the AWS CLI versus Cloud Run deployment via gcloud). Platform migration between serverless container platforms is less costly than migration between Kubernetes and ECS, but the migration from serverless containers to always-on containers (or the reverse) requires re-evaluating the autoscaling model, the cost model, and the operational patterns from scratch.

Container orchestration decisions in AI chat history

Container orchestration decisions produce a specific and recognizable pattern in AI chat history: an initial high-stakes architecture session where the platform choice is made under time pressure, followed by a long tail of platform-specific debugging, configuration, and optimization sessions that embed hundreds of smaller decisions into the operational configuration. Three months of AI chat history on Kubernetes operations may contain more architectural decisions than the original EKS setup session — but those decisions are invisible because they look like debugging rather than architecture.

The initial architecture session is often a single 60–120 minute ChatGPT or Claude conversation that establishes the platform choice. The pattern: "We need to move from docker-compose on a single server to something that can scale. What should we use — Kubernetes or ECS?" followed by a comparison that weighs the tradeoffs, a platform selection, and an initial setup guide. The problem is not the quality of the comparison — the AI produces a thorough analysis of the tradeoffs. The problem is that the specific constraints that were in play — the team's current AWS usage, the technical sophistication of the engineers, the autoscaling requirements, the budget, the compliance requirements — are context in the chat session that never gets written down anywhere outside it. The decision that emerges — "we'll go with EKS" — is recorded in the infrastructure code, but the context that justified that decision over ECS Fargate, Cloud Run, or App Runner is lost.

The incident-driven configuration sessions contain the most consequential operational decisions in the platform's history. The pattern: "Our pods are taking 5 minutes to scale out during traffic spikes — how do we speed up autoscaling?" followed by a diagnosis and a configuration change. The diagnosis might reveal that the HPA evaluation interval, the Karpenter provisioner configuration, or the pod readiness probe delay is the bottleneck. The configuration change — adjusting the HPA metric aggregation window, switching from Karpenter's default node selection to a faster instance type, or pre-warming a pool of standby nodes — is applied and the problem is solved. But the decision embedded in that change — "we will maintain a standing fleet of medium-sized nodes to absorb sudden scale-out demand, accepting the ongoing cost of unused node capacity" — is a capacity and cost decision, not just a configuration change. It belongs in the orchestration ADR alongside the original platform choice, because it determines the ongoing cost floor and the autoscaling behavior the team relies on. The postmortem as ADR is the mechanism for capturing these incident-driven configuration decisions before they become invisible operational assumptions.

The cost optimization sessions are where the most expensive hidden decisions live. The pattern: "Our EKS cluster is costing $4,200 per month and we're only at 30% average utilization — how do we reduce costs?" followed by a multi-part optimization that introduces Spot instances, adjusts resource requests and limits, enables Karpenter's consolidation mode, and reduces the node pool size. Each of these changes is an implicit decision about the reliability, availability, and operational complexity tradeoffs the team is willing to accept. Spot instance adoption introduces Spot interruption events as a new operational concern. Consolidation mode introduces node churn as Karpenter repacks workloads onto fewer nodes. Tighter resource requests reduce scheduling headroom. These are decisions about the platform's operational behavior, not just cost optimization configurations — and they are almost never written down as decisions. The open-source extractor surfaces these sessions from the AI chat history so the cost-versus-reliability tradeoffs made in optimization sessions are preserved alongside the architecture decisions made in the initial setup session.

The compliance and security sessions contain the RBAC, NetworkPolicy, and audit log decisions that are driven by external requirements rather than internal engineering judgment. The pattern: "Our SOC 2 auditor is asking about our Kubernetes RBAC model and our network segmentation — what should we configure?" followed by RBAC role and binding configuration, NetworkPolicy definitions, and API server audit log configuration. These sessions contain the explicit compliance rationale for security decisions — which RBAC bindings exist, why specific network paths are allowed or denied, what is captured in the audit log. The rationale is exactly the information that the next SOC 2 audit will require, and without the extracted decision record, the compliance team will spend audit preparation time reconstructing decisions that were made and justified once, in a closed chat session, eighteen months earlier. The security ADR must be consistent with the orchestration ADR's RBAC model and NetworkPolicy configuration.

Writing the container orchestration ADR

The container orchestration ADR needs five sections, covering the decisions that different stakeholders will return to as the platform matures: the initial platform choice rationale, the autoscaling model, the networking and service discovery model, the operational complexity acceptance, and the migration criteria.

Section 1: Platform choice and rationale. The orchestration platform chosen — managed Kubernetes (EKS, GKE, AKS), ECS Fargate, ECS with EC2 capacity providers, serverless containers (Lambda, Cloud Run, App Runner), or self-managed Kubernetes — with the specific service configuration. The constraints that were present at decision time: team size, existing cloud provider relationships, compliance requirements, workload characteristics (stateful versus stateless, traffic pattern, resource profile), budget, and the technical sophistication of the engineering team. The alternatives that were considered and the rationale for rejection. The decision criteria that could trigger a platform reconsideration: autoscaling latency exceeding a threshold, operational maintenance consuming more than a specified percentage of engineering time, a new workload requirement that the current platform cannot satisfy, or a team growth pattern that changes the onboarding cost equation. The current team's Kubernetes or ECS knowledge depth and the plan for maintaining or growing that knowledge as the team scales.

Section 2: Autoscaling model. The autoscaling mechanism for each service type: HPA with CPU/memory metrics, KEDA with specific external metric scalers, ECS target tracking or step scaling, or Lambda concurrency with or without provisioned concurrency. The target metric and threshold for each service, with the rationale for why that metric is the correct load signal (CPU utilization is a lagging indicator; queue depth is a leading indicator; request count per target is a direct load signal). The expected scale-out latency end-to-end, measured in a controlled test rather than estimated from platform documentation, broken down by component: metric collection interval, autoscaling controller evaluation interval, new pod or task scheduling time, container startup time (including image pull if not cached and readiness probe settling time), and node or instance provisioning time if required. The scale-in behavior: how aggressively the platform removes capacity when load decreases, the minimum replica or task count for each service, and the scale-in cooldown to prevent oscillation. The node or compute autoscaling mechanism (Karpenter, Cluster Autoscaler, ECS capacity providers, Fargate on-demand) and the instance type or Fargate configuration that determines the cost per unit of additional capacity. The CI/CD pipeline ADR must be consistent with the autoscaling model on the question of deployment strategy: a rolling update deployment on a platform with a 4-minute scale-out latency may require a higher minimum replica count than a blue-green deployment that keeps a warm secondary environment running.

Section 3: Networking and service discovery. The CNI plugin (for Kubernetes) or VPC networking mode (for ECS) and the IP address model. The service discovery mechanism: Kubernetes DNS (CoreDNS) with ClusterIP Services, ECS with AWS Cloud Map, or serverless with API Gateway. The ingress and external traffic routing model: Kubernetes Ingress controller (nginx, Traefik, AWS ALB Ingress Controller) with specific routing rules, ECS with ALB listener rules, or serverless with API Gateway routes. The internal service-to-service communication model: direct DNS resolution within the cluster, a service mesh for mTLS and traffic shaping, or a simple HTTP/gRPC library without network-level encryption. The NetworkPolicy or security group model for restricting pod-to-pod or task-to-task communication — which is the record of the network segmentation decisions that compliance frameworks require. The service mesh ADR must reference the orchestration networking model as the layer below it.

Section 4: Operational complexity acceptance. An explicit acknowledgment of the operational complexity floor that comes with the chosen platform. For Kubernetes: the estimated weekly hours spent on cluster maintenance (version upgrades, node group updates, add-on management, certificate rotation), the cadence for control plane version upgrades, the tooling installed on the cluster and the team responsible for maintaining each tool's Helm releases. For ECS: the operational maintenance that remains (task definition updates for base image patches, autoscaling policy reviews, IAM role audits). For serverless: the cold start monitoring and provisioned concurrency review cadence. The knowledge requirement for on-call engineers and the onboarding plan for new engineers who will participate in on-call. The runbook location and the cadence for testing runbooks (because a runbook that has not been tested since it was written is not a runbook — it is documentation of how things worked at the time of writing, which may differ from how they work now). The escalation path for incidents that exceed the on-call engineer's Kubernetes or ECS knowledge boundary.

Section 5: Migration criteria. The conditions under which the team would reconsider the orchestration platform. These criteria must be specific and measurable, not qualitative ("if Kubernetes becomes too hard") — specific thresholds that, when crossed, trigger a formal migration evaluation. Examples: "if cluster maintenance consumes more than 15% of the infrastructure team's engineering time per week, trigger a migration evaluation to ECS Fargate"; "if scale-out latency under 10x normal load exceeds 5 minutes for any tier-1 service, trigger an autoscaling model review"; "if the team grows to a point where fewer than 25% of engineers on call have sufficient Kubernetes knowledge to diagnose incidents independently, trigger an orchestration complexity review." The migration evaluation is not a commitment to migrate — it is a commitment to assess the alternatives against the current constraints and make a documented decision, the same decision that was made in year one but with the benefit of two years of operational experience. The first year of decisions establishes the operational patterns that will be hardest to change; the orchestration ADR's migration criteria are the mechanism for keeping those patterns under deliberate review rather than defaulting to inertia.

A container orchestration platform is the runtime environment for every service the team builds, and it is the context in which every production incident is diagnosed and resolved. The choice between Kubernetes and ECS Fargate and serverless containers is not primarily a technical question about features — it is an organizational question about how much operational complexity the team can carry productively, how much autoscaling latency the product's SLO can tolerate, and how much onboarding time the team's hiring model can absorb. These are questions about the team and the business, not just the platform. The decisions about orchestration that are made in a single ChatGPT session in year one — often under the pressure of an incident like the Black Friday event in the opening narrative — shape the team's operational reality for years. Without a record of those decisions, the team in year three cannot reason about whether to migrate, cannot explain to a new engineer why the platform was chosen, and cannot evaluate new requirements against the constraints that justified the original choice. The open-source extractor surfaces the AI chat sessions where those decisions were made — the initial architecture conversation, the incident-driven optimization sessions, the compliance-driven security configuration sessions — connecting the reasoning from the moment of decision to the operational record that will inform the next platform evolution.