The service mesh decision record: why the inter-service communication infrastructure you chose in year two constrains your observability and zero-trust security posture in year four

Service mesh adoption is rarely treated as an architecture decision. It is treated as a platform upgrade — the team deploys Istio or Linkerd to solve a specific problem (circuit breaking, mTLS, traffic shaping) and moves on. The mesh becomes invisible infrastructure: the platform team maintains it, the application teams rely on it, and no one writes down what was chosen, why, what was deliberately excluded from the mesh, or when the decision should be revisited. Two years later, the mesh is load-bearing for observability, security policy, and Kubernetes version compatibility — and no one can reconstruct what is intentional configuration and what is accumulated default.

A new VP of Engineering joins a company in year four. The company runs twelve microservices on Kubernetes. She opens the runbook and reads "all services communicate over mTLS — handled by Istio." She asks: why Istio? The platform engineer who deployed it left eight months ago. The Slack thread where it was evaluated is archived and partially deleted. She finds a Confluence page titled "Service Mesh Evaluation — 2022" with a dead link to a Google Doc. The current platform team member says it was chosen because "Istio was the industry standard at the time and the team had prior Envoy experience." That is the entire institutional record.

Eight months later, the company pursues a SOC 2 audit. The auditor asks: confirm that all inter-service traffic is encrypted via mTLS. The platform engineer opens the Istio configuration and discovers that three of the twelve services communicate over plaintext — their sidecars were never injected. The original deployer did this intentionally: the three services are batch jobs running as Kubernetes CronJobs, and sidecar injection caused the CronJob pods to never terminate because the Envoy proxy remained alive waiting for connections after the main container exited, causing the CronJob to run indefinitely and the next scheduled run to be blocked. This is a known Istio behavior. The fix requires per-pod injection opt-out annotations. The original platform engineer applied them and moved on. The decision to exclude these workloads from the mTLS perimeter — with a named reason — was never written down. The audit stalls for two weeks while the team determines what is intentional exclusion and what is overlooked gap. Both produce a finding; only one is remediable by documentation.

Like most infrastructure decisions that accumulate feature by feature, the service mesh configuration is visible as a fact — the cluster state is observable — but invisible as a decision. The cluster state answers "what is true now?" The decision record answers "what was considered, what was chosen and why, what was deliberately excluded, and what would make us reconsider." Without the record, every operational question that requires context collapses into archaeology: Slack searches, git blame on YAML files, and interviews with engineers who may have left.

What "we use a service mesh" actually means across four patterns

The first decision inside "we use a service mesh" is the mechanism — the architectural pattern by which inter-service communication is intercepted, encrypted, observed, and controlled. The pattern choice is often driven by what was familiar when the platform team first evaluated meshes, but it is a real choice with real trade-offs, and a team that deployed Istio made different architectural decisions than a team that deployed Linkerd or a team that embedded mTLS in each service's application code.

No service mesh is a legitimate architectural choice that most teams do not acknowledge as a choice when they transition away from it. Services call each other over plain HTTP or TCP; TLS is configured per-service if at all; service discovery is handled by DNS; traffic shaping (retries, circuit breaking, timeouts) is handled by libraries in each service's codebase. This is the standard pattern for teams in the early stages of microservices adoption, and for teams with a small number of services that communicate in simple, well-understood patterns. The decision to not adopt a mesh is a real decision — one that becomes harder to revisit as the number of services grows — and it deserves a record that names the evaluation criteria and the condition under which the decision should be reopened.

Sidecar-proxy meshes (Istio with Envoy, Linkerd 2.x with its Rust micro-proxy) inject a lightweight proxy container into each application pod at admission time. All inbound and outbound traffic passes through the sidecar, which handles mTLS termination, traffic shaping, telemetry collection, and policy enforcement — transparently to the application via iptables rules that redirect traffic through the proxy. The application communicates as if no mesh exists; the sidecar intercepts everything at the network layer. The resource cost is real: each pod carries an additional container, and the iptables interception adds latency on every request. The operational benefit is that the mesh functionality is entirely outside the application code — enabling, disabling, or changing the mesh requires no application changes. The failure mode is that the sidecar's lifecycle is coupled to the pod's lifecycle: changing the proxy version requires rolling restarts across all injected pods, which in large clusters is a significant operational event.

Sidecar-less and eBPF-based meshes (Cilium, Istio ambient mode) run mesh functionality as a node-level agent using eBPF programs to intercept and process traffic at the kernel level. No sidecar container is injected into application pods. The per-pod resource overhead is eliminated; the requirement to restart pods for proxy upgrades is eliminated. The constraints are a minimum kernel version requirement (typically Linux 5.10 or later — a real constraint for clusters running on older LTS kernel versions) and a different security model. Istio's ambient mode separates L4 encryption (handled at the node level by a per-node ztunnel agent) from L7 policy enforcement (handled by an optional per-namespace waypoint proxy) — which means teams that need L7 authorization policies still need to deploy waypoint proxies, partially restoring the per-namespace operational overhead the ambient model was designed to eliminate. A team evaluating Cilium or ambient mode in 2022 was evaluating less mature software than a team evaluating the same options today — documenting which version was evaluated and what the maturity concerns were is part of what makes the decision record useful two years later.

Application-layer mesh implements mesh functionality inside the application code rather than in infrastructure. Go services use gRPC with TLS options and the circuit breaking logic from a shared library; Java services use Resilience4J for circuit breaking and the Spring Boot TLS configuration for mTLS; observability is handled by OpenTelemetry SDKs in each service. This is sometimes called the "library pattern" versus the "platform pattern." The benefit is no infrastructure control plane to operate and no per-pod sidecar overhead. The cost is that consistency requires every team to apply the library correctly — mTLS that is inconsistently applied is worse than no mTLS because it creates a false sense of coverage. Updating the shared library requires per-service deployments; enforcing a library version requires organizational coordination. Teams that chose the application-layer pattern for a five-service system and then grew to thirty services typically find that the consistency problem reaches an inflection point where a platform pattern becomes worth the operational overhead — but without a decision record that names the consistency condition as the revisitation trigger, the migration happens reactively after an audit finding rather than proactively.

The observability constraint

Once a sidecar-proxy mesh is deployed, it becomes the primary telemetry collection point for inter-service communication. Istio and Linkerd emit distributed tracing spans, access logs, and traffic metrics from the sidecar proxies — before the application code receives the request. This produces a consistent observability baseline that does not depend on each service's telemetry library. It is also a constraint: the mesh's telemetry model determines what observability data is available, at what granularity, and in what format.

The constraint appears most visibly in distributed tracing. A service mesh generates a span for every request it proxies, recording the source service, destination service, response code, and latency. But the mesh can only observe the network boundary — it cannot follow a request's path through the application's internal processing and into downstream calls. When a service receives a request and makes three downstream calls, the mesh sees one inbound connection and three outbound connections. Without the application extracting the trace parent header from the inbound request and injecting it into each outbound call, the three downstream spans appear as disconnected root traces rather than children of the original span. Teams that adopt a service mesh believing it "handles distributed tracing" and skip adding trace header propagation to their services discover this gap when they first try to trace a request end-to-end and find a forest of disconnected root spans — observability theater that shows traffic volume but not request flow.

The tracing backend integration is where the mesh version coupling becomes visible. Istio's default tracing configuration changed across major versions: older Istio deployments emit Zipkin B3 trace headers and write to a Zipkin or Jaeger endpoint; newer versions use OpenTelemetry with OTLP export. A team running Istio 1.12 integrated with Jaeger that wants to migrate to Datadog APM discovers that their mesh version's native tracing integration does not support OTLP endpoints — the migration requires an Istio upgrade, which is a Kubernetes-compatibility-constrained operation. The tracing backend decision and the mesh version decision are coupled in ways that are not visible at deployment time. Like performance optimization decisions where the baseline measurement matters as much as the optimization itself, the mesh's observability integration is a baseline that constrains future observability migrations.

Metrics cardinality is the observability consequence teams most often discover after deployment rather than during evaluation. Istio Envoy proxies emit hundreds of Prometheus metrics per service pair — request counts, latency histograms, TCP connection counts — with label dimensions for source service, destination service, response code, and protocol. For a cluster with twenty services, this produces tens of thousands of time series. Teams that deploy a service mesh and then notice their Prometheus storage costs increasing by 3–5x within a month are discovering the cardinality consequence of Istio's default telemetry configuration. Reducing cardinality requires customizing Envoy's stats filters via Istio's `EnvoyFilter` API — a mesh configuration mechanism that most application teams are unaware of when evaluating whether to deploy a mesh. A decision record that notes "metrics cardinality was evaluated against our Prometheus retention budget; we accepted the default configuration with a revisitation trigger if monthly Prometheus storage exceeds $X" converts a surprise operational cost into a documented trade-off.

The zero-trust and mTLS constraint

The most common primary reason teams adopt a service mesh is mTLS: automatic mutual TLS between services without requiring each service to manage its own certificate lifecycle. The mesh mints short-lived certificates tied to each service's Kubernetes service account identity, rotates them automatically, and verifies that both sides of every connection present valid certificates. This is real security value — and it is a decision with three underspecified components that produce the most common compliance failures in service mesh deployments.

Permissive versus strict mTLS mode. Istio's default configuration is PERMISSIVE: services accept both mTLS and plaintext traffic. This enables gradual adoption — services without sidecars can still reach sidecar-equipped services during the mesh rollout. It is security theater for any compliance requirement that requires mTLS to be enforced rather than offered. STRICT mode rejects plaintext traffic. A team that moves from PERMISSIVE to STRICT without auditing which services lack sidecars — and without documenting which services were intentionally excluded from injection — breaks those services. The permissive-to-strict migration is a zero-trust readiness audit packaged as a configuration change, and it requires knowing what is intentional configuration before it can proceed safely. As with any security architecture decision, the undocumented exclusion is indistinguishable from an oversight until someone writes down the difference.

Authorization policy scope. mTLS ensures that service A's connections to service B are authenticated — service B knows the request is from a service with a valid certificate issued by the cluster's certificate authority. Authorization policy (Istio's `AuthorizationPolicy` resource) ensures that service A is permitted to call service B. Without authorization policies, mTLS is authentication without least-privilege access control: every service with a valid certificate can reach every other service in the mesh. Compromising any single service grants network access to all other services. This is mutual authentication but not zero-trust network access. Teams that deploy a service mesh for mTLS and defer authorization policy configuration indefinitely have improved the authentication layer without reducing blast radius — which is an improvement, but not the security model that "we have mTLS" implies to an auditor. The authorization policy decision — whether to deploy a default-deny posture with per-service allow policies, or a default-allow posture with specific deny rules — is an architecture decision with direct compliance implications that belongs in the same record as the mTLS mode decision.

External traffic handling. Services that receive traffic from outside the cluster — public-facing endpoints, internal load balancers receiving traffic from non-mesh clients — require a decision about how the mesh's mTLS boundary interacts with the cluster's ingress layer. Istio's ingress gateway can terminate external TLS and re-encrypt inbound connections via mTLS to the destination service, creating a TLS handoff at the cluster boundary where the gateway is the only non-mesh endpoint that communicates over plaintext internally. Alternatively, TLS passthrough carries the external TLS connection to the service without the gateway decrypting and re-encrypting — the service terminates TLS itself. The ingress boundary decision determines where the trust boundary is, what the mesh can observe (if the gateway re-encrypts, the mesh sees decrypted request metadata; if TLS passes through, the mesh sees only the encrypted payload), and what the application is responsible for. This is a security architecture decision that most mesh deployments make implicitly through the choice of ingress configuration.

The Kubernetes version coupling constraint

Istio is explicitly versioned against Kubernetes minor versions. Each Istio release supports a specific range of Kubernetes versions; the Istio project declines to support configurations outside this range and will not debug issues reported against unsupported combinations. This coupling means a Kubernetes minor version upgrade may require an Istio version upgrade first — and an Istio major version upgrade may require reviewing and updating every Istio CRD configuration in the cluster, because VirtualService, DestinationRule, AuthorizationPolicy, and PeerAuthentication schemas change between major versions in ways that require manual migration of existing configurations.

Linkerd 2.x is less tightly coupled to Kubernetes versions but depends on the Kubernetes webhook admission controller API for sidecar injection. Changes to Kubernetes admission webhook behavior across minor versions can affect injection reliability, and Linkerd minor version upgrades occasionally require Kubernetes version prerequisites that were not documented in the original deployment.

The team that does not document the version coupling at deployment time discovers it when the cluster's Kubernetes version reaches end-of-life and the platform team tries to upgrade. The upgrade path now includes an unplanned service mesh version assessment — how does the current mesh version map to the target Kubernetes version? Is a mesh version upgrade required first? What does the mesh version upgrade involve? — that the team is unprepared for because no record names the supported Kubernetes version range or the mesh upgrade history. As with any architectural decision that is expensive to supersede, the coupling constraint makes the initial deployment decision more important to document, not less.

The version coupling constraint is also an evaluation criterion that most teams do not include when they select a mesh. eBPF-based meshes are not tightly coupled to Kubernetes minor versions in the same way — Cilium's compatibility matrix is wider, and eBPF program updates do not require pod restarts. For teams that run on managed Kubernetes and upgrade frequently to stay current with cloud provider support windows, the version coupling constraint may be a stronger selection criterion than the feature set difference between Istio and Linkerd. A decision record that names version coupling as an evaluated criterion — even if it did not drive the final selection — answers the "why didn't we choose X?" question that arises when a platform engineer researches alternatives during an upgrade planning session.

The sidecar exclusion policy: the most commonly missing section

Every production Kubernetes cluster running a sidecar-proxy mesh has workloads that are excluded from sidecar injection. Some exclusions are imposed by the mesh's own constraints: Kubernetes DaemonSet pods running with host networking cannot be injected because the iptables rules that redirect traffic through the sidecar conflict with host network mode. Some exclusions are imposed by application behavior: batch jobs implemented as Kubernetes CronJobs often cannot be injected because the Envoy proxy, which remains alive waiting for connections after the main container exits, prevents the pod from terminating and the CronJob from completing. Some exclusions are policy decisions: teams may decide that infrastructure agents, log collectors, or monitoring sidecars should not have a proxy sidecar because the additional hop adds latency to high-volume observability data without security benefit.

Each exclusion is a gap in the mTLS perimeter. The workloads excluded from injection communicate over plaintext, receive plaintext connections, and are outside the authorization policy scope. This is often correct — the batch job that only writes to an external database has no legitimate inbound service mesh traffic and the exclusion is sound — but it is only verifiable as correct if the exclusion is documented with a reason. Without documentation, the exclusion is indistinguishable from an oversight: a service that was supposed to be in the mesh but was excluded by accident during a migration.

The sidecar exclusion policy is the section that most directly affects SOC 2 and zero-trust readiness. An auditor who asks "which services are outside the mTLS perimeter?" is asking for the exclusion list. A team that can produce a named list with documented reasons for each exclusion demonstrates control; a team that must reconstruct the list from cluster state and git history demonstrates the absence of control. For platform teams that make decisions that constrain product teams, the exclusion policy is a governance artifact — it determines what security guarantees the platform can offer and what guarantees remain the responsibility of each application team.

Writing the service mesh decision record

The Nygard ADR format adapts for service mesh decisions with four sections that most teams leave entirely undocumented.

The mesh selection and topology decision. Name the mechanism chosen and the alternatives evaluated with rejection reasons. "We evaluated Istio 1.17 (sidecar), Linkerd 2.13 (sidecar), and Cilium 1.14 (eBPF). Istio was chosen for three reasons: (1) the platform team had prior experience operating Envoy in a standalone deployment, reducing the operational knowledge acquisition cost; (2) Istio's AuthorizationPolicy and PeerAuthentication resources map directly to the network segmentation requirements we were targeting for SOC 2; (3) the cluster nodes run kernel 4.19 LTS, which excludes eBPF-based options (Cilium requires ≥5.10). Linkerd was evaluated and rejected because its L7 authorization policy model (based on Server and ServerAuthorization resources in Linkerd 2.12) required per-server configuration rather than namespace-wide policy configuration — adding per-service platform overhead that the team did not want to absorb for twelve services. Cilium was evaluated and rejected on kernel version grounds — not on capability grounds; the evaluation noted that Cilium ambient-mode-equivalent functionality was the preferred option if kernel version constraints were lifted." This record makes a future Linkerd or Cilium proposal engage with specific rejection reasons rather than starting from neutral ground.

The sidecar exclusion policy. Name each excluded workload type, the technical reason for exclusion, and the security implication. "Sidecar injection is enabled cluster-wide via the `istio-injection: enabled` label on all application namespaces. The following workload categories are excluded via per-pod `sidecar.istio.io/inject: 'false'` annotations: (1) CronJob batch processing jobs — excluded because Envoy proxy containers prevent CronJob pods from terminating after the main container exits, causing CronJobs to run indefinitely. These jobs communicate outbound to the external database (port 5432, PostgreSQL) and have no inbound service mesh traffic. Security implication: outbound database traffic from these jobs is plaintext from the cluster network layer; database access is controlled by database credentials and IP allowlist rather than by mesh authorization policy. (2) Datadog agent DaemonSet — excluded because DaemonSet pods run with host networking and iptables-based sidecar injection does not function in host network mode. Security implication: the Datadog agent is an infrastructure component that receives no inbound service traffic and sends outbound observability data to the Datadog API endpoint; it is outside the inter-service authorization scope." Named exclusions with named security implications make the SOC 2 question answerable in minutes.

The mTLS mode and authorization policy model. Name the current mode, the target posture, and the migration plan. "Current mode: PERMISSIVE across all namespaces (services accept both mTLS and plaintext). Target mode: STRICT (enforced mTLS, plaintext rejected). This migration has not yet been executed. Migration sequence: (1) identify any remaining non-excluded services that are not injected by running `kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {.spec.containers[*].name}{"\n"}{end}' | grep -v istio-proxy | grep -v excluded-namespaces`; (2) inject sidecars into any remaining non-excluded services; (3) switch namespaces to STRICT mode in order, starting with non-production namespaces, verifying no plaintext traffic via Kiali's security graph before advancing to production; (4) deploy default-deny AuthorizationPolicies per namespace; (5) add per-service allow policies for each legitimate inter-service communication path (map is in `docs/service-dependencies.md`). Authorization policy model: default-deny with explicit allow. Each allow policy permits the specific source service account to reach the destination service. Allows are scoped to the method paths required — not wildcard on `*`." The migration plan, documented as a decision, is the difference between a zero-trust initiative and a zero-trust intention.

The Kubernetes version compatibility and upgrade trigger. Name the supported Kubernetes version range and the conditions for re-evaluation. "This deployment uses Istio 1.18, which supports Kubernetes 1.25–1.28 (per Istio's support policy: current release supports n-2 minor versions). Istio version upgrade is required before the cluster Kubernetes version exceeds 1.28. Istio 1.x to 2.x major upgrades require reviewing all VirtualService, DestinationRule, AuthorizationPolicy, and PeerAuthentication configurations for deprecated field removal — budget two platform-engineer-weeks for this review plus a staged rollout. Re-evaluate the mesh selection entirely if: (1) the cluster nodes are upgraded to kernel ≥5.10 — this makes eBPF-based options viable and the resource overhead comparison should be revisited; (2) Istio control plane resource consumption (istiod pods) exceeds 15% of the control-plane node's allocatable resources; (3) the time required to execute a mesh version upgrade (from upgrade-start to all-pods-restarted) exceeds four hours — at this scale, the per-pod restart overhead of sidecar upgrades may justify migration to an ambient-mode or eBPF approach that does not require pod restarts for proxy updates; (4) mesh-related service incidents (connection failures attributable to sidecar configuration) exceed two per quarter for two consecutive quarters."

Finding service mesh decisions in AI chat

The WhyChose extractor surfaces service mesh decisions from four session types that contain the reasoning most teams cannot reconstruct two years later.

The initial evaluation session. "Should we use Istio or Linkerd?", "What's the difference between Istio and Cilium?", "How does Envoy sidecar injection work?", "Do we need a service mesh or can we handle mTLS in the application?", "What's Istio ambient mode and is it production-ready?" These sessions contain the alternatives the platform engineer considered and the criteria that drove the selection. The evaluation session is typically the most important session to recover because the mechanism choice — which all subsequent mesh decisions build on — is rarely revisited once the first service is injected with a sidecar.

The CronJob or batch job incident session. "Why isn't my Kubernetes CronJob completing?", "Istio sidecar not terminating after job finishes", "How to disable sidecar injection for specific pods without disabling it for the namespace", "iptables rules cleanup after main container exits — Envoy still running." These sessions identify the workloads that were excluded from the mesh and the technical reasons for each exclusion. They are the primary data source for the sidecar exclusion policy section — without recovering these sessions, the exclusions are invisible cluster configuration with no documented rationale, indistinguishable from accidents.

The mTLS migration session. "How do I move Istio from permissive to strict mTLS?", "How do I find which services are sending plaintext traffic to the mesh?", "Istio AuthorizationPolicy ALLOW vs DENY — which takes precedence?", "How do I write an AuthorizationPolicy that allows service A to call service B but not service C?", "PeerAuthentication strict mode breaking connections — how to debug?" These sessions contain the migration reasoning, the services that caused problems during migration, and the authorization policy decisions that were made service by service. For teams where the platform team and product teams work asynchronously, recovering these sessions from multiple engineers — the platform engineer who designed the policy model and the product engineer who debugged why their service could no longer reach the auth service — produces a more complete picture of the authorization model than any single person's export.

The upgrade planning session. "How do I upgrade Istio from 1.16 to 1.18?", "What changes between Istio 1.17 and 1.18 that I need to review?", "How do I check if my Istio CRDs are compatible with the new version?", "Canary upgrade for Istio — how does the revision label approach work?" These sessions contain the Kubernetes version compatibility reasoning, the CRD migration steps, and the upgrade strategy. They are the source material for the version coupling section of the ADR — and recovering them is the difference between a documented upgrade history and a cluster where no one is confident about what mesh version upgrade steps have been executed and which CRD schemas have been migrated.

What the decision record prevents

A documented service mesh decision prevents three recurring problems that teams without a mesh ADR encounter as they grow.

It prevents the compliance audit surprise. A SOC 2 audit that asks "confirm all inter-service traffic is encrypted" is a lookup operation if the sidecar exclusion policy is documented with named workloads and security implications. Without documentation, the audit requires reconstructing cluster configuration from kubectl output and Slack history — a multi-day exercise that produces the audit finding "unable to confirm compliance due to incomplete documentation," which is itself a control failure finding regardless of whether the actual mTLS coverage is adequate. Security architecture decisions that are documented as decisions rather than cluster configuration artifacts survive team turnover and produce audit evidence rather than audit findings.

It prevents the Kubernetes upgrade freeze. A cluster approaching end of its supported Kubernetes version range does not produce an unplanned service mesh version assessment if the version coupling is documented with an upgrade trigger. Without the trigger, the coupling is discovered mid-upgrade — after the Kubernetes upgrade has been started, when the operations team discovers the mesh version is unsupported on the target Kubernetes version. The mesh upgrade then becomes a blocking prerequisite that was not included in the upgrade plan. Cross-team dependencies that become visible only when they block a high-stakes operation are the most expensive kind — the service mesh version coupling between the platform team's mesh decision and the infrastructure team's Kubernetes upgrade timeline is a canonical example.

It prevents the authorization policy sprawl and the blast radius regression. Teams that move to strict mTLS without a documented authorization policy model tend to write broad allow policies under time pressure — "allow all traffic from the frontend namespace to the backend namespace" rather than "allow the frontend service account to reach the /api/v1/users endpoint on the user service." The broad policy is correct enough to unblock the migration but expands blast radius relative to least-privilege. Without a documented policy model that names the intended scope of each allow rule, the team cannot distinguish the broad rules that were intentionally written from the broad rules that were written as a shortcut. The pattern that was chosen under constraint versus the pattern that was chosen as policy looks identical in the cluster configuration without the decision record that names which it was.

Further reading