The real-time architecture decision record: why the transport layer you chose determines your horizontal scaling ceiling and your client reconnection behavior in production
WebSocket versus SSE versus long-polling is decided in the first sprint AI session — "how do I push updates to the client in real time?" — and never documented. Sticky sessions are enabled at the load balancer because WebSockets require them, disabling the auto-scaling policy. The client reconnection policy is not implemented because the developer testing with thirty concurrent connections never observed a thundering herd. The transport layer decision determines your horizontal scaling ceiling and what your users experience the first time a rolling deploy drops their connections at production scale.
A 19-person SaaS company built a collaborative document editor. The real-time sync feature was built in the second sprint, following a ChatGPT session in which the engineer asked "what's the best way to push document changes to all connected users in real time?" The session recommended WebSockets. The engineer chose the ws library for Node.js, configured the connection to accept subscriptions to document channels, and held per-document subscriber maps in process memory. The implementation worked correctly. In staging with four concurrent connections, there were no problems. The product launched and users liked the real-time collaboration feature.
Eight months later, the product had grown to 1,200 concurrent connections during peak hours. The team was running two Node.js instances behind an AWS Application Load Balancer. The DevOps engineer configured auto-scaling to add a third instance when CPU exceeded 70% on the existing two. The third instance was added. The ALB started routing new connections to it. Old connections remained on the original two instances. Three weeks later the CTO noticed that 73% of connections were on instance-1, 25% were on instance-2, and 2% were on the new instance-3. The load balancer was distributing new connections evenly, but the absolute number of new connections was small relative to the existing pinned connections. Instance-1, which had been running the longest, held all the connections established before the third instance was added. Adding more instances provided negligible relief because the load did not distribute away from the overloaded original instances — it only landed on the new instances for new connections established after the scale-out event.
The DevOps engineer investigated and discovered that sticky sessions had been enabled at the ALB level at some point — the ALB configuration showed stickiness.enabled = true with a 1-day duration. The engineer could not locate when or why this was set. The git history showed no Terraform change that added stickiness. The most likely explanation was that an engineer had enabled it through the console during an incident and the state was never reconciled back to infrastructure-as-code. Stickiness was not in the original architecture notes for the real-time feature. It had not been identified as a requirement when WebSockets were chosen. Enabling it may have been a response to a bug where users experienced connection drops — sticky sessions ensured they reconnected to the same instance and recovered their in-memory subscription state — but there was no record of this reasoning.
Disabling sticky sessions would fix the auto-scaling problem but would cause connected users to land on different instances on reconnect, where their document subscription state did not exist. The subscriber map for each document was held in the memory of the instance that had accepted the initial connection. A client reconnecting to a different instance would receive no updates for the document they were editing until the server-side subscription was re-established. The architecture could not be made stateless without externalizing the subscriber maps to a shared store — Redis pub-sub, a managed broker, or a similar mechanism. That refactor was estimated at two to three weeks. The real-time feature, as built, had a horizontal scaling ceiling that was visible in retrospect as an inherent property of the transport choice combined with per-instance state storage. That ceiling had not been identified as a constraint when the implementation was chosen.
A 31-person fintech was building a notifications panel for their banking dashboard. The first implementation used WebSockets. A month into development, a frontend engineer raised a question: the bank's corporate proxy servers, which many of their enterprise customers routed traffic through, were blocking WebSocket connections. The WebSocket protocol upgrade request — the Upgrade: websocket HTTP header — was being dropped or rejected by HTTP-only corporate proxies, leaving those users with no notifications. The team investigated Server-Sent Events as a fallback. SSE uses standard HTTP with a text/event-stream content type and works through HTTP-only proxies that are transparent to long-lived HTTP connections. The decision to switch from WebSockets to SSE was made in a Slack thread over two days and executed in a week. It was documented in a Confluence page that no one could find 18 months later. When a new engineer joined the team and asked why they were using SSE instead of WebSockets, no one could reconstruct the corporate proxy rationale — the team assumed the original implementation had been WebSockets and the switch had been made for performance reasons. The actual reason — proxy compatibility with their enterprise ICP — had been decided correctly and then lost.
The three structural properties the transport decision determines
When a team chooses a real-time transport in an early sprint session, they are making a decision with three structural consequences that become load-bearing constraints as the product scales. None of these consequences are visible at the scale of a staging environment or an early-stage user base. All three become visible at the scale where they matter.
The scaling model: stateful per-instance versus stateless shared state. The real-time transport decision is inseparable from the connection state model. A WebSocket server that holds per-connection state in process memory — user subscriptions, channel memberships, presence data, last-seen sequence numbers — is a stateful service. Stateful services cannot scale horizontally by adding instances and distributing connections freely, because a client reconnecting to a different instance finds none of its state. The team has two options: sticky sessions (session affinity at the load balancer, which pins each client to the instance it originally connected to and defeats free horizontal distribution) or externalized state (all connection state written to a shared store — Redis, a managed broker, or a database — so any instance can reconstruct a client's context on reconnect). The stickiness option allows the application to remain stateful at the instance level and defers the externalization refactor, but it creates a scaling ceiling: the ceiling is reached when the distribution of pinned connections across instances is so uneven that adding new instances provides no relief. The externalization option requires the initial implementation to write subscription state to the external store on establishment and to read it on reconnect, and it requires the external store to handle the pub-sub fan-out (when a message arrives for a document, the publishing instance writes to the Redis channel, and each subscriber instance receives it and forwards it to the locally connected client). The choice between these models at implementation time determines whether the service can auto-scale freely or requires a two-to-three-week refactor when the scaling ceiling appears. The infrastructure-as-code decision record and the container orchestration decision record both become load-bearing against the scaling model: an auto-scaling policy that adds instances to distribute load is correct for stateless services and ineffective for stateful services with pinned connections.
Client reconnection behavior and thundering herd dynamics. Every real-time transport experiences disconnections. Server deploys cycle instances. Network interruptions drop TCP connections. Load balancer health checks cause brief routing changes. A client's response to a disconnect event — how quickly it attempts to reconnect, and with what backoff policy — is a decision that has server-side load management consequences. A client that attempts immediate reconnection — retry after 0ms, or after a fixed 500ms — produces a tight reconnect loop when the server is temporarily unavailable. At 2,000 concurrent connections, an instance restart that disconnects all clients and triggers immediate reconnection attempts produces 2,000 simultaneous TCP handshakes, 2,000 simultaneous TLS negotiations, 2,000 simultaneous authentication token verifications, and 2,000 simultaneous subscription state reconstructions in the same 200ms window. The CPU and database query load of this spike is qualitatively different from the load of 2,000 connections arriving gradually over ten minutes of normal user activity. Without jitter in the reconnection window, exponential backoff alone does not fully solve the problem: if all 2,000 clients implement the same exponential backoff starting from the same disconnect timestamp, they arrive at the same reconnection window simultaneously. The reconnection policy must include jitter — a random offset within the current backoff window — to spread reconnect attempts across time. This is the server-side load management policy encoded as a client-side behavior. It belongs in the transport ADR as a specified requirement for client implementations, not left to each client team to implement independently. The absence of a documented reconnection policy produces a situation where mobile client, web client, and desktop client each implement their own retry logic — one with fixed 1s retry, one with exponential backoff without jitter, one with a hard limit of three retries before surfacing an error to the user — each producing different server-side load profiles and different user-facing error experiences at the first production deploy that drops connections at scale. The error handling strategy decision record intersects here: the reconnection failure mode — what the client shows the user when reconnection exceeds the maximum retry window — is a product decision that belongs in the same ADR as the transport choice, not in a separate ticket raised by the mobile team at the first production incident.
Proxy and network intermediary compatibility. The real-time transport choice determines whether the connection works through the network infrastructure between the client and the server. Corporate HTTP proxies, CDN edge nodes, and some mobile carrier networks handle WebSocket connections differently from HTTP connections. The WebSocket protocol starts with an HTTP/1.1 upgrade request; a proxy that does not understand the WebSocket upgrade either drops the connection or completes the HTTP response and closes the connection before the WebSocket handshake is finished. An HTTPS WebSocket connection (WSS) tunnels through the proxy as an opaque TLS stream, which most proxies pass through without inspecting — the proxy compatibility problem is more severe for WS (unencrypted) than for WSS, but enterprise customers may route all traffic through inspecting proxies that terminate TLS and re-encrypt, which intercepts WSS connections and may not re-establish the WebSocket handshake correctly. Server-Sent Events and long-polling use standard HTTP connections that all proxies handle; their compatibility range is broader at the cost of the unidirectional constraint (SSE) or the polling overhead (long-polling). The ICP determines which compatibility range matters: a B2C consumer product with home internet users will rarely encounter corporate proxies; a B2B product targeting enterprise customers who route all traffic through their IT-managed network will encounter them regularly. The fintech's SSE decision was correct given their enterprise ICP — the proxy compatibility constraint was real and material for their specific customer segment. The decision was made correctly and then lost. A documented ADR that recorded "we chose SSE over WebSockets because our enterprise customers route through HTTP-only corporate proxies, confirmed by reports from three enterprise customers in Q2 2024" would have preserved the reasoning through employee turnover and made the trade-off visible to the new engineer who joined 18 months later.
Transport options and their structural properties
WebSockets. A WebSocket connection is a full-duplex, persistent TCP connection. The client sends the HTTP upgrade request; the server responds with 101 Switching Protocols; from that point, both sides can send frames independently without the request-response overhead of HTTP. WebSockets are the correct choice for bidirectional real-time communication where the client sends data at high frequency or in response to events: collaborative editing (all participants send cursor positions, keystrokes, and selection ranges as they type), multiplayer games (client sends controller input at 60Hz), and chat applications where message threading requires per-user typing indicators and read receipts sent from client to server. The structural constraint is per-connection state management: a WebSocket server that holds subscription or channel membership state in process memory must either use sticky sessions or externalize state to a broker before it can scale horizontally. The connection count per instance is the primary resource constraint: a Node.js process can handle tens of thousands of idle WebSocket connections (the connection itself is cheap if it is not sending frequent messages), but the state associated with each connection — if held in memory — scales the process's heap proportionally. The performance optimization decision record becomes relevant when connection counts grow: memory profiling often reveals that the per-connection state objects account for a disproportionate share of heap, and the optimization path is either reducing the state stored per connection or moving it to a shared store — which is the same architectural change required for stateless scaling. The horizontal scaling ceiling imposed by in-memory per-connection state appears first in performance optimization sessions, before it is recognized as an architecture constraint. The WebSocket ADR must specify, at adoption time, whether state will be held per-instance or in a shared store, because that choice directly determines the scaling model and the operational policy for load balancer session affinity.
Server-Sent Events (SSE). SSE is a one-way HTTP/1.1 persistent connection from server to client. The server sends newline-delimited event frames with optional event ID, event type, and data fields. The client receives events in order and automatically reconnects if the connection drops, sending the Last-Event-ID header so the server can resume from where delivery was interrupted. SSE is the correct choice for unidirectional real-time delivery: live dashboards, notification feeds, AI-generated text streaming (where the server produces tokens and streams them to the client), build log streaming, and progress indicators. The browser's EventSource API handles reconnection with automatic exponential backoff — the application developer does not implement reconnection logic for browser clients. The built-in Last-Event-ID resume mechanism means the server can track what each client has received and continue the stream from the last confirmed event without the client needing to re-request missed events. The structural constraint is the HTTP/1.1 per-origin connection limit: browsers allow six concurrent HTTP/1.1 connections per origin; each SSE connection consumes one. A web application that opens multiple SSE connections (one per data feed) can exhaust the browser's connection pool, blocking other HTTP requests on the same origin. HTTP/2 eliminates this constraint by multiplexing multiple streams over a single TCP connection; SSE over HTTP/2 behaves as a single stream among many and does not consume additional connections. The infrastructure prerequisite is HTTP/2 support through the entire path from browser to origin: CDN, load balancer, and origin server. An SSE adoption decision without confirming HTTP/2 end-to-end connectivity is a decision that may add a browser connection limit constraint to production before the HTTP/2 requirement is met in the infrastructure. The CDN decision record intersects here: a CDN configured for HTTP/1.1 between the CDN and the origin will buffer SSE event frames rather than streaming them, introducing latency that may defeat the real-time requirement. SSE over HTTP/2 through a CDN requires the CDN to support server push or streaming pass-through on HTTP/2 — a capability that must be verified against the specific CDN configuration, not assumed from the general HTTP/2 support claim.
Long-polling. Long-polling is an HTTP request that the server holds open until it has an event to deliver, at which point it completes the response and the client immediately issues a new long-poll request. From the user's perspective, long-polling delivers near-real-time updates without a persistent connection. From the infrastructure's perspective, long-polling is a sequence of short-lived HTTP requests — each request is completed within seconds of an event being available. Long-polling works through all HTTP infrastructure: proxies, CDNs, and load balancers handle it as ordinary HTTP requests, with no protocol-level WebSocket support required. The structural constraint of long-polling is request overhead: each event delivery requires a new TCP connection establishment (if keep-alive is not enabled or the connection pool is exhausted), a new TLS negotiation, a new HTTP request, and a new response. For high-frequency events — ten or more events per second per client — long-polling is less efficient than a persistent connection because the overhead of establishing and tearing down each request is proportional to the event frequency. For low-frequency events — one notification per minute per user — long-polling's overhead per event is negligible and its infrastructure compatibility is often worth the simplicity. Long-polling is also the correct fallback when WebSocket connections are blocked by corporate proxies: a transport negotiation mechanism (the approach used by Socket.io and similar libraries) starts with long-polling and upgrades to WebSockets when the WebSocket handshake succeeds, falling back to long-polling if the upgrade fails. The transport negotiation approach provides proxy compatibility without giving up WebSocket efficiency when proxies permit it — but the negotiation mechanism adds complexity that must be maintained: the client library must be kept at a version that supports both transports, and the server must handle both protocols simultaneously. The dependency upgrade decision record becomes relevant for transport negotiation libraries: Socket.io major version upgrades have historically broken the transport negotiation protocol in ways that require coordinated client and server upgrades, and the upgrade timing constraint is invisible if the library's role in the transport architecture is not documented.
HTTP streaming (chunked transfer encoding). HTTP streaming sends a single HTTP response with Transfer-Encoding: chunked and writes event data as chunks without closing the response. The client reads chunks as they arrive. This is the mechanism used by AI model APIs to stream generated tokens: the server sends each token as a chunk as it is produced, and the client renders them progressively. HTTP streaming is appropriate for single-producer, single-consumer streams where the content is produced in sequence and the consumer reads it in order — token streaming, file streaming, event log streaming. It is not appropriate for multi-publisher fan-out (where multiple events from different sources must be interleaved for a single client) or for bidirectional communication. HTTP streaming inherits the proxy and CDN compatibility constraints of SSE with one additional consideration: buffering. HTTP proxies and CDN edge nodes may buffer the entire response body before forwarding it to the client, defeating the streaming requirement. Streaming requires the CDN and proxy to support pass-through of chunked responses without buffering — a configuration that must be explicitly enabled (for example, AWS CloudFront requires the Cache-Control: no-store header and a response size greater than the buffer threshold, or a response with Content-Type: text/event-stream using SSE framing to trigger streaming mode). A team that builds HTTP streaming without verifying the CDN buffering behavior may ship a feature that streams correctly in development (where the CDN is not in the path) and delivers the full response body as a single event in production (after the CDN buffers the complete response). The observability strategy decision record is the mechanism for detecting CDN buffering in production: streaming delivery latency should be measured from the first chunk delivered to the client, not from the final response completion — a metric that distinguishes a buffered delivery (all chunks arrive simultaneously at response completion) from a streamed delivery (chunks arrive progressively as the server produces them).
Managed real-time brokers (Ably, Pusher, PubNub, Soketi). A managed broker externalizes the transport, connection management, and fan-out concerns to a third-party service. The application backend publishes events to the broker via REST or WebSocket; the broker delivers them to subscribed clients. The application server is stateless relative to client connections — it does not hold open WebSocket connections to clients. The scaling ceiling of the in-process WebSocket server disappears: the broker scales client connections horizontally, and the application's published events are delivered to all subscribers regardless of which application instance published them. The structural trade-off is third-party dependency: the broker's availability is now in the critical path of the real-time feature. A broker outage does not just degrade the real-time feature — it may block core product functionality if the real-time transport is used for mandatory user interactions rather than optional live updates. The ADR must document the fallback behavior for broker unavailability: does the product degrade gracefully (core functionality continues without real-time updates, which appear when the broker recovers), or does the broker outage block core functionality (user cannot complete an action that requires a real-time confirmation event from the broker)? A managed broker that is in the critical path of a financial transaction confirmation, a two-factor authentication code delivery, or a payment status update is a third-party availability dependency that belongs in the payment processor decision record category of third-party risk, not treated as an infrastructure convenience. The decision to use a managed broker must include the fallback behavior, the circuit breaker policy (when to stop publishing to the broker and fall back to polling), and the data sovereignty implications (are the events the application publishes to the broker subject to data residency requirements that the broker must be configured to honor?).
AI chat session types and what each one misses
The real-time transport decision follows a consistent pattern in AI chat history. An early feature session establishes the transport. A scale event triggers an investigation that surfaces the scaling constraint. A proxy or compatibility incident triggers a transport change. Each of these sessions contains a decision that belongs in a real-time architecture ADR — and each one is missing the constraints that would have changed the decision or informed the refactor. The WhyChose extractor surfaces these sessions because they are among the most consequential in a product's infrastructure history and among the most consistently underdocumented.
The "how do I push real-time updates?" session. This is the founding session — the first time the engineer asks an AI assistant how to deliver server-initiated events to a web client. The session covers: the transport options (WebSockets, SSE, long-polling, managed services), the library choices for the chosen transport, and a basic implementation example. The session recommends WebSockets for most real-time use cases because they support bidirectional communication and have broad library support. What the session misses: the engineer does not ask about the scaling model because they are thinking about getting the feature to work, not about what happens at 2,000 concurrent connections. The session does not cover sticky sessions because the engineer has not encountered load balancers yet. It does not cover the client reconnection policy because the engineer's test environment has a single server instance that never restarts. The transport choice from this session determines the architecture for the life of the product, but the session is scoped to the implementation sprint, not to the multi-year operational consequences. The ADR that should be written after this session captures the implementation choice and explicitly addresses the questions the session did not raise: will connection state be held per-instance or in a shared store, what is the client reconnection policy, and what is the fallback behavior for clients that cannot establish the primary transport.
The "our load balancer isn't distributing connections evenly" session. This is the scaling ceiling discovery session. An engineer queries an AI assistant with "our WebSocket server isn't distributing load evenly across instances, some instances are overloaded while new ones are empty." The session explains sticky sessions and the per-instance state problem. The session produces two paths: enable sticky sessions (quick fix, preserves in-memory state, defers the scaling ceiling) or externalize state to Redis pub-sub (correct fix, enables stateless scaling, requires refactoring the subscriber map). The engineer chooses the quick fix — enable sticky sessions — because the refactor is a week's work and the scaling ceiling is a few months away at current growth rates. This decision is not documented. Sticky sessions are enabled in the load balancer configuration, possibly through the console rather than infrastructure-as-code. Three months later, the engineer who made this decision has moved to a different team. The DevOps engineer inherits a load balancer with stickiness enabled and no record of why it was set or what would break if it were disabled. The infrastructure-as-code strategy decision record addresses part of this problem: load balancer configuration in Terraform means the stickiness setting is visible in version control with a commit message. But the commit message is not a substitute for an ADR that explains why stickiness was enabled, what the alternative was, and under what conditions the sticky session approach should be replaced with externalized state. A future engineer reading the Terraform commit knows that stickiness was enabled; the ADR tells them why it was a deliberate deferral of the externalization refactor and when to revisit it.
The "our enterprise customers can't connect" session. This is the proxy compatibility incident session. The session is triggered by a support ticket from an enterprise customer whose IT infrastructure blocks WebSocket connections. The AI session covers proxy compatibility: WebSocket upgrade requests are blocked by HTTP-only proxies, WSS connections may be intercepted by TLS-terminating corporate proxies, and SSE or long-polling provide broader proxy compatibility at the cost of unidirectionality or polling overhead. The session recommends SSE as a fallback for the enterprise customer's use case (if the notification flow is server-to-client only) or Socket.io-style transport negotiation (which starts with long-polling and upgrades to WebSockets when available). The engineer implements the SSE fallback for the affected enterprise customers. What the session misses: the engineer does not ask whether the enterprise ICP is a general characteristic of the product's target market or an edge case for a single customer. For a B2B product targeting enterprises, the proxy compatibility constraint is a structural ICP requirement, not an edge case. The transport decision for the core product should have been SSE or long-polling from the start, with the proxy compatibility requirement as a documented constraint in the ADR. Instead, the constraint is discovered in production through customer support, resolved for the specific customer, and not generalized to a documented product-wide transport policy. The new CTO who joins later asks "what transport does our notification system use and why?" and receives the answer "WebSockets, except for some enterprise customers who use SSE, I think because of some proxy issue." The reasoning — that the enterprise ICP has a structural proxy compatibility requirement that should have driven the original transport choice — is not available because the session that discovered it was not documented as a decision.
The "our reconnect is hammering the database" session. This session is triggered by a database CPU spike during a rolling deploy. An engineer queries "after we deploy, there's a huge spike in database queries that lasts about 30 seconds, then everything returns to normal." The AI session identifies the thundering herd reconnect pattern: all disconnected clients reconnect simultaneously, and each reconnection triggers authentication token verification, session restoration, and subscription state reconstruction — all of which hit the database. The session recommends exponential backoff with jitter. The engineer implements the reconnection policy on the web client but not on the mobile client (a different team) or the desktop client (a third team). Three months later, a rolling deploy during peak hours still produces a database spike — the mobile client is reconnecting with a fixed 500ms retry. The reconnection policy was not documented as a product-wide requirement; each client team implemented their own version. The real-time architecture ADR is the correct place to specify the reconnection policy as a requirement for all client implementations: the minimum backoff window, the maximum backoff window, the jitter calculation, the maximum retry count before surfacing an error, and the event the server emits when it is ready to accept reconnections (to allow clients to reconnect faster when the server signals readiness rather than the maximum backoff window). A specification in the ADR converts the reconnection policy from a suggestion discovered in one client's incident postmortem to a documented interface contract that all client implementations comply with.
Five ADR sections for real-time architecture
A real-time architecture ADR that prevents the sticky session surprise, the thundering herd incident, the proxy compatibility discovery, and the state externalization refactor covers five sections that teams consistently omit from the initial transport implementation.
First, transport choice with ICP compatibility rationale. The ADR documents the transport mechanism chosen — WebSockets, SSE, long-polling, HTTP streaming, or managed broker — with the alternatives evaluated and the explicit rejection reasons. The rejection reasons must reference the ICP's network environment: a B2B product targeting enterprise customers whose IT infrastructure routes traffic through HTTP-only proxies cannot choose WebSockets as the primary transport without a fallback, because the proxy compatibility problem is not an edge case for that ICP — it is a predictable consequence of the market segment. The compatibility envelope of each transport must be documented: WebSockets require network infrastructure that passes the WebSocket upgrade header or allows opaque TLS tunneling; SSE requires HTTP/2 or per-origin connection limit management on HTTP/1.1; long-polling requires no protocol-level support beyond standard HTTP. The transport choice for the core product, the transport for the enterprise fallback path (if applicable), and the transport negotiation mechanism (if the product must support multiple transports dynamically) are all documented in this section. A product that discovers mid-development that corporate proxies block their chosen transport and switches to SSE has made two transport decisions — the decision that was logged in the initial sprint and the decision that replaced it — and both should appear in the ADR with the context that drove the change.
Second, connection state model and scaling policy. The ADR documents where connection state is held and what auto-scaling policy that choice enables. Per-instance in-memory state requires sticky sessions at the load balancer, must document the stickiness duration setting, and must name the scaling ceiling at which the sticky session approach requires replacement — expressed as a connection count or a deployment frequency. Externalized state requires documentation of the shared store (Redis cluster, managed broker, database table), the pub-sub channel model (one channel per user, per document, per room), the state written to the store on connection establishment, and the state read from the store on reconnect. The auto-scaling policy that the state model enables is documented explicitly: "stateless — ALB can route any client to any instance, auto-scaling adds instances and existing connections distribute over time as clients reconnect" or "sticky sessions — auto-scaling adds capacity for new connections only; existing connections do not migrate until clients reconnect." The connection count per instance limit is documented for the specific runtime and instance type: a Node.js process on a t3.medium with 2 GB RAM and the current per-connection state size can support approximately N concurrent connections before memory pressure forces a restart — a number derived from load testing, not assumed. This section is the mechanism that prevents the "we enabled sticky sessions once and now nobody knows why" problem: the stickiness setting is a deliberate, documented choice with a named trigger for revisitation (the scaling ceiling), not an undocumented configuration change made in the console during an incident. The multi-tenancy decision record intersects here: a multi-tenant product that holds per-tenant channel state in memory must consider the tenant isolation model in the connection state design — cross-tenant event delivery through a shared Redis pub-sub channel is a data isolation failure, and the pub-sub channel design must enforce tenant boundaries as explicitly as the application's data access layer does.
Third, client reconnection policy specification. The ADR specifies the reconnection policy as an interface contract for all client implementations — web, mobile, and desktop — not as implementation guidance for a single client. The specification covers: the initial reconnection delay (the minimum backoff window before the first reconnect attempt, recommended 1–5 seconds), the backoff multiplier (the factor by which the delay increases with each failed attempt, recommended 2×), the maximum backoff window (the ceiling on the reconnection delay, recommended 30–60 seconds), the jitter function (a random offset added to each backoff duration, drawn from a uniform distribution over the current window, to spread reconnect attempts across time), the maximum retry count (the number of reconnection attempts before surfacing a visible error to the user or switching to a degraded-mode polling fallback), and the server-ready signal (an event or endpoint the server exposes when it is ready to accept new connections after a restart, allowing clients to reconnect proactively rather than waiting for the maximum backoff window). The specification includes the load calculation at the target connection count: with N concurrent connections and the specified backoff parameters, the peak reconnect request rate following an instance restart is approximately N / (jitter_window_seconds), which must be within the server's connection acceptance capacity per second. A specification that produces a reconnect storm at the expected connection count is incorrect at design time, not after the first production incident. The API rate limiting decision record intersects here: a rate limiter that applies to the WebSocket upgrade endpoint or the SSE connection endpoint must be configured with the peak reconnect rate in mind — a rate limit too low for the reconnect storm will prevent legitimate reconnections after a deploy, compounding the downtime with reconnection failures.
Fourth, message delivery semantics and ordering guarantees. The ADR documents what the transport guarantees about message delivery and ordering, and what the application must implement to provide stronger guarantees when the transport's guarantees are insufficient. WebSockets deliver messages in-order and exactly-once within a single connection — but a connection break resets the delivery sequence; messages sent while the client was disconnected are lost unless the server implements a buffer. SSE's Last-Event-ID mechanism provides at-least-once delivery across reconnections when the server assigns sequential IDs to events and buffers recent events — but the buffer size and retention window must be documented, because an event that expires from the buffer before the client reconnects is lost. Long-polling delivers each response exactly once (the server completes the response and the client acknowledges receipt implicitly by issuing a new long-poll request), but a network failure that drops the response after it leaves the server but before it reaches the client produces a silent loss — the server believes the event was delivered, the client never received it. The delivery semantics documented in the ADR determine what the application layer must add to provide the user-facing guarantee: exactly-once delivery of notifications to the user's notification center, or at-most-once delivery of presence updates where the latest state supersedes any lost intermediate states. A notification system that requires exactly-once delivery of each notification must implement application-layer sequence numbers, client-side acknowledgment, and server-side event buffering regardless of the transport — the transport's delivery guarantee is not sufficient on its own. Documenting this requirement at transport adoption time prevents the "some users are missing notifications and we don't know why" investigation, which is the application-layer delivery gap surfacing in production because the transport's delivery semantics were assumed rather than documented. The message broker decision record intersects here: a message broker in the server-side event pipeline adds delivery guarantees that the real-time transport does not provide — the broker ensures each event is delivered to the application server at least once, but the application server-to-client delivery gap is still determined by the transport and the application-layer acknowledgment model.
Fifth, third-party dependency and degraded-mode fallback. The ADR documents the third-party dependencies in the real-time architecture and the degraded-mode behavior when those dependencies are unavailable. For self-hosted WebSocket infrastructure, the dependency is the server instances and the shared state store; the degraded mode is a loss of real-time updates, with the client falling back to polling at a reduced frequency. For a managed broker, the dependency is the broker's availability; the degraded mode must be specified explicitly: does the product surface real-time updates from the broker and nothing otherwise (the real-time feature is optional and the core product works without it), or does the product route mandatory confirmations through the broker (the core product cannot function without the broker's delivery guarantee)? A managed broker in the critical path of a mandatory user interaction — a two-factor code delivered via the broker, a payment confirmation that the client waits for before completing the checkout flow — is a single-point-of-failure for the core product with a third-party SLA. The ADR must document the fallback for broker unavailability: a polling endpoint that the client switches to when the broker connection fails, a server-sent timeout that the client uses to complete the interaction without the real-time confirmation, or an explicit error to the user with instructions to retry. The fallback behavior must be tested — not just documented — to confirm it activates correctly when the broker is unavailable. The CI/CD pipeline decision record must include a test stage that simulates broker unavailability (blocking outbound connections to the broker's domain) to verify that the fallback activates and the core product functionality continues. A fallback documented in the ADR but not verified in the CI pipeline is a fallback that has the same probability of working correctly during an incident as no fallback at all. The decisions that were never written down pattern applies here as much as it does to any other architectural choice: the third-party broker dependency is a decision whose consequences are invisible until the first broker outage during peak traffic, at which point the question "what does our product do when the broker is unavailable?" will be answered by whatever the code happens to do — which may be a 500 error loop, a spinner that never resolves, or a silent hang. Documenting and testing the answer before that question is asked under pressure is what separates a recoverable degradation from an incident.
None of these five sections appear in the "how do I push real-time updates?" AI session that initiated the transport choice. The initial session covers the mechanics of getting a WebSocket or SSE connection working. It does not cover what happens at 2,000 connections with sticky sessions, what the client does when the server restarts, whether corporate proxies block the connection, what happens to messages sent during a disconnect window, or what the product does when the managed broker is unavailable. These are not advanced concerns — they are the normal operating conditions of a real-time feature in a production product at any scale beyond a single instance and a handful of users. The WhyChose extractor surfaces the initial session, the sticky session investigation, the proxy incident, and the thundering herd postmortem from AI chat history; the real-time architecture ADR takes the decisions buried in those sessions and converts them into a documented transport contract, scaling policy, and reconnection specification — written before the incidents that make those constraints visible at the worst possible moment.
FAQs
Why do WebSockets require sticky sessions and why does that prevent horizontal auto-scaling?
A WebSocket connection is a persistent, stateful TCP connection between a specific client and a specific server instance. Once the handshake completes, all subsequent messages for that connection travel on that TCP socket. If the server holds per-connection state in process memory — subscription filters, channel memberships, presence data, sequence numbers — that state exists only on the instance that accepted the handshake.
Sticky sessions ensure a client reconnects to the same instance, preserving its in-memory state. This creates the scaling ceiling: auto-scaling adds new instances and routes new connections to them, but existing connections remain pinned to original instances. The new instances are underloaded; the original instances remain overloaded. Adding instances provides no relief for the existing load distribution. Removing an overloaded instance requires migrating all its pinned connections — dropping and re-establishing them — which produces a reconnect storm.
The alternative that enables free horizontal scaling is externalizing connection state to a shared store: Redis pub-sub, a managed broker, or a similar mechanism. Any instance can reconstruct a client's context on reconnect because the state is in the shared store, not in the instance that previously held the connection. The WebSocket ADR must document which model was chosen at implementation time — the choice directly constrains the auto-scaling policy and cannot be changed without a refactor of the connection state management approach.
What is the thundering herd reconnect problem and why does it only appear at the first rolling deploy?
The thundering herd reconnect problem occurs when a large number of clients disconnect simultaneously and all attempt to reconnect at approximately the same moment. A rolling deploy cycles through server instances and drops each instance's connections in sequence. Without a reconnection backoff policy, clients that detect a disconnect attempt to reconnect immediately — within milliseconds of the disconnect event.
At 2,000 concurrent connections, an instance restart produces 2,000 simultaneous TCP handshakes, TLS negotiations, authentication token verifications, and subscription state reconstructions in the same 200ms window. This spike is qualitatively different from the load of 2,000 connections arriving gradually over normal user activity. The behavior is first observed during the first rolling production deploy at scale — staging environments with 50 concurrent connections and a single instance never expose it.
An exponential backoff policy with jitter spreads reconnect attempts across tens of seconds instead of tens of milliseconds. Exponential backoff alone is insufficient if all clients disconnect at the same timestamp: they arrive at the same reconnection window simultaneously. Jitter — a random offset within the current backoff window — is required to spread the reconnect attempts across time. The reconnection policy is a server-side load management decision encoded as a client-side behavior; it belongs in the real-time architecture ADR as a specified requirement for all client implementations, not left to each client team to discover independently.
When is Server-Sent Events a better choice than WebSockets and what are its operational limitations?
SSE is a better choice than WebSockets for unidirectional real-time data flows: live dashboards, activity feeds, notification streams, progress updates, and AI-generated text streaming. SSE connections are plain HTTP connections — they work through HTTP/2 multiplexing, standard reverse proxies, and CDN infrastructure without requiring WebSocket protocol support. Corporate HTTP proxies that block WebSocket upgrade requests pass SSE connections through as ordinary long-lived HTTP responses.
SSE has a built-in Last-Event-ID resume mechanism: the browser automatically sends the last received event ID on reconnect, and the server can resume the stream without client-side reconnection logic. The browser's EventSource API handles reconnection with built-in exponential backoff — the application developer does not need to implement reconnection for browser clients.
The operational limitation is the HTTP/1.1 per-origin connection limit: browsers allow six concurrent connections per origin, and each SSE connection consumes one. HTTP/2 eliminates this limit by multiplexing multiple streams over one TCP connection. SSE adoption over HTTP/1.1 without managing this limit can exhaust the browser's connection pool and block other HTTP requests on the same origin. The infrastructure prerequisite — HTTP/2 support end-to-end through CDN, load balancer, and origin server — must be verified before SSE is chosen as the transport, not after the connection limit appears in production.