Why does database connection pooling need an architecture decision record?

Database connection pooling is one of the least-documented architectural choices in most production systems, despite being one of the most consequential. The pooler you chose — PgBouncer, RDS Proxy, application-level pooling, or no external pooler — determines three structural properties of your database layer: the connection exhaustion ceiling (how many concurrent application threads or processes can be active before PostgreSQL rejects new connections with FATAL: sorry, too many clients already), the server RAM commitment per connection (each PostgreSQL backend process requires 5–10MB RAM whether it is actively executing a query or idle), and the failover recovery behavior (how fast the application reconnects after the database restarts, the primary fails over to a replica, or a maintenance window forces a connection drop). Without a decision record, the team in year three cannot reason about why the connection pool is sized the way it is, why a specific pooler was chosen over alternatives, or whether the pooling mode (transaction, session, or statement) is safe to change when a new ORM or framework is introduced. The connection pooling ADR is also where prepared statement behavior under PgBouncer transaction mode — a source of subtle and hard-to-diagnose application failures — must be documented.

What is the difference between PgBouncer transaction mode and session mode?

PgBouncer supports three connection pooling modes that determine when a server connection (a real PostgreSQL backend process) is returned to the pool for reuse by a different client. In session mode, each client connection holds a dedicated server connection for the duration of the client connection lifetime — semantically equivalent to no pooling at all, with the only benefit being connection multiplexing at the OS level. Session mode is safe for all PostgreSQL features including prepared statements, advisory locks, listen/notify, and session-local configuration settings. In transaction mode, the server connection is held only for the duration of a single transaction — the connection is returned to the pool between transactions, allowing many more client connections to share a smaller number of server connections. Transaction mode is the mode that provides genuine connection multiplexing benefits: a pool of 50 server connections can serve hundreds or thousands of client connections if each client spends time between transactions. The constraint: transaction mode breaks features that assume a persistent server connection across transactions. Prepared statements created with PREPARE and used with EXECUTE require a persistent server connection because the prepared statement is stored in the server process memory — in transaction mode, the client may be assigned a different server connection for EXECUTE than for PREPARE, causing a 'prepared statement does not exist' error. Named cursors declared with DECLARE CURSOR and used with FETCH require the same server connection across multiple statements for the same reason. Applications using these features must either use protocol-level prepared statements (which PgBouncer 1.21+ handles via a prepared statement tracking mechanism) or disable them in favor of inline queries. In statement mode, the server connection is returned to the pool after each individual SQL statement, which is more aggressive than transaction mode and incompatible with any multi-statement transaction — it is appropriate only for read-only workloads or single-statement-per-request architectures.

When should a team use RDS Proxy instead of PgBouncer?

RDS Proxy is AWS's fully managed connection pooler for RDS and Aurora databases. It is the appropriate choice when three conditions are present: the application runs on AWS (RDS Proxy is an AWS-only service with no self-hosted equivalent), the team wants to avoid operating and monitoring a separate PgBouncer process (RDS Proxy is fully managed — AWS handles availability, upgrades, and scaling), and the application requires IAM-based authentication for database connections (RDS Proxy supports IAM authentication as a first-class feature, eliminating the need to manage database credentials in application configuration). RDS Proxy is also the better choice for Lambda-backed applications: Lambda functions scale to hundreds of concurrent executions without connection pooling, and each Lambda execution environment holding its own database connection would exhaust max_connections instantly. RDS Proxy multiplexes Lambda connections to a manageable number of database backend processes, with pinning behavior that holds a connection for the Lambda execution environment when session-level state is detected. PgBouncer remains the better choice when the application is not on AWS, when the team needs granular control over pooling mode and pool sizing per database user or application, when per-connection overhead of RDS Proxy (which adds 1–2ms of latency per query) is unacceptable for latency-sensitive workloads, or when the team has existing PgBouncer operational experience and monitoring already in place. The connection pooling ADR must specify the rationale for this choice because a team migrating from EC2-hosted PostgreSQL to RDS later must evaluate whether to migrate from PgBouncer to RDS Proxy at the same time, and the original rationale for PgBouncer — if absent from any decision record — cannot inform that evaluation.

How does connection pooling affect database failover recovery time?

The connection pooler is the first component that must detect a database restart or primary failover and reconnect application clients to the new primary. The failover recovery time visible to the application — the period from the old primary becoming unavailable to the first successful query on the new primary — is determined by three intervals: the time for the database to complete failover (Patroni or AWS Aurora automatic failover typically completes in 15–60 seconds), the time for the pooler to detect the connection failure and drain its connection pool, and the time for the pooler to establish new connections to the new primary endpoint. PgBouncer detects server connection failures when the next query attempt fails, and reconnects clients transparently if reconnect_db_connections is enabled — but the reconnect uses the configured host parameter, which must be a DNS name that points to the current primary (a floating IP via keepalived, the Patroni-managed primary endpoint, or a load balancer VIP) rather than a static host:port pointing to the old primary's IP address. RDS Proxy handles Aurora failover automatically — it detects the primary change via the Aurora cluster endpoint, reconnects its pooled connections to the new primary, and reduces application-visible downtime to under 30 seconds for Aurora Global Database. Application-level poolers (SQLAlchemy connection pool, HikariCP) detect connection failures per-connection when a query attempt throws a database error — the recovery behavior depends on the pool's connection validation and eviction configuration. SQLAlchemy's pool_pre_ping option tests each connection before returning it from the pool, evicting connections that have died since being pooled, but this test adds one round-trip latency to each connection checkout during the recovery period. HikariCP's keepaliveTime and connectionTimeout settings determine how quickly dead connections are detected and replaced. The connection pooling ADR must document the failover detection mechanism and the expected application-visible outage duration for each failure mode — database restart, primary failover, and connection pool exhaustion — because these numbers determine whether the database tier meets the availability SLO.

2026-06-21 · ~20 min read

The database connection pooling decision record: why the pooler you chose determines your connection exhaustion ceiling and your failover behavior during database restarts

Database connection pooling looks like a performance tuning detail until your application hits FATAL: sorry, too many clients already at 2am and your on-call engineer doesn't know whether the pooler is PgBouncer in transaction mode or the ORM's built-in pool — and the ChatGPT session where that decision was made is buried in a window that closed eighteen months ago. The pooler you chose in year one determines your connection exhaustion ceiling, your server RAM commitment per idle connection, and how fast your application recovers when the database restarts.

A sixteen-person B2B SaaS company ran their primary PostgreSQL database on a single db.r6g.xlarge RDS instance: 4 vCPUs, 32 GB RAM, and a max_connections value of 340 set automatically by RDS based on instance RAM. The application was a Rails monolith. ActiveRecord's connection pool was configured at the default — 5 connections per worker process — and they ran 12 Puma workers across three application servers. That's 60 application-side connections. The background job system was Sidekiq: 25 concurrency, 5 connections per thread, 4 Sidekiq processes across two servers — another 500 connections on paper, though Sidekiq's actual active concurrency under typical load was a fraction of that. For thirteen months, the math worked. Peak observed connections hovered around 180. Nobody thought about connection pooling.

Then the sales team closed a deal that required a data migration: importing three years of historical records from a legacy system into the PostgreSQL database. The migration script was a Python process that opened 50 concurrent database connections to parallelize the INSERT batches. The engineering team ran it at 6pm on a Friday. At 6:17pm, the application began returning 500 errors. The error logs showed PG::ConnectionBad: FATAL: sorry, too many clients already. The Puma workers, Sidekiq processes, the migration script, and the database health check endpoint had collectively exceeded 340 connections. PostgreSQL was rejecting every new connection attempt. The migration script was killed. The existing connections, most of them idle, did not release fast enough for the health checks to pass. For eleven minutes, the application was completely unavailable while the on-call engineer scrambled to figure out which process was holding the excess connections.

The post-mortem revealed three decisions that had never been documented: no external connection pooler had been deployed, the Sidekiq connection pool was misconfigured (Sidekiq concurrency × pool_size per process produces far more connections than the concurrency ceiling requires), and no connection limit had been set at the database user level to prevent a runaway process from consuming the entire max_connections budget. The ChatGPT session where the team decided to defer PgBouncer — "we'll add it when we need it" — was gone. The session that configured Sidekiq's connection pool — which had been copy-pasted from a Stack Overflow answer and never revisited — no longer existed. The decisions that aren't written down don't disappear; they become the undocumented constraints that fail at the worst possible moment.

Why connection pooling is a decision cluster, not a configuration detail

Connection pooling is routinely treated as an operational detail — something you configure once, forget about, and only revisit when something breaks. This framing misses the three structural properties that the pooling choice determines, each of which constrains later decisions in ways that are invisible until the constraint is hit.

The connection exhaustion ceiling is determined by the pooler topology, not by max_connections alone. PostgreSQL's max_connections parameter is the hard limit on simultaneous backend processes. Hitting it produces an immediate FATAL error that the application cannot retry its way out of — the connection cannot be established, period. The connection exhaustion ceiling for your application is the gap between max_connections and the maximum number of connections your application processes will attempt to open under peak load, stress test conditions, or runaway script execution. Without an external pooler, that gap depends on every team knowing and respecting the connection budget; with an external pooler in transaction mode, the application can have thousands of client connections multiplexed onto a small, controlled number of server connections. The pooler is the enforcement mechanism that makes the ceiling reliable rather than aspirational.

The server RAM commitment is determined by the number of server-side connections, not client-side connections. Each PostgreSQL backend process — each actual server connection — requires approximately 5–10 MB of RAM for the shared memory segment, the work memory allocation, and the process stack. For max_connections = 340, that's potentially 1.7–3.4 GB of RAM committed to connection overhead, consumed whether those connections are executing queries or sitting idle waiting for the application to use them. An external pooler in transaction mode reduces the server-side connection count to a fraction of the client-side count: 50 server connections can serve 500 or 5,000 client connections if each client transaction takes milliseconds. The RAM freed by reducing server connections is RAM available for shared_buffers and effective_cache_size — the PostgreSQL parameters that actually determine query performance. The performance optimization decisions made at the database layer are directly constrained by how many server connections the pooler architecture requires.

The failover recovery behavior is determined by the pooler's reconnection logic, not by the database's failover speed. When the database restarts for a maintenance window, a PostgreSQL version upgrade, or a Patroni primary failover, the pooler must detect the connection failure, drain its server connections, and establish new connections to the new primary endpoint. The application-visible outage — the window during which queries fail with connection errors — is the sum of the database's failover time and the pooler's reconnection time. A PgBouncer deployment pointing at a static host IP rather than a floating DNS name will fail to reconnect to the new primary automatically, extending the application outage until the configuration is updated manually. An RDS Proxy deployment will handle Aurora failover reconnection automatically. An application-level pool with no connection validation will return broken connections to application threads that then fail with cryptic errors rather than a clean reconnection attempt. The database migration and upgrade strategy must account for the pooler's failover behavior because every planned maintenance window is a test of the reconnection logic.

The pooling mode determines which PostgreSQL features the application can safely use. This is the decision most teams make implicitly and document never. PgBouncer's transaction pooling mode — the mode that provides the multiplexing benefits — breaks any PostgreSQL feature that assumes a persistent server connection across transaction boundaries: named prepared statements, advisory locks, LISTEN/NOTIFY for pub/sub, temporary tables, and session-level configuration settings (SET LOCAL works within a transaction; SET outside a transaction does not persist across the pool boundary). These are not obscure PostgreSQL features; they are used by every major ORM and migration framework. The architecture decision record for connection pooling must specify the pooling mode and enumerate the PostgreSQL features explicitly declared off-limits as a consequence — because a new team member who introduces a feature that uses server-side prepared statements under PgBouncer transaction mode will spend hours debugging intermittent errors before discovering the constraint.

The PostgreSQL connection model and max_connections math

PostgreSQL uses a process-per-connection model: each client connection spawns a dedicated operating system process on the database server (a "backend process"). This is fundamentally different from the thread-per-connection model used by MySQL and the async worker model used by systems like Nginx. The process model provides strong isolation — a crashing backend process does not affect other connections — but it makes connections expensive. Each backend process has a fixed overhead before any query is executed: process startup time, shared memory mapping, authentication overhead, and the per-process RAM commitment described above.

The max_connections parameter limits the total number of backend processes PostgreSQL will allow. When the limit is hit, PostgreSQL returns a FATAL error to any new connection attempt immediately — there is no queue, no backpressure, no graceful degradation. The setting is a hard ceiling with an immediate cliff edge. AWS RDS and Aurora set max_connections automatically based on instance RAM using the formula LEAST({DBInstanceClassMemory/9531392}, 5000) — approximately one connection per 9.5 MB of instance RAM. For a db.r6g.xlarge (32 GB), this gives approximately 340 connections. For a db.r6g.2xlarge (64 GB), it gives approximately 680. Doubling the instance size to get more connection headroom is a legitimate but expensive strategy — a strategy that an external connection pooler makes unnecessary for most workloads.

Three processes compete for the max_connections budget in a typical production deployment: application servers (each process or thread in the application framework holds a pool of connections), background job workers (background job infrastructure like Sidekiq, Celery, and BullMQ each hold their own connection pools, often misconfigured to hold far more connections than the job concurrency requires), and operational access (human database access via psql, migration scripts, analytics queries from BI tools, and health-check queries from load balancers and monitoring systems). The connection pooling ADR must enumerate all three categories with explicit connection counts and document the total headroom budget remaining for unexpected processes — because the system that exceeded max_connections in the opening narrative was a migration script that nobody had included in the connection budget calculation.

PgBouncer: the standard external connection pooler

PgBouncer is the most widely deployed external connection pooler for PostgreSQL. It is a lightweight single-threaded proxy process (written in C) that sits between the application and PostgreSQL, accepting client connections and multiplexing them onto a smaller pool of real PostgreSQL server connections. PgBouncer is open source, extensively tested, and has been the production-standard choice for PostgreSQL connection pooling for over a decade.

Transaction mode is the mode that provides genuine multiplexing benefit: the server connection is held only for the duration of a single database transaction, then returned to the pool. A client with 1,000 concurrent application threads can multiplex onto 50 or 100 server connections if each thread's transactions complete in milliseconds. The multiplexing ratio depends entirely on the average transaction duration and the query throughput — a workload where each thread holds a transaction open for 500 milliseconds needs far more server connections than a workload where each transaction completes in 5 milliseconds. Transaction mode is the right default for OLTP web application workloads where most queries are short INSERT, UPDATE, SELECT operations within milliseconds-long transactions.

The constraint that must be documented in the ADR: transaction mode is incompatible with named prepared statements, advisory locks, LISTEN/NOTIFY, temporary tables, and SET configuration outside of transactions. ORMs and migration frameworks make heavy use of these features. Django's migrations use advisory locks to prevent concurrent migration runs. Ruby on Rails' ActiveRecord::Locking::Advisory uses advisory locks for optimistic locking. SQLAlchemy's compiled query cache uses server-side prepared statements for performance. When these features are used with PgBouncer in transaction mode, they fail with non-obvious errors: advisory lock operations succeed on one server connection and fail silently on another; prepared statement names are unknown on the server connection that receives the EXECUTE call. The team that deploys PgBouncer transaction mode without documenting the feature restrictions will encounter these failures months later, when a new feature or library introduces the restricted behavior without knowing the restriction exists.

PgBouncer 1.21 introduced prepared statement tracking — PgBouncer intercepts Parse messages in the PostgreSQL protocol and transparently manages prepared statements across the server connection pool, allowing transaction mode to be used with frameworks that use the extended query protocol for prepared statements. This significantly reduces the practical restrictions of transaction mode for modern ORMs that use protocol-level prepared statements (as opposed to PREPARE/EXECUTE SQL commands). The ADR must specify the PgBouncer version deployed and whether prepared statement tracking is enabled, because the behavior of the pooler under transaction mode depends on this.

Session mode maps each client connection to a dedicated server connection for the lifetime of the client connection. This provides none of the multiplexing benefit but does provide connection multiplexing at the OS level (the operating system's TCP connection overhead is paid at the PgBouncer level rather than the PostgreSQL level) and centralizes connection management for monitoring and administrative purposes. Session mode is the safe default for applications that use any of the features incompatible with transaction mode — it is better to use session mode and carry the server-side connection count than to use transaction mode and break application features silently.

Deployment topology decisions that the ADR must record: whether PgBouncer runs on each application server (co-located) or as a separate dedicated process (standalone). Co-located PgBouncer simplifies networking (each application server connects to localhost:5432 instead of a remote PgBouncer endpoint) and eliminates the PgBouncer process as a single point of failure. Standalone PgBouncer allows centralized connection pool sizing and monitoring across all application servers, but requires network connectivity from application servers to PgBouncer and high availability configuration (two PgBouncer processes with a floating IP via keepalived or a load balancer) to avoid making PgBouncer itself the availability bottleneck. The infrastructure-as-code strategy determines whether PgBouncer configuration — pool sizes, auth_file, server host, pooling mode — is version-controlled and applied consistently across environments or managed ad-hoc on the server.

Application-level connection pooling

Every major application framework and ORM includes a built-in connection pool that manages a set of long-lived database connections across request or job handlers. SQLAlchemy's engine pool, HikariCP for Java/JVM applications, node-postgres Pool for Node.js, GORM's ConnPool for Go, ActiveRecord's connection pool for Rails, Ecto's connection pool for Elixir — each manages a pool of database connections within the application process, reusing connections across queries rather than opening a new connection per query.

Application-level pooling is appropriate as the sole connection management strategy when three conditions hold: the total number of application processes times the pool size per process fits comfortably within the database's connection budget with headroom, there is no concern about connection exhaustion from unexpected external processes (migration scripts, analytics tools, monitoring agents), and the application does not need to scale application processes independently of the database connection budget. For small teams with a single application server and a reasonably sized database instance, application-level pooling alone is the correct starting point — introducing PgBouncer or RDS Proxy before the connection exhaustion ceiling becomes a real constraint adds operational complexity without meaningful benefit.

The critical misconfiguration that application-level pooling produces at scale is connection proliferation in background job systems. Sidekiq (Ruby), Celery (Python), BullMQ (Node.js), and similar systems multiplex many concurrent jobs across a pool of worker threads, and each thread may hold a database connection. The naive configuration — pool_size equal to the thread concurrency — is correct only if every concurrent job uses the database simultaneously. In practice, most background jobs alternate between database-intensive phases and CPU-bound or I/O-wait phases. A Sidekiq worker with 25 concurrency and pool_size = 25 across 4 processes holds 100 server connections even when only 15 of the 100 concurrent job slots are actively executing a database query at any given moment. The connection pooling ADR must document the background job system's concurrency and pool_size configuration and justify the pool_size against the actual observed database query concurrency — not just set it to match the worker concurrency as a default. The observability infrastructure must include a dashboard for active database connections broken down by application component so the team can observe actual connection utilization versus the configured ceiling.

RDS Proxy: the managed option for AWS workloads

AWS RDS Proxy is a fully managed connection pooler that sits between the application and an RDS or Aurora database. RDS Proxy manages a pool of long-lived connections to the database and multiplexes application connections onto them, with multiplexing behavior determined by the RDS Proxy pinning rules. RDS Proxy is the correct default for teams running on AWS who want the benefits of connection pooling without operating a PgBouncer process.

RDS Proxy supports two authentication modes: native PostgreSQL password authentication and IAM authentication. IAM authentication replaces database password management with short-lived IAM-signed tokens — the application requests a database authentication token from the AWS SDK using its IAM role, and RDS Proxy validates the token against IAM rather than the PostgreSQL pg_hba.conf password file. This eliminates the need to rotate database credentials stored in application configuration or Secrets Manager, because the token is derived from the IAM role identity rather than a static credential. The authentication strategy for database access should document whether RDS Proxy IAM authentication is used and how the IAM role trust policy is scoped to prevent unauthorized database access from other application roles.

RDS Proxy handles Aurora failover automatically: when the Aurora primary fails and a replica is promoted, RDS Proxy reconnects its pooled connections to the new primary endpoint within the Aurora failover window (typically 15–30 seconds for in-region failover). The application connects to the RDS Proxy endpoint (a stable DNS name that does not change during failover) rather than the Aurora cluster writer endpoint directly. The application-visible outage during a failover is reduced to the period between the old primary's last committed transaction and the first successful query through RDS Proxy to the new primary — typically under 30 seconds, compared to the 60–90 seconds that a PgBouncer deployment without a DNS-level failover mechanism might take to detect and reconnect.

The latency cost of RDS Proxy is 1–2 milliseconds added to each query round-trip — the overhead of the proxy hop between the application and the database. For latency-sensitive read paths where the baseline query time is under 5 milliseconds, this overhead is significant as a percentage of total query time. For write-heavy OLTP workloads where most queries take 5–50 milliseconds, the overhead is negligible. The ADR must document whether the 1–2ms proxy overhead is acceptable for the application's latency SLO — the performance optimization record for database queries provides the baseline query latency measurements that this judgment requires.

RDS Proxy's pinning behavior is the analog to PgBouncer's pooling mode restrictions and must be documented in the ADR. RDS Proxy pins a client connection to a specific server connection — bypassing the multiplexing behavior — when it detects session-level state that cannot safely be shared across connections: prepared statements created with the protocol-level Prepare message, session-local configuration settings set with SET, temporary tables, and advisory locks. A pinned connection is held for the client for the duration of the client connection, not just the transaction, reducing the multiplexing ratio. RDS Proxy's enhanced logging mode records pinning events with the reason, making it possible to identify which application code is triggering pinning and whether it can be refactored to avoid it. The ADR must specify whether pinning event monitoring is configured and what the acceptable pinning rate is for the application's connection budget.

Newer poolers: PgCat and Odyssey

PgCat (written in Rust) and Odyssey (written in C by Yandex) represent the current generation of PostgreSQL connection poolers that go beyond PgBouncer's feature set. Both support multi-threaded operation (PgBouncer is single-threaded and can be a CPU bottleneck on very high connection rate workloads), query routing to read replicas, and improved prepared statement handling under transaction mode.

PgCat adds read/write query routing — a single PgCat endpoint can route read queries to a pool of read replicas and write queries to the primary, based on query type detection or explicit annotation. This eliminates the need for separate application configuration for read and write endpoints and enables horizontal read scaling without application changes. PgCat's read replica load balancing is configurable: round-robin, random, or least-connections strategy. The operational complexity of PgCat's read routing — which requires accurate topology detection for the primary/replica configuration, and handling of replica lag for read-after-write consistency — is higher than PgBouncer's simpler pass-through model. The ADR must document whether read/write splitting is enabled and the consistency model for read-after-write queries (are they always routed to the primary to avoid replica lag, or is replica lag acceptable for certain query categories?).

For the majority of teams evaluating connection poolers for the first time, PgBouncer in transaction mode (with prepared statement tracking on PgBouncer 1.21+) or RDS Proxy (for AWS deployments) covers the decision space. PgCat and Odyssey are appropriate when PgBouncer's single-threaded model becomes a throughput bottleneck at very high connection rates (thousands of new connections per second) or when read replica routing at the pooler level is a requirement. The ADR must document which pooler was chosen and why, including whether the alternatives were evaluated and what specific capabilities or constraints ruled them out.

Failover behavior during database restarts

The connection pooler's behavior during database restarts is the most operationally significant property for teams running with high-availability database setups (Patroni for self-hosted PostgreSQL, AWS Aurora Multi-AZ, or RDS Multi-AZ). The failure modes that occur during database restarts are systematically different from the connection exhaustion failures described above — and they require the pooler to be specifically configured to handle reconnection correctly.

When the database primary restarts (for a planned maintenance window, a PostgreSQL version upgrade, or a Patroni-triggered failover following primary unavailability), all existing server connections are terminated. The connection pooler holds a pool of server connections to the database; when the database restarts, all those server connections become invalid simultaneously. The pooler's behavior at this point determines the application-visible outage duration.

PgBouncer's default behavior: server connections are validated only when a client tries to use them. A client submitting a query on a server connection that became invalid during the database restart will get an error, PgBouncer will discard the invalid connection, and PgBouncer will attempt to establish a new connection to the database server. If the database has restarted on the same host (a planned restart), reconnection succeeds immediately after the database is available. If the database has failed over to a new primary (a Patroni promotion), reconnection to the original host fails — the original host is now either offline or the standby. PgBouncer must be pointed at a DNS name or virtual IP that resolves to the current primary: a Patroni-managed floating IP via keepalived, a HAProxy or Consul health-checked endpoint that tracks Patroni primary election, or a cloud load balancer that routes to the current primary only. Without this DNS-level indirection, PgBouncer cannot reconnect to a promoted replica automatically — manual intervention is required to update PgBouncer's host configuration and reload the process, extending the application outage until the update is applied.

The CI/CD pipeline must include database failover as a tested failure scenario — not just in the context of application code deployment, but as a test of the connection pooler's reconnection behavior. A team that has never tested failover in a staging environment that mirrors the pooler topology will discover the reconnection behavior for the first time during a production incident.

RDS Proxy's failover behavior is automatic and does not require DNS configuration changes. When Aurora detects a primary failure and promotes a replica (typically within 15–30 seconds), RDS Proxy detects the topology change via the Aurora control plane, drains its server connections to the old primary, and establishes new connections to the new primary — all transparently, with the RDS Proxy endpoint DNS name remaining stable throughout. Application connections to RDS Proxy experience a brief period of failed queries (typically under 30 seconds) during which RDS Proxy returns connection errors, followed by automatic recovery once the new primary is accepting connections. This is the most operationally straightforward failover behavior available for AWS-hosted PostgreSQL workloads and is one of the strongest arguments for RDS Proxy over PgBouncer for Aurora deployments where failover correctness is critical.

Application-level pool failover behavior depends on the pool's connection eviction configuration. SQLAlchemy's pool_pre_ping=True issues a lightweight SELECT 1 query before returning each connection from the pool — if the connection is dead, it is evicted and a new connection is established. This adds one round-trip latency to each connection checkout but ensures that application code never receives a dead connection from the pool. The tradeoff: during the period immediately after a database restart, every connection in the pool will fail the pre-ping check and attempt reconnection simultaneously — a thundering herd against the database as it comes back up. The incident post-mortem process should include connection pool behavior during restart as a line item in any post-mortem for database-related outages, because the reconnection thundering herd is a common source of extended recovery time that is fixable with connection pool backoff configuration.

Connection limit enforcement at the database user level

The connection budget enforcement mechanism that most teams defer and then regret is connection limits on the database user (role) level. PostgreSQL allows per-role connection limits via CREATE ROLE worker_user WITH LOGIN CONNECTION LIMIT 50. This limits the number of simultaneous connections by a specific user to 50, regardless of what max_connections permits. Per-role connection limits create isolation guarantees that pooler-level limits alone cannot provide: a runaway migration script running as the migration_user role cannot consume connections reserved for the app_user role, because migration_user has a lower connection limit that caps its blast radius.

The pattern that prevents the opening narrative's eleven-minute outage: the production application role (app_user) is limited to max_connections × 0.6 — enough for normal operation with headroom for spikes. A migration role (migration_user) is limited to max_connections × 0.1. A monitoring role (monitoring_user) is limited to 10 connections. Superuser connections are reserved at the PostgreSQL level (the superuser_reserved_connections parameter, default 3, holds back that many connections for superuser administrative access even when the connection limit is hit). This structure ensures that no single consumer of connections — however misconfigured or runaway — can exhaust the entire connection budget and lock out all other processes. The connection pooling ADR must specify the per-role connection limits and the rationale for each limit, because the limits constrain the maximum pool size that can be configured for each application component and must be updated whenever the application scales out.

The caching layer decisions interact with the connection budget in a specific way: applications that cache aggressively reduce the number of database queries per request, which reduces the number of connections that need to be simultaneously active to achieve a given request throughput. A team that caches query results in Redis may be able to reduce their active server connection count significantly while maintaining the same request throughput — which makes the difference between needing an external pooler immediately versus in six months. The connection pooling ADR must acknowledge this interaction: the correct time to introduce an external pooler is when the caching layer is already in place and the remaining database connection traffic still threatens the connection exhaustion ceiling, not as a substitute for caching that eliminates the database queries in the first place.

The AI chat sessions that create undocumented pooling decisions

Connection pooling decisions emerge from AI chat sessions in four recognizable patterns, each of which should produce an ADR update but typically does not.

The initial scaling conversation. The team's first connection pooling discussion happens when the database connection count starts climbing toward max_connections. An engineer opens a ChatGPT or Claude session and asks: "We have a Rails app with 8 Puma workers and 5 Sidekiq processes, each with 25 concurrency. We're at 180 connections out of 340 max. Should we add PgBouncer now or wait?" The session produces a recommendation — sometimes PgBouncer now, sometimes "you have plenty of headroom, add it when you need it," sometimes a Sidekiq pool_size reduction as an interim fix. The recommendation is implemented or deferred. The session closes. The rationale — the workload characteristics, the headroom calculation, the alternative the team explicitly chose not to pursue — is gone.

The pooling mode misconfiguration discovery. After deploying PgBouncer, the team encounters intermittent failures: "ERROR: prepared statement 'a1' does not exist" or advisory lock behavior that is inconsistent across requests. An engineer opens a Claude session and diagnoses the issue: PgBouncer transaction mode is incompatible with the prepared statements the ORM is creating. The session produces a fix: switch to session mode, or add prepared_statements = false to the ORM configuration, or enable PgBouncer 1.21's prepared statement tracking. The fix is applied. The session closes. The constraint — that transaction mode is now safe because prepared statements are disabled, or that session mode is being used and multiplexing benefits are limited — is never added to the ADR. The next engineer to tune the pool size does so without knowing which pooling mode is active or why.

The failover incident post-mortem. A database restart during a planned maintenance window causes a longer-than-expected application outage. An engineer uses Claude to diagnose why PgBouncer is not reconnecting to the new primary. The session produces the root cause: PgBouncer's host configuration points to the old primary's static IP rather than the floating IP managed by keepalived, so PgBouncer cannot reach the promoted replica. The fix is applied. The session closes. The root cause — that the original PgBouncer deployment was misconfigured for failover — is documented in the incident post-mortem but not in the connection pooling ADR, so the next deployment of PgBouncer (in the staging environment or for a new service) repeats the same misconfiguration.

The connection budget recalculation. The team adds a new application component — a reporting service, a new Sidekiq queue type, a data export worker — and observes the connection count climbing again. An engineer uses ChatGPT to recalculate the connection budget and determine whether the new component fits within the headroom or requires a pool size reduction elsewhere. The session produces a revised pool_size configuration for the Sidekiq workers and a confirmation that the headroom is adequate. The session closes. The revised budget — the new connection ceiling, the per-component allocations, the rationale for the pool_size reduction — is never added to the ADR. The next budget recalculation starts from zero because the previous calculation exists only in a closed chat window. The WhyChose extractor surfaces these connection budget sessions from your chat history and structures them as ADR updates — the per-component allocations, the headroom calculations, and the constraint decisions that are otherwise lost to the session close.

Writing the connection pooling ADR

The connection pooling architecture decision record needs five sections. The Nygard ADR template provides the right structure; these are the connection pooling-specific fields that must be populated in each section.

Context and problem statement. The total connection budget: max_connections value and how it is set (RDS auto-calculation, manual Terraform variable, PostgreSQL default). The connection consumers enumerated with their pool sizes: application web servers (process count × pool_size per process), background job workers (worker count × pool_size per worker), migration tooling (peak connections used by the migration framework), operational access (DBA, monitoring, analytics). The current peak observed connection count and the headroom remaining. The specific event or threshold that triggered the pooling decision: first deployment, connection count reaching a percentage of max, an outage, or an architectural review.

Decision and alternatives considered. The pooler chosen (PgBouncer, RDS Proxy, PgCat, Odyssey, or application-level-only) and the alternatives explicitly evaluated with rejection rationale. The pooling mode (transaction, session, statement for PgBouncer; pinning behavior for RDS Proxy). The server-side connection pool size — the number of actual PostgreSQL backend processes the pooler will maintain. The deployment topology (co-located or standalone for PgBouncer; RDS Proxy endpoint for AWS). The authentication mechanism (password-based or IAM for RDS Proxy).

PostgreSQL features declared off-limits under the chosen pooling mode. If PgBouncer transaction mode is used: which features are explicitly prohibited (named prepared statements, advisory locks, LISTEN/NOTIFY, temporary tables, session-level SET), which alternatives are approved for each (protocol-level prepared statements if PgBouncer 1.21+ with prepared statement tracking, GET_LOCK alternatives, polling-based notification instead of LISTEN, CTEs or subqueries instead of temporary tables, transaction-level SET LOCAL). This section must be signed off by the engineering team and linked in onboarding documentation — it is the constraint that new team members and new ORMs must know before using the database layer.

Failover behavior and recovery SLO. The primary failover mechanism (Patroni, Aurora Multi-AZ, RDS Multi-AZ, single-AZ with manual recovery). How the pooler detects and reconnects after a failover (DNS-level floating IP for PgBouncer, automatic for RDS Proxy, pre_ping for application-level pools). The expected application-visible outage duration during a planned primary restart and during an unplanned failover. The test procedure for verifying failover behavior in the staging environment. The monitoring alert that fires when pooler reconnection is taking longer than the expected recovery time. The queue and messaging layer must also be reviewed for database dependency during failover — background jobs that retry against a down database can amplify the connection exhaustion problem rather than waiting for recovery.

Connection budget enforcement and recalculation trigger. The per-role connection limits set in the database and the rationale for each limit. The monitoring metric that tracks the active connection count per role and alerts when any role approaches its limit. The event that triggers a connection budget recalculation: adding a new application component, increasing application server count, changing the background job worker count, or upgrading the database instance type. The recalculation must produce an updated ADR revision — not just a configuration change — so that the next recalculation has a documented starting point rather than an undocumented running state.