2026-06-25 · ~21 min read

The email infrastructure decision record: why the email provider you chose determines your deliverability ceiling and your suppression list portability

Email infrastructure setup looks like a 30-minute Mailgun or SendGrid API key until a shared IP pool reputation event drops your transactional open rate by eleven points and routes password reset emails to spam folders, or a provider migration re-sends to 1,847 addresses who had previously unsubscribed and pushes your Google domain reputation above the enforcement threshold. The provider you chose at setup time sets your IP reputation isolation model, your suppression list portability architecture, and your bounce and complaint webhook model — none of which were visible when the first email arrived in the inbox.

A 12-person SaaS product built its email infrastructure in 2023. The backend engineer who set up transactional email had used Mailgun at a previous job, so Mailgun was chosen without evaluation. The API key was configured, DKIM and SPF DNS records were added, a smoke test sent a password reset email to a test address, and the session was complete. The email system worked. For eight months, it continued to work.

In month nine, the head of product noticed that the weekly activation email — sent to users who had signed up but not completed their first export — had an open rate that had dropped from 41% to 30% over a ten-day window. No copy change, no timing change, no volume change. The engineering team opened Mailgun's domain dashboard and found no anomaly in the send counts or bounce rates. Two days later, a user support ticket arrived: "your password reset email went to my spam folder." Then another. Then three in one day. The team checked Gmail Postmaster Tools — a free Google service that shows domain-level reputation metrics for email authenticated against your domain — and saw that the domain's IP reputation had shifted from "High" to "Low" in Google's classification system.

A support ticket to Mailgun revealed the root cause. The team's account was on Mailgun's shared sending pool — the default configuration for new accounts. Another sender on the same shared pool had generated a spike of spam complaints from Gmail users. Gmail's FBL (Feedback Loop) had flagged the pool's IPs, and the reputation impact propagated to every sender on those IPs, including the product's transactional emails. The sending pool's IP reputation score had degraded, and the team's password reset and activation emails were being routed to spam folders by Gmail's spam filter as a consequence — not because of anything the team had done.

The fix was upgrading to Mailgun's dedicated IP tier at $35 per month additional cost, then completing a 30-day IP warming sequence before routing full send volume through the new IP: 200 emails on day one, 500 on day two, 1,000 on day three, doubling each day until reaching the actual daily send volume of approximately 800 emails. The warm-up protocol was required because a brand-new dedicated IP address has no reputation history with Gmail, Outlook, or Yahoo — receiving servers apply conservative filtering to unknown IPs until they have observed sufficient volume of consistent, complaint-free email to establish a positive reputation. The upgrade also required re-configuring the DKIM record to the new sending domain assigned to the dedicated IP pool. The total engineering time was two days spread across three weeks of warm-up monitoring. The transactional open rate returned to 38% over the six weeks following the dedicated IP activation. None of this work would have been necessary if the IP allocation model — shared versus dedicated, the risk profile of each, and the volume threshold that justifies dedicated IPs — had been in the decision record at setup time. See the pattern in decisions never written down: the setup session closes when the first email arrives in the inbox, not when the deliverability architecture has been considered.

The second incident was a provider migration. A 19-person developer-tools company had been using SendGrid for transactional email since 2022 — API integration, template management, and Event Webhook for bounce events. In early 2024, the team decided to migrate to Postmark. The motivation was deliverability: Postmark's industry reputation for transactional email inbox placement is consistently rated higher than SendGrid's in the developer community, and the team had observed that a subset of their password reset emails were routing to Outlook's junk folder at a rate they attributed to SendGrid's shared IP pool. The migration was planned as a one-day engineering task: update the SMTP credentials in the application configuration, migrate the email templates from SendGrid's template system to Postmark's Handlebars format, update the webhook endpoint URL in Postmark's settings, and test with a handful of manual sends.

The migration went live on a Tuesday. On Friday of the same week, the team received a Google Postmaster Tools alert by email: their domain's spam rate had risen from 0.1% to 0.4%. The 0.3% threshold is significant: in February 2024, Google began enforcing a 0.3% spam rate policy for bulk senders — senders who send 5,000 or more messages per day to Gmail addresses. Above 0.3%, Google increases aggressive spam filtering on the sending domain; sustained violations can result in Gmail rejecting all email from the domain. The team's spam rate had crossed this threshold three days after the provider migration.

The investigation took four hours. The team pulled the Postmaster Tools spam rate timeline, confirmed the spike started the day of the migration cutover, and pulled Postmark's activity log for that Tuesday. The activity log showed that 1,847 email addresses had received the weekly product digest email sent on Tuesday — the first marketing email sent from Postmark after the migration. Cross-referencing against SendGrid's suppression list showed that 1,847 addresses in that batch had previously sent an unsubscribe signal to SendGrid. Those addresses were in SendGrid's suppression list and had not received email from the product since their unsubscribe. When the migration moved the email send to Postmark, the suppression list was not imported. Postmark had no record of these addresses being suppressed, so the weekly digest email was delivered to all of them. Gmail users in that group who received an email they had unsubscribed from months earlier marked it as spam, producing the complaint rate spike. The GDPR and CAN-SPAM unsubscribe obligation is tied to the email address, not to the sending platform. A previously-unsubscribed address must not receive email regardless of which provider is used to send it. The suppression list is a compliance artifact that must be migrated with the same care as the application's database — not left behind in the old provider's dashboard.

The fix required three steps: first, exporting the full suppression list from SendGrid (a CSV download of all suppressed addresses); second, importing the CSV into Postmark's suppression list via the Postmark Suppressions API; third, waiting approximately 30 days for the Google domain reputation metric to recover as the complaint rate returned to baseline. The migration engineer who planned the one-day task had not known that a suppression list existed as a distinct artifact requiring migration. It was not in the decision record. It was not in the runbook. It was knowledge that lived in the head of the engineer who had set up the original SendGrid integration — who had left the company eight months before the migration was planned. See the new CTO onboarding problem for the same structural pattern: the decisions made at setup time become invisible once they are working, and invisible until a migration or incident reveals the gap.

The three structural properties that email provider selection determines

When teams choose an email provider, the evaluation focuses on API documentation quality, pricing, and whether a previous team member has used the provider before. These are the selection criteria that appear in the AI chat session that ends when the first email is delivered. The structural properties that determine whether the selection ages well — whether the deliverability model isolates the product from neighbor-sender reputation events, whether the suppression list is portable when the team eventually wants to switch providers, and whether bounce and complaint data flows reliably into the application before a compliance audit or a Gmail enforcement action — are set at selection time and are almost never revisited until an incident reveals them.

IP reputation model and deliverability isolation. Every email provider's sending infrastructure is organized around IP addresses — the addresses from which outbound email is delivered. Receiving mail servers (Gmail, Outlook, Yahoo, corporate mail servers running Exchange or Postfix) maintain reputation scores for sending IP addresses based on observed complaint rates, bounce rates, spam trap hits, and engagement metrics. An IP with a history of low complaint rates and high engagement (recipients opening and clicking) is classified as high-reputation and receives preferred inbox routing. An IP with a history of high complaint rates or spam trap hits is classified as low-reputation and is subject to spam folder routing, throttling, or outright rejection.

Shared IP pools aggregate multiple sending accounts onto a common set of IP addresses. The reputation of the pool reflects the aggregate behavior of all senders on the pool. A pool where 95% of senders are compliant and 5% are generating complaints will have a reputation that sits below what a fully compliant sender could achieve on a dedicated IP, because the pool's reputation is a blend of all senders' behavior. The severity of the neighbor effect depends on the provider's pool management practices: providers that monitor accounts for complaint rate spikes and remove or isolate bad senders quickly limit the contamination window. Providers with weaker monitoring allow a single bad actor to degrade the pool's reputation for days or weeks before taking action. Transactional email senders on shared pools are more vulnerable to the neighbor effect than marketing senders, because transactional email — password reset, invoice, security notification — has a low complaint rate baseline that is easy to degrade relative to marketing email, where users expect to see unwanted messages occasionally. A transactional sender whose complaint rate climbs from 0.05% to 0.15% because of a shared pool reputation event will see that increase reflected in inbox placement rates before it crosses any provider enforcement threshold.

Dedicated IPs eliminate the neighbor effect by isolating the sending account's reputation on an IP used exclusively by that account. The tradeoff is IP warming: a new dedicated IP has no reputation history, and email receiving servers apply conservative filtering until they have processed sufficient volume of consistent, complaint-free email from the IP to establish confidence in its reputation. Most providers require a warm-up schedule that takes 2–6 weeks depending on desired daily send volume. Postmark's pre-warmed dedicated IP model is the exception: Postmark manages a pool of dedicated IPs per account that it maintains in good standing, so accounts get the reputation isolation of dedicated IPs without the warm-up cost. This is the structural reason Postmark commands a higher per-message price than SendGrid or Mailgun. The build-versus-buy framing applies directly: SES with BYOI (Bring Your Own IP) gives full reputation control at the cost of owning the entire warm-up and management process; Postmark buys the warm-up and management for you at a premium per-message rate; SendGrid and Mailgun sit in between with shared pools as default and dedicated IPs available at volume thresholds.

Suppression list architecture and compliance portability. A suppression list is the set of email addresses that must not receive email from the sending account, regardless of the email category or the sending system used. Entries are added to the suppression list when: a recipient reports the email as spam (a complaint, also called a feedback loop report); a recipient unsubscribes from any email via the List-Unsubscribe header or a one-click unsubscribe link; or a hard bounce occurs (the recipient's address does not exist, the domain does not accept email, or the receiving server rejects delivery permanently). GDPR treats an unsubscribe request as a withdrawal of consent to receive marketing email — the obligation to honor it is permanent and does not expire. CAN-SPAM requires that unsubscribe requests be processed within 10 business days and that the address not be added back to any list without a new opt-in. Neither regulation provides an exception for provider migrations: an address that unsubscribed from your marketing email in 2022 when you were on Mailgun must not receive marketing email in 2025 when you are on Resend, even though Resend's system has no record of the prior unsubscribe.

The compliance portability problem is created by how providers store suppression data. SendGrid stores suppressions in five distinct lists (Global Unsubscribes, Group Unsubscribes, Bounces, Spam Reports, Blocks/Invalids) each with different API endpoints and different CSV export formats. Mailgun stores suppressions in three lists (Bounces, Unsubscribes, Complaints) accessible via REST API or CSV bulk export. Amazon SES delivers bounce and complaint notifications to an SNS topic — it does not maintain a suppression list for the sender; the sender is responsible for subscribing to the SNS topic, receiving the notifications, and storing the suppression data in their own database. Postmark provides a Suppressions API with GET and POST endpoints and stores suppressions per Message Stream. When migrating from one provider to another, the migration plan must include: exporting the complete suppression list from the source provider, normalizing the format to the destination provider's import requirements, and importing the full list before the first send from the new provider. A migration that is planned as "update the API credentials" and executed before this step fails compliance and risks the complaint rate spike that the 19-person company experienced.

The suppression list portability problem points toward a more robust architecture for teams that anticipate provider switches: maintaining the canonical suppression list in the application's own database, synchronized from provider events in real time. When the application owns the suppression list, a provider migration requires only re-registering the webhook endpoint with the new provider and seeding the new provider's suppression system from the application's database — not extracting and importing from the old provider's format. The data retention decision record has a relevant constraint here: suppression records must be retained indefinitely, not deleted. GDPR's right to erasure (Article 17) does not apply to suppression records because deleting them would cause the organization to re-send to an address that previously withdrew consent — the opposite of erasure compliance. A suppression record is retained to prevent a regulatory violation, not as a data asset. The retention policy for suppression data must explicitly exempt it from any automated data deletion pipeline.

Bounce and complaint webhook model. Bounce and complaint data is the feedback signal that tells the application which recipient addresses are non-deliverable and which recipients have flagged the email as unwanted. Hard bounces — permanent delivery failures where the recipient address does not exist, the domain does not accept email, or the receiving server has blocked the sending IP permanently — must trigger immediate suppression: the address must not receive any further email from the application. Continuing to send to hard-bounced addresses damages IP reputation (receiving servers classify repeated delivery attempts to known-invalid addresses as a spam signal) and wastes send volume. CAN-SPAM's 10-business-day unsubscribe processing deadline applies equally to complaint feedback loop signals: an address that submits a spam complaint must be suppressed. Soft bounces — temporary delivery failures caused by the recipient's mailbox being full, the receiving server being temporarily unavailable, or rate-limiting by the receiving domain — should be retried with exponential backoff and converted to hard-bounce suppressions after a configurable number of failed attempts.

Email providers deliver these signals via webhooks — HTTP POST requests to an endpoint the application configures in the provider's dashboard. The webhook payload format differs by provider in ways that are not trivially compatible. Mailgun's bounce webhook sends a JSON object with a event-data.event field set to "failed", a event-data.severity field set to "permanent" or "temporary", and the recipient address in event-data.recipient. SendGrid's bounce webhook sends an array of event objects where each object has an event field set to "bounce" and the address in the email field. Postmark's bounce webhook sends a single JSON object with a Type field set to "HardBounce" and the address in the Email field. Amazon SES delivers bounce and complaint events via SNS in a nested JSON structure where the bounce type is in bounce.bounceType and the bounced addresses are in bounce.bouncedRecipients as an array. None of these are compatible at the JSON structure level. An application that processes bounce webhooks from Mailgun and is migrated to Postmark without updating the webhook handler will silently fail to process bounce events — no error, no alert, just missing suppressions. The observability strategy framing applies: a missing bounce suppression is a metric that produces no error and generates no alert. The symptom is discovered when IP reputation degrades over weeks of sending to invalid addresses, or when a compliance audit surfaces unprocessed unsubscribe requests. The webhook handler, the suppression action it triggers, and the monitoring that confirms bounce events are being processed are all part of the email infrastructure decision record — not implementation details left to the engineer who builds the first integration.

Provider-specific complaint rate thresholds are the enforcement mechanism the application must monitor against. Google's Gmail enforces a 0.3% spam rate policy for bulk senders (5,000+ daily emails to Gmail addresses) — above this threshold, Gmail increases spam folder routing, and sustained violations result in domain-level rejection. This threshold was established and publicly enforced beginning in February 2024 as part of Google's bulk sender requirements. Yahoo enforces a similar 0.3% threshold. Email providers themselves have independent enforcement thresholds: Mailgun suspends accounts with complaint rates above 0.08%; SendGrid's trust and compliance team contacts senders above 0.1%; Postmark terminates accounts for sustained transactional email complaint rates above 0.3%, but because Postmark's strict policy bars marketing senders, the baseline rate is much lower and the threshold matters less in practice. The parallel between email complaint rate thresholds and the payment dispute rate thresholds covered in the payment processor decision record is structural: both are provider-enforced compliance thresholds where crossing the threshold triggers escalating consequences, and both require proactive monitoring before the threshold is breached rather than reactive response after.

Email provider options and their structural properties

Amazon SES is the correct default for teams already on AWS who send high volumes of transactional email and are willing to own the operational infrastructure around bounce handling, suppression management, and complaint monitoring. SES costs $0.10 per 1,000 emails when sending from an EC2 instance or Lambda function in the same AWS region, and $0.10 per 1,000 when sending via the SES API or SMTP interface from outside AWS — the lowest unit cost of any managed transactional email provider by a factor of 5–10x compared to SendGrid, Postmark, or Resend at equivalent volume. SES routes through Amazon's IP infrastructure, which benefits from Amazon's own sending reputation — one of the highest baseline IP reputations available because Amazon sends enormous volumes of AWS transactional email (account notifications, billing alerts, service announcements) from the same infrastructure. The operational model is the key distinction from other providers: SES is a raw sending API, not an opinionated email platform. SES does not maintain a suppression list on the sender's behalf — it delivers bounce and complaint notifications to an AWS SNS topic that the sending application must subscribe to, receive events from, and store in its own database. BYOI (Bring Your Own IP) for dedicated sending IPs requires dedicated IP groups configured in the SES console, with the team responsible for the warm-up sequence and ongoing reputation monitoring. SES does not provide built-in template management (the native SES template system is basic and lacks versioning), does not provide a dashboard for open rate and click rate analytics, and does not provide a webhook for open and click events — only for bounces, complaints, and delivery confirmations via SNS. Teams that choose SES and do not build the bounce and complaint processing pipeline, the suppression list database, and the complaint rate monitoring correctly will discover the gaps when deliverability degrades silently. The startup decision log framing: SES is the right infrastructure choice for a team at the stage where the engineering capacity to own these systems exists — not for a two-person team shipping its first transactional email whose priority is to deliver the feature, not to build the surrounding operational stack.

SendGrid is the correct default for teams that want a managed email platform with suppression list management, Event Webhook for bounce and complaint events, domain authentication wizard, and template management at moderate per-message cost. SendGrid's v3 API is well-documented, widely integrated, and has mature client libraries in every major language. The Event Webhook delivers bounce, complaint, delivery, open, and click events to an HTTP endpoint of the sending application's choice in a consistent JSON format (array of event objects). SendGrid's Suppression Manager UI allows manual management of bounce, complaint, and unsubscribe lists and provides CSV export and import for all suppression types. The Subuser system allows a single SendGrid account to create isolated sending contexts for different products or teams, with separate IP pools, separate suppression lists, and separate event webhooks per subuser — useful for multi-product companies that want reputation isolation between product lines. The tradeoff is the shared IP pool default at lower sending volumes: SendGrid offers dedicated IPs at higher plan tiers (Pro plan at $89.95/month and above), and below that threshold the account shares IPs with other SendGrid customers. SendGrid's customer base includes marketing email senders, transactional senders, and notification senders in the same shared pools — the neighbor effect risk is real for transactional senders who require consistent inbox placement. The Twilio acquisition has added organizational complexity to SendGrid's roadmap and support operations without a clear product benefit for pure transactional sending teams. The API versioning framing applies: SendGrid has a notable API version history (v2 SMTP API, v3 Mail Send API, legacy Marketing Campaigns API, new Marketing Campaigns API) with different documentation quality per version. Applications built against SendGrid's v2 API in 2018–2020 may need migration work if they depend on features that are not available in v3.

Postmark is the correct choice when transactional email deliverability is a product-critical requirement. Postmark's strict acceptable use policy bars bulk and marketing email from using the Postmark infrastructure — only transactional email (triggered by user actions: account confirmation, password reset, invoice delivery, security notification, subscription confirmation) is permitted. This policy means Postmark's IP pools contain only transactional senders, and the pre-warmed dedicated IPs per account provide the reputation isolation of dedicated IPs from day one without the warm-up sequence. Postmark's inbox placement rates are consistently higher than SendGrid's and Mailgun's in independent deliverability benchmark tests, particularly for Gmail and Outlook. The Suppression API provides clean programmatic access to bounce and unsubscribe management per Message Stream. The tradeoff is the highest per-message cost in the managed provider category — approximately $1.25 per 1,000 emails compared to SendGrid's $0.90 per 1,000 at comparable volume tiers — and the strict usage terms that require pre-approval for any email category that could be interpreted as marketing. Teams that attempt to use Postmark for lifecycle email (weekly digest, re-engagement sequence, feature announcement) will receive account warnings and potential suspension. The correct architecture for teams that need both high-deliverability transactional email and marketing email is to use Postmark for transactional sends and a separate provider (SendGrid, Mailchimp, Klaviyo) for marketing sends, with the two streams configured on separate sending domains to maintain reputation isolation. This is the subdomain segmentation architecture described in the ADR sections below.

Resend is the correct choice for teams that value developer experience above all other criteria at early stage, and for teams using React Email for template development. Resend launched in 2022 with a developer-first API design that reflects the lessons of SES's verbosity and SendGrid's v2/v3 fragmentation: the API surface is small, the documentation is clear, the authentication is straightforward, and the webhook event schema is consistently structured. Resend uses Amazon SES as the underlying sending infrastructure — the sending reputation benefits from Amazon's IP standing. The free tier (100 emails per day, 3,000 emails per month) makes evaluation zero-cost. The React Email integration is a genuine developer experience advantage: email template development in React JSX with hot-reload preview, type-checked props, and component composition is substantially more productive than managing HTML templates in a provider's web dashboard or maintaining Handlebars files. The tradeoffs: Resend's IP reputation track record is still accumulating — the service is newer than SES, SendGrid, and Postmark, and the aggregate sending reputation of the shared IP pool is not as established. Resend's suppression management and bounce handling are less mature than SendGrid's and less documented than Postmark's. Teams building email infrastructure that must be compliant and auditable should plan to validate that Resend's suppression export and bounce handling meets their requirements before committing to it as the sole provider. The structural pattern repeats: Resend is the correct choice when developer productivity at setup time is the primary criterion — the operational maturity questions become relevant when the system scales or faces a compliance review.

AI chat session types and what each one misses

The email infrastructure setup follows a predictable pattern of AI chat sessions. The WhyChose extractor surfaces these sessions from chat export files, and the structural decisions they omit are consistent across the decision records reviewed. The setup session closes when the first email is delivered — not when the deliverability architecture, the suppression list ownership model, or the bounce handling pipeline has been considered.

The initial email setup session covers: choosing an email provider based on a quick cost comparison or previous experience; adding the API key or SMTP credentials to the application configuration; adding DKIM and SPF DNS records per the provider's documentation; sending a test email to a personal inbox; and confirming delivery. The session ends when the first email arrives. What the session does not cover: whether the account is configured on a shared IP pool or dedicated IPs, and what the risk profile of each is at the team's current send volume; whether bounce and complaint events will be delivered to the application and if so, by what mechanism; where the suppression list will be stored and who owns the canonical copy; whether the sending domain is the root domain (yourdomain.com) or a subdomain (mail.yourdomain.com or transactional.yourdomain.com), and what the implications of using the root domain for email reputation are if the domain is also used for the product's web presence; whether the List-Unsubscribe header will be present in outbound emails from day one (Google's bulk sender requirements mandate it for marketing email; its presence also signals compliance intent to receiving servers); and whether SPF, DKIM, and DMARC are all configured correctly and passing before the first production send (many setups configure SPF and DKIM but not DMARC, leaving the sending domain exposed to spoofing and failing one of Google's bulk sender requirements).

The email template design session covers: writing HTML email templates, adding variable substitution for personalized fields (recipient name, account-specific URLs), configuring templates in the provider's template management system, and previewing the output in email clients. What the session misses: whether the template includes a functional one-click unsubscribe link that routes to a suppression endpoint in the application's backend — not just a mailto: link, but an HTTP endpoint that confirms the suppression and acknowledges the request; whether the List-Unsubscribe and List-Unsubscribe-Post headers are included in the MIME headers of all outbound emails, as required by Google for bulk senders and strongly recommended for transactional senders as of 2024; whether the plain text alternative (multipart/alternative MIME part) is populated — missing a plain text version is a mild spam signal in Gmail's content scoring; and whether the template's unsubscribe flow adds the address to the application's canonical suppression list, not just to the provider's suppression list, so that the suppression survives a future provider migration.

The deliverability debugging session covers: a specific symptom — emails going to spam, emails not being delivered — and a diagnosis of the immediate cause: DKIM failure, SPF failure, blacklisted IP, content flagging. The session produces a fix for the observed symptom. What the session misses: whether the symptom is a one-time event or evidence of a structural problem; specifically, whether the observed spam routing is caused by IP reputation (shared pool neighbor effect — observable in Google Postmaster Tools as IP reputation decline) versus domain reputation (caused by past high complaint rates from the sending domain — observable as domain reputation decline in Postmaster Tools) versus content scoring (triggered by specific template content patterns that receiving servers classify as spam-like) versus engagement decline (Gmail increasingly treats low open rate as a signal that recipients consider the email unwanted — a feedback signal that affects inbox placement before complaint rate crosses enforcement thresholds). The distinction between these root causes determines the correct fix: an IP reputation problem is fixed by dedicated IP isolation or provider migration; a domain reputation problem requires suppression list cleanup and reduced send volume to re-engage only high-engagement segments; a content problem is fixed in the template; an engagement problem requires list hygiene and re-permission campaigns. Diagnosing the wrong root cause and applying the wrong fix produces a session that closes when the symptom is temporarily reduced, with the underlying structural cause intact. The observability strategy framing: an email deliverability problem is not observable from application logs. Gmail Postmaster Tools domain reputation, provider dashboard bounce and complaint rate trends, and per-domain inbox placement tracking are separate monitoring systems that must be set up independently and checked on a regular cadence, not only when a user reports a problem.

The provider migration session covers: the business rationale for switching providers (cost, deliverability, features); updating the application's API credentials and SMTP configuration; migrating email templates to the new provider's format; and sending test emails. What the session misses: the full suppression list migration (exporting all bounce, complaint, and unsubscribe records from the source provider, normalizing to the destination provider's import format, and importing before the first production send); DNS record updates for DKIM at the new provider — because the new provider uses a different DKIM key pair, the DKIM DNS record must be updated to the new provider's public key before the first send, or emails will fail DKIM authentication at receiving servers; IP warm-up if the new provider uses new dedicated IPs — a migration to Postmark from a shared-pool provider, for example, will route sends through Postmark's pre-warmed IPs without a warm-up requirement, but a migration to SES with BYOI requires configuring and warming new dedicated IPs before routing full volume; webhook handler updates for the new provider's event format — the application's bounce and complaint handler was written for the old provider's webhook JSON structure and will silently fail or misparse events from the new provider until updated; and a post-migration monitoring window of at least 30 days watching Google Postmaster Tools domain reputation and spam rate, provider complaint rate, and inbox placement — a migration that introduces an unintended suppression gap will show in the spam rate metric within 1–2 sends, and the faster the gap is detected, the shorter the recovery window.

The consistent pattern across all four session types is the same one described in decisions never written down: the session closes when the immediate task works, not when the structural decisions have been made explicit. The first email arrives, the migration test passes, the spam folder symptom resolves — and the session is over. The suppression list ownership model, the IP warm-up schedule, the bounce handler format dependency, and the complaint rate monitoring cadence are invisible at the moment the success criterion is met. They become visible only when a deliverability event or a provider migration reveals the gap.

Five ADR sections for email infrastructure selection

An email infrastructure ADR that prevents the incidents described in this post covers five sections that teams consistently skip.

First, the provider selection with alternatives and rejection reasons, and volume-based re-evaluation triggers. The ADR records which provider was chosen, which alternatives were evaluated, the rejection reasons for each, and the specific conditions under which the selection should be re-evaluated. "Postmark chosen over SendGrid because: transactional deliverability is a product-critical requirement (password reset is a core user flow; inbox placement failure creates support tickets and lost activations); Postmark's pre-warmed dedicated IPs eliminate the shared pool neighbor effect risk that affects SendGrid's transactional senders; at current send volume (800 transactional emails per day), Postmark's higher per-message cost is material but not prohibitive ($24/month at 19,200 monthly sends versus $8/month at SendGrid pricing). SendGrid evaluated and rejected: shared IP pool default at current send volume; Twilio acquisition has created support quality regression per team reports. Amazon SES evaluated and rejected: operational overhead of owning bounce and complaint pipeline does not justify the cost saving at current volume; re-evaluate SES when monthly send volume exceeds 500,000 emails (at which point Postmark cost is $625/month versus SES cost at approximately $50/month — a $575/month differential that justifies the engineering investment in owning the operational stack). Re-evaluate the provider choice when: transactional email open rate on Gmail falls below 85% for two consecutive weeks; monthly send volume crosses 100,000 and per-message cost exceeds $100/month; provider experiences two or more deliverability incidents in one quarter; or the team adds marketing email sends requiring a separate marketing email provider." Rejection reasons prevent the re-evaluation session from starting at zero, and the re-evaluation triggers make the switch criteria explicit rather than driven by whoever advocates loudest in a planning meeting.

Second, the IP allocation model and reputation monitoring cadence. The ADR documents whether the account uses a shared IP pool or dedicated IPs, the minimum daily send volume threshold that justifies evaluating a dedicated IP, the warm-up schedule if dedicated IPs are in use or planned, and the monitoring systems that track IP and domain reputation. "Current configuration: Postmark pre-warmed dedicated IPs (no warm-up required). If migrating to SendGrid: shared pool acceptable below 500 daily sends; evaluate dedicated IP upgrade when daily transactional volume exceeds 1,000 consistently. If migrating to SES BYOI: warm-up schedule is 200/500/1,000/2,000/5,000/10,000/full per day for seven days before routing full production volume; warm-up sends should be to the most engaged segment (recent signups, most active accounts) to maximize positive engagement signals during the reputation-building period. Reputation monitoring: Google Postmaster Tools domain reputation checked weekly — log domain reputation classification (High/Medium/Low/Bad) in the analytics notes; provider complaint rate checked monthly in the provider dashboard; alert fires in Slack #infra-alerts when Gmail Postmaster Tools domain reputation drops from High to Medium or below, or when provider-dashboard complaint rate exceeds 0.1% over any rolling 7-day window." The monitoring cadence is not a security posture requirement — it is the mechanism that catches a reputation degradation event before it reaches Gmail's 0.3% enforcement threshold or the provider's suspension threshold. The ADR Consequences section should state the specific complaint rate ceiling and the specific inbox placement degradation that the configuration is designed to prevent.

Third, the suppression list architecture and migration procedure. The ADR documents who owns the canonical suppression list, how the canonical list is populated and queried, the export format at the current provider, and the migration procedure for future provider switches. "Canonical suppression list: owned by application database (table: email_suppressions, columns: address, suppression_type [bounce/complaint/unsubscribe], suppressed_at, source_provider, source_event_id). Provider-side suppression list: synchronized from provider bounce and complaint webhooks in real time via the bounce handler at /webhooks/email/{provider}. Provider suppression list is used as the primary send filter (provider's API enforces suppression on send); application database is the durable backup and migration artifact. When switching email providers: (1) export full suppression list from current provider (SendGrid: Suppressions API GET /v3/suppression/bounces + /unsubscribes + /spam_reports; Postmark: Suppressions API GET /suppressions per Message Stream); (2) import into new provider's suppression system before the first production send; (3) verify import count matches export count ±5%; (4) verify that a test send to a known-suppressed address is rejected by the new provider before routing production traffic. Retention policy: suppression records must be retained indefinitely — they are compliance artifacts exempt from the standard 2-year data deletion policy. See the data retention decision record for the policy that explicitly exempts suppression records from automated deletion pipelines." This section is the direct mitigation of the 1,847-address incident. Its absence from the decision record was the direct cause of the incident. Its presence is the artifact that makes the migration a routine procedure rather than a compliance near-miss.

Fourth, the bounce and complaint webhook model with rate monitoring. The ADR documents the webhook endpoint, the event types it handles, the suppression action per event type, the retry policy for soft bounces, and the complaint rate thresholds with their escalation procedures. "Bounce webhook: POST /webhooks/email/bounces, authenticated via provider-specific HMAC signature (Postmark: X-Postmark-Signature header; Mailgun: X-Mailgun-Signature + X-Mailgun-Timestamp + X-Mailgun-Token; SendGrid: Twilio-Email-Event-Webhook-Signature + Twilio-Email-Event-Webhook-Timestamp). Hard bounce action: suppress address in application suppression table within 60 seconds of webhook receipt; do not retry delivery to hard-bounced addresses. Soft bounce action: increment soft-bounce counter for address; suppress after 3 soft bounces within 7 days; otherwise retry delivery on next scheduled send. Complaint action: suppress address within 60 seconds of complaint webhook receipt; same suppression table as bounces; set suppression_type = 'complaint'. Rate monitoring: complaint rate alert at 0.08% (7-day rolling average against total sends in the same window); escalation path: page on-call SRE, suspend the send campaign that generated the complaint spike, review complaint details in provider dashboard to identify sending pattern or segment that drove the increase. Google enforcement threshold: 0.3% spam rate (as measured by Google Postmaster Tools, not provider complaint rate — these differ because not all Gmail spam reports generate a complaint FBL event). Provider suspension threshold: Mailgun 0.08%; SendGrid 0.1%; Postmark 0.3% for transactional (stricter in practice). Bounce rate alert at 2% (7-day rolling); action: pause sends to the segment driving the bounce spike, run list hygiene query to identify stale or invalid addresses in the affected cohort." The webhook handler format and the suppression action are the operational mechanisms that prevent the IP reputation degradation described in the first incident. The rate monitoring and escalation path are the mechanisms that prevent the Google enforcement action described in the second incident. Together they constitute the reactive infrastructure that makes the email sending system self-correcting rather than self-degrading over time.

Fifth, the email sending categories and subdomain segmentation policy. The ADR documents which email categories exist, which sending domain each category uses, and the rationale for the segmentation. "Email categories and sending domains: transactional email (password reset, email confirmation, security alert, invoice delivery) — sending domain: mail.yourdomain.com; never used for marketing sends; DMARC policy: p=reject. Lifecycle and product email (activation nudge, feature announcement, onboarding sequence) — sending domain: updates.yourdomain.com; distinct from transactional domain to isolate reputation impact. Marketing and newsletter — sending domain: news.yourdomain.com; separate provider (SendGrid, Klaviyo, or Mailchimp) from transactional provider; opt-in required. Rationale for segmentation: a marketing campaign that generates an elevated complaint rate degrades the sending domain's reputation at Gmail. If transactional and marketing email share the same sending domain, the campaign's reputation impact propagates to password reset and invoice delivery — the highest-priority email categories in terms of product function. Domain segmentation contains the blast radius of any single category's reputation event to that category's subdomain only. The product's transactional domain is never exposed to marketing-driven complaint spikes. DMARC p=reject on the transactional subdomain prevents spoofing of the highest-trust email category." This policy is not derivable from the application's email configuration files. It represents a deliberate architecture decision that must be documented for any engineer who adds a new email sending context — an onboarding sequence, a promotional campaign, a system notification — to understand which domain to route it through and why. Without documentation, the default is to add all new sends to the existing sender, which progressively erodes the reputation isolation that the segmentation was designed to maintain. The authentication strategy context applies here: SPF, DKIM, and DMARC records are authentication infrastructure that must be configured per sending domain, and each new sending domain requires its own authentication setup — not inheriting the configuration from the primary domain.

None of these five sections are in the application's SMTP configuration, the provider's API key settings, or the DNS records. They are the email infrastructure reasoning that every engineer who sends a new email category, migrates a provider, debugs a deliverability problem, or joins the team after the original setup depends on. The WhyChose extractor surfaces the email setup, template, and deliverability sessions from AI chat history; the ADR is what takes the reasoning from those sessions and makes it legible to the team inheriting the decisions. The incidents in this post — the shared pool reputation degradation and the suppression list migration gap — are not caused by engineering errors in the setup session. They are caused by structural decisions that were not made explicit at setup time and that therefore could not be applied during the event that revealed them. The ADR is the artifact that closes that gap.

FAQs

What is the difference between shared IP pools and dedicated IPs for email deliverability?

A shared IP pool is a set of IP addresses used by multiple sending accounts on the same email service provider platform. When you sign up for Mailgun, SendGrid, or Amazon SES without configuring a dedicated IP, your outbound emails are delivered from IPs that other customers also send from. Email receiving servers — Gmail, Outlook, Yahoo — evaluate the sending IP's reputation as one signal in their spam filtering decision. If another sender on the same shared pool generates a high complaint rate or hits spam traps, the pool's IP reputation degrades and your emails inherit that degradation. Your inbox placement falls even though your own sending practices are compliant. The severity depends on how aggressively the provider monitors and removes bad actors from the pool.

A dedicated IP is an IP address used exclusively by your account. Your deliverability reputation depends only on your own sending behavior — no neighbor effect. The tradeoff is IP warming: a new dedicated IP has no reputation history with receiving servers. Most providers require a warm-up sequence (200 emails day one, 500 day two, 1,000 day three, doubling daily) before routing full production volume through the new IP, to build a positive reputation before sending at scale. Postmark's pre-warmed dedicated IPs per account are the exception: Postmark maintains the IP reputation on behalf of each account, so new accounts send from pre-warmed dedicated IPs without a warm-up sequence — this is the structural reason Postmark costs more per message than SendGrid or Mailgun. The decision between shared and dedicated IP is set at provider selection time and determines the deliverability ceiling the infrastructure can achieve. A team on a shared pool cannot eliminate the neighbor effect without upgrading to dedicated IPs or switching providers; neither is zero-cost. The email infrastructure ADR should document the IP allocation model and the daily send volume threshold that triggers re-evaluation of the dedicated IP upgrade.

Why does migrating email providers require more than just updating the API credentials?

A complete provider migration involves five distinct operations, each carrying a deliverability or compliance risk if skipped. First, DNS authentication records: every provider requires a unique DKIM key pair for email authentication. The DKIM DNS record (a TXT record at a provider-specific selector subdomain like pm._domainkey.yourdomain.com) must be updated to the new provider's public key before the first send, or emails will fail DKIM authentication and Gmail will increase spam filtering on the domain. Second, the suppression list: addresses that previously hard-bounced, complained, or unsubscribed must not receive email regardless of which provider sends it. GDPR unsubscribe obligations are permanent and provider-agnostic. The suppression list must be exported from the old provider and imported into the new provider before the first production send. Third, IP warm-up if the new provider introduces new dedicated IPs. Fourth, template migration: email template syntax differs by provider (SendGrid v3 tags, Postmark Handlebars, Resend React Email, SES templates), and templates must be migrated and tested before cutover. Fifth, webhook handler updates: the application's bounce and complaint handler was written for the old provider's JSON event structure, which will not match the new provider's structure. A handler that silently misparsed bounce events produces no error — it simply fails to suppress bounced addresses, and the IP reputation degrades over weeks of sending to invalid addresses before the symptom becomes visible.

The 1,847-address incident in this post was caused specifically by skipping the suppression list migration. The migration was planned as "update credentials, migrate templates" — the suppression list was unknown to the engineer who planned it because it was never documented in the email infrastructure decision record. A migration checklist that covers all five steps is not a bureaucratic safeguard; it is the operationalization of the ADR. Teams that document the suppression list architecture (who owns the canonical copy, how to export it, where to import it) can execute a provider migration safely. Teams that do not will discover the gap when the complaint rate spikes.

When should a startup use Amazon SES vs SendGrid vs Postmark vs Resend?

The choice depends primarily on three factors: the team's capacity to own operational infrastructure, the send volume, and whether deliverability or developer experience is the primary constraint. Amazon SES is correct when the team is AWS-primary, send volume is high enough to justify the engineering investment in owning bounce handling and suppression management, and the cost difference is material. SES's $0.10/1,000 email cost versus Postmark's $1.25/1,000 saves $1,150/month at one million monthly sends. Below 100,000 monthly sends, the absolute dollar saving is under $115/month — rarely worth the operational overhead for a team that has other engineering priorities. SendGrid is correct when the team wants a full-featured email platform with suppression management, event webhooks, and template management, without the SES ownership cost, at a per-message cost that is below Postmark's. The risk is the shared IP pool default at lower sending volumes and the neighbor effect exposure that entails. Postmark is correct when transactional email deliverability is a product-critical requirement — specifically when password reset, account confirmation, or billing emails going to spam is a direct product experience failure — and the team is willing to pay the per-message premium for pre-warmed dedicated IP isolation. Resend is correct when developer productivity at setup time is the primary criterion: the API design is the cleanest of the four, the React Email integration eliminates template maintenance overhead, and the free tier reduces evaluation friction. The tradeoff is a less established reputation track record and less mature operational tooling for suppression and bounce management.

For most early-stage B2B SaaS teams sending under 50,000 monthly transactional emails, the correct default is Postmark for transactional (password reset, invoice, security alert) and SendGrid or Mailchimp for any marketing sends — using separate sending subdomains to maintain reputation isolation between the categories. The ADR should document the monthly volume threshold that triggers re-evaluation of Amazon SES — typically when Postmark cost exceeds $300/month (around 240,000 monthly sends), at which point the engineering investment in owning the operational stack becomes cost-justified.

What should an email infrastructure ADR document that teams typically skip?

Teams typically document the provider name, the API key rotation schedule, and the DNS records (SPF, DKIM, DMARC). The sections that prevent the incidents in this post are: first, the IP allocation model — whether the account uses a shared pool or dedicated IPs, the daily volume threshold for a dedicated IP upgrade evaluation, and the warm-up schedule if dedicated IPs are in use; second, the suppression list architecture — where the canonical suppression list is stored (provider-managed vs application-owned), the export format and API access at the current provider, and the explicit migration procedure for future provider switches including the suppression list import step that must occur before the first production send; third, the bounce and complaint webhook model — the endpoint, the HMAC authentication method per provider, the suppression action per event type, the soft bounce retry count before hard-bounce conversion, and the complaint rate monitoring cadence with the specific thresholds (Google 0.3%, provider-specific) that trigger an escalation; fourth, the sending category and subdomain segmentation policy — which email categories (transactional, lifecycle, marketing) send from which subdomain and why, with the rationale that a marketing complaint rate spike must not affect the transactional subdomain's reputation; fifth, the re-evaluation triggers that make the provider switch criteria explicit and measurable rather than driven by whoever advocates for a different provider in a planning session.

None of these sections are visible in the application's configuration files or the provider's dashboard. They are the email infrastructure reasoning that every engineer who adds a new send, migrates a provider, or debugs a deliverability problem needs — and that is missing when the WhyChose extractor surfaces the email setup and template sessions from AI chat history without a corresponding ADR to explain why the suppression list is stored where it is, what the bounce handler expects, and what the migration procedure is when the team eventually wants to switch providers.