2026-06-26 · ~18 min read

The notification system decision record: why the channel architecture you chose determines your user engagement surface and your deliverability failure modes

In-app versus email versus push versus SMS versus webhook is decided in the first AI session about notifying users — "how do I send users notifications about X?" — and never documented as a deliberate channel architecture choice with the alternatives evaluated. APNs device tokens expire when users reinstall the app or upgrade iOS; the 410 Gone response must be processed to remove stale tokens from the database or the push budget silently delivers to addresses that receive nothing. Email deliverability depends on SPF, DKIM, and DMARC configuration decisions made before the first send, on the IP reputation of the sending infrastructure, and on a suppression list that is a legal compliance requirement under CAN-SPAM and GDPR, not an optional deliverability optimization. The notification channel decision determines what your users see, on which surfaces, when, and whether the message arrives or silently fails.

A 28-person SaaS company built an analytics platform with a weekly insights digest as the primary user engagement feature. The first engineer set up the email notification system in a weekend sprint following a ChatGPT session in which they asked "how do I send weekly digest emails to users with SendGrid in Node.js." The session provided a working integration: API key configuration, a templated HTML email, and a cron job that ran at 9 AM Monday. The first digest sent successfully. The unsubscribe link was not included — the engineer noted it as a follow-up ticket. The ticket remained open for four months.

Three months after launch, the weekly digest open rate was 3.2%. The team spent the next quarter planning a mobile app with push notifications to improve engagement. Meanwhile, the email system had four compounding problems that no one had investigated. The sending domain was analytics-team@insights.saas.io — a subdomain with no SPF record, no DKIM configuration, and no DMARC policy. The email service provider account was on the shared IP pool, inheriting the reputation of all other senders on the same pool infrastructure. The digest was sent from the shared pool at a volume of 800 messages per week, insufficient to build dedicated IP reputation but large enough to be filtered when the shared pool's reputation varied. There was no bounce processing webhook — hard bounces accumulated in the user table and continued to receive sends, which caused the email service provider's internal bounce rate metric to climb toward the threshold where they would throttle the account. And the unsubscribe link, still absent, meant that users who wanted to stop receiving the digest clicked "Report Spam" instead — the only available signal to them — contributing spam complaint events to the shared pool's reputation score.

The 3.2% open rate was the aggregate output of these four infrastructure failures, not a signal about subject line quality or content relevance. When a new growth engineer joined and investigated, they discovered that 61% of business email domains among the user base were filtering the digest to spam based on the subdomain's authentication record gaps and the shared pool's reputation. The engineer who had configured the integration had left the company six months earlier. There were no architecture notes for the email notification system — only the working code and the open unsubscribe ticket. When the growth engineer asked why the team had chosen a subdomain instead of the primary domain for sending, why DKIM had not been configured, and why a suppression list had not been integrated from the start, the answer was "that's how the original implementation worked and nobody changed it." The reasoning had never existed in written form; the initial implementation had been shaped entirely by the code examples in the ChatGPT session, which demonstrated a minimal working integration but did not cover deliverability infrastructure.

The CAN-SPAM compliance gap surfaced separately during due diligence for a Series A investment round. The data room review flagged that the weekly digest sent to 800 users had no unsubscribe mechanism, which violated CAN-SPAM's requirement for a clear and conspicuous opt-out mechanism in every commercial email, and GDPR's requirement for explicit consent and the ability to withdraw it for users in the EU. The legal team estimated the maximum exposure from the four-month compliance gap. The unsubscribe link was added to the digest template that week. The suppression list integration — to ensure that future unsubscribes were honored and not delivered to users who opted out — required changes to the user schema, the email sending pipeline, and the cron job logic. None of this had been identified as a requirement in the original notification implementation session, and none of it appeared in any documentation that subsequent engineers could read before the due diligence discovery.

A 19-person B2B SaaS adopted APNs push notifications for mobile users when their iOS app reached version 2.0. The engineer who built the APNs integration configured the APNs connection, registered device tokens on app launch, stored them in a device_tokens database table with a user_id foreign key, and sent notifications by querying all tokens for a given user and posting to the APNs HTTP/2 gateway. The implementation worked correctly. Push notification click-through rates averaged 22% in the three months after launch — substantially higher than their email open rate. The integration engineer moved to a different company six months later.

Fourteen months after launch, a major iOS version was released. Users upgraded over the following three weeks. The iOS upgrade triggered widespread app reinstalls on devices with limited storage, and app reinstalls invalidate APNs device tokens — when an app is reinstalled, iOS issues a new device token for the new installation. Sending a push notification to the old token produces an HTTP 410 Gone response from the APNs gateway with a reason of Unregistered, indicating the token is no longer associated with any active installation. The application server was not processing 410 responses: it was catching the non-200 HTTP status, logging the error as a warning, and continuing to the next token in the send loop. The stale tokens remained in the device_tokens table. They continued to receive send attempts with each notification. Each attempt produced a 410 response. Each 410 was logged and ignored.

Over the three weeks following the iOS upgrade wave, the push notification click-through rate declined from 22% to 4%. The on-call engineer investigated the application error logs and found thousands of APNs 410 errors per day, but the errors were warnings rather than actionable alerts — no alert had been configured on the error type. Cross-referencing the token database against the APNs response log revealed that 78% of the stored device tokens were producing consistent 410 responses. The remaining 22% of active tokens were delivering successfully at a 19% click-through rate — nearly unchanged from the pre-iOS-upgrade baseline. The aggregate click-through rate of 4% was the 22% delivery rate applied to 19% of real recipients, diluted by the 78% zero-delivery stale token majority. The fix was straightforward: a database query to delete all tokens that had produced 410 responses in the past 30 days, followed by a handler in the notification send loop to delete tokens immediately on 410 receipt. The fix took one day. The impact had been running for three weeks, through two on-call rotations, before the root cause was identified. The APNs token lifecycle documentation — that tokens expire on reinstall and that 410 responses must trigger immediate token deletion — was not linked from any architecture document. It had not appeared in the initial implementation's ChatGPT session, which covered the happy path of token registration and notification delivery without covering the token invalidation lifecycle.

The four structural properties the channel decision determines

When a team chooses notification channels in an early sprint AI session, they are making a decision with four structural properties that create maintenance surfaces as the product scales. Each property generates a failure mode that becomes visible only after the system is in production and the infrastructure has accumulated drift away from its initial state.

Delivery infrastructure ownership and deliverability requirements. Each notification channel has an infrastructure layer between the application and the user that introduces deliverability dependencies the application team must manage. For email, the deliverability infrastructure is the sending domain's authentication records (SPF, DKIM, DMARC), the sending IP reputation, the bounce and complaint processing webhook configuration, and the suppression list that honors opt-out signals across all email sends. These are not configuration details — they are architectural decisions about what the application owns versus what it delegates to the email service provider, and what the operational team is responsible for monitoring. For push notifications, the deliverability infrastructure is the APNs provider certificate or token authentication credentials (which expire on different schedules depending on the authentication method chosen), the FCM project credentials, the device token database and its stale token cleanup mechanism, and the token registration flow for handling app reinstalls. For SMS, the deliverability infrastructure is the phone number provisioning (toll-free number versus short code, where short code requires a six-to-eight-week carrier approval process), the STOP/HELP opt-out handling required by carrier regulations in the US and the TCPA, and the international number formatting compliance. For webhooks, the deliverability infrastructure is the HMAC-SHA256 signature on outgoing payloads (so the receiver can verify authenticity), the retry policy with exponential backoff and a dead letter queue for failed deliveries, and the idempotency key header that enables receivers to deduplicate webhook deliveries without requiring exactly-once delivery guarantees from the sender. Each channel's infrastructure must be designed before the first send. Retrofitting email authentication records after a reputation event, or adding APNs token lifecycle management after a stale token buildup, or implementing webhook signature verification after an unsigned delivery incident, requires coordination across the sending infrastructure, the database schema, and the client-side registration flow — work that is substantially more complex after the system has active users than before the first notification is sent.

Token and subscription lifecycle management. Push notifications and webhooks depend on client-issued tokens or endpoint registrations that expire, rotate, and become invalid through external events that the application server does not control and is not directly notified about. APNs device tokens expire on app reinstall, major iOS upgrade, and permission revoke; the application learns about expiry through the 410 Gone response to a send attempt, not through a proactive notification from Apple. FCM registration tokens rotate on a similar lifecycle — tokens are refreshed by the Firebase SDK and the application must update the stored token when the SDK issues a new one, typically via an onTokenRefresh callback that must be registered in the app. A mobile app that does not implement the token refresh callback accumulates stale FCM tokens in the same way as an APNs implementation that does not handle 410 responses, except the FCM staleness accumulates through SDK-initiated token rotation rather than through user actions. Webhook endpoint URLs expire when a customer's server URL changes, when they rotate to a new subdomain, or when the endpoint handler is redeployed to a new URL path. The application must handle webhook delivery failures — repeated non-200 responses from the endpoint — as a signal to disable the webhook subscription and notify the customer to re-register a working endpoint, rather than continuing to attempt delivery to an endpoint that will never respond successfully. The lifecycle management for each channel's registration mechanism must be specified before the first registration is stored: what is the token refresh flow, what is the stale token detection mechanism, what is the cleanup policy, and what monitoring alerts on high 410 or webhook failure rates before the stale token buildup reaches the proportion where it materially degrades delivery metrics. The decisions that were never written down pattern applies to token lifecycle management with particular force: the engineer who built the token registration flow knows that tokens expire and that 410 responses must be processed; the engineer who inherits the notification system six months later knows only that there is a device_tokens table and a notification send loop. The lifecycle behavior — what 410 means, when to run a cleanup job, what alert to configure on the error rate — is not derivable from the code without reading the APNs documentation. It belongs in the notification ADR.

Retry semantics and idempotency requirements. A notification that fails on the first delivery attempt — a temporary network error, a provider rate limit, a transient APNs gateway timeout — requires a retry. The retry semantics differ by channel and have consequences for the user experience if they are not specified correctly. For email, the email service provider handles retry after SMTP acceptance — once the application server has handed the message to the provider's SMTP gateway or API, the provider is responsible for retry and eventual delivery or bounce notification. The application-side retry concern is whether to re-send a digest email if the API call to the email service provider fails before acceptance — which requires idempotency at the API call level (using a message ID to prevent duplicate sends if the application server retries after a network timeout where the provider received the first request but the response was lost). For push notifications, APNs and FCM guarantee at-most-once delivery by default: they attempt to deliver the notification once and report success or failure; they do not retry failed deliveries on behalf of the application. For high-priority notifications where delivery confirmation matters, the application must implement its own retry with the knowledge that APNs at-most-once semantics mean a retried notification will be delivered as a new notification — visible to the user as a duplicate — unless the notification payload includes a collapse-id (APNs) or a collapse_key (FCM) that instructs the platform to replace a previously delivered notification with the same key. For webhooks, the application is responsible for all retry logic: the retry policy must be specified with an initial backoff window, a backoff multiplier, a maximum backoff window, a maximum retry count, and a dead letter queue for webhooks that exhaust the retry budget without successful delivery. The idempotency key — a stable identifier included in the webhook payload and as an HTTP header — enables the receiver to deduplicate webhook deliveries when the sender retries after a timeout where the first delivery succeeded but the acknowledgment was lost. Without an idempotency key, a webhook receiver that processed an event and returned a 200 response that was lost in transit will process the same event again when the sender retries, potentially triggering duplicate side effects. The retry and idempotency specification for each channel belongs in the notification ADR because it defines the interface contract between the notification system and both the delivery infrastructure and the webhook receivers. The background job infrastructure decision record intersects here: the notification retry queue is a background job that must be durable across server restarts, which requires a persistent job queue (not in-memory queuing) and a dead letter queue inspection mechanism for failed deliveries that exceed the retry budget.

Frequency capping and user preference model. Every notification channel has a point at which increased send volume decreases user engagement rather than increasing it. Email open rates decline when weekly digest frequency increases to daily. Push notification permission revocation rates increase when push volume exceeds the user's tolerance threshold. SMS unsubscribe rates increase when message frequency exceeds the user's expectation at the time they provided their phone number. The frequency cap converts the system's event volume — which is driven by system activity, not by user engagement capacity — into a per-user delivery budget that prevents the notification system from training users to ignore or disable notifications by sending too many. The per-type preference model extends frequency control to per-user, per-channel, per-notification-type settings: users can disable one type of notification without disabling all notifications; they can receive email digests but not push notifications for the same events; they can opt into SMS for security alerts only. The preference storage schema must be designed before the first notification is sent, because adding per-type preferences retroactively requires migrating all existing users' implicit preferences into explicit rows without changing their experience. The decision about which notification types are transactional — mandatory sends that bypass user preferences because they are required by the product contract (password reset, billing failure, security alert, two-factor authentication code) — must be made explicitly, because a transactional exemption that is too broadly applied defeats the user's preference settings, and a transactional exemption that is too narrowly applied means security-critical notifications require user opt-in. The preference model is a product architecture decision with retention consequences: a notification system that sends every event to every channel for every user until individual users manually disable notifications trains the worst-engaged users to disable notifications entirely, removing the channel permanently for the moments when the notification is most needed.

Channel options and their structural properties

In-app notifications. In-app notifications — a bell icon, notification center, or in-context alert — are displayed when the user is active in the application. Delivery is guaranteed for active users and zero for inactive users: there is no delivery infrastructure between the application and the user beyond the application server and the database query that populates the notification center. The deliverability requirement is the response time of the notification center query: if the notification center is populated by a full-table scan on the notifications table for each user's session, the query performance degrades as notification volume accumulates without pagination or archiving. The notification center state — which notifications are read, which are unread, how many are displayed before pagination — is stored in the database, and the schema design affects query cost per page load permanently. In-app notifications are the correct primary channel for events that require immediate action while the user is in the application (a teammate mentions the user in a comment, a background job completes, an error occurs in a workflow the user is watching) and the correct supplementary channel for events delivered via email or push (the notification center confirms delivery and allows the user to navigate to the relevant context from within the product). The failure modes for in-app notifications are primarily architectural rather than deliverability-related: a notification center that accumulates unactioned notifications without pagination or archiving becomes a dead letter queue of items the user will never process, reducing the signal value of new notifications; read state that does not synchronize across multiple browser tabs or devices shows stale unread counts; and a notification center that queries the full notification history on each page load without caching or indexed pagination creates database load proportional to notification accumulation per user. The real-time architecture decision record intersects with in-app notifications directly: delivering notification events to the in-app notification center in real time requires the same transport decision — WebSocket, SSE, or long-polling — with the same scaling model, reconnection policy, and proxy compatibility constraints documented there.

Email notifications. Email has the broadest reach of any notification channel — every user has an email address — and the most complex deliverability infrastructure. The deliverability prerequisites are: SPF, DKIM, and DMARC records on the sending domain (authentication), a warm-up plan for dedicated sending IPs at volume above 10,000 messages per day (reputation building), bounce and complaint webhook processing (list hygiene that prevents ISP throttling), and a suppression list integrated across all email sends (compliance with CAN-SPAM and GDPR opt-out requirements). The choice between dedicated and shared IP infrastructure should be made at adoption time based on projected send volume: shared IP pools are appropriate for volumes below 10,000 messages per day where the cost of dedicated IP warm-up exceeds the deliverability improvement; dedicated IPs are appropriate for higher volumes where control over the sending IP's reputation is worth the warm-up investment. The email service provider choice — SES, SendGrid, Postmark, Mailgun — affects the deliverability tooling available: SES is the most cost-effective ($0.10 per thousand messages) but has minimal deliverability tooling; SendGrid and Postmark provide domain authentication setup guidance, bounce and complaint monitoring dashboards, and dedicated IP warm-up scheduling. The email infrastructure decision record covers the sending infrastructure in depth; the notification ADR's email section should reference that record and specify the email-specific decisions: which notification types send email, whether email is the primary or supplementary channel, the digest consolidation threshold (how many individual events to consolidate into a single digest email rather than sending each as a separate message), and the transactional email types that must bypass user preferences and the suppression list (password reset, billing failure, security alert). The unsubscribe mechanism — a one-click unsubscribe link in the email footer that writes to the suppression list, required by CAN-SPAM for all commercial email and by GDPR for all marketing email — must be present in the first email send, not added as a follow-up ticket.

Push notifications (APNs and FCM). Push notifications deliver to mobile devices when the application is in the background or not running, providing reach that in-app notifications cannot. Click-through rates for well-targeted push notifications are four to nine times higher than email open rates for time-sensitive events. The structural requirement is permission: iOS requires explicit user permission for push notifications, with an opt-in rate that typically ranges from 40% to 60% depending on when in the user journey the permission request is shown and how the value proposition is framed. Android historically defaulted to push notification permission granted, but Android 13 and later require explicit permission, reducing the difference between iOS and Android opt-in rate dynamics. The APNs integration decision — provider certificate authentication versus token-based authentication — determines the rotation schedule: provider certificates expire annually and must be renewed and redeployed before expiry; token-based authentication uses a signing key that does not expire, eliminating the annual certificate rotation risk. Token-based authentication is the correct choice for new integrations; certificate-based authentication is a legacy approach with a recurring operational risk. FCM (Firebase Cloud Messaging) provides a unified API for Android, iOS, and web push, reducing the integration surface at the cost of a Firebase dependency in the critical notification path. Direct APNs integration without FCM reduces third-party dependencies for iOS-specific notifications at the cost of maintaining two separate integrations (APNs and FCM) for cross-platform products. The token lifecycle management — processing 410 Gone responses for APNs and token refresh callbacks for FCM, running periodic cleanup jobs, and alerting on delivery rate declines greater than 20% week-over-week — must be specified in the notification ADR before the first token is stored. The secrets management decision record intersects here: the APNs private key (for token authentication) and the FCM service account credentials are high-value secrets that grant the ability to send push notifications to all of the application's users; their storage, rotation policy, and access control must be documented and managed with the same rigor as database credentials and payment processor API keys.

SMS notifications. SMS provides near-universal delivery — no app required, no permission opt-in beyond the user's consent at number collection time — with an open rate of approximately 98% and a read rate that typically occurs within three minutes of delivery. The cost per message is $0.0075 to $0.05 depending on the carrier and destination country, making SMS cost-effective for high-value notifications (two-factor authentication codes, billing alerts, security notifications) and expensive for high-volume engagement notifications (weekly activity digests sent via SMS would cost ten to fifty times the cost of equivalent email sends). The number provisioning decision — toll-free number versus short code versus long code — determines the carrier approval timeline and the deliverability characteristics: short codes (five or six digit numbers) have the highest throughput (100 messages per second), the highest carrier trust level, and a six-to-eight-week carrier approval process that must be initiated before launch; toll-free numbers have intermediate throughput and trust levels and a shorter approval period; long codes (standard ten-digit phone numbers) have carrier filtering applied for commercial messaging and are inappropriate for high-volume or promotional SMS sends. TCPA compliance in the United States requires explicit prior written consent for marketing SMS messages, prohibits automated sends during prohibited hours (before 8 AM or after 9 PM in the recipient's time zone), and requires STOP keyword opt-out handling — the carrier automatically processes STOP responses for short codes, but the application must update the suppression list when the carrier reports an opt-out. Twilio and AWS SNS both provide STOP/HELP opt-out handling, but the application must configure the suppression list webhook to process opt-out events and remove opted-out numbers from future send queries. International SMS delivery varies by destination country: carrier filtering, regulatory requirements, and character encoding (GSM-7 for standard SMS, UCS-2 for messages containing non-GSM characters which reduces the 160-character limit to 70 characters) all affect deliverability and cost. The multi-tenancy decision record intersects with SMS compliance: a multi-tenant SaaS that sends SMS notifications on behalf of tenants may be the sending entity under TCPA, even if the tenant collected the user's phone number, creating compliance exposure that belongs in the notification ADR's compliance section rather than treated as a tenant configuration detail.

Webhook notifications. Webhooks are HTTP callbacks that deliver notification events to a developer or API user's endpoint in real time, enabling programmatic processing of events without polling. Webhooks are the correct channel for developer-facing products (API platforms, CI/CD integrations, data pipeline tools) where users need to trigger automated workflows on notification events. The delivery contract for webhooks — what HTTP status codes indicate success, what the retry policy is on failure, how long the receiver has to respond before the sender times out the request, and whether delivery is at-most-once or at-least-once — defines the interface between the notification system and all webhook receivers. A webhook system that does not publish its retry policy and idempotency key specification creates an ambiguous interface: receivers that experience a delivery where the sender retried after a timeout do not know whether the event was delivered once (the sender's first request succeeded but the response was lost) or whether a duplicate event was sent (the sender confirmed the first attempt failed and genuinely retried). HMAC-SHA256 signature on the webhook payload — using a per-endpoint signing secret that the customer configures and can rotate — allows the receiver to verify that the payload originated from the application server and was not modified in transit. Without signature verification, a webhook endpoint is an unauthenticated HTTP endpoint that accepts event payloads from anyone who can POST to its URL. The signing secret must be delivered to the customer through the application's dashboard at registration time and must be rotatable independently of the endpoint URL — if the signing secret is compromised, the customer must be able to rotate it without changing the endpoint URL, which would require re-registration in systems that have the URL hard-coded. The API rate limiting decision record intersects with webhook delivery: outbound webhook sends must be rate-limited per endpoint to prevent a burst of system events from exceeding the receiver's endpoint capacity, and the rate limit must be documented in the webhook contract so receivers can design their endpoint capacity accordingly.

AI chat session types and what each one misses

The notification channel decision follows a consistent pattern in AI chat history. The founding session establishes the channel based on the immediate use case. A deliverability problem triggers an investigation that surfaces infrastructure gaps. A compliance review discovers opt-out handling gaps. A user complaint triggers a frequency capping discussion. Each session addresses a specific symptom without generalizing to the channel architecture that would prevent the next variant of the same failure. The WhyChose extractor surfaces these sessions because they contain decisions that belong in a notification system ADR and that are consistently left in conversational form — visible in the chat history but not accessible to the team member who inherits the notification system without context.

The "how do I notify users about X?" session. This is the founding session — the first time an engineer asks an AI assistant how to deliver a notification to users for a specific event type. The session covers: the channel options for the specific event (email, push, SMS), the library or service provider for the chosen channel, and a working implementation example. The session recommends the most obvious channel for the specific event — email for a weekly digest, push for a real-time alert, SMS for a two-factor code — and provides a minimal working integration. What the session misses: the engineer is solving the immediate notification problem, not designing a notification architecture. The session does not cover the full taxonomy of notification types the product will need to send — only the one type that prompted the session. It does not cover the deliverability infrastructure prerequisites — SPF, DKIM, DMARC, IP warm-up, suppression list — because the engineer's immediate goal is a working send, not a production-ready sending infrastructure. It does not cover the token lifecycle for push notifications, the STOP handling for SMS, or the retry and idempotency contract for webhooks. The channel choice from this session becomes the implicit architecture for all future notifications of similar types, even when subsequent notification types have different channel requirements. An ADR written after this session documents the channel choice, names the event types it covers, and explicitly addresses the deliverability prerequisites, lifecycle management requirements, and compliance obligations for each channel adopted — before the first production send at volume.

The "our email open rate is low" session. This session is triggered by a metric review. An engineer queries an AI assistant: "our weekly digest email open rate is 3%, what can we do to improve it?" The session covers subject line optimization, send time optimization, A/B testing frameworks, and content personalization. What the session misses: the engineer is treating the symptom (low open rate) as a content problem rather than a deliverability infrastructure problem. The session does not prompt the engineer to investigate whether the emails are reaching the inbox — whether the authentication records are configured correctly, whether the shared IP pool's reputation is depressed, whether the bounce rate has climbed to the threshold where ISP throttling is active. An email that is correctly delivered to the inbox and ignored because of a weak subject line has a different root cause than an email that is classified as spam before the user's inbox ever receives it. The correct investigation sequence is: verify that authentication records (SPF/DKIM/DMARC) are configured, check the bounce and complaint rates against the service provider's threshold documentation, review the inbox placement rate using the email service provider's inbox testing tool, and only then investigate content and subject line optimization. A session that recommends subject line A/B testing before the team has verified inbox placement is optimizing the wrong variable. The notification ADR's email deliverability section is the mechanism for ensuring the infrastructure investigation is completed at adoption time rather than discovered in a low-open-rate postmortem six months after launch.

The "push notifications aren't working for some users" session. This session is triggered by user support tickets or a decline in push notification click-through rate. An engineer queries: "some users say they stopped receiving push notifications, what could cause this?" The session covers: permission revocation (the user disabled push notifications in iOS Settings), token expiry on app reinstall (the most common cause), and APNs certificate expiry (for implementations using certificate authentication). The session recommends checking the APNs response for 410 errors and implementing a token cleanup mechanism. What the session misses: the engineer is investigating the problem for specific users rather than for the token database as a whole. After the iOS upgrade wave that triggered 78% token staleness, the problem is not a handful of users who reinstalled their apps — it is a systematic accumulation of stale tokens across a substantial fraction of the user base that has been building for weeks before the click-through rate metric surfaced the signal. The session-level investigation confirms the root cause and provides the fix, but it does not prompt the engineer to run a bulk cleanup of all tokens that have produced 410 responses in the past thirty days — only to add the 410 handler going forward. The bulk cleanup, the periodic cleanup job, and the monitoring alert on delivery rate decline must be added as explicit action items beyond the narrow scope of the "fix the 410 handler" fix. The notification ADR's push token lifecycle section, written at adoption time, would have included the 410 handler as a required component of the initial implementation rather than a fix discovered after the first iOS upgrade wave.

The "users are complaining about too many notifications" session. This session is triggered by user feedback or an increase in push notification permission revocation rate. An engineer queries: "users are complaining that we send too many notifications, what's the best way to handle this?" The session recommends adding a notification preferences settings screen with per-type on/off toggles and a global notification frequency control. What the session misses: the preference model design — what types to expose, what the default is for new types added later, which types are transactional and cannot be opted out of, and how preferences interact with frequency caps — is an architectural decision that affects the database schema, the notification send pipeline, and the user experience. A settings screen implementation built on top of an existing schema that has no per-type preference rows will require a database migration and a pipeline change to honor the new preferences. An implementation that adds per-type preferences without specifying the transactional exemption — which types must bypass preferences because they are required by the product contract — will accidentally allow users to opt out of password reset emails or security alerts. The frequency cap — the maximum number of non-transactional notifications per user per day or week — requires a queue or buffer in the notification pipeline, a priority ranking among pending notifications, and a digest consolidation mechanism for batching low-priority events into a single send. None of these are settings screen details; they are notification system architecture decisions that should be specified before the first notification is sent rather than designed under the constraint of an existing send pipeline that was not built to support them.

Five ADR sections for notification system architecture

A notification system ADR that prevents the deliverability infrastructure gap, the APNs token staleness buildup, the compliance exposure, and the preference model migration covers five sections that teams consistently omit from the initial notification implementation session.

First, channel taxonomy and routing logic. The ADR documents the full taxonomy of notification types the product sends, the primary and supplementary channel for each type, the routing logic that selects channels based on event context and user availability, and the transactional exemptions that bypass user preferences. The channel taxonomy forces the team to enumerate every notification type before the first implementation, identifying types that require different channels than the founding session's initial type — a product whose first notification was an email digest may also need push notifications for real-time alerts and webhooks for API integrations, each with their own deliverability infrastructure. The routing logic documents how the system selects channels for a user who has opted into multiple channels: does a real-time alert send to push and email simultaneously, or push first with email as a fallback after a delivery timeout, or email only if push fails? The transactional exemptions must be named explicitly: password reset, billing failure notification, two-factor authentication code, and security alert are typically transactional and bypass user preferences and suppression lists (except legal suppression — a user who has taken legal action to be removed from all communication cannot receive transactional email from the company). The routing logic and transactional exemptions belong in the ADR because they are not derivable from the implementation code without reading the decision history. The new CTO who inherits the product and asks "why do we send both email and push for the same event?" should be able to read the ADR's routing logic section rather than reconstructing the reasoning from the send pipeline code.

Second, email deliverability infrastructure. The ADR documents the sending domain and subdomain strategy, the SPF/DKIM/DMARC configuration status at launch, the email service provider and IP infrastructure (shared versus dedicated, warm-up plan if dedicated), the bounce and complaint webhook configuration (the endpoint that processes hard bounces and spam complaints into suppression list entries), the suppression list integration (the database table or service that stores opted-out addresses and is queried before every non-transactional send), and the compliance requirements met at launch (unsubscribe link in all non-transactional sends, GDPR consent model for EU users, CAN-SPAM content requirements). The email deliverability section is not a one-time checklist — it is a reference document for the operational team. When the bounce rate rises above the service provider's warning threshold, the ADR tells the operator which webhook to check, which suppression table to audit, and what the warm-up plan assumed about daily send volume. When the GDPR consent model is questioned in a due diligence review, the ADR documents when the model was implemented, what it covers, and what the team's legal review concluded. The security and compliance decision record intersects with email deliverability at the TCPA and GDPR boundary: the consent model for marketing email (explicit opt-in, double opt-in, or implicit consent by account creation) is a legal decision that the notification ADR should document with reference to the legal review that approved the chosen model, not a product decision made by the engineering team without legal context.

Third, push token lifecycle management. The ADR documents the APNs authentication method (provider token authentication versus certificate authentication — provider token authentication recommended), the token registration flow (when the application registers the device token with the backend — on each app launch, to handle iOS-initiated token rotation, rather than only on first launch), the 410 Gone response handler (which deletes the token from the device_tokens table immediately on receipt), the periodic stale token cleanup job (which queries the APNs feedback service and deletes tokens Apple has confirmed as permanently inactive, running at a frequency appropriate for the user base churn rate — weekly for stable user bases, daily for products with high reinstall rates), and the monitoring alert configuration (an alert that fires when the weekly push notification delivery success rate drops more than 20% below the prior week baseline, indicating a token staleness accumulation event rather than a content quality issue). The FCM integration section covers the token refresh callback registration in the app (onTokenRefresh in Firebase SDK), the backend endpoint that receives token refresh events and updates the stored token for the user, and the FCM delivery failure handling (which mirrors the APNs 410 handling with FCM's equivalent error responses). The push token lifecycle section is operationally critical because the failure mode — a token database that accumulates stale entries silently, degrading delivery rates across the user base without a specific user-visible error — is invisible without the monitoring alert and the periodic cleanup job. Both must be included in the initial implementation; they cannot be added after the first stale token event without a cleanup backfill run that must be sized against the full existing token table.

Fourth, frequency capping and user preference model. The ADR documents the frequency cap policy (the maximum number of non-transactional notifications per user per day and per week across all channels), the preference storage schema (user_id, notification_type, channel, enabled, updated_at — one row per user per notification type per channel combination), the default preference for new notification types (opted-in by default with the user notified of the new type, or opted-out by default requiring the user to explicitly enable it), the transactional notification types that bypass the preference model entirely, the digest consolidation threshold (the number of pending events for a user within a time window that triggers batching into a digest rather than individual sends), and the time-zone-aware send window (the hours during which non-urgent notifications may be delivered, respecting the user's local time zone rather than the server's time zone). The preference model schema must be implemented before the first notification is sent because schema changes after launch require migrations that must correctly populate the preference rows for all existing users. The migration choice — opt all existing users into all notification types (preserving their current experience of receiving all notifications), or opt all existing users out of all non-transactional types (requiring them to explicitly re-enable), or inherit a binary global preference (if a user had previously disabled all notifications, their per-type preferences default to disabled) — is a product decision with user experience consequences that the migration cannot determine from the user data alone. Documenting the migration policy in the ADR before the schema is built converts a future migration decision made under time pressure into a planned behavior specified at adoption time. The data retention decision record intersects with the notification preference model: notification history records — a log of every notification sent, to which user, on which channel, at what time — have a retention policy that balances user-facing audit trail value against storage cost growth. The ADR should specify the notification history retention period and whether notification history is user-accessible (allowing users to review past notifications) or internal-only (used for delivery debugging and compliance audit).

Fifth, retry policy and idempotency contract. The ADR documents the retry policy for each channel, the idempotency mechanism for each channel, and the dead letter queue handling for notifications that exhaust the retry budget. For the send-side email retry — retrying an API call to the email service provider that failed before the provider acknowledged receipt — the idempotency key is a stable message identifier generated at notification creation time that the email service provider uses to detect and deduplicate re-submitted messages. For push notifications, the retry policy covers the application-server-to-APNs retry (transient APNs gateway errors that warrant a retry versus permanent errors like 410 Gone that indicate the token is invalid and must not be retried), and the notification expiry time (the APNs apns-expiration header that instructs APNs to discard the notification if it has not been delivered within the specified time window — appropriate for time-sensitive alerts that are irrelevant after the event has passed). For webhooks, the retry policy specifies the initial backoff window, the backoff multiplier, the maximum backoff window, the maximum retry count, the dead letter queue for exhausted deliveries, and the idempotency key header (Webhook-Id or X-Idempotency-Key) that enables receivers to deduplicate deliveries from retries following lost responses. The dead letter queue for webhooks — the set of deliveries that exhausted the retry budget without a successful receiver acknowledgment — must generate an alert to the customer (in-app notification, email, or both) that their webhook endpoint is not receiving deliveries, with a link to the webhook delivery log in the dashboard where they can inspect the failure responses and re-register a working endpoint. A dead letter queue that accumulates silently without alerting the customer creates a support ticket when the customer notices that their automated workflows stopped triggering, without the diagnostic information needed to identify whether the endpoint is returning errors, timing out, or unreachable. The observability strategy decision record covers the monitoring instrumentation for the notification system: per-channel send rate, delivery success rate, retry rate, dead letter queue depth, per-type user preference opt-in rate, and frequency cap trigger rate (the proportion of notification events that are suppressed by the frequency cap rather than delivered) are all operational metrics that must be instrumented at launch to provide the visibility needed to diagnose deliverability problems, token staleness events, and frequency cap calibration issues before they accumulate into material user experience degradation. The queue and messaging decision record intersects with the notification retry infrastructure: a notification pipeline that requires durable retry across server restarts and dead letter queue inspection needs a persistent job queue — the same infrastructure as any other durable background task — not in-memory retry queuing that loses pending retries on server restart.

None of these five sections appear in the "how do I notify users about X?" AI session that established the notification channel. The initial session covers the minimum viable integration — a working send from the application code to the channel provider. It does not cover the sending domain authentication requirements, the token expiry handling, the compliance opt-out mechanism, the per-type preference schema, or the retry idempotency contract. These are not advanced optimization concerns — they are the operational requirements of a notification system that continues to work correctly as users reinstall apps, change email providers, opt out of marketing communications, and receive notifications across multiple channels over the lifecycle of the product. The WhyChose extractor surfaces the founding session, the deliverability investigation, the token staleness incident, and the preference model discussion from AI chat history; the notification system ADR takes the channel choices buried in those sessions and converts them into a documented delivery infrastructure specification, token lifecycle management policy, compliance obligation record, and retry contract — written before the incidents that make those requirements visible at the worst possible moment.

FAQs

Why do APNs device tokens expire and how does failing to handle 410 Gone responses cause push notification failure to accumulate silently?

An APNs device token is issued by Apple to a specific installation of an app on a specific device. The token becomes invalid when the user uninstalls the app, reinstalls it (which creates a new installation and a new token), or revokes push notification permission in iOS Settings. When an application server sends a push notification to an expired token, the APNs gateway returns an HTTP 410 Gone response with a reason of Unregistered, indicating the token should be removed from the application's database.

If the application server does not process 410 responses — logging the error and continuing rather than deleting the stale token — the stale token remains in the database and continues to receive send attempts. Each attempt produces another 410 that is again not processed. After an iOS major version release that triggers widespread app reinstalls, a database with no 410 handling can reach 70–80% stale tokens within three to four weeks. The click-through rate metric shows the aggregate effect: if 78% of tokens are stale and deliver nothing, the aggregate click-through rate is 22% of the real-user rate — the metric looks like a content quality problem when the root cause is database staleness.

The correct implementation processes every APNs response code: 200 OK indicates acceptance for delivery; 410 Gone requires immediate token deletion. A periodic cleanup job querying the APNs feedback service provides a second path for removing tokens Apple has confirmed as permanently inactive. Both the real-time 410 handler and the periodic cleanup job are required components of the initial implementation; adding them after the first stale token wave requires a bulk cleanup backfill of the entire token table.

What email infrastructure decisions determine deliverability before the first send, and why do they matter more than subject line optimization?

Email deliverability is determined primarily by authentication record configuration (SPF, DKIM, DMARC), IP reputation, and suppression list hygiene — all established before the first send. SPF is a DNS TXT record listing authorized sending IPs for the domain. DKIM is a cryptographic signature on each outgoing message verified by the recipient's mail server. DMARC is a policy instructing recipient servers what to do when SPF and DKIM both fail. A domain without all three configured is treated as low-trust by major inbox providers regardless of content quality.

IP reputation is the second factor: a fresh dedicated sending IP has no positive history, causing inbox providers to apply caution during the warm-up period. Sending high volumes from a new dedicated IP without a gradual warm-up schedule — starting low and increasing over four to eight weeks — causes deliverability problems that appear as low delivery rates and spam classification, not as content feedback. Suppression list management adds a legal compliance dimension: CAN-SPAM requires unsubscribe requests to be honored within ten business days; GDPR requires explicit consent and the ability to withdraw it. A suppression list not shared across marketing and transactional email sends allows transactional sends to reach opted-out users, creating legal exposure that surfaces in compliance reviews rather than in deliverability metrics.

Subject line optimization is a secondary variable that matters only after the infrastructure delivers the email to the inbox. An email classified as spam before the user's inbox receives it cannot be rescued by subject line testing. The correct investigation sequence is: verify authentication records, check bounce and complaint rates, confirm inbox placement rates, and only then optimize content — the order in which these factors affect the open rate is the order in which they should be investigated and fixed.

What is notification frequency capping and why is the per-type preference model a product architecture decision, not a settings screen detail?

Notification frequency capping limits the maximum number of notifications delivered to a user within a time window, converting the system's event volume — driven by system activity, not by user engagement capacity — into a per-user delivery budget. Without a frequency cap, a product that generates many system events trains high-activity users to ignore or disable notifications by exceeding their tolerance threshold. The cap requires a priority queue for pending notifications, a selection mechanism for which events fit within the daily budget, and a digest consolidation path for batching low-priority events that exceed the cap into a single delivery.

The per-type preference model extends frequency control to per-user, per-channel, per-notification-type settings. The preference storage schema — one row per user per notification type per channel — must be designed before the first notification is sent because adding per-type preferences after launch requires a database migration that must correctly assign preferences to all existing users without silently changing their experience. The migration choice — opt all existing users into all types, opt them out, or inherit a prior global preference — is a product decision that the migration code cannot determine from the user data alone. It must be specified in the ADR before the schema is built.

The transactional exemptions — notification types that bypass user preferences because they are required by the product contract (password reset, security alert, billing failure) — must be named explicitly. An exemption too broadly applied defeats user preference settings; one too narrowly applied allows users to opt out of security-critical sends. These boundaries are not settings screen details; they are the architectural decisions that determine what the notification system will and will not allow a user to disable, with legal and security implications that belong in the notification ADR rather than left to the engineer implementing the settings screen.