Why does a message queue need an architecture decision record?

A message queue looks like infrastructure configuration — choose a broker, configure a connection string, publish a message, consume it in a worker. This framing hides the architectural decisions embedded in the messaging design. The delivery guarantee (at-most-once, at-least-once, exactly-once) is not a checkbox — it determines what your consumer code must do when a message arrives twice, which determines whether your consumer is safe to retry without a side effect check. The dead-letter strategy (no DLQ, a DLQ per queue, a shared DLQ with message metadata, automatic requeue with backoff) determines whether a malformed message that causes a consumer crash is isolated for inspection or silently lost or causes an infinite retry loop that blocks all subsequent messages. The consumer group model (competing consumers sharing a queue vs. independent consumers each receiving all messages vs. partitioned consumers maintaining per-partition ordering) determines whether adding a new subscriber to an event requires a code change in the existing consumers. The message schema evolution policy (no schema enforcement, JSON schema validation, Protobuf or Avro with a schema registry) determines whether a producer change can be deployed independently of a consumer change or requires a coordinated rollout. The ordering guarantee (no ordering, per-partition ordering, global ordering) determines whether your consumer can assume causality between messages or must handle messages that arrive out of the sequence they were produced. None of these decisions are visible in the code as design rationale — they appear as middleware configuration, worker bootstrap code, and message format choices scattered across the codebase. The queue ADR holds all of them together with the reasoning that justifies each choice, making it possible to evolve the messaging design safely as the system grows.

What should a queue and messaging architecture decision record include?

A queue and messaging ADR needs six sections. First, the queue selection and delivery model: the broker chosen (Kafka, RabbitMQ, SQS, Redis Streams, or other), the delivery guarantee (at-most-once, at-least-once, exactly-once with scope defined), and the rejection reasons for each alternative considered — not a comparison table but a specific statement of what was wrong with each alternative for this specific use case. Second, the dead-letter strategy: whether a dead-letter queue is configured, the maximum retry count before a message is routed to the DLQ, the backoff policy between retries (fixed delay, exponential backoff, backoff with jitter), whether retries are in-place (the original queue entry is retried) or re-queued (a copy is written to a retry queue), and the operational procedure for processing DLQ contents (is the DLQ monitored, what triggers review, is replay manual or automated). Third, the consumer group model: whether consumers are competing (one consumer in a group processes each message) or fan-out (each subscriber receives all messages), whether consumers maintain per-partition state, the consumer concurrency model (single-threaded consumer maintaining ordering vs. multi-threaded consumer with concurrent message processing that may process later messages before earlier ones), and the consumer failure model (what happens when one consumer in a group fails). Fourth, the message schema and evolution policy: the message format (JSON, Protobuf, Avro), whether schema validation occurs at produce time, consume time, or neither, whether a schema registry is in use, the schema compatibility mode configured (backward-compatible only, forward-compatible only, full compatibility, none), and the coordination procedure for schema changes (can producer and consumer be deployed independently or must they be deployed together). Fifth, the ordering guarantee and its consumers: whether ordering is guaranteed (no ordering, per-partition ordering, global ordering), what business operations require ordering (if payment.initiated must precede payment.completed in processing, ordering is a correctness requirement, not a performance preference), and the design decisions that preserve ordering under consumer concurrency and partition rebalancing. Sixth, the replay and retention policy: how long messages are retained in the queue (controls the replay window), whether event replay is a supported operational procedure (and how it is triggered), and what replay means for idempotency (replaying events that have already been processed requires that consumer idempotency checks remain valid for the retention period).

How do messaging architecture decisions appear in AI chat history?

Messaging decisions appear in AI chat history in four session types. First, the initial integration session: 'how do I add a background job queue to my Node.js app?', 'should I use RabbitMQ or Kafka?', 'how do I publish events when a payment is completed?', 'what is the difference between a queue and a topic?' These sessions contain the broker selection, the delivery model assumption, and the initial consumer design — all made under the framing of decoupling a synchronous operation without yet encountering the failure modes that the delivery semantics determine. The initial session rarely documents what happens when the consumer crashes, because at the time of the initial integration the consumer has never crashed. Second, the incident session: 'messages are being processed twice and it's creating duplicate orders', 'my consumer crashed and I lost 200 messages', 'one bad message is blocking all subsequent messages and I can't figure out how to skip it', 'the queue is filling up faster than the consumer is draining it'. These sessions document the failure mode that the initial delivery semantic choice produces in production, and the fix adopted under incident pressure — often a change that is correct for the immediate incident but incomplete as a general design. The at-least-once delivery semantic was chosen because it was the default; the idempotency requirement it imposes appears in chat only when duplicates become an incident. The dead-letter queue is designed in the incident session because it was not designed in the initial integration session. Third, the scaling session: 'how do I process messages faster?', 'can I have multiple consumers processing the same queue?', 'I need to add a new service that also needs to receive these payment events — do I need to change the queue?', 'some consumers need to receive all events and some need to receive only events for their region'. These sessions surface the consumer group model implications: adding a concurrent consumer changes the ordering guarantee (multiple consumers processing the same queue may process messages out of producer order); adding a new subscriber may require changing from a queue (competing consumers) to a topic (fan-out consumers), which is an architectural change, not a configuration change. Fourth, the audit or compliance session: 'we need to be able to replay events from the last three months for a new consumer we are adding', 'the auditor wants a complete event log of all payment state changes', 'GDPR requires that we be able to delete a user's data from all our systems including the event queue'. These sessions reveal whether the retention policy and replay capability were designed into the messaging architecture or assumed to be available when needed — and for systems where message retention is the only event log, the retention policy is a compliance artifact whose 90-day or 7-year horizon was never recorded as a decision.

2026-06-19 · ~20 min read

The queue and messaging decision record: why the message queue you chose determines your delivery guarantees and your dead-letter handling posture

Message queues look like infrastructure configuration — choose a broker, write a publisher, write a consumer, deploy a worker. This framing conceals the architectural decisions embedded in the choice: the delivery semantic that determines what your consumer must do when a message arrives twice; the dead-letter strategy that determines whether a malformed message is isolated for inspection or loops forever blocking subsequent processing; the schema evolution policy that determines whether producer and consumer can be deployed independently. Most teams encounter these decisions not when choosing the queue, but when something goes wrong in production and there is no decision record to explain why the system is behaving the way it is.

A payments team ships a new event-driven architecture. When a payment completes, a payment.completed event is published to a queue. Two consumers listen: a fulfillment service that ships the order, and a billing service that generates the invoice. The design is straightforward. The decoupling is real. The first week goes smoothly.

Three weeks after launch, the billing consumer starts crashing. The crash happens on a specific payment format — a recurring subscription charge where the currency field is null because the legacy billing system that generates recurring charges predates the currency-normalization pass that was added six months ago. The consumer restarts, re-picks the same message from the queue, crashes again, restarts again. It loops indefinitely.

Orders are being fulfilled correctly. Invoices are not being generated. The revenue leak is invisible: the monitoring dashboard shows "billing consumer status: running" because the container is up. It shows "billing consumer error rate: 0 per minute" because the consumer crashes before it reaches the error metric emission code. The only signal is in the raw log output, which nobody reads in real-time.

The revenue leak runs for 72 hours before someone notices the accounts receivable balance is wrong.

The post-mortem surfaces the decision that was never made: there is no dead-letter queue. The message that caused the crash cannot be isolated and inspected — it is still in the main queue, being retried every 30 seconds. The only recovery options are: fix the consumer and deploy it (which stops future crashes but does not tell the team whether the malformed message has been processed zero times or 200 times), purge the queue (which loses all in-flight messages), or replay from the original payment system (which the payment system does not support, because event replay was never a requirement). The dead-letter handling design was an afterthought that was supposed to be added later. It was never added.

Like most foundational infrastructure decisions, the messaging design is visible as a working system and invisible as a set of choices. The at-least-once delivery semantic (the default for the broker they chose), the no-DLQ configuration (the default when no DLQ is explicitly configured), the consumer restart-on-crash behavior (the default in the container orchestration platform), the absence of an idempotency check in the consumer (nobody thought to add one because nobody thought the consumer would receive a message twice) — each was a default, adopted without documentation of why it was chosen or what it implied. When the first malformed message arrived, there was no decision record explaining what the intended behavior was and no design document describing what the recovery procedure should be.

Why messaging is an architectural decision, not infrastructure configuration

A message queue appears to be a plumbing choice: connect two services without coupling them. The plumbing framing hides the fact that every messaging integration embeds at least five distinct architectural decisions, each with consequences that are invisible until a specific failure mode triggers them.

The delivery guarantee is a programming contract for every consumer. At-least-once delivery guarantees that a message is delivered to a consumer at least once; the consumer may receive it more than once. This guarantee imposes a requirement on every consumer ever written for that queue: the consumer must be idempotent. Processing the same message twice must produce the same result as processing it once. This requirement is not in the queue configuration — it is in the consumer code, or it is missing from the consumer code, or it is partially implemented in some consumers and missing in others. An at-least-once queue with non-idempotent consumers is a system that works correctly under normal conditions and silently generates incorrect state when a consumer crashes after processing and before acknowledging a message. The delivery guarantee decision and the idempotency requirement it creates must both appear in the messaging ADR, or the idempotency requirement will be discovered in production by a duplicate processing incident.

The dead-letter strategy is a failure routing decision. When a consumer fails to process a message, the broker can redeliver the message indefinitely, route it to a dead-letter queue after N retries, or discard it. Each choice has a different failure mode: indefinite redelivery blocks subsequent messages if the queue has ordering guarantees; no dead-letter queue loses messages that cannot be processed; a DLQ isolates problematic messages but requires an operational procedure for reviewing and replaying them. The dead-letter strategy is a failure-mode design — it specifies what happens to messages that cannot be processed, which is a correctness and reliability decision, not a configuration detail. Teams that defer this decision (add a DLQ later, when we need it) discover that they needed it when the first unprocessable message arrives.

The consumer group model determines the fan-out architecture. A queue with competing consumers (multiple worker instances reading from one queue, each message delivered to exactly one worker) has different semantics than a topic with independent consumer groups (multiple services each subscribed independently, each receiving every message). Adding a second service that needs to receive the same events requires either adding a competing consumer to the existing queue (wrong, if the second service should receive all events, not just the ones the first service didn't process) or migrating from a queue model to a topic or pub/sub model (an architectural change, not a configuration change). The consumer group model at the time of the first integration determines the migration cost when the second subscriber is added — which is usually not considered when there is only one subscriber.

The schema evolution policy is the producer-consumer coupling contract. A message is an interface between a producer and a consumer. If the schema of that interface is not enforced and versioned, a producer change can break consumers without any warning at deploy time. Adding a required field to a message schema, removing a field that consumers depend on, renaming a field, or changing a field's type can break consumers in ways that are not detected until messages are consumed. A schema registry (Confluent Schema Registry for Avro/Protobuf schemas, AWS Glue Schema Registry for Avro, or a custom JSON Schema validation layer) provides a compatibility gate at produce time: a producer that publishes a message with a backward-incompatible schema change is rejected before the message enters the queue. The decision to use schema enforcement is an infrastructure investment; the consequence of not using it is a producer-consumer coupling that is not visible in the code but appears as a runtime failure when the producer is deployed without coordinating with the consumer.

The ordering guarantee determines the consumer concurrency model. A queue with no ordering guarantee allows messages to be delivered to consumers in any order — a consumer processing a batch of messages may receive them out of the order they were produced. This is acceptable for idempotent, order-independent operations (sending a notification, incrementing a counter, updating a search index). It is not acceptable for state machine transitions (a payment.refunded event that arrives before the payment.completed event it references requires the consumer to handle an event whose precondition state does not yet exist). Per-partition ordering (Kafka's model: messages in the same partition are delivered in producer order, messages in different partitions are unordered) is a partial guarantee — it requires that related messages be routed to the same partition by a partition key. The partition key is a design decision: for payment events, the payment ID or the customer ID may be the correct partition key, depending on whether per-customer ordering or per-payment ordering is required. The ordering guarantee and the partition key strategy together form the consumer correctness model, and they belong in the messaging ADR because they determine what the consumer can safely assume about the sequencing of messages it receives.

Delivery semantics: what they actually guarantee and what they require

Delivery guarantees are one of the most misunderstood properties in distributed systems. "At-least-once" and "exactly-once" are claims about the broker's behavior, not about the end-to-end correctness of message processing. Understanding what each guarantee actually requires from the consumer is more important than knowing which guarantee the broker supports.

At-most-once delivery means a message is delivered zero or one times. The broker acknowledges the message before delivery, or the consumer acknowledges immediately upon receipt without waiting for processing to complete. If the consumer crashes after acknowledgement and before processing, the message is lost — the broker will not redeliver it. At-most-once is appropriate when message loss is acceptable: telemetry and metrics (a dropped data point does not affect the aggregate), best-effort notifications (a user does not receive a non-critical push notification), and real-time events where a stale retry is worse than no delivery (a live sports score update that is one minute old when retried). For business-critical operations (payments, order state changes, user account events), at-most-once delivery is not acceptable because message loss produces invisible incorrect state — the customer's order status is wrong in the database, and there is no mechanism for detecting or correcting it.

At-least-once delivery means a message is delivered to a consumer at least once and may be delivered more than once. The broker delivers the message and waits for an explicit acknowledgement before removing the message from the queue. If the consumer processes the message and crashes before sending the acknowledgement, the broker redelivers the message to the next available consumer. The consumer must be idempotent — processing the same message twice must be safe. At-least-once is the delivery semantic of most message brokers in their default configuration (RabbitMQ with manual acknowledgement, Kafka consumers that commit offsets after processing, AWS SQS standard queues). The key architectural requirement it creates is the idempotency design for every consumer. An idempotency implementation has several options: (1) a unique constraint on a processed_events table keyed by message ID — fast check, requires a write, correct under concurrent consumer processes because the unique constraint enforces mutual exclusion; (2) a Redis SET NX check on the message ID with a TTL set to the retention period of the queue — lower write latency than a database, allows the idempotency key to expire when the message is old enough that redelivery is no longer a risk; (3) a conditional database update that applies the event only if the current state matches the expected precondition state — correct when the event is a state machine transition but requires that the state machine be designed to express idempotency as a precondition check. The idempotency pattern chosen must be documented in the messaging ADR, because every consumer must implement it consistently — a consumer that does not implement idempotency in a system that guarantees at-least-once delivery is a correctness bug waiting for the first redelivery incident to trigger.

Exactly-once delivery is the guarantee that each message is processed exactly once, with no duplicates and no loss. True exactly-once between two independent systems (a message broker and a database) requires a distributed transaction: the message acknowledgement and the state change must commit atomically. Most systems do not support this. Kafka with transactions (using the idempotent producer and transactional consumer API) provides exactly-once within a single Kafka cluster — a producer can write to multiple topics and a consumer can commit its offset and write to another topic in a single atomic transaction, as long as all the systems involved support the Kafka transactional protocol. The transactional outbox pattern achieves exactly-once semantics between a database and a message queue without requiring the queue to support distributed transactions: the event is written to an outbox table in the same database transaction as the business state change, and a separate log-tailer (Debezium, a custom CDC process) reads the outbox table and publishes messages to the queue. The event is published exactly once because the outbox entry is committed exactly once; the consumer still receives at-least-once delivery from the queue (the CDC process may republish on failure) and must implement idempotency. Understanding the scope of the exactly-once guarantee and whether it is achieved via broker transactions or via the outbox pattern is a design decision that must be in the messaging ADR — because the implementation complexity, operational requirements, and failure modes are different in each case.

Dead-letter queues: the failure routing architecture

A dead-letter queue is the destination for messages that cannot be successfully processed after a configured number of attempts. The decision to use a DLQ, and the design of how messages reach it and are processed from it, is a failure routing architecture decision that is made once and applies to every consumer failure mode that will occur over the lifetime of the system.

The retry count and backoff policy determine how aggressively the broker retries a failing consumer before routing the message to the DLQ. A retry count of 1 means any consumer exception immediately routes the message to the DLQ — useful when consumer failures are usually caused by bad message data rather than transient infrastructure issues, because retrying a bad message does not fix it. A retry count of 10 with exponential backoff means the broker will retry a message for several minutes before routing to the DLQ — useful when consumer failures are sometimes caused by transient infrastructure issues (a downstream database is briefly unavailable, a third-party API returns a 503) that will resolve before the retry attempts are exhausted. Exponential backoff with jitter prevents a synchronized retry surge when many consumers simultaneously fail and all retry at the same scheduled interval. The retry count and backoff policy must be calibrated to the consumer's failure modes: a retry count set for transient infrastructure failures will cause a bad message to cycle for hours before reaching the DLQ; a retry count set for bad message data will route transient failures to the DLQ before they would have self-resolved.

The DLQ structure determines how easy it is to inspect and replay failed messages. A single DLQ per queue provides isolation: failed messages from different source queues do not mix, and the source queue context is preserved. A single shared DLQ for all queues simplifies operations (one place to check for failures) but makes it harder to distinguish failures from different sources. A DLQ with message metadata (the original queue name, the failure reason, the stack trace, the retry count, the timestamp of each retry attempt) enables rapid diagnosis of failure patterns without requiring access to consumer logs. The metadata schema for DLQ messages is a design decision that should be specified in the messaging ADR and enforced by the queue configuration — a DLQ message without failure metadata is harder to triage than a DLQ message with the full failure context.

The DLQ operational procedure is as important as the DLQ configuration. A DLQ that is not monitored provides no value — messages accumulate invisibly, the same way they would if there were no DLQ. A DLQ alert that fires when the message count exceeds a threshold provides the signal that a failure is occurring. A DLQ replay mechanism (a tool or procedure that re-publishes DLQ messages to the source queue after the underlying issue is fixed) closes the recovery loop — without replay, fixing the consumer bug still leaves the affected messages in the DLQ permanently. The DLQ procedure should answer: how are DLQ alerts routed (to the on-call rotation, to the owning team's Slack channel, as a critical alert or a warning), how are DLQ messages inspected (a CLI tool, an internal dashboard, direct broker query), what is the standard replay procedure, and whether replay is idempotent (re-publishing a message that was successfully processed by a previous retry before the crash — partially processed messages — will be reprocessed; the consumer's idempotency check must handle this). The DLQ operational procedure belongs in the postmortem action items and the messaging ADR, not only in the runbook.

In-place retry versus re-queue retry is a structural decision with ordering implications. In-place retry (the same message is retried in its original position in the queue) preserves ordering for ordered queues but blocks subsequent messages if the failing message cannot be processed. Re-queue retry (a copy of the message is written to a retry queue after each failure, and the original is acknowledged to unblock subsequent messages) allows subsequent messages to be processed but breaks ordering guarantees for ordered queues. The choice between in-place and re-queue retry depends on whether ordering is required and whether blocking subsequent messages is acceptable — and both of those properties come from other decisions in the messaging ADR, which is why the dead-letter strategy cannot be designed in isolation from the delivery model and the ordering guarantee.

Consumer group models and fan-out architecture

The consumer group model is the architectural decision that determines how multiple services receive events from a shared messaging system. The choice made at the first integration shapes whether adding a second subscriber requires a configuration change or an architectural migration.

Competing consumers (a pool of worker instances reading from a single queue, each message delivered to exactly one worker) is the correct model for work distribution: a task queue where multiple workers process jobs in parallel, a webhook delivery system where each webhook is processed by one worker, a background job queue where the goal is throughput rather than fan-out. In a competing consumer model, adding a second service that needs to receive the same events requires that both services share a single queue (wrong, because each service would only receive a fraction of the events, not all of them) or that the producer publishes to two separate queues (one per subscriber, which requires modifying the producer when each new subscriber is added). The competing consumer model does not scale to fan-out use cases without architectural modification.

Publish-subscribe fan-out (a topic or exchange that routes every message to all subscribed consumer groups, each group receiving an independent copy) is the correct model for event distribution: a payment.completed event that needs to reach both the fulfillment service and the billing service, a user.created event that needs to reach the onboarding service, the analytics service, and the email service. In the pub/sub model, adding a new subscriber means creating a new subscription to the topic — no changes to the producer, no changes to existing subscribers. The structural requirement of the pub/sub model is that the topic retains messages long enough for all subscribers to consume them; a subscriber that falls behind cannot receive messages that were consumed before the subscriber connected. Kafka's consumer group model (each consumer group maintains its own offset into a topic partition, consuming messages independently) is a pub/sub architecture with retention-bounded replay; RabbitMQ exchanges (fanout, direct, topic, headers) route messages to multiple bound queues. The choice between competing consumers and pub/sub fan-out determines the subscriber addition cost for the lifetime of the system.

Consumer concurrency and per-partition ordering intersect in the Kafka model in a specific way that is a common source of production surprises. In Kafka, messages in the same partition are delivered in producer order. A consumer group with N consumers reading from a topic with M partitions assigns partitions to consumers — each partition is consumed by exactly one consumer in the group at any time. This means that ordering is guaranteed within a partition, not across partitions. If related messages (all events for a specific customer) must be processed in order, they must be routed to the same partition by a consistent partition key (the customer ID). The partition key choice is a correctness decision: if messages for the same customer are routed to different partitions, a customer.upgraded event and a customer.feature_granted event for the same customer may be processed by different consumer instances in different orders. The partition key strategy — what determines which messages are co-located in the same partition — belongs in the messaging ADR alongside the ordering guarantee it supports.

Schema evolution: the invisible coupling

A message schema is an interface definition between a producer and all consumers subscribed to that queue or topic. When the schema changes, the compatibility of that change with existing consumers is the same kind of breaking change risk as a REST API change — but it is harder to detect because the schema is embedded in the message bytes, not in an API contract enforced at the HTTP layer.

Unstructured JSON without schema enforcement is the most common starting point and the most common source of producer-consumer coupling bugs. A producer adds a required field to a message without coordinating with consumers. A consumer reads a field that the producer has renamed or removed. A producer changes a field type (a string that was always a valid integer becomes an actual integer) and downstream consumers that do JSON string comparison fail silently. These failures are not detected at deploy time — they are detected when a message is consumed with the new schema by a consumer that expected the old schema, which may be minutes or hours after the producer deployment. The absence of schema enforcement is a design decision: it prioritizes producer deployment speed (no schema compatibility gate) at the cost of consumer compatibility risk. This trade-off belongs in the messaging ADR.

Schema registries with compatibility enforcement (Confluent Schema Registry for Avro and Protobuf, AWS Glue Schema Registry, a custom JSON Schema validation layer) enforce schema compatibility at produce time. Before a producer publishes a message, the schema is validated against the registered schema for that topic with the configured compatibility mode. Backward compatibility (the default) means new schema versions can be read by consumers using the old schema: new optional fields may be added, required fields may not be removed, field types may not change. A producer that attempts to publish a message with a backward-incompatible change is rejected — the incompatibility is detected before the message enters the queue. Forward compatibility means old schema versions can be read by consumers using the new schema: required fields may be added (consumers using old schema will ignore them), optional fields may be removed. Full compatibility requires both backward and forward compatibility: only optional field additions are allowed. The compatibility mode chosen determines the constraints on producer and consumer evolution — and those constraints determine whether producer and consumer can be deployed independently (backward compatibility, with consumers deployed before producers) or must be coordinated (no schema enforcement, requiring synchronized deployment).

Schema versioning and migration strategy for existing queues with messages in flight must be addressed when a backward-incompatible change is required. If a required field must be renamed, or a field type must change, the migration typically requires running two schema versions simultaneously — the producer publishes messages with both field names for a period, the consumer reads both field names, old messages in the queue are consumed, the old field is removed from the producer, the consumer is updated to require only the new field. This kind of migration must be designed explicitly; it does not happen automatically, and the duration of the two-version window (the time required to drain old-schema messages from the queue) depends on the message retention period and the consumer's processing rate. The migration procedure for schema changes belongs in the messaging ADR alongside the schema format and compatibility mode, because the procedure determines how long a backward-incompatible change takes to complete safely.

Queue selection and the trade-offs that matter

The queue broker selection is the decision that most teams treat as the primary choice — Kafka or RabbitMQ or SQS — while the delivery semantics, dead-letter strategy, consumer model, and schema policy are treated as secondary configuration. The broker selection matters, but it matters specifically because each broker has structural properties that make it more or less suited to specific delivery models, fan-out requirements, and replay capabilities. Understanding the broker selection in terms of these structural properties is more useful than a benchmark comparison.

Kafka is a distributed log with consumer-controlled offset management. Messages are retained on disk for a configurable period (days to weeks to indefinitely with tiered storage), regardless of whether consumers have read them. Consumer groups maintain their own offsets, which means any consumer group can replay any point in the retention window by resetting its offset. This makes Kafka the natural choice when event replay is a requirement: adding a new consumer group that needs to process historical events, backfilling a new data store from the event log, or reprocessing events after a consumer bug is fixed. The structural cost is operational complexity: Kafka requires ZooKeeper (older versions) or KRaft (newer versions) for cluster management, partition assignment and rebalancing requires careful configuration, and maintaining a Kafka cluster in production requires operator expertise that a managed queue service (SQS, Cloud Pub/Sub) does not. For teams without existing Kafka expertise, a managed Kafka service (Confluent Cloud, Amazon MSK, Aiven for Kafka) reduces the operational burden while preserving the structural properties.

RabbitMQ is a traditional message broker with routing flexibility. Messages are routed from producers to exchanges (fanout, direct, topic, headers) and then to queues bound to the exchange with routing keys. The routing model is more expressive than Kafka's topic-partition model: a single producer can route different messages to different queues based on message attributes without the consumer knowing about the routing. RabbitMQ's push-based delivery (the broker pushes messages to consumers) differs from Kafka's pull-based model (consumers poll for messages), which affects the behavior under consumer backpressure: a RabbitMQ consumer that is slow to process messages accumulates unacknowledged messages in the broker's prefetch buffer, which can exhaust broker memory if the consumer does not limit its prefetch count. RabbitMQ supports at-most-once (auto-ack) and at-least-once (manual ack) delivery; it does not have a native exactly-once mechanism. It is the correct choice when routing flexibility and the push delivery model match the use case, and when event replay from a persistent log is not required.

AWS SQS (and its equivalents: Google Cloud Pub/Sub, Azure Service Bus) is a managed queue service that eliminates broker operations entirely. SQS standard queues provide at-least-once delivery with no ordering guarantee; SQS FIFO queues provide exactly-once delivery (within the FIFO deduplication window) with per-message-group ordering. SQS has a native dead-letter queue configuration: a redrive policy specifies the maximum receive count before a message is routed to the DLQ, which is another SQS queue. The SQS model is the lowest-friction starting point for teams that need a message queue without Kafka's operational complexity: no cluster to manage, pay-per-request pricing, native AWS IAM integration, and built-in DLQ support. The structural limitation is that SQS does not retain messages beyond 14 days and does not support consumer-controlled replay from an offset — a new consumer cannot process historical messages that were consumed before it subscribed. For use cases where event replay is a future requirement, starting with SQS and migrating to Kafka later is an architectural migration (a rewrite of the producer and consumer interfaces, a period of dual-publishing to both systems during cutover), not a configuration change. The expected future requirement for event replay belongs in the queue selection reasoning in the messaging ADR.

Redis Streams is a lightweight append-only log with consumer group semantics, similar to Kafka in its consumer offset model but with different operational properties: Redis is a single-node or cluster in-memory data structure, which means Redis Streams has low latency and simple operations but limited retention capacity (bounded by available memory) and different durability properties than a Kafka cluster with disk-backed storage. Redis Streams is the correct choice when the team already operates Redis and the messaging volume and retention requirements are within Redis's capacity — adding a Redis Streams consumer to an existing Redis deployment is operationally simpler than introducing a new Kafka cluster. The capacity bound (messages bounded by available memory) is the structural limit that determines whether Redis Streams or a disk-backed broker is appropriate for the use case.

Ordering guarantees and when you actually need them

Message ordering is one of the most frequently misunderstood requirements in messaging system design. Teams often request ordering guarantees by default, without considering whether the operations they are performing are actually order-dependent — and ordering guarantees come at a throughput cost (limiting the degree of consumer parallelism) and a complexity cost (partition key selection, handling partition rebalancing during consumer failures).

An operation is order-dependent when the correctness of processing message B depends on message A having been processed first. A payment state machine that transitions from pending → processing → completed is order-dependent: processing payment.completed before payment.processing produces an incorrect state because the transition from pending to completed is not a valid state machine step. A search index update that replaces the current document with the latest version is not order-dependent: whether the update for version 3 arrives before or after the update for version 4 does not matter as long as the final state is version 4. Performing an order-dependent operation on a queue with no ordering guarantee requires that the consumer implement its own ordering mechanism (track the last processed sequence number and buffer out-of-order messages until the gap is filled), which is complex and introduces latency. Performing an order-independent operation on an ordered queue limits throughput to the throughput of a single consumer per partition. Matching the ordering guarantee to the actual requirement is a correctness and performance decision that belongs in the messaging ADR.

The error handling strategy and the ordering guarantee interact in a specific way that is worth calling out explicitly. In an ordered queue with in-place retry, a message that cannot be processed blocks all subsequent messages in the same partition until the retry limit is exhausted and the message is routed to the DLQ. If the retry backoff is exponential and the maximum retry count is high, a single bad message can block its partition for tens of minutes. This interaction — ordering guarantees combined with retry semantics and DLQ configuration — is the design space that produces the 72-hour silent failure in the opening story. Each of these decisions is individually reasonable; the combination of all of them with defaults produces a silent partition-blocking failure. Only the messaging ADR, which holds all five decisions together, makes the interaction visible.

Retention, replay, and the compliance implications

Message retention policy determines how long the broker holds messages after they are produced, regardless of whether they have been consumed. This appears to be an operational decision (how much disk space to allocate) but it is also a capability decision (event replay is only possible within the retention window) and a compliance decision (message contents may include personal data subject to GDPR deletion requirements).

Event replay as an architectural requirement must be planned before the first consumer is deployed, not added later. A queue without persistent storage (RabbitMQ with no message persistence, SQS beyond the 14-day retention window, Redis Streams beyond the available memory) cannot support replay of consumed messages. A team that discovers they need to backfill a new service from historical events — or that needs to reprocess events after a consumer bug — has no replay mechanism available if the broker was not configured for it. Kafka with a configurable retention period (or with infinite retention using tiered storage) supports replay by resetting the consumer group offset; the replay capability is only available if the retention period is long enough to include the messages to be replayed. The retention period and the replay requirement belong together in the messaging ADR: if event replay is a future requirement (even a hypothetical one), the broker and retention configuration must support it.

GDPR and personal data in message payloads creates a specific compliance requirement that the messaging ADR must address. If messages contain personal data (user IDs, email addresses, IP addresses, behavioral data), GDPR's right to erasure requires that the personal data be removed from all systems — including from message payloads in the broker's retention window. A message in a Kafka topic with a 90-day retention period that contains a user's email address must have that email address erased from the retained message when the user exercises their erasure right. This can be achieved by not embedding personal data in message payloads (embed a user ID that is looked up at consume time, not the email address itself), by using a compacted topic with tombstone records (a null value for the key replaces all retained messages for that key), or by encrypting the personal data in the message payload with a per-user encryption key and deleting the key (the "crypto-shredding" technique — the encrypted data remains in the retained messages but becomes unreadable). The approach chosen must be in the data retention decision record and the messaging ADR, because it determines how the queue is configured and how producers and consumers handle personal data. Discovering this requirement after a 90-day-retention Kafka cluster has been in production with email addresses in message payloads requires a message re-encoding migration that is significantly more complex than designing for it from the start.

Finding queue and messaging decisions in AI chat history

Messaging decisions are some of the most valuable decisions buried in AI chat history because they are made early, under the assumption that the infrastructure choice is low-stakes, and they have consequences that compound over months. Three months of AI chat history for a team that shipped an event-driven feature typically contains the full archaeology of the messaging design: the initial broker selection, the delivery model default, the consumer implementation, the first incident that revealed a missing dead-letter queue or a non-idempotent consumer, and the ad-hoc fix applied under pressure.

The initial integration session contains the decisions with the longest tail: "how do I publish events when a payment is completed?", "should I use RabbitMQ or Kafka for this?", "how do I have two services both receive the same event?", "how do I process messages in the background without blocking the API?" The AI responses in these sessions reflect the default configuration of the broker library used — and defaults are not decisions. A Kafka consumer library defaults to at-least-once delivery; the developer who follows the quickstart inherits this guarantee without understanding what it requires from their consumer code. The initial integration session is the place to look for the delivery model assumption that was never surfaced as a decision.

The incident session contains the decisions made under pressure: "messages are being processed twice and we're generating duplicate orders", "a bad message is stuck in the queue and blocking everything behind it", "the consumer crashed and we lost 200 messages — how do we recover them?", "the queue is growing faster than the consumer is draining it and we're running out of memory". These sessions document the retroactive DLQ addition, the idempotency check added after the first duplicate incident, the consumer prefetch limit added after the first memory exhaustion, the partition rebalancing configuration added after the first rebalance-during-deployment caused messages to be processed out of order. Each fix addresses the immediate incident correctly; none of them are recorded as architectural decisions with their reasoning.

The growth session contains the fan-out architectural decisions: "we have a new service that also needs to receive payment events — how do we add it?", "we need some consumers to only receive events for US customers and some for EU customers — how do we route by region?", "we need to replay all payment events from the last 30 days to populate the new analytics database — how do we do that?" These sessions reveal whether the initial messaging design supports the new requirement (a pub/sub topic model allows adding a new consumer group without changing the producer) or requires an architectural migration (a competing consumer queue model requires the producer to fan out to multiple queues, or a topic migration). The fan-out requirement and the replay requirement are often hypothetical in the initial integration session and concrete in the growth session — by which point the cost of migration is higher than the cost of designing for them from the start would have been.

Writing the queue and messaging ADR

A messaging ADR is more complex than most single-component ADRs because it documents a set of interconnected decisions whose interactions produce the failure modes the system will encounter. The structure must make those interactions explicit.

The first section documents the broker selection and delivery model with the rejection reasons for each alternative. Not a comparison table, but a specific statement of what was wrong with each alternative for this use case: RabbitMQ was rejected because event replay is a future requirement and RabbitMQ's message retention model does not support offset-based replay; SQS was rejected because the 14-day retention limit is shorter than the compliance requirement for event log retention; Redis Streams was rejected because the memory-bounded retention would require over-provisioning Redis to support the expected message volume. The selection and rejection reasons establish why the broker is correct for this use case, which is the context a future engineer needs to evaluate whether the decision still holds when the requirements change.

The second section documents the delivery guarantee and the idempotency requirement it imposes. At-least-once delivery chosen; the idempotency check implemented in consumers uses a database unique constraint on processed_events(message_id); the constraint applies to all consumers of all queues in this system; new consumers must implement the idempotency check before processing any message. This section converts the delivery semantic from a broker configuration into a programming contract — and names the contract explicitly so that it is not rediscovered as a bug.

The third section documents the dead-letter strategy: DLQ configured for each source queue; maximum receive count of 3 before routing to DLQ; exponential backoff (30s, 2m, 10m) between retries; DLQ monitored with a CloudWatch alarm when message count exceeds 1; standard replay procedure (diagnose from DLQ metadata, fix consumer bug, re-publish DLQ messages to source queue using the replay tool at scripts/replay-dlq.sh); idempotency check in consumers ensures replay safety.

The fourth section documents the consumer group model and the fan-out architecture. Pub/sub topic model chosen; each service creates its own consumer group subscribed to the shared topic; adding a new subscriber requires only creating a new consumer group (no producer changes); the topic retains messages for 30 days, which is the maximum replay window for a new subscriber backfilling from historical events; consumer concurrency is per-partition (one consumer instance per partition), which maintains per-partition ordering with a partition key of customer ID.

The fifth section documents the schema evolution policy. Avro schema with Confluent Schema Registry; backward compatibility mode (new optional fields may be added, existing fields may not be removed or renamed); producer deployments that include schema changes must be validated against the schema registry before deployment; the producer CI pipeline includes a schema compatibility check step; consumers are deployed before producers for backward-compatible changes; the schema migration procedure for backward-incompatible changes requires a two-version window process described in the runbook.

The sixth section documents the retention policy and its compliance implications. 30-day retention for operational queues; messages contain customer ID references but not email addresses or PII directly (PII is looked up from the customer service at consume time, not embedded in the message payload); GDPR erasure of a customer's records does not require message payload modification because no PII is embedded; the customer ID is a stable internal identifier that is not itself PII under the legal team's interpretation (documented in the data retention ADR).

The WhyChose extractor surfaces the messaging decisions buried in team AI chat history — the initial broker selection that set the delivery semantic, the incident session that revealed the missing DLQ, the growth session that discovered the fan-out architecture limitation — and associates them with the chat context that explains why each choice was made. The decisions are not always in a single session; the messaging architecture is assembled across multiple sessions over months. The extractor connects them into a coherent decision record so the team can see the full design, not just the code it produced.