2026-06-21 · ~22 min read

The message broker decision record: why the broker you chose determines your delivery guarantee and your consumer group isolation capability

Q: When should a team use AWS SQS instead of Kafka?

AWS SQS is the appropriate choice when three conditions hold: the team is already operating in AWS and wants managed infrastructure without running a Kafka cluster or paying for a managed Kafka service; the message processing pattern is work queue distribution (one consumer processes each message exactly once) rather than event log fan-out (multiple consumer groups consume the same message stream independently); and replay of previously consumed messages is not a requirement. SQS Standard queues provide near-unlimited throughput, managed scaling, and at-least-once delivery with best-effort ordering. SQS FIFO queues provide exactly-once processing within a 5-minute deduplication window and strict ordering within message groups, at the cost of a 300 transactions-per-second throughput limit per API action (3,000 with message batching). SQS does not retain messages after they have been successfully consumed — the delivery model is destructive, meaning consumed messages are gone from the queue. If a new downstream consumer needs to process historical messages, those messages are not available in SQS; they must be replayed from the source system (the database, an S3 event archive, or another durable store). This makes SQS appropriate for fire-and-forget work distribution (sending emails, resizing images, triggering webhooks, processing payment callbacks) and inappropriate for event sourcing, audit trails, or patterns where new consumers need to backfill from history. The operational simplicity argument for SQS is strong for teams without Kafka expertise: no cluster to size, no ZooKeeper or KRaft quorum to manage, no partition rebalancing to understand, no consumer lag alerting to wire up from scratch — the broker is entirely managed and scales transparently.

Q: What should a message broker ADR document that teams typically skip?

Teams typically document the broker name and nothing else: 'We use Kafka for async messaging.' The ADR that prevents incidents documents the structural decisions that are invisible in the codebase: (1) the delivery guarantee and its idempotency contract — at-least-once means every consumer must be idempotent; exactly-once via transactions means read_committed isolation is required; this contract must be stated explicitly so that every future service that consumes from this broker understands what assumptions are safe to make; (2) the partition count and the reasoning behind it — the partition count is a hard ceiling on consumer group parallelism, and the reasoning ('8 partitions because we expected 4 consumer instances with 2x headroom') decays fast as the team grows; future engineers who try to add parallelism need to know this ceiling exists and why it was set where it was; (3) the retention policy and the replay window it provides — 24-hour retention is a very different capability from 7-day retention or 30-day retention; the retention policy is usually set once at topic creation and forgotten; when a new consumer needs replay, the answer to 'how far back can we go?' depends entirely on this policy; (4) the consumer group naming convention and isolation model — ad-hoc consumer group names create invisible coupling; two services that accidentally share a consumer group name will silently split the message stream between them, each seeing only half the messages; the naming convention and isolation rules must be documented before the third team starts consuming from the topic; (5) the dead letter queue strategy — what happens when a consumer fails to process a message after the maximum retry count; where do poison pill messages go; who monitors them; the absence of a DLQ strategy means failed messages either block the consumer or disappear silently, and neither failure mode is acceptable.

Message broker selection looks like an infrastructure detail until a payment reconciliation service needs 30 days of event replay and you discover the topic retention policy was set to 24 hours at initial setup — and the engineer who set it hasn't worked there in eight months. The broker you chose in year one determines your delivery guarantee, your replay window, your consumer group isolation model, and how many partitions stand between you and the throughput ceiling you'll hit when the team tries to scale.

A nine-person startup built their first event-driven service in month four. The team lead had worked at a company that used Kafka, and when the question came up — "should we use a message queue for the order processing pipeline?" — the answer arrived quickly: "We'll use Kafka. It's what Netflix uses for high-throughput event streaming." Nobody pushed back. Kafka was set up on a single broker for development, three brokers for production. The order topic got four partitions because four felt like a reasonable round number. The retention policy defaulted to 168 hours — seven days — and nobody changed it. Consumer groups were named by whoever wrote the service that consumed them: order-processor-group, notification-svc, analytics-consumer.

Fourteen months later, the company raised a Series A and hired a Head of Finance who wanted automated payment reconciliation. The reconciliation service needed to process every order.placed, payment.captured, and refund.issued event from the beginning of the company's history to generate correct monthly reports. The engineering team estimated two weeks to build the reconciliation service. On day one of the build, they discovered that the longest any event could live in the Kafka topic was seven days. Fourteen months of events did not exist in the broker. They existed in the PostgreSQL database — but the reconciliation service had been designed to consume from Kafka, not to query Postgres directly.

Three engineer-weeks went into a data migration path: exporting historical events from the database, transforming them into the Kafka message format, producing them onto a dedicated backfill topic, and wiring the reconciliation service to consume from the backfill topic first, then switch to the live topic. The backfill revealed a second problem: the payment.captured events had a schema that had evolved in two undocumented ways since month four. The consumer broke on messages from the first six months because the currency_code field had been renamed to currency in a refactor, and the amount field had changed from a float to an integer (cents) after a floating-point precision incident. Neither schema change had been documented anywhere. The ChatGPT sessions where those decisions had been made — "how should we represent money amounts in our event payloads?" — were gone.

Two more weeks. The reconciliation launch date slipped by six weeks. Four engineer-weeks of work were required because three undocumented decisions — the retention policy, the partition count strategy, and the message schema evolution policy — had been set in passing during infrastructure setup sessions and never written down. The ADR for the Kafka adoption, if it had existed, would have been one document. The six weeks of recovery work it would have prevented was not an engineering problem. It was a documentation problem dressed up as an engineering problem.

Four months after the reconciliation launch, the product team added a loyalty points service that needed to consume from the same order.placed topic. A junior engineer set up the consumer and named the consumer group order-processor-group — the same name as the existing order processing service. For three days, the order processing service and the loyalty points service were silently splitting the order event stream between them. Each service was receiving approximately half the orders. The order processing service processed 48% of orders fully; the loyalty points service processed 52% of orders. The remaining 52% of orders were processed by the loyalty service and the remaining 48% were not processed for loyalty at all. The error did not produce visible errors — both consumers were processing messages successfully, just different messages. The issue was discovered only when a customer complained that they had placed five orders but only received loyalty points for two of them.

The consumer group naming collision was invisible in the codebase. There was no documentation of what consumer group names were in use, what naming convention they followed, or why consumer group isolation mattered. The engineer had picked a sensible-sounding name that happened to already be taken.

These four incidents — the retention policy gap, the backfill schema breaks, and the consumer group collision — were all downstream consequences of the same upstream omission: the message broker was selected and configured in a series of informal setup sessions, and the structural decisions made during those sessions were never recorded. A message broker decision record that documented the retention policy, the schema evolution contract, the partition count rationale, and the consumer group naming convention would have prevented all four incidents.

The three structural properties that broker selection determines

When teams evaluate message broker options, the discussion centers on throughput benchmarks, managed vs self-hosted operational burden, and familiarity from previous jobs. These are real factors. But the structural properties that determine whether your broker choice ages well are different — and they're set at selection time, not during feature development.

Delivery guarantee and its idempotency contract

The delivery guarantee is the broker's contractual promise about how many times a message will be delivered to consumers. This guarantee is not a configuration option that you can freely choose after selecting a broker — it is a property of the broker's architecture, and it determines what application-level contract your consumers must uphold.

At-least-once delivery is the default for Kafka, RabbitMQ, and SQS Standard queues. The broker guarantees that every message will be delivered to a consumer at least one time. In failure scenarios — consumer crashes after processing but before acknowledging, network partition between consumer and broker, broker leader failover — the broker redelivers the message. The consumer may therefore receive the same message more than once. The application contract required by at-least-once delivery is idempotency: applying the message a second time must produce the same observable result as applying it the first time. For a payment capture event, idempotent processing means checking whether this payment has already been captured before processing it. For an email send event, idempotent processing means checking whether this email has already been sent. The idempotency contract is an application responsibility, not a broker responsibility — and it applies to every consumer of every topic, forever, for as long as the broker uses at-least-once delivery.

Exactly-once delivery eliminates the duplicate redelivery problem but introduces coordination overhead. Kafka's transactional API — introduced in version 0.11 — allows a producer to write messages to multiple partitions atomically and allows consumers configured with isolation.level=read_committed to see only messages that have been fully committed by producers. The combination prevents duplicates: if a producer fails mid-transaction, the partial transaction is aborted and rolled back, so consumers with read_committed isolation never see the partial write. The cost is latency: read_committed consumers cannot read messages from open transactions, which means their end-to-end latency includes the time waiting for producers to commit. For high-throughput pipelines where producers batch many messages per transaction, this latency overhead is material.

SQS FIFO queues achieve a form of exactly-once via message deduplication IDs: messages with the same deduplication ID within a 5-minute window are delivered only once. This is deduplication, not true exactly-once semantics — it only applies within the 5-minute window, and it requires the producer to supply a stable deduplication ID for each message. For most work queue patterns, SQS FIFO's deduplication is sufficient. For event sourcing patterns with long windows and replay requirements, it is not.

The practical implication: when you choose a broker with at-least-once delivery, you are choosing idempotency as a permanent application contract. Every engineer who writes a consumer, now and in the future, must understand this contract and implement idempotent processing. That understanding needs to be in the ADR, not assumed from general Kafka knowledge, because "general Kafka knowledge" is not uniformly distributed across every engineer who will ever touch the codebase.

Consumer group isolation and partition ceiling

Kafka and Redis Streams use a consumer group model that provides strong isolation between independent consumers of the same message stream. Every consumer group maintains its own committed offset — its own read position — into the topic. When consumer group A advances past message 1,000, consumer group B is unaffected; it still reads from wherever its own offset is committed. Multiple consumer groups can consume the same topic simultaneously, each at their own pace, each seeing every message, without any coordination between them. This is Kafka's fan-out model: a single order.placed topic serves the order processor, the notification service, the analytics pipeline, the loyalty points service, and any future consumer — without any configuration change to the topic and without any awareness between consumers.

The partition count creates the parallelism ceiling within a consumer group. Each partition is assigned to exactly one consumer instance within a consumer group at a time. A topic with four partitions supports a maximum of four active consumer instances per consumer group. The fifth consumer instance in a group sits idle, receiving no messages, serving only as a standby for failover. Increasing consumer parallelism beyond four requires increasing the partition count — which is a topic-level change with consequences for keyed messages. Kafka guarantees that all messages with the same key are delivered to the same partition in order, which is how you achieve per-entity ordering (all events for order ID 1234 arrive in sequence). Increasing the partition count redistributes keys to different partitions, which breaks in-flight ordering for messages that were already in transit at the time of the repartitioning. Partition count is therefore a decision that must be made with the growth trajectory in mind, not set to the first round number that felt reasonable.

RabbitMQ uses a queue-based fan-out model rather than an offset-based consumer group model. In RabbitMQ, fan-out requires routing the same message to multiple queues via an exchange: a fanout exchange broadcasts every message to every queue bound to it; a topic exchange routes messages to queues whose binding key matches the message routing key. Each queue is then consumed independently. The structural difference from Kafka's model is that the fan-out topology must be configured before the message is published — if you add a new downstream consumer six months after the topic was created, you need to add a new queue binding to the exchange and then decide how to handle the messages that were published before the binding was added (they are gone, because they were already consumed by the existing queues or expired). In Kafka, a new consumer group that is added six months later can reset its offset to the beginning of the retained log and replay all messages within the retention window. In RabbitMQ, there is no equivalent: messages consumed by existing queues are removed from those queues and cannot be replayed.

Replay capability and retention model

Kafka's log-based architecture retains messages for a configured duration after publication, regardless of whether they have been consumed. The retention policy — how long messages persist in the log — determines your replay window: how far back in time a new consumer group can start reading. A seven-day retention policy means any new consumer group created today can read messages published in the last seven days. A thirty-day retention policy extends that window to thirty days. An "indefinite" retention policy (retention.bytes capped, retention.ms set to -1) keeps messages until the log fills the allocated storage.

The retention policy is set per topic at creation time. It defaults to seven days in most Kafka configurations. Teams that use Kafka for audit trails, compliance records, or reconciliation workflows frequently discover that the default retention is too short only when they need to replay beyond seven days — at which point the messages are gone and the replay must come from a different source. Setting a longer retention policy costs storage; the right retention duration depends on the longest backfill window any downstream consumer could legitimately need, which is a product requirement, not a technical default.

SQS's retention model is fundamentally different. SQS messages are deleted after successful consumption. There is no replay; there is no log. If you need to process a message again, you must republish it from the source. This is appropriate for transient work queues — send an email, process a payment webhook, resize an image — where the message represents a one-time unit of work. It is inappropriate for event sourcing, audit trails, or any pattern where consumers might legitimately need to replay the event history.

Redis Streams provides a middle ground. The XADD command appends messages to an immutable stream. Messages remain in the stream until explicitly deleted via XTRIM or XDEL, or until the stream exceeds a configured MAXLEN. Consumer groups (XGROUP CREATE, XREAD GROUP) maintain per-group read positions and support replay by resetting the group's last-delivered-ID. The replay capability is bounded by what is still in the stream — messages trimmed by MAXLEN are gone. Redis's primary durability caveat is persistence configuration: by default, Redis fsync is either every second (AOF with everysec) or never (RDB snapshots only), and a crash in the wrong moment can lose up to one second of messages. For applications where losing a message is acceptable, Redis Streams' replay model is powerful. For applications where durability is a hard requirement, Kafka's replication factor and ACK model provide stronger guarantees at the cost of higher operational complexity.

The options and their structural tradeoffs

Apache Kafka (and managed variants: Amazon MSK, Confluent Cloud)

Kafka is the dominant choice for high-throughput event streaming and event sourcing architectures. Its log-based retention model, consumer group isolation, and partition-level ordering give it structural capabilities that queue-based brokers cannot replicate. The tradeoffs are operational complexity and a steep learning curve for teams new to the partition and offset model.

Self-hosted Kafka on KRaft (Kafka without ZooKeeper, available from version 2.8 and production-stable from 3.3) eliminates ZooKeeper as a separate operational dependency. Prior to KRaft, self-hosted Kafka required running a separate ZooKeeper cluster alongside Kafka — a significant operational surface area that was the primary source of "Kafka is too complex for our team" complaints. KRaft mode consolidates metadata management into Kafka's own Raft-based controller quorum, reducing the operational footprint to the Kafka brokers themselves. For teams with Kubernetes experience, Strimzi (an open-source Kafka operator) automates Kafka cluster lifecycle management — provisioning, scaling, rolling upgrades, topic management — as Kubernetes custom resources. This brings Kafka's operational model closer to what teams comfortable with infrastructure-as-code strategy decisions expect.

Managed Kafka services — Amazon MSK and Confluent Cloud — eliminate cluster management in exchange for a per-broker-hour or per-CKU cost model. MSK is appropriate for teams already in AWS who want Kafka semantics without running their own brokers. Confluent Cloud adds Schema Registry (for enforcing and evolving Avro/Protobuf message schemas), Kafka Connect (for streaming data between Kafka and external systems), and ksqlDB (stream processing without a separate Flink or Spark cluster). The Schema Registry is the managed solution to the schema evolution problem that caused the reconciliation service backfill breaks in the opening narrative — every message is tagged with a schema ID, consumers retrieve the schema for that ID from the registry, and breaking schema changes fail at publish time rather than at consume time during backfill six months later.

The minimum viable Kafka cluster for production is three brokers (for a replication factor of three, which tolerates one broker failure without data loss). Running three brokers costs more than running one SQS queue — this is the correct starting point for the cost comparison, not raw per-message pricing. For teams publishing fewer than 10,000 messages per second with simple fan-out requirements and no need for exactly-once semantics, the operational investment in Kafka frequently does not justify the capabilities it provides over simpler alternatives.

RabbitMQ

RabbitMQ is the AMQP-native message broker that predates Kafka's log-based model. Its exchange-based routing model — direct, fanout, topic, and headers exchanges — provides flexible message routing that Kafka does not natively support. In Kafka, routing a subset of messages to a specific consumer requires the consumer to filter messages after consumption; in RabbitMQ, routing rules live in the broker and are applied before message delivery. This model reduces consumer-side filtering logic for complex routing requirements.

RabbitMQ's queue-based delivery model is a strong fit for work queue patterns: tasks that should be processed exactly once by exactly one worker, with backpressure signaling when the worker pool is overwhelmed (the queue depth increases; the producer slows). The acknowledgment model (manual ACK, NACK, REJECT) gives consumers explicit control over redelivery: a consumer can NACK a message to put it back in the queue for redelivery, REJECT it to move it to the dead letter exchange, or ACK it to remove it permanently. This per-message control is more expressive than Kafka's offset-commit model for work queue semantics.

The absence of a native replay model is RabbitMQ's structural limitation for event-driven architectures where downstream consumers might need to backfill from history. Once a message is ACK'd and removed from a queue, it is gone. If a new consumer needs to process historical messages, those messages must be republished from the source. For audit trails, event sourcing, or any pattern that requires adding new consumers retroactively, RabbitMQ's architecture requires maintaining a separate event store alongside the broker — a common pattern (RabbitMQ for live delivery + PostgreSQL or S3 event archive for replay) but a more complex overall system than Kafka's self-contained log.

RabbitMQ's operational characteristics differ from Kafka's in ways that matter for small teams. A single RabbitMQ node is sufficient for development and moderate production workloads; quorum queues (RabbitMQ's modern HA model, recommended over classic mirrored queues since RabbitMQ 3.8) provide replication with a Raft-based leader election model that tolerates minority-node failures. The operational model is simpler than Kafka's for low-to-moderate throughput workloads. For workloads that push beyond a few thousand messages per second, Kafka's partition model scales more linearly than RabbitMQ's queue model under concurrent consumer pressure.

AWS SQS (Standard and FIFO)

SQS is the correct default for teams that need reliable work queue distribution within AWS, do not need event replay, and want zero operational overhead. SQS is fully managed — no servers to size, no cluster to monitor, no partition count to choose. It scales transparently to millions of messages per second for Standard queues. Dead letter queues are a first-class feature: configure a redrive policy with a maximum receive count, and messages that fail processing after N delivery attempts are automatically moved to a DLQ without any application code change.

Standard queues deliver at-least-once with best-effort ordering. Messages may arrive out of sequence; duplicates are possible in failure scenarios. FIFO queues deliver exactly-once (within a 5-minute deduplication window) with strict ordering within message groups, at the cost of a 3,000 message-per-second throughput limit with batching. For most application-level work queue patterns — sending emails, processing webhooks, triggering background jobs — Standard queue semantics are sufficient with idempotent consumers, and the throughput limit is not a constraint.

The visibility timeout is SQS's mechanism for preventing duplicate concurrent processing. When a consumer receives a message, the message becomes invisible to other consumers for the duration of the visibility timeout. If the consumer processes and deletes the message before the timeout expires, the message is gone. If the consumer fails before deleting the message, the visibility timeout expires and the message becomes visible again for another consumer to pick up. The visibility timeout must be longer than the maximum processing time for any message in the queue — a mismatch produces double-processing because the broker assumes the consumer failed and redelivers to another instance while the original consumer is still working.

SQS's primary structural limitation is the absence of replay and the absence of consumer group fan-out. A message consumed from an SQS queue is deleted and cannot be consumed by another service. Fan-out to multiple independent consumers requires publishing the same message to SNS (Simple Notification Service) and having SNS deliver to multiple SQS queues — one per consumer. This SNS-to-SQS fan-out must be configured before the message is published; new consumers added later receive only messages published after their SQS subscription was created. For systems that start with one downstream consumer and grow to six over two years, this topology must be planned for at design time, not discovered during the sixth consumer's onboarding.

Redis Streams

Redis Streams is the least-discussed option in most broker selection conversations, which undersells its fit for specific use cases. XADD appends messages to an immutable, ordered stream. XREAD and XREAD GROUP provide both broadcast (every reader sees every message from their last-read position) and consumer group semantics (messages are distributed across consumer group members with per-consumer acknowledgment via XACK). XPENDING reports unacknowledged messages per consumer — the primary mechanism for detecting consumer lag and dead consumers. XCLAIM allows reassigning unacknowledged messages from a slow or failed consumer to a healthy one.

Redis Streams is appropriate for high-velocity event streams where latency matters more than durability and where message volume is bounded. The in-memory architecture produces sub-millisecond publish and consume latencies — materially lower than Kafka's disk-backed log or SQS's network round-trips. For real-time use cases — live leaderboard updates, real-time notification delivery, sub-second event triggers — Redis Streams' latency profile is competitive with any alternative.

The durability caveat is the primary selection disqualifier for high-stakes event streams. Redis AOF persistence with fsync everysec (the default) can lose up to one second of writes on crash. AOF with fsync always (on every write) provides stronger durability at the cost of write throughput reduction. For an application where losing one second of order events during a Redis crash is acceptable — because the source of truth is the PostgreSQL orders table and Redis Streams is used only for real-time delivery — the durability tradeoff is manageable. For an application where the message broker is the system of record for events, Redis Streams' durability model is not appropriate without explicit mitigation (separate event archive, synchronous fsync with throughput constraints accepted).

The MAXLEN TRIM model for retention is less granular than Kafka's time-based retention. MAXLEN caps the stream at a message count; older messages are trimmed when the count is exceeded. This means the effective replay window depends on the publish rate: a stream with MAXLEN 1,000,000 on a topic publishing 100,000 messages per day provides roughly 10 days of replay; the same stream on a topic publishing 1,000,000 messages per day provides roughly 24 hours. Teams that rely on Redis Streams' replay capability need to size MAXLEN based on the worst-case publish rate and the required replay window — and document both in the ADR, because the relationship between MAXLEN, publish rate, and replay window is non-obvious.

The AI chat sessions that produced undocumented decisions

Message broker configuration produces a specific pattern of undocumented decisions: the sessions where the configuration happens are focused on getting the infrastructure running, not on documenting the structural constraints. The decisions feel like implementation details at the time. They reveal themselves as architectural decisions only when something changes — a new consumer, a scale requirement, a compliance audit.

The initial infrastructure session — "how do I set up Kafka for a Node.js application?" — produces the partition count, the replication factor, the retention policy, and the default consumer group configuration. These are answered as one-time setup questions. The session closes when the Kafka cluster is running and the first message is flowing. The ChatGPT session log contains the configuration values but not the reasoning: why four partitions instead of eight, why seven-day retention instead of thirty, why replication factor three instead of two. Next session, different topic. The infrastructure session is closed. See the pattern described in decisions never written down.

The scaling incident session — "my Kafka consumer is falling behind, how do I add more consumer instances?" — produces an undocumented discovery of the partition ceiling. The engineer learns that adding a fifth consumer to a four-partition topic doesn't help, adds a caveat comment in the consumer code, fixes the immediate throughput problem by a different means (larger batch sizes, faster processing logic), and closes the session without documenting the partition count as a scaling constraint. The next engineer who hits the same ceiling doesn't have the comment, can't find the ChatGPT session, and repeats the same diagnosis. This is structurally identical to the database connection pooling decision record pattern where the connection limit is discovered in production rather than documented at setup time.

The new service onboarding session — "how do I write a consumer for an existing Kafka topic in Python?" — produces the consumer group name. The session gives the engineer a working code sample that sets group.id to a sensible string. The engineer uses the string from the sample, or picks a name that makes sense in isolation, without knowing what consumer group names are already in use or what the naming convention is. The collision that caused the nine-person startup's loyalty points bug was produced by exactly this session type. The ADR format that would have prevented it requires exactly one field: a section on consumer group naming convention and the isolation model it enforces.

The compliance preparation session — "how do we demonstrate that all payment events are captured and auditable?" — produces a retrospective discovery of the retention policy. The engineer checks the topic configuration, finds seven-day retention, and either accepts it (with documentation that audit logs must be backed up separately before day seven) or extends it (with a storage cost that wasn't anticipated in the infrastructure budget). The decision is made under time pressure, in a session focused on compliance artifact production, and is recorded in neither the infrastructure code nor the ADR. The next compliance audit, if it asks the same question, produces the same scramble. This is the same pattern as the CI/CD pipeline decision record where the rollback capability is discovered during an incident rather than designed in advance.

What to actually document in the message broker ADR

A message broker ADR that prevents the incidents described above is not the same document as the configuration README for the Kafka cluster. The README documents how to connect; the ADR documents why this broker was chosen over the alternatives, what constraints that choice imposes on every consumer, and what decisions were made during configuration that a future engineer would not be able to infer from the codebase alone.

The delivery guarantee contract is the most critical section. State the guarantee explicitly — at-least-once, exactly-once, best-effort-ordering — and state the consumer contract it requires. "At-least-once: every consumer must be idempotent. Use the event ID from the message envelope to gate processing; if the event ID is already present in the processed_events table, skip the message and ACK it." This is an operational contract that applies to every consumer written against this broker now and in the future. Future engineers will not infer it from "we use Kafka"; they need it stated.

The partition count and its rationale belong in the ADR. Not just the number — the reasoning. "Four partitions: chosen for an initial target of two consumer instances with 2x headroom for peak traffic. Increasing partitions beyond four requires careful planning for keyed messages (customer_id is the partition key for the order topic) because repartitioning changes the key-to-partition mapping and breaks in-flight ordering. Review before any consumer scale-out that requires more than four concurrent instances." This is the sentence that prevents the "let me just add more consumer instances" response to a falling-behind consumer — and it's the sentence that will never appear in the codebase, only in the ADR. For broader context on build-vs-buy decisions in infrastructure choices, the same reasoning about capturing tradeoffs applies directly.

The retention policy and replay window must be documented with their product implications. "Seven-day retention on all topics. This provides a replay window of seven days for new consumers, outage recovery, and consumer group resets. Any consumer that needs to replay further than seven days must source events from the PostgreSQL events archive table (schema: events(id, type, aggregate_id, payload, created_at), indexed on created_at). New consumers with backfill requirements must account for the archive-then-live consume pattern in their design." This is documentation that the backfill engineer in the opening narrative would have found in week one and used to build the right design from the start, instead of discovering it after the consumer was already built.

The consumer group naming convention section is where the partition collision bug lives. Specify the convention, why it matters, and the isolation guarantee it provides. "Consumer group names follow the pattern {service-name}-{environment}-group (e.g., notification-service-prod-group). Consumer group names must be unique per service — sharing a group name between two services causes them to split the partition assignment and each receive approximately half the messages, which appears to work correctly at low volume and fails silently at production traffic levels. Register new consumer group names in the message_broker_consumers table before using them in production." The registration step is optional; the explanation of the failure mode is not. Understanding the failure mode is what allows future engineers to recognize the collision risk when they're choosing a group name. Guidance on how to document architecture decisions consistently suggests this level of operational context over bare configuration values.

The dead letter queue strategy deserves its own section because its absence is the most common broker configuration gap. "Every queue/topic consumer is configured with a DLQ. Messages that fail processing after three delivery attempts are moved to {topic-name}-dlq. An alert fires to the #infrastructure Slack channel when the DLQ depth exceeds zero. The on-call engineer's first action on a DLQ alert is to inspect the message payload and the consumer error log, not to delete the DLQ messages. DLQ messages represent application bugs or data corruption; they are not operational noise. Procedures for DLQ investigation and replay are in the runbook at docs/runbooks/message-broker-dlq.md." The alert-on-zero-depth requirement is the key: a DLQ that silently fills with poison pill messages while the team doesn't know it exists is not a safety net — it's a deferred incident.

The schema evolution policy closes the gap that caused the reconciliation service backfill breaks. "Message schemas follow backward-compatible evolution: adding fields is allowed; renaming or removing fields requires a deprecation period with both old and new fields present simultaneously. Breaking schema changes require a new topic version (e.g., order-placed-v2) rather than an in-place modification of the existing topic schema. Schema changes are registered in the message_schemas table with the date of introduction and a description of the change, so backfill consumers can handle schema variants by checking the message's schema_version field." This policy does not require Schema Registry or Avro — it only requires a documented convention and a schema_version field in the message envelope. Both are cheaper to implement than six engineer-weeks of retroactive backfill repair.

The open-source WhyChose extractor surfaces these decisions from existing AI chat history — the setup sessions, the scaling incident sessions, the onboarding sessions — where the decisions were made. Running the extractor on Kafka-related chat threads frequently surfaces the original partition count rationale ("four partitions because we expected two instances"), the original retention rationale ("seven days is fine for now"), and the original consumer group name discussion — the decisions that exist in the chat history but have never been written into an ADR. The extraction output gives teams a starting point for writing the broker ADR retroactively, before the next compliance audit or backfill requirement makes the gaps expensive.

The ADR template for message broker selection

The template below follows the Nygard format extended with broker-specific sections. It is not a complete substitute for a decision record tailored to the actual broker and context — it is a checklist of the sections that prevent the incidents described above.

# ADR-NNN: Message broker selection

## Status
Accepted / Proposed / Superseded by ADR-NNN

## Context
[What event-driven patterns does this broker need to support?
What are the message volume, throughput, and latency requirements?
What downstream consumers need to consume from this broker, now and
in the foreseeable future? What replay or audit requirements exist?]

## Decision
We will use [Kafka / RabbitMQ / SQS / Redis Streams] for [scope].

## Delivery guarantee
[at-least-once / exactly-once / best-effort-ordering]
Consumer contract: [what idempotency or deduplication approach
each consumer must implement]

## Partition count and rationale (Kafka / Redis Streams)
[Number of partitions, partition key strategy, rationale for count,
conditions under which partition count should be reviewed]

## Retention policy and replay window
[Retention duration, what replay window it provides, where to source
events for replay beyond the retention window]

## Consumer group naming convention
[Convention, uniqueness requirement, registration process,
description of the failure mode when names collide]

## Dead letter queue strategy
[DLQ configuration, alert threshold, investigation procedure,
link to runbook]

## Schema evolution policy
[Backward-compatible change rules, breaking change procedure,
schema versioning in message envelope]

## Consequences
[Positive: capabilities this broker provides.
Negative: operational complexity, cost, constraints on consumers.
Risks: what needs to be monitored or revisited as load grows.]

The sections that teams consistently skip are delivery guarantee (the consumer contract is assumed), partition count rationale (the number is set but the reasoning is not), and the schema evolution policy (assumed to be "don't break things" until something breaks). Those three sections are the ones whose absence causes the incidents. Write them before the first consumer is built, not after the first backfill fails.

Message broker decisions share the same structural characteristic as queue and messaging pattern decisions: the choice looks reversible at the beginning ("we can always switch later") and reveals its true migration cost only after the topology has been built out — after twelve services are consuming from eight topics with established schema contracts and consumer group conventions. The ADR is not bureaucratic overhead. It is the document that makes "switch later" possible without six weeks of archaeology.

Frequently asked questions

What is the difference between at-least-once and exactly-once message delivery?

At-least-once delivery means the broker guarantees that every message will be delivered to consumers at least one time, but does not guarantee that it will be delivered exactly one time. A message may be delivered more than once in failure scenarios: if a consumer receives a message, processes it successfully, but fails before sending an acknowledgment back to the broker, the broker will redeliver the message to another consumer. The consumer's processing logic must be idempotent. Exactly-once delivery means the broker guarantees that every message is delivered to consumers exactly one time, regardless of failures — achieved via transactional coordination between producer, broker, and consumer. Kafka's exactly-once requires the producer transactional API and consumers with isolation.level=read_committed. SQS FIFO provides deduplication within a 5-minute window via message deduplication IDs. Most applications achieve acceptable behavior with at-least-once delivery plus idempotent consumers, which has better throughput and simpler operational characteristics than exactly-once transactional semantics.

Why does Kafka's partition model limit how many consumers can process a topic in parallel?

In Apache Kafka, each partition can be consumed by exactly one consumer within a consumer group at a time. The partition count is therefore a hard ceiling on the parallelism achievable within a single consumer group. A topic with four partitions supports a maximum of four active consumer instances per consumer group — the fifth instance sits idle. Increasing consumer parallelism beyond the partition count requires either increasing the partition count (which may break keyed message ordering) or restructuring the application to use multiple consumer groups. Consumer group isolation — each group maintaining its own committed offset — is Kafka's fan-out model: multiple independent services can each consume every message from the same topic without affecting each other's read position.

When should a team use AWS SQS instead of Kafka?

SQS is the appropriate choice when: the team is already in AWS and wants managed infrastructure without running a Kafka cluster; the message processing pattern is work queue distribution (one consumer processes each message once) rather than event log fan-out (multiple consumer groups each see every message); and replay of previously consumed messages is not a requirement. SQS does not retain messages after successful consumption — consumed messages are gone, with no replay capability. SQS is appropriate for fire-and-forget work distribution (sending emails, resizing images, processing payment callbacks) and inappropriate for event sourcing, audit trails, or patterns where new consumers need to backfill from history. The operational simplicity advantage for SQS is significant for teams without Kafka expertise: no cluster to size, no partition count to choose, no consumer lag alerting to wire up.

What should a message broker ADR document that teams typically skip?

Teams typically document the broker name and nothing else. The ADR sections that prevent incidents are: (1) the delivery guarantee and its idempotency contract — every consumer must understand whether they need to handle duplicates; (2) the partition count and its rationale — the partition count is a hard ceiling on consumer group parallelism, and future engineers who try to scale consumers need to know this ceiling and why it was set; (3) the retention policy and the replay window it provides — set at topic creation and often forgotten until a new consumer needs to replay beyond the window; (4) the consumer group naming convention — sharing a consumer group name between two services silently splits the message stream, a failure mode that is invisible at low volume; (5) the dead letter queue strategy — the absence of a DLQ strategy means failed messages either block the consumer or disappear silently; (6) the schema evolution policy — breaking schema changes discovered during backfill are expensive to recover from if there is no documented evolution contract.