The event-driven architecture decision record: why the event schema you chose determines your consumer coupling surface and your schema evolution cost

2026-07-03 · Decision record · Event-driven architecture · Schema design · Microservices

Fourteen months after a 35-person fintech company adopts an event-driven architecture for their payment processing service, eleven microservices consume the payment.completed event: fraud detection, loyalty points accrual, tax calculation, financial reporting, email notifications, CRM sync, analytics ingestion, billing reconciliation, order fulfillment, shipping trigger, and audit logging. The event schema the payments team designed on the first day is a flat structure: { paymentId, userId, amount, currency, status, timestamp }. It was designed in an afternoon and never reviewed as a contract — it was just the fields the first consumer, the email notification service, happened to need at the time.

Fourteen months later, the payments team begins building multi-vendor support. Transactions will now have a merchantId that is required for tax calculation (different merchants are registered in different tax jurisdictions), required for fraud detection (merchant-specific risk models must be applied), and required for financial reporting (regulatory reports must be broken down by merchant). The field does not exist in the current event schema. The payments team drafts a schema change: add merchantId as a required field. The schema team reviews the change and discovers that the fraud detection service uses strict deserialization — it validates that the incoming event contains exactly the six expected fields and rejects events with unexpected fields. Adding merchantId to a required position in the schema makes the new events structurally incompatible with the fraud detection consumer as currently deployed. The tax service stores the event payload in a raw JSONB column and its downstream transformations hard-code the exact column set — adding a new field does not break the ingestion, but the tax transformation pipeline silently drops the new field. Two analytical pipelines register the full event schema as a schema-on-read spec that is hard-coded in their Spark jobs; adding a required field triggers a schema validation error at the job's deserialization step.

The rollout takes eleven weeks. It requires synchronized deployment coordination across six teams, a dual-publish window where the payments service publishes to both payment.completed.v1 and payment.completed.v2 topics simultaneously for eight weeks while the consumer services migrate, and two weekend maintenance windows for the services that cannot be updated without taking a maintenance hold on their processing pipelines. The engineering manager who coordinates the migration estimates the effort at 340 person-hours across all involved teams. Nobody had written down what the contract of payment.completed was, which services consumed it, what field-level coupling each consumer had, or whether the schema guaranteed backward compatibility for additive changes.

The second failure looks different but traces to the same undocumented event boundary decision. An e-commerce platform's inventory service begins publishing inventory.updated events when stock levels change. The team that designs the event decides to embed a complete product snapshot: { inventoryId, productId, sku, quantity, warehouseLocation, productName, productDescription, productImages, productCategories, price, weight, dimensions }. The rationale, never written down, is that embedding the full snapshot is convenient — consumers receive everything they need without making additional calls to the catalog service. Three consumer services build their product display logic directly against the embedded snapshot fields, treating the event as their authoritative source of product data.

Sixty days later, the catalog team completes a product service rewrite. The new service changes the productImages representation from an array of URL strings to an array of objects with { url, alt, width, height, format }. The catalog team updates their API responses. They do not update the inventory service — the inventory service is a different team's codebase, and the catalog team is not aware that the inventory service embeds a copy of the product schema in its events. For six weeks, the inventory service continues to embed productImages in the old format. The three consumer services that built against the embedded snapshot are now producing incorrect alt texts and missing image dimensions in their rendering output. The failure is discovered during a mobile app accessibility audit. The root cause investigation takes three days because the coupling was not documented: nothing recorded that the inventory event embedded a copy of the product schema structure, that the embedded schema was implicitly owned by the catalog team but maintained by the inventory team, or that the image field format change would propagate through the event to all inventory event consumers.

Both systems adopted event-driven architecture to decouple services. Both designed their initial event schemas quickly without writing down the contract. The event-driven architecture decision record is the document that makes the schema contract explicit — including the fields and their types, the consumers and their coupling depth, the versioning guarantee, the payload boundary policy, and the migration procedure for breaking changes.

What an event-driven architecture decision record covers

Adopting event-driven architecture does not replace service coupling with decoupling — it replaces synchronous coupling with asynchronous coupling against a schema contract. The difference is that synchronous coupling is visible at runtime (a broken API call fails immediately and the error is easy to trace), while asynchronous coupling against an event schema is invisible until a schema change breaks a consumer (and the break may be silent if the consumer uses lenient deserialization). The event schema is the contract, and the event-driven architecture ADR is the document that makes the contract explicit.

The five decisions that belong in an event-driven architecture ADR are:

Three structural properties that the event schema decides

1. The consumer coupling surface and its visibility

The consumer coupling surface is the complete set of consumers that depend on specific fields in a published event schema, together with the type of coupling each consumer has. Coupling type matters as much as the number of consumers: a consumer that uses lenient deserialization and only accesses two specific fields has a narrow, shallow coupling — it can tolerate additive changes, field reordering, and the addition of new nested structures without modification. A consumer that uses strict deserialization (validating that the event contains exactly the expected field set and no others), that depends on the complete field set rather than a named subset, or that stores the raw event payload for downstream processing has broad, deep coupling — almost any schema change, including additive ones, may require coordinated migration.

The coupling surface is typically invisible at schema design time because it is determined by how consumers are implemented, not by how the schema is documented. The schema owner knows what fields they publish. The schema owner does not know whether the fraud detection team uses strict or lenient deserialization, whether the analytics team registered the event schema as a hard-coded Spark job spec, or whether the billing team stores the raw payload in a column that has a downstream transformation pipeline with its own field expectations. This information is only in the consumer codebases, which the schema owner may not have read.

Making the coupling surface visible requires a consumer registry: a document that the schema owner maintains (or that is automatically populated by a schema registry) that lists, for each event type, every known consumer service, the fields each consumer accesses by name, and the deserialization behavior each consumer uses. The consumer registry does not prevent coupling — it makes existing coupling discoverable before a schema change rather than after. The practical value is the ability to evaluate the blast radius of a proposed schema change before beginning migration: "This change affects 3 of our 11 consumers. Of those 3, two use strict deserialization and require coordinated migration; one uses lenient deserialization but stores the raw payload and has a downstream transformation that must be updated. Estimated migration effort: 4 weeks." Without the registry, that estimate is not available until each consumer team has reviewed the proposed change independently.

2. The schema evolution cost and the backward-compatibility ceiling

Schema evolution cost is the effort required to change a published event schema while keeping all existing consumers working correctly. It is determined by three factors: consumer count, coupling depth, and the versioning strategy's compatibility ceiling. Consumer count is the most obvious factor — more consumers means more teams that must review the change, coordinate their migration, and deploy their updated service before the migration window can close. But coupling depth and versioning strategy often dominate cost even with a small consumer count.

The versioning strategy sets the backward-compatibility ceiling: the maximum change that can be made without requiring consumer migration. A pure backward-compatible evolution strategy (additive changes only, all changes must be backward-compatible with existing consumers using lenient deserialization) has a ceiling that allows adding new optional fields and new event types, but prohibits removing fields, renaming fields, changing field types, or changing field semantics. A versioned-topic strategy (publish to events.v2.type-name for breaking changes) has no ceiling on the schema change itself but has a migration cost for every breaking change: dual-publish windows, consumer migration coordination, and topic cleanup after all consumers have migrated. A schema registry with compatibility enforcement (Confluent BACKWARD compatibility, for example) has a machine-enforced ceiling that catches compatibility violations before they reach production, but only if all consumers use the schema registry for deserialization.

The cost calculation that belongs in the ADR is: for the chosen versioning strategy, what is the expected cost of a breaking schema change? This is not a constant — it depends on the consumer count and coupling depth at the time of the change, which will grow over the life of the event type. An event type with 3 consumers today may have 12 consumers in two years. The migration cost of a breaking change grows proportionally with consumer count unless the consumer registry is maintained and coupling depth is actively managed (by requiring lenient deserialization in consumer codebases and prohibiting direct dependency on the complete field set). The ADR should document both the current cost estimate and the cost growth model as the consumer base scales.

3. The event boundary and embedded-snapshot risk

The event boundary determines what information belongs in the event payload and what consumers should fetch from the owning service at processing time. This is a spectrum: at one end, events carry only an identifier and a minimal event context (a stable event type name and the timestamp); at the other end, events carry complete entity snapshots including all attributes of every referenced entity at the time of the event. Most events fall somewhere between these extremes.

The risk of embedding entity snapshots in events is schema coupling to the embedded entity. When an event embeds a product snapshot, the event schema is implicitly coupling to the product schema's structure. If the catalog team changes the product schema — adding a field, renaming a field, restructuring a nested object — the embedded snapshot in the event is now stale or structurally different from the current product schema. If the event producer does not update the embedded snapshot format when the product schema changes, consumers that depend on the embedded format receive outdated data silently. If the event producer does update the embedded format, the schema change propagates through the event to all consumers of the event, even consumers that did not intend to depend on the product schema at all.

The correct event boundary is determined by the schema change rate of the referenced entity and the ownership model of the embedded fields. Fields that are semantically part of the event itself — the fields that describe what happened, not the complete state of entities involved — have low schema change risk and belong in the event payload. Fields that describe the complete current state of a referenced entity owned by a different team have high schema change risk and should be referenced by identifier rather than embedded. A practical rule: if a field could change in a separate team's codebase without the event producer team necessarily knowing about it, that field should be a lookup (the consumer fetches current state) rather than an embedding (the producer includes a copy of state). This rule keeps the event schema's field ownership entirely within the event producer's team, eliminating the class of silent coupling failures that arise from cross-team schema embedding.

Five ADR sections for an event-driven architecture decision record

1. Event schema structure and field ownership

Document the complete field set for each event type, the data type and nullability of each field, and the owning team for each field. Field ownership is the most critical documentation in the event schema ADR because it determines who must be consulted for any schema change and who bears the migration responsibility when a change is required. A field that is owned by the event producer team can be changed or removed by that team alone. A field that describes the state of an entity owned by a different team is implicitly co-owned — the event producer has publishing responsibility but the field's meaning and structure are governed by the owning team's schema decisions.

Document which fields are part of the canonical event contract and which are convenience inclusions. The canonical event contract is the minimum field set that makes the event meaningful: the identifier of the entity that changed, the type of change, and the timestamp. Convenience inclusions are additional fields added to save consumers from making a lookup call — they describe context that the consumer could obtain from another service but that would require an additional API call. Canonical contract fields are permanent: removing them is a breaking change that requires full consumer migration. Convenience inclusions should be explicitly marked as such so that future schema owners understand that their removal may be feasible with lower migration cost than canonical field removal.

For each event type, document the rejected field candidates and the reasons for rejection. The most common rejection reasons are: "this field is owned by a different team and embedding it creates implicit coupling to their schema" (the event boundary decision), "this field requires a database lookup at publish time and increases publish latency by 30ms at our event volume" (the performance cost of embedding), and "this field is derivable from the existing fields and the consumer's own service state" (reducing coupling to the minimum information set). Documenting rejected fields prevents them from being re-proposed without understanding why they were previously declined.

2. Schema versioning strategy and backward-compatibility policy

Document the versioning strategy explicitly: whether the event schema uses backward-compatible evolution (additive changes only), explicit version fields embedded in the event payload, versioned topic names (payments.v2.completed), or a schema registry with machine-enforced compatibility rules. Each strategy has a different compatibility ceiling and a different migration procedure for changes that exceed the ceiling.

For backward-compatible evolution, document the precise definition of a backward-compatible change for this event type. The general definition (adding optional fields is backward-compatible; removing fields is not) is insufficient because consumer behavior determines what is actually compatible. If any consumer uses strict deserialization that rejects events with unexpected fields, then adding an optional field is not backward-compatible for that consumer. The backward-compatibility definition in the ADR must therefore reflect the actual deserialization behavior of all consumers, not the theoretical schema compatibility rules. If any consumer uses strict deserialization, document this in the consumer registry (section 3) and reflect it in the versioning policy: this event type cannot guarantee backward-compatible evolution for additive changes until the strict-deserializing consumers are updated to use lenient deserialization.

For versioned-topic strategies, document the dual-publish window duration and the consumer migration deadline. The dual-publish window is the period during which the event producer publishes events to both the old and new topic versions simultaneously. This window must be long enough for every consumer team to complete their migration — testing, review, and deployment in their release cadence. The consumer migration deadline is the date after which the old topic version will stop receiving events. Both dates must be communicated to all consumer teams at the start of the migration, not discovered through a topic deprecation warning in production. Document the notification lead time requirement: how far in advance must consumer teams receive notice of a versioned-topic migration before the migration begins?

3. Consumer registry and coupling surface documentation

Maintain a consumer registry for each event type that enumerates every known consumer service, the fields they access by name, their deserialization approach (strict or lenient), and whether they store the raw event payload for downstream processing. The consumer registry is the primary tool for evaluating the blast radius of a schema change before beginning migration. It should be updated when a new consumer is added, when an existing consumer changes its field access pattern, or when a consumer is decommissioned.

Document the coupling depth for each consumer entry. Coupling depth has three levels: identity coupling (the consumer only checks for the event type, not specific field values), field coupling (the consumer accesses specific named fields and has business logic that depends on their type and semantics), and structural coupling (the consumer depends on the complete field structure, uses strict deserialization, or stores the raw payload for processing by a system that has its own hard-coded field expectations). Identity-coupled consumers can be ignored for most schema changes. Field-coupled consumers require review for any change to the fields they access. Structural-coupled consumers require review for almost any schema change, including additive ones.

Document the consumer update latency for each consumer entry: how long does it take from the time a consumer team receives a migration request to the time they can deploy an updated consumer in production? This is determined by the team's release cadence, their review and testing requirements, and their deployment pipeline speed. A consumer team with a two-week sprint cadence and a one-week deployment pipeline has a minimum update latency of three weeks from request to production. The dual-publish window for any schema migration must exceed the maximum consumer update latency across all affected consumers, not the average. Documenting these latencies at registration time makes migration timeline planning a lookup rather than a coordination effort.

4. Event granularity and payload boundary policy

Document the event granularity decision: whether the event type represents a fine-grained state transition (every field change generates an event) or a coarse-grained business fact (only semantically meaningful business events are published). Fine-grained events have lower information loss (consumers can reconstruct the complete state change history) but higher volume and more fragile consumer logic (consumers must filter for the specific state transitions they care about, and the filter logic is coupled to the event structure). Coarse-grained events have lower volume and more stable consumer logic (consumers receive exactly the business events they care about) but require the event producer to determine which state transitions constitute meaningful business events — a judgment call that may be wrong for some consumer use cases.

Document the payload boundary policy as a rule that applies to all events published by this service. The rule should specify: which entity types may be embedded in events (typically: entities that are fully owned by the publishing service and whose schema changes require the publishing team's involvement anyway), which entity types must be referenced by identifier only (entities owned by other teams), whether derived fields may be embedded (fields computed from the canonical event data and the publishing service's own database), and the maximum payload size target and how it is enforced. A payload size ceiling (for example, 64 KB) prevents events from growing unboundedly as convenience inclusions accumulate over time.

Document the entity reference convention: when a consumer needs to look up the current state of a referenced entity from the event, which service is the authoritative source, what endpoint or API should be called, and what SLA does that endpoint carry? The consumer lookup approach shifts the latency cost from publish time to processing time, but it also creates a processing-time dependency on the owning service's availability and response time. If the owning service is unavailable at event processing time, the consumer must decide whether to retry (delay processing), proceed without the referenced data (accept incomplete data), or dead-letter the event (skip processing until the owning service recovers). The event boundary policy should specify the expected consumer behavior for each of these cases so that all consumers make a consistent choice rather than each consumer team independently deciding how to handle lookup failures.

5. Schema evolution procedure and breaking-change migration plan

Document the definition of a breaking change for this event type, given the actual consumer implementations recorded in the consumer registry. A breaking change is any schema change that requires one or more consumers to update their processing code before the change is deployed in production. For this event type, given the consumer implementations currently registered, the following changes are breaking: removing any existing field; changing the data type of any existing field; renaming any existing field; changing the semantics of any existing field (for example, changing the unit of the amount field from cents to dollars without changing the field name); and adding a required field to the canonical contract. The following changes are additive and non-breaking for lenient-deserializing consumers but must be verified against the consumer registry before deployment: adding a new optional field, adding a new nested object, and adding a new event type.

Document the notification and migration lead time requirement. Before any breaking schema change is deployed, every consumer team listed in the consumer registry for the affected event type must receive a migration request that includes: the exact change being made, the reason for the change, the proposed dual-publish window start and end dates, the actions required by the consumer team, the contacts for questions, and a confirmation deadline (the date by which the consumer team must confirm they have reviewed the migration request and assigned migration work to a sprint). The lead time between the notification and the dual-publish window start must be at least equal to the maximum consumer update latency across all affected consumers, plus a buffer for unanticipated complications. The minimum required lead time for this event type is specified in this section, derived from the consumer update latencies recorded in the consumer registry.

Document the dual-publish window procedure step by step. The procedure should specify: which topic names are involved (old and new, or old and versioned), how the event producer is configured to publish to both simultaneously (a feature flag, a configuration parameter, or an application code change), the monitoring that confirms both topics are receiving events during the window, the consumer migration confirmation process (each consumer team marks their migration as complete in a shared tracking document or a schema registry consumer group), the routing switch mechanism that stops publishing to the old topic after all consumers have migrated, and the old topic retention policy after the switch (how long the old topic is retained for late-migrating consumers before it is deleted). Document the rollback procedure for each step: if the dual-publish configuration fails, how is single-topic publishing restored? If a consumer migration produces unexpected behavior after routing to the new topic, what is the procedure for temporarily routing that consumer back to the old topic while the issue is resolved?

Further reading