What is a saga pattern and why does it become necessary when a business operation crosses service boundaries?

A saga is a sequence of local transactions, one per participating service, coordinated to complete a business operation that spans multiple service boundaries. When a business operation must be atomic — either all changes succeed or none persist — and all of the data involved lives in a single database, the operation can be wrapped in a database transaction and either committed or rolled back atomically. When the data is split across multiple services, each with its own database, there is no shared transaction boundary. The saga pattern replaces the atomic transaction with a sequence of local transactions and a set of compensating transactions that undo each local commit if a later step fails. If an Orders service creates an order, an Inventory service reserves stock, and a Payments service charges the customer — and the payment fails — the saga's compensation logic must explicitly cancel the inventory reservation and cancel the order. Each compensation step is a new write operation, not a rollback; if a compensation step fails, the saga is in a partially compensated state that must be detected and resolved. Sagas are correct implementations of distributed coordination — they are not workarounds. But they are significantly more complex than a database transaction: the compensation logic for a three-service saga is typically 300-500 lines of application code that handles every failure permutation, including the case where a compensation step itself fails. This implementation cost must be estimated at service boundary design time, not discovered when the first cross-service business operation is built. If the service boundary is drawn so that all data required for a business operation lives in one service's database, the saga is not needed and the operation is a single local transaction. The decision of where to draw the boundary determines which operations require sagas.

2026-07-01 · ~20 min read

The microservices vs monolith decision record: why the service boundary you drew in the founding sprint determines your deployment coupling surface and your distributed transaction surface area

Q: When do microservices provide genuine value over a monolith, and when do they add operational complexity without commensurate benefit?

Microservices provide genuine value in three scenarios: (1) Independent scaling requirements exist at the service level. A payments service processes 1,000 requests per minute; a reporting service processes 10. Separate deployments allow separate scaling policies. In a monolith, the reporting logic scales with the payments logic even if it does not need the capacity. (2) Team ownership boundaries match service boundaries. A team of six owns a specific business domain end-to-end — the service is their unit of deployment, they set their own release cadence, they own their data store. The independence benefit materializes only when the team boundary matches the service boundary. Conway's Law applies in both directions: if your organization is a single 8-person team, a microservices architecture forces coordination overhead that a monolith does not. (3) The service has genuinely divergent operational requirements — different languages, different runtime environments, different security boundaries, different compliance scopes — that cannot be cleanly isolated within a monolith's deployment unit. Microservices add complexity without commensurate benefit when the team is smaller than the coordination overhead (fewer than 15-20 engineers, microservices typically add more overhead than they remove), when service boundaries cross frequently-co-changing code (features regularly touch multiple services simultaneously, requiring coordinated deployments), when most operations span multiple services and require distributed transactions that a monolith could handle with a single database transaction, or when the primary motivation is 'industry best practice' rather than a specific scaling, ownership, or isolation requirement that a monolith demonstrably cannot satisfy.

Microservices are adopted in the founding sprint because a previous employer used them, a conference talk praised their independence, or the founding engineer wants to avoid the "big ball of mud" they inherited at their last job. The service boundaries are drawn in one session without asking which business operations cross them or how a cross-boundary operation that requires atomicity will be handled. The boundary placement determines whether placing an order is a database transaction or a three-service saga with 400 lines of compensation logic. Both are implementable — but only one was estimated before the services were deployed.

A 14-person B2B SaaS team building a field service management platform made the microservices decision in their first week. The two co-founders had spent their careers at companies that used microservices — one at a 600-person logistics company, one at a 400-person e-commerce platform. Both associated microservices with engineering maturity. They were building a platform for managing work orders, technician dispatching, parts inventory, and customer invoicing. The natural service boundaries seemed obvious: a Work Orders service, a Scheduling service, an Inventory service, and a Billing service. Each service would own its data, deploy independently, and scale to the load its specific domain generated. The platform would be built the right way from the start.

The first eight months validated the separation for read-heavy operations. The Work Orders service returned job lists to the mobile app. The Scheduling service showed technician calendars. The Inventory service reported parts availability by warehouse location. Each service served its queries from its own database without cross-service calls. The API gateway routed requests and the services were genuinely independent for the read path. The founding engineers were pleased with the architecture's cleanliness.

The operational complexity emerged around month nine, when the product manager defined the "complete job" workflow: a technician marks a job as complete in the mobile app, the system simultaneously closes the work order, releases the reserved technician slot in the schedule, decrements the parts used from inventory, and generates a customer invoice. Four operations, four services, one business event. The team's first implementation used synchronous HTTP calls from a "workflow coordinator" in the Work Orders service: close the order, then call Scheduling to release the slot, then call Inventory to decrement parts, then call Billing to generate the invoice. If any call failed, the coordinator would attempt to undo the completed steps by making additional HTTP calls to the same services with reversal payloads.

The coordinator worked in the happy path. The failure handling was the problem. If the Billing call failed after the work order was closed, the technician slot released, and the inventory decremented, the coordinator tried to reopen the work order and re-reserve the slot. But the re-reservation call to Scheduling could fail if the technician had been booked for a new job in the milliseconds between the slot release and the compensation attempt. The coordinator was in an inconsistent state: closed work order, released slot, decremented inventory, no invoice, failed re-reservation. No database transaction could fix this — the data was in four separate databases. The team spent three weeks building a proper saga with explicit state tracking, idempotent compensation handlers, and a recovery process for sagas that stalled mid-execution. The compensation logic for the four-service "complete job" saga was 480 lines. The 480-line number had not appeared anywhere in the founding session that drew the service boundaries — it was a cost discovered when the first business operation that required transactional coordination was built.

Three more cross-service operations emerged over the following six months: "cancel job" (compensate billing if invoice was generated, re-stock inventory if parts were reserved, reopen technician slot), "transfer job" (change technician assignment, update two scheduling records, notify both technicians), and "bulk close jobs" (batch "complete job" operations for end-of-month invoicing). Each required its own saga. By month eighteen the saga implementations totaled 2,100 lines, maintained by two engineers who were the only people who understood the compensation logic well enough to modify it safely. A new engineer who joined in month sixteen described the saga codebase as "the part of the repo where I need to run changes past someone who was here before." The service boundaries had been drawn based on data ownership intuition, not on an analysis of which business operations crossed them and what coordination mechanism those crossings would require.

A 9-person marketplace startup building a peer-to-peer equipment rental platform adopted microservices on day one. The CTO had joined from a large company that ran 200 microservices and viewed them as the natural unit of software organization. The founding team of four engineers each owned a service: Users, Listings, Bookings, and Payments. Each service had its own repository, its own PostgreSQL database, its own Docker container, and its own deployment pipeline. The platform would grow by adding new services, not by expanding existing ones.

Local development was the first friction point. Running the platform locally required starting four PostgreSQL instances, four Node processes, a Redis instance for session management, and an nginx reverse proxy for routing. The team's solution was a docker-compose file with eleven containers. A new engineer's onboarding checklist included a forty-minute step to pull all images and start the compose stack. When the Bookings service's database schema changed, the engineer developing it needed to restart three containers in a specific order to avoid migration conflicts with the shared volume mount. The docker-compose file became a carefully maintained artifact, not an operational afterthought.

The deployment problem compounded as features matured. Adding a "verified listing" badge — a label shown on a listing when the owner's identity had been verified through the Users service — required changes to the Listings service (to store and query the verified flag), the Users service (to emit a verification event), and the Bookings service (to show a "booking a verified listing" confirmation screen). Three services, three deployments, one feature. The team instituted a practice of tagging related deployments in the deployment log, but a feature that required coordinated deployments across three services five times in a month was not delivering the independence that microservices were supposed to enable. The engineers were spending more time on deployment coordination than on product functionality.

By month eight the team had eleven services — the original four plus a Notifications service, a Search service, an Analytics service, a Reviews service, a Messaging service, an Admin service, and a Pricing service. Most features touched three to five services. The monthly deployment count was 340, of which 290 were coordinated multi-service deployments. Two of the four founding engineers estimated they were spending 30% of their time on infrastructure, deployment, and service coordination rather than product features. The CTO proposed consolidating the Listings, Search, and Reviews services into a single Catalog service, the Notifications and Messaging services into a single Communications service, and the Analytics and Admin services into a single Operations service. The consolidation reduced eleven services to seven and required migrating three sets of database schemas — a two-sprint project that produced no user-visible features. After the consolidation the team ran the remaining seven services as a near-monolith for six months before the company raised a Series A and hired enough engineers to justify the service boundaries they originally drew.

Structural properties set by the service boundary decision

Four structural properties are determined when a service boundary is drawn. None of them appear explicitly in the founding session that split a system into services — they are the operational consequences of a design choice made in response to an instinct about independence, a previous employer's architecture, or a desire to avoid a future big-ball-of-mud problem.

Property 1: Deployment coupling surface. A service boundary is also a deployment boundary. Code that lives in the same service deploys together. Code in different services can deploy independently — but only if no other service must change simultaneously for the deployment to be safe. The deployment coupling surface is the set of features that require coordinated deployments across multiple services. A feature that changes the Listings service's API response shape without versioning requires the Bookings service to deploy simultaneously if it reads that field. A feature that adds a new required field to a message that the Notifications service consumes requires both the publisher and consumer to deploy in a specific order. The coupling surface is determined at boundary-draw time: if two domains frequently co-evolve — their data models change together for the same business reasons — placing them in different services does not make their deployments independent; it makes their deployments require coordination. The CI/CD pipeline decision record documents the deployment pipeline per service; in a multi-service system the pipeline must also document which services are deployment-coupled and require coordinated release sequences, not just independent build-and-push steps.

The deployment coupling surface compounds with team size. A single team that owns multiple services can coordinate deployments because the same people own both ends of the dependency. A boundary that crosses team ownership requires cross-team deployment coordination — a ticket, a calendar, a synchronization meeting. When service boundaries match team ownership boundaries and those boundaries match data and co-evolution boundaries, microservices provide their advertised independence. When any of those three alignments is missing, the coupling surface grows and the independence benefit is partially or fully cancelled. The container orchestration decision record documents the runtime environment for multiple services; the service boundary ADR must document the deployment coupling surface — which pairs of services are coupled and why — so that future engineers understand the constraint before modifying a service boundary or introducing a new inter-service dependency.

Property 2: Distributed transaction surface area. A database transaction can atomically commit or roll back any set of write operations on data within a single database. When a business operation requires atomically modifying data in two different databases — one per service — there is no shared transaction boundary and no database-native rollback mechanism. The distributed transaction surface area is the set of business operations that require atomicity across service boundaries. For each such operation, the team must implement either a choreography-based saga (services publish events and react to each other's events, with each service responsible for compensating its own local transaction on failure detection), an orchestration-based saga (a dedicated orchestrator service directs each participant to execute or compensate its local transaction in sequence), or the two-phase commit protocol (distributed locking across databases, almost never appropriate in practice due to coordination overhead and lock duration).

Saga implementation costs are not trivial. A two-service saga with a single failure mode requires writing two local transaction handlers, two compensation handlers, and a mechanism to detect whether the overall saga succeeded or failed. A four-service saga with multiple failure modes — where each step can fail before or after committing, and compensation steps can themselves fail — requires tracking saga state, implementing idempotent compensation handlers (a compensation applied twice must produce the same result as a compensation applied once), handling the case where a compensation fails and the saga is in a partially compensated state, and building a monitoring and recovery mechanism for stalled sagas. The queue and messaging decision record documents the message broker for asynchronous inter-service communication; choreography-based sagas rely on this broker and its delivery guarantees to coordinate compensation without a central orchestrator. If the broker delivers at-least-once (most do), every saga participant must implement idempotent handlers. The message broker decision record documents the delivery guarantee and consumer group model; the service boundary ADR must document which business operations are in the distributed transaction surface area, which saga pattern is used for each, and the estimated implementation cost per operation — so the cost of the boundary decision is visible at design time rather than discovered when the first cross-service transactional feature is built.

Property 3: Local development and operational complexity. A monolith runs as a single process with a single database. A developer starts one process, one database, and runs tests against both. A ten-service system runs as ten processes with up to ten databases plus shared infrastructure — message broker, service discovery, API gateway, secrets manager. A developer starting the full system locally must start every dependency the feature under development interacts with, or mock those dependencies in a way that accurately reflects their behavior. Docker Compose solves the startup problem but adds the maintenance problem: the compose file becomes a source of truth for service topology that must be updated every time a service's dependencies change, and a 15-container compose stack takes 4–7 minutes to start cleanly on typical developer hardware. The observability strategy decision record documents distributed tracing across services; in a multi-service local development environment, distributed traces are the primary tool for understanding how a request traverses the service graph, which means the local observability stack must include a trace collector and viewer to make debugging tractable.

The operational complexity compounds in production. Each service is a separate deployment target with its own health checks, its own scaling policy, its own CPU and memory profile, its own restart behavior, and its own failure mode. A production incident in a microservices system requires determining which service is failing, whether the failure is originating in that service or propagating from a dependency, and how to isolate the failing service without cascading the failure to its consumers. The service mesh decision record documents the traffic management layer for inter-service communication; circuit breakers, retry policies, and timeout configurations that prevent a single slow service from exhausting the connection pools of its callers must be explicitly configured and tested. In a monolith, a slow function is a performance problem. In a microservices system, a slow downstream service is a distributed systems reliability problem. The boundary ADR must document the minimum viable local development stack and the fallback for engineers who cannot run the full stack locally — service virtualization, contract testing, or a shared development environment.

Property 4: Team cognitive load and the co-evolution test. Service boundaries should match co-evolution boundaries. Two modules that change together for the same reasons — that receive related feature requests in the same sprint, that share data models, that are developed by the same engineer most of the time — belong in the same service. Splitting co-evolving code across service boundaries does not make it independent; it makes it coupled at a distance, connected by an inter-service API that adds a versioning contract and a deployment dependency to what was previously a function call. The co-evolution test is the primary criterion for service boundary placement: before drawing a boundary between two domains, ask how frequently features have required changes to both in the past six months. If the answer is "most features touch both," the boundary is wrong and will produce coupling overhead without independence benefit.

The modular monolith is the alternative that preserves co-location benefits while maintaining internal structure. A modular monolith organizes code into modules with explicit internal APIs — the orders module exposes an OrdersService interface, the inventory module exposes an InventoryService interface — but both modules compile and deploy as a single process sharing a single database. Business operations that span domains are local function calls, not distributed transactions. The modules can be independently tested but are not independently deployed. A modular monolith can be decomposed into separate services later, when the co-evolution patterns have stabilized and the team has grown large enough to maintain independent deployment pipelines and operate separate services. The dependency injection decision record documents the module boundary enforcement mechanism in a monolith; a modular monolith enforces its internal API contracts through dependency injection containers or module visibility rules, and the inter-module contract quality determines how cleanly a module can later be extracted into a service. The boundary ADR must document whether the system is a monolith, a modular monolith (with explicit module boundaries but shared deployment), or a set of independently deployed services — and for the third option, the criterion for which domains are separate services and which are collocated.

What the founding session records and what it omits

The service architecture decision is almost always made in one of the first three engineering sessions of a new system. It is almost never documented as an explicit decision — it is embedded in the initial repository structure, the initial docker-compose file, or the initial CI/CD pipeline configuration. The founding session that sets the architecture records the service names and their responsibilities. It does not record the criteria for the boundaries, the expected distributed transaction surface area, or the plan for services that turn out to have incorrect boundaries.

Four types of AI chat sessions generate these gaps:

The "how should we structure our services?" session. The founding engineer asks how to split the application into services. The session explains the single responsibility principle, suggests splitting by business domain, and produces a diagram of four to six services with their data stores. The session does not ask: how frequently do features require changes to multiple of these domains simultaneously? Which business operations require atomicity across domain boundaries, and how will those be implemented? How many engineers are on the team and which services will each person own? The answers to these questions would change the boundary recommendations. A team of six engineers with two domains that co-evolve constantly should hear "start with a modular monolith" — but the session answers the question asked ("how do we structure services") rather than the question that should have been asked ("should we use microservices at all given our team size and domain co-evolution patterns"). The resulting architecture is technically correct in isolation and operationally expensive in practice. The authorization model decision record documents the cross-cutting concern of who can access what; in a microservices system authorization must be enforced at the service level, not just at the API gateway, which adds implementation surface to every service that a monolith can implement once.

The "how do we implement this cross-service operation?" session. A feature requires modifying data in two services atomically. The engineer asks how to handle this. The session introduces the saga pattern — explains choreography versus orchestration, produces a code sample for the happy path. The session produces one saga implementation without establishing a saga policy: which pattern will be used for all cross-service transactions (choreography or orchestration), where saga state will be persisted, how stalled sagas will be detected and recovered, and what the idempotency contract is for compensation handlers. Without a policy, each cross-service operation is implemented independently by the engineer who builds it. The compensation patterns diverge. The stall detection is missing on some operations. The idempotency handling is inconsistent. When a production incident causes multiple sagas to stall simultaneously, the recovery is manual and service-specific. The error handling strategy decision record documents the system-wide error handling and retry policy; in a saga, the retry policy for individual steps must be coordinated with the compensation trigger — too many retries before triggering compensation means more data to unwind; too few means spurious compensations for transient failures.

The "how do we set up local development?" session. The team is running four services and local development is becoming cumbersome. The engineer asks how to simplify it. The session produces a docker-compose file that orchestrates the services and their dependencies. The session does not ask: as the service count grows to eleven, will this compose file remain the primary local development mechanism, or does the team need a service virtualization strategy for services not under active development? What is the startup time budget for the local stack and when will the compose approach exceed it? The session solves the current problem (four services, manageable) without projecting to the likely future (eleven services, problematic). The contract testing pattern — where each service publishes a machine-readable contract of its API behavior and consumers verify against the contract without running the producer — is not mentioned because the current problem is solvable with docker-compose and the future problem is not visible yet. The test strategy decision record documents the testing approach for the system; in a multi-service system, contract tests between services are the mechanism that allows services to be tested in isolation without running their dependencies, and they must be established before the compose stack becomes unwieldy rather than after.

The "how do we version inter-service APIs?" session. A service's API is changing and a consuming service must be updated simultaneously. The engineer asks how to version the API to allow the two services to deploy independently. The session explains semantic versioning for HTTP APIs, URL versioning, and header-based versioning. The session produces a versioning scheme for the specific API. It does not establish a deprecation policy: how long must a deprecated API version remain available before the consuming service's deployment catches up, who is responsible for monitoring consumer migration off the old version, and what happens to a consuming service that has not migrated when the deprecated version is removed. Without a deprecation policy, old API versions accumulate. The API versioning decision record documents the versioning scheme; the service boundary ADR must document the inter-service contract management process — how contracts are published, how consumers discover current contract versions, and what the deprecated-version retention policy is — to prevent the deployment independence of the boundary from being cancelled by the implicit coupling of an unmanaged deprecation timeline.

The WhyChose extractor surfaces the founding "how should we structure our services?" session, the first cross-service saga session, the first contract versioning session, and the local development setup session from AI chat history. The service boundary ADR converts the implicit decisions embedded in those sessions — service names, data ownership assignments, the first saga implementation, the first versioning scheme — into a documented boundary criterion, a distributed transaction policy, a local development contract, and a migration path for boundaries that turn out to be wrong.

The five sections of a service boundary ADR

Section 1: Architecture model and boundary criterion. Document the architecture model selected — monolith, modular monolith, or independently deployed microservices — with the specific rationale. If independently deployed services: document the criterion for where a boundary belongs. The co-evolution test is the primary criterion: code that changes together for the same business reasons belongs in the same service; code that changes independently for different reasons can be separated. Document the current boundary map — which domains are separate services and which are collocated — with the rationale for each boundary. For boundaries that exist primarily for future independence ("we expect these to diverge eventually"), document the signal that would trigger extracting the currently collocated module: team ownership transfer, sustained divergence in release cadence, divergence in scaling requirements, or a specific headcount threshold. The boundary map should be a living section of the ADR, reviewed when a feature consistently requires changes to multiple services simultaneously — that pattern is evidence that the boundary is in the wrong place. The API gateway decision record documents the routing layer in front of the services; the boundary ADR must document which services are client-facing (exposed through the gateway) and which are internal (called only by other services), since a client-facing service boundary has a higher versioning cost than an internal one.

Section 2: Cross-service communication model. Document the communication mechanism for each type of inter-service interaction. Synchronous communication (HTTP/REST or gRPC) is appropriate for request-response interactions where the caller requires an immediate response to continue its work — querying a service for data, triggering an action and waiting for confirmation. Asynchronous communication (message broker) is appropriate for event-driven interactions where the publisher does not need the consumer's response before continuing and where the consumer may be temporarily unavailable without blocking the publisher. Document the contract format: OpenAPI specification for HTTP APIs (versioned and published to a shared registry), Protobuf for gRPC (version-controlled in a shared schema repository), JSON Schema or Avro for message broker events. Document the versioning policy for inter-service contracts: the deprecation period for old API versions (minimum time between marking a version deprecated and removing it), the consumer registry (how the team tracks which services consume each contract version), and the breaking change definition for each format (removing a field from an OpenAPI response schema is breaking for consumers that read the field; adding an optional field with a documented default is non-breaking). The API schema design decision record documents the external API schema design; the service boundary ADR must document the internal inter-service contract design separately, since internal contracts can evolve faster than external ones but still require a deprecation process to allow independent deployment.

Section 3: Distributed transaction strategy. Document the business operations that span service boundaries and require atomicity. For each such operation: identify the participating services, document the failure modes (which step can fail, what state is left if it fails, what compensation is required), select the saga pattern (choreography-based for simpler flows with fewer failure modes, orchestration-based for complex flows with multiple failure mode permutations), document where saga state is persisted (a dedicated saga state table in a shared database or in the orchestrating service's database), and document the idempotency contract for each participant (what happens if a step is executed twice due to a retry — the participant must produce the same result both times). Document the stall detection and recovery mechanism: how long before a saga in a partially completed state is flagged as stalled, who is notified, and what the manual recovery process is. Document the boundary condition: which operations the team considers acceptable to leave non-atomic (operations where a partial completion is tolerable and will be resolved by eventual consistency) versus which require saga coordination. Not every cross-service write requires a saga — an operation that writes to Service A and then publishes an event for Service B to process, where a failure in Service B's processing is tolerable and will be retried later, may be handled by the message broker's delivery guarantee without an explicit saga.

Section 4: Local development and testing contract. Document the minimum viable local development stack: which services must be running locally to develop and test any given service, what the startup sequence is, and what the startup time target is for the full stack. Document the service virtualization strategy for services not under active development — whether engineers run a lightweight stub, a recorded response replay, or a shared environment instance. Document the contract testing approach: whether services publish machine-readable API contracts (Pact, OpenAPI, Protobuf) that consumers verify against independently, and how contract test results gate deployments (a consumer cannot deploy if it fails the contract test for a service it depends on; a provider cannot deploy if it breaks the contract of any registered consumer). The CI/CD pipeline decision record documents the build and deploy pipeline for each service; in a contract-tested system the pipeline must run contract verification against published contracts before any service is promoted to staging or production. Document the observability minimum for local development: at minimum a log aggregator that collects structured logs from all running services into a single stream, and a trace viewer that shows request paths across service boundaries. An engineer debugging a cross-service bug without distributed traces must correlate log lines from multiple services manually; distributed traces reduce a multi-hour debugging session to a single trace view. The observability strategy decision record documents the trace collection infrastructure; the boundary ADR must document that trace context propagation is a required implementation detail in every service, not an optional enhancement.

Section 5: Migration path for incorrect boundaries. Document the process for recognizing that a boundary is wrong and correcting it. A boundary is wrong when it fails the co-evolution test — features consistently require coordinated changes to both services — or when a business operation that crosses the boundary requires a saga whose complexity exceeds the value of the boundary's independence. The correction process has two directions: merging services that were incorrectly separated (folding Service B's code and data into Service A, migrating Service B's database schema into Service A's database, removing the inter-service API and replacing it with module-level function calls) and extracting a module from a service that has grown too large to own independently. Document the data migration approach: whether service merges use online schema migrations with dual-write periods (both the old and new data location are written during the migration, allowing rollback) or offline migrations (service goes read-only during migration, data is moved in bulk, service resumes on the new schema). Document the API deprecation process for the inter-service API that is eliminated in a service merge: the consuming services must be updated before the provider can be removed, and the deprecation period must be long enough for all consumers to migrate. The database migration strategy decision record documents the schema migration tooling and process; in a service merge the schema migration is a cross-database operation that must be carefully sequenced to avoid data loss during the transition. Without a documented migration path, incorrect boundaries persist indefinitely because the cost of correcting them is undefined and therefore always deferred in favor of work with a clearer scope and timeline.

None of these five sections appear in the founding sprint session that drew the service boundaries based on data domain intuition, previous employer convention, or a desire to avoid a future monolith. The session records which code belongs in which service. It does not ask what happens when a business operation requires atomicity across three of those services, how engineers will run eleven services locally when the team has grown from four to nine, or what the process is when the Listings and Search services turn out to co-evolve on every sprint and should have been a single Catalog service from the beginning. The microservices vs monolith ADR is the document that converts the implicit architecture choice into the operational parameters that determine whether the second year of the system is characterized by autonomous teams shipping independently or by coordinated release trains, saga compensation bugs, and 30% of engineering capacity consumed by deployment infrastructure. The WhyChose extractor recovers the founding session, the first saga session, the first contract versioning session, and the local development setup session from AI chat history; the service boundary ADR extracts the durable architectural constraints from those sessions and documents them where engineers encounter them: next to the service boundary definitions and the inter-service contract schemas, not in the Slack thread from the founding week.

FAQs

When do microservices provide genuine value over a monolith?

Microservices provide genuine value in three scenarios: independent scaling requirements exist at the service level (a payments service at 1,000 req/min and a reporting service at 10 req/min benefit from separate scaling policies); team ownership boundaries match service boundaries (a team of six owns a service end-to-end, sets their release cadence, and owns their data store — the independence benefit materializes only when team and service boundaries align); or the service has genuinely divergent operational requirements (different languages, different security boundaries, different compliance scopes) that cannot be cleanly isolated within a single deployment unit.

Microservices add operational complexity without commensurate benefit when the team is smaller than the coordination overhead (fewer than 15-20 engineers, the deployment pipeline overhead and local development friction typically exceed the autonomy benefit), when service boundaries cross frequently-co-changing code (features regularly touch multiple services, requiring coordinated deployments), when most operations that require atomicity span multiple services (forcing saga implementations for what would be local transactions in a monolith), or when the primary motivation is "industry best practice" rather than a specific scaling, ownership, or isolation requirement the current architecture demonstrably cannot satisfy. The modular monolith — a single deployable with explicit internal module boundaries — is the alternative that provides internal structure without distributed systems complexity and can be decomposed later when team size and domain stability justify it.

What is a saga and why is its implementation cost determined at service boundary draw time?

A saga is a sequence of local transactions coordinated across multiple services to complete a business operation that requires atomicity. When all data is in one database, the operation is wrapped in a database transaction and either committed or rolled back. When the data is split across service-owned databases, there is no shared transaction boundary. The saga replaces the atomic transaction with a sequence of local commits and compensating transactions that undo each commit if a later step fails. A compensation transaction is not a rollback — it is a new write that explicitly undoes the previous commit, and if the compensation itself fails, the saga is in a partially compensated state requiring manual intervention.

The implementation cost of a saga is determined by the number of participating services and the failure mode permutations. A two-service saga is 150-200 lines. A four-service saga with each step capable of failing before or after committing, with compensation steps that can themselves fail, and with a stall detection and recovery mechanism, is 400-600 lines. This cost is fixed at boundary draw time: if the boundary places Order, Inventory, Payments, and Billing in separate services, then the "place order" operation requires a four-service saga regardless of when it is implemented. If the boundary places all four in one service, the same operation is a local transaction of 30 lines. The boundary decision determines the saga budget before any saga is written; the ADR must document the distributed transaction surface area so this cost is visible at design time.

What should a microservices vs monolith ADR document that a general architecture decision does not?

A general architecture decision records that the system uses microservices (or a monolith) and the high-level motivation. The service boundary ADR must document: (1) The co-evolution criterion — which code changes together for the same reasons and therefore belongs in the same service, versus which code changes independently and can be separated. (2) The distributed transaction surface area — which business operations cross service boundaries and require atomicity, which saga pattern is used for each, and the idempotency and stall-recovery contract. (3) The inter-service communication model — synchronous versus asynchronous per interaction type, the contract format, and the deprecation policy. (4) The local development contract — the minimum viable stack, the service virtualization strategy for services not under development, and the contract testing approach. (5) The migration path — how the team recognizes an incorrect boundary, the process for merging incorrectly-separated services, and the data migration approach. None of these appear in the founding sprint session that drew the boundaries; all of them determine whether the second year of development is characterized by autonomous deployment and clear ownership or by saga debugging at 2am, coordinated release trains across three teams, and a local development stack that takes seven minutes to start.