The ML model serving infrastructure decision record: why the inference serving model you chose determines your latency floor and your model update frequency ceiling
Model serving architecture is decided when the team deploys the first model. A pickle file behind a Flask endpoint, a nightly batch scoring job, a gRPC microservice, or a managed inference platform — each is reached for because it is familiar or fast to implement, not because its latency floor, hardware requirements, and model update frequency ceiling were evaluated against the product's serving SLA. The serving architecture sets constraints that cannot be resolved by tuning: a batch scoring job cannot serve real-time predictions, a CPU Flask service cannot serve a transformer model within a 50-millisecond SLA, and a rolling-restart deployment mechanism cannot roll back a model regression within minutes. These constraints are discovered months after the founding deployment, when the product team adds a feature that the serving architecture cannot support.
A 12-person e-commerce startup building a product recommendation engine deployed their first model in the way most ML teams do: a data scientist trained a gradient-boosted tree model in a Jupyter notebook, serialized it with pickle, and handed the file to a backend engineer who wrapped it in a Flask endpoint. The endpoint loaded the model file on startup, received a POST request with a user ID and the current session's viewed product IDs, looked up user features from a Redis hash, and called model.predict() to return a ranked list of ten product recommendations. Median inference latency was 12 milliseconds. The product page loaded recommendations in under 50 milliseconds end-to-end. The system worked well.
Over the next fourteen months the team retrained the model four times, each time by the same process: the data scientist trained a new model on updated data, saved a new pickle file, uploaded it to the server, and the backend engineer restarted the Flask process. During the restart — which took 2–4 seconds for the Flask process to initialize and load the model — the recommendations endpoint returned a 503, and the product page fell back to showing the store's bestsellers. The team accepted the brief degradation. Their model update frequency was once per quarter, and a 4-second serving gap was an acceptable cost for a quarterly refresh.
In month fifteen, the product team decided to replace the gradient-boosted tree with a two-tower neural network — a model architecture that encodes users and items into dense vector embeddings and scores them by dot product similarity. The new architecture delivered a 22% improvement in click-through rate in offline evaluation. The backend engineer began integrating it into the Flask endpoint. The PyTorch model's CPU inference time was 180 milliseconds at p99 for a batch of 50 candidate items — compared to 12 milliseconds for the gradient-boosted tree. The product page's 50-millisecond end-to-end recommendation SLA could not accommodate 180 milliseconds of inference time. The team needed GPU inference: the same model on a GPU produced p99 inference of 8 milliseconds for the same batch size.
Deploying GPU inference required changes that extended far beyond adding a GPU instance. The Flask/pickle serving model assumed CPU compute — pickle-serialized PyTorch models are not directly portable to GPU serving frameworks without format conversion. GPU serving required exporting the model to TorchScript or ONNX, deploying a GPU-capable inference server (TorchServe, Triton Inference Server, or a GPU-enabled container with a custom ONNX Runtime setup), configuring GPU memory management and batch sizing for optimal utilization, rewriting the serving endpoint to communicate with the inference server over gRPC rather than calling model.predict() in-process, and provisioning GPU instances in the production environment. The migration took six weeks. The neural network model that had been ready for deployment sat waiting for a serving infrastructure that was not ready for it — because the original decision to use Flask/pickle/CPU had never been written down as a choice with a latency-floor consequence, and nobody had evaluated whether that infrastructure could serve a neural architecture when the time came to replace the gradient-boosted tree.
A 20-person fintech startup building a personal loan origination platform deployed their first credit risk scoring model as a nightly batch job. An Airflow DAG ran at 2:00 AM, read all active loan applications from the database, computed creditworthiness scores using a logistic regression model, and wrote the scores back to a PostgreSQL table. The loan officer interface read from this table when reviewing applications. The architecture was appropriate for the product: loan officers reviewed applications during business hours, the overnight score refresh was fresh enough for the next morning's queue, and the batch computation ran comfortably on a single database server in under four minutes. The team shipped the model and moved on to building other features.
Eight months later, the business expanded from assisted lending — where a loan officer reviewed each application — to instant consumer lending, where a borrower submitted an application on their phone and received an approval or rejection within three seconds. The product team assumed that the credit risk model was already built and that wiring it to the instant approval flow was a one-week integration task. The engineering team discovered that this was not the case. The batch-scored table contained scores computed the previous night for applicants who were already in the system — it had no score for a new applicant who had submitted their application 10 seconds ago. Serving the instant approval flow required scoring new applicants in real time, which required online inference — querying the credit bureau, computing features, and running the model prediction within the three-second approval SLA.
The logistic regression model itself was simple and fast enough for online inference. The bottleneck was the feature computation pipeline. The batch job computed features by joining applicant data against three months of transaction history in the data warehouse, a query that took 8–14 seconds per applicant in batch mode but would take the same 8–14 seconds in real-time mode for a new applicant. Reducing this to under two seconds required pre-computing and caching applicant features in a low-latency feature store — a piece of infrastructure that did not exist and that the team had not planned to build. The migration to real-time model serving required building a feature caching layer for pre-computed features, an online inference endpoint to replace the batch scoring job for new applicants, a credit bureau integration with a sub-two-second response SLA, and an application state machine that could distinguish between new applicants (requiring real-time scoring) and applicants already in the batch-scored table (returning cached scores). The project took three months. The instant lending product launched three months after the date the product team had planned, because the serving architecture — a nightly batch job — had never been documented as a decision with a latency-floor consequence of "24 hours until the next batch run."
Structural properties set by the model serving architecture decision
Four structural properties are determined when a team chooses how to serve ML model predictions in production. None appear in the founding deployment session — the session that asks how to get the trained model responding to requests and focuses entirely on getting a working endpoint by the end of the sprint.
Property 1: Prediction latency floor and the inference compute contract. The minimum achievable prediction latency is determined by three compounding constraints at serving time: the model architecture and parameter count, the serialization and runtime format, and the hardware. A logistic regression model or a small gradient-boosted tree runs in under 5 milliseconds on CPU in any mainstream runtime. A large ensemble (1000 trees, depth 8) runs in 30–80 milliseconds on CPU. A small transformer (12 layers, 110M parameters) runs in 80–250 milliseconds on CPU and 5–20 milliseconds on GPU. These numbers are architectural properties of the model, not implementation details — they cannot be reduced at the serving layer without changing the model. The serving framework adds its own latency floor: a Python Flask endpoint with a single-threaded HTTP server adds 8–20 milliseconds of overhead from HTTP parsing, Python GIL contention, and serialization; a gRPC endpoint with a compiled protobuf schema adds under 2 milliseconds; an in-process library call adds negligible overhead. The product's serving SLA — the maximum acceptable end-to-end latency for a prediction, measured at p99 — must be satisfiable by the sum of the model's inference latency floor and the serving framework's overhead. If the SLA is 50 milliseconds and the model's CPU inference floor is 80 milliseconds, no serving framework optimization can close the gap; the team must either use GPU inference, quantize the model to reduce the computational cost, distill it into a smaller architecture, or relax the SLA.
The serialization format determines which runtimes can serve the model and the compatibility surface for future model architecture changes. A pickle-serialized scikit-learn model can only be served by a Python runtime with the same scikit-learn version; a version mismatch between the training environment and the serving environment produces a deserialization error that is discovered at deployment time. An ONNX-exported model can be served by any runtime with an ONNX Runtime implementation — Python, C++, Java, Go, .NET — and ONNX Runtime's graph optimization compiles the model to hardware-specific instruction sets (AVX-512 on Intel CPUs, TensorRT on NVIDIA GPUs) that reduce the inference latency floor by 20–60% compared to the framework's default runtime. TorchScript, TensorFlow SavedModel, and CoreML each have similar runtime portability and optimization tradeoffs. The format choice made when the first model is serialized for deployment is a commitment to a serving runtime ecosystem; switching formats later requires retraining with export hooks that produce the target format, which is straightforward for simple models but may be infeasible for models that use custom Python layers or training-time tricks that are not expressible in the target format's computation graph. The feature store decision record documents the feature computation and serving architecture that feeds the model; the model serving ADR must document the serving latency SLA at p99 given the model architecture's inference floor on the target hardware, and the format that enables that hardware target, so that a future team replacing the model knows whether a new architecture choice is compatible with the existing serving infrastructure before the new model is trained.
Property 2: Model update frequency ceiling and the deployment protocol. How often model weights can be refreshed in production is constrained by the deployment mechanism chosen when the first model was deployed. A batch scoring job can be updated by replacing the model artifact before the next batch run — the update frequency ceiling is once per batch cycle, typically daily or weekly. A REST microservice with rolling restart has an update frequency ceiling determined by the deployment pipeline duration and the acceptable serving gap during restart: if restarting the Flask process takes 4 seconds and the team will not accept more than a 1-second serving gap, the update frequency is constrained to once per day during low-traffic windows. A blue-green deployment — standing up a new serving instance with the new model weights, routing traffic to it, and terminating the old instance — has no serving gap and an update frequency ceiling limited only by the time to initialize a new serving instance (typically 30–120 seconds for a Python serving process with a GPU model load). A canary deployment — routing a fraction of traffic to a new model version while the remainder goes to the current version — requires traffic splitting infrastructure (a load balancer or service mesh with per-version routing) and adds the complexity of maintaining two model versions in production simultaneously, but enables safe high-frequency updates with online metric validation before full rollout.
The model update frequency is a product constraint, not just an infrastructure preference. A fraud model that can only be updated weekly accumulates a growing blind spot as new fraud patterns emerge and are observed in transaction data — the model cannot be retrained to detect them until the weekly update window. A recommendation model updated daily incorporates yesterday's user behavior into today's recommendations; one updated monthly serves recommendations calibrated to last month's behavior for users whose interests have shifted. The update frequency ceiling is set when the deployment mechanism is chosen and cannot be increased without changing the deployment infrastructure. The CI/CD pipeline decision record documents the deployment automation infrastructure; the model serving ADR must document the model update frequency ceiling given the deployment mechanism, the process for validating a new model version before full rollout (shadow serving, canary percentage, monitoring window duration, rollback criterion), and the ownership of the deployment pipeline — who triggers a model update, who monitors the rollout, and who has the authority to roll back.
Property 3: Inference compute hardware and the cost model at scale. The hardware required to serve a model at production traffic volumes is determined by the model's inference latency floor on that hardware and the required QPS. If the recommendation endpoint must serve 2,000 requests per second at p99 under 50 milliseconds, and the model's CPU inference takes 30 milliseconds per request, a single CPU core can serve approximately 33 requests per second (1000ms / 30ms per request). Serving 2,000 requests per second requires 61 CPU cores just for inference computation, plus additional capacity for headroom and failure tolerance. At a cost of $0.04 per CPU-hour on commodity cloud compute, 70 cores cost $67 per hour, $1,600 per day, $48,000 per month. If the same model on GPU takes 4 milliseconds per request and a GPU can execute 16 requests in parallel in a batched inference call, a single GPU can serve 4,000 requests per second — reducing the serving cost to 1 GPU at $3 per GPU-hour, $72 per day, $2,200 per month. The cost difference between CPU and GPU serving for a neural architecture at this scale is a factor of 20, and this difference is determined at model selection time, not at serving infrastructure time.
Batched inference — grouping multiple prediction requests and running them through the model in a single forward pass — amortizes GPU initialization overhead across multiple requests, improving GPU utilization and reducing per-request cost. Batching introduces a wait latency: requests arriving at the inference server are held until a full batch accumulates or a timeout expires. A batch size of 32 with a maximum wait time of 5 milliseconds means that a single request may wait up to 5 milliseconds before being processed, adding a queue latency component that is not present in per-request inference. The batch size and maximum wait time are tuning parameters that trade GPU utilization against tail latency: larger batches improve utilization but increase the maximum queue wait; smaller batches reduce queue wait but under-utilize the GPU. These parameters must be calibrated against the serving SLA at p99, not just at p50, because p99 tail latency in a batched system is dominated by the queue wait time, not the inference time. The container orchestration decision record documents the Kubernetes configuration for scaling inference pods; the model serving ADR must document the autoscaling policy for the inference tier — the metric used to trigger scale-out (CPU utilization, GPU utilization, request queue depth, inference latency percentile), the scale-out lag (the time between the trigger condition and the additional capacity being available), and the scale-in delay (the buffer before scaling down to avoid thrashing during traffic spikes) — because a serving tier that autoscales too slowly produces a latency spike during traffic ramps that the business experiences as a product outage, not a capacity management event.
Property 4: Model rollback and the versioning contract. When a newly deployed model produces degraded online metrics — lower click-through rate, higher false negative rate on fraud, worse conversion — the time required to return to the previous model's serving behavior is determined by the versioning mechanism chosen when the first model was deployed. A model file replaced in the serving pod is gone: rolling back requires the previous artifact to be stored somewhere accessible (a model registry, an object store, the deployment pipeline's artifact cache) and a new deployment triggered for the previous version, which takes as long as any deployment — potentially 10–30 minutes for a GPU model with a slow cold-start. A model registry with version pinning — where each deployment references a version identifier in a registry rather than a specific file path — allows rollback by updating the version pointer and triggering a restart, which is faster than a full artifact upload but still requires a serving process restart. A canary deployment with traffic weight control allows rollback without restarting any process: adjusting the traffic split from 10% new model / 90% old model back to 0% new model / 100% old model takes seconds and the old model never stopped serving.
The rollback window — the time between identifying a model regression and returning to the previous model — determines the blast radius of a bad model deployment. A 30-minute rollback window during which the degraded model is serving 2,000 requests per second means 3.6 million predictions were served at degraded accuracy before the rollback completed. The rollback window is set by the deployment mechanism and the availability of the previous version's artifact: it cannot be reduced below the deployment duration without changing the deployment infrastructure. The model serving ADR must document the version retention policy (how many previous model versions are retained in the artifact store and for how long), the rollback procedure (the sequence of steps to execute, who executes them, the time estimate for each step), the rollback criterion (the specific online metric threshold that triggers a rollback, measured over what monitoring window, by what team member), and the communication protocol for notifying stakeholders when a model rollback is triggered. The disaster recovery decision record documents the infrastructure recovery runbook; the model serving ADR must document an equivalent model accuracy recovery runbook that is distinct from infrastructure recovery — a serving infrastructure that is fully healthy and serving requests at normal latency can simultaneously be serving a model whose accuracy has degraded, and the detection and remediation of model accuracy regression requires a different escalation path than infrastructure failure.
What the founding session records and what it omits
The model serving decision is made in the founding sprint, in the week the first model is trained. It is almost always made by asking a practical question: "how do we get predictions from this model into the application?" The session answers this question efficiently and correctly — it produces a working serving endpoint. What it does not record is what the serving choice prevents and what it requires.
Four types of AI chat sessions generate the gaps:
The "how do we deploy our first ML model?" session. The team has a trained model and needs to serve predictions from the application. They ask how to wrap the model in an endpoint. The session explains Flask or FastAPI for Python serving, pickle or joblib for serialization, and how to structure a predict endpoint that accepts feature values and returns a prediction. The session produces a working deployment. It does not ask: what is the model architecture's inference latency floor on CPU, and does the product's SLA require GPU hardware? What is the maximum acceptable serving gap during model updates, and does the restart-based deployment mechanism produce a gap that exceeds that tolerance? If the model is replaced with a neural architecture in 12 months, is this serving infrastructure compatible with it, or will the serving layer need to be rebuilt from scratch? The session answers the question asked — "how do we get predictions running?" — without documenting the serving architecture as a decision with consequences for future model development. The resulting endpoint works correctly until the team trains a model that it cannot serve, at which point the serving infrastructure must be rebuilt on a timeline that was not planned because the architectural constraint was never documented. The feature store decision record documents the feature computation layer upstream of model serving; the model serving ADR must document the serving architecture choice and its latency-floor consequence as part of the same founding deployment decision, so that the model development roadmap reflects the serving infrastructure's current capability.
The "how do we add a new model type to our serving layer?" session. The team has trained a new model architecture that outperforms the current model in offline evaluation and wants to deploy it. The engineer asks how to serve the new model from the existing endpoint. The session recommends loading the new model file alongside the existing architecture. It does not ask: is the new model's serialization format compatible with the existing serving runtime, or does it require a different inference library? Does the new model's CPU inference time satisfy the serving SLA, or does it require GPU hardware that the existing serving fleet does not have? Can the existing deployment mechanism handle the new model's larger artifact size within the acceptable deployment time? The session assumes that "serving a model" is a solved problem and the new model is a different instance of the same problem. For models of the same architecture, this is often true. For a change from a gradient-boosted tree to a transformer, it is not — the new architecture may be incompatible with the existing serialization format, the existing hardware, and the existing deployment pipeline, making the serving layer migration a prerequisite for the model migration. The API schema design decision record documents the external API versioning contract; the model serving ADR must document an equivalent compatibility contract for the serving infrastructure: the model architectures and serialization formats the infrastructure is designed to serve, and the migration path when a new model architecture falls outside the current capability.
The "how do we reduce our prediction latency?" session. The team has received complaints that recommendations are slow or that risk scores are taking too long to return. They ask how to make the model faster. The session recommends model quantization (reducing numerical precision from float32 to int8), pruning (removing low-weight connections), or distillation (training a smaller student model to match the larger model's outputs). It does not ask: is the latency problem in the model inference or in the serving framework overhead, the feature lookup, or the network round-trip? If the model is already at its architecture's theoretical minimum inference time, no serving optimization will reduce latency below the floor. If the latency problem is in the serving framework overhead — a Flask server with high Python GIL contention — the fix is a different serving framework, not a different model. If the latency problem is in the feature lookup — a sequential database query that blocks model inference — the fix is the feature caching layer documented in the feature store decision record. The session optimizes the wrong component because the latency budget per component was never allocated in the founding serving ADR. Without a documented breakdown of the end-to-end latency budget (feature lookup: 8ms, serving framework overhead: 5ms, model inference: 30ms, network: 2ms, total budget: 50ms), every latency investigation starts from scratch by profiling the entire serving path. The observability strategy decision record documents the metrics infrastructure; the model serving ADR must document the per-component latency budget and the instrumentation that measures each component's contribution to end-to-end prediction latency, so that latency investigations can identify the bottleneck rather than optimizing the component that is easiest to profile.
The "how do we update the model more frequently?" session. The team wants to retrain and redeploy the model more often — daily instead of weekly, or hourly instead of daily — to incorporate more recent data. They ask how to automate model retraining and deployment. The session recommends a retraining pipeline triggered by a data freshness check, a model evaluation step that validates the new model meets a minimum quality bar before deployment, and a deployment trigger that pushes the new artifact to the serving fleet. It does not ask: does the current deployment mechanism produce a serving gap that is acceptable at the target update frequency? If the model is updated daily and each deployment causes a 4-second serving gap, the product experiences 4 seconds of degraded serving every 24 hours — acceptable for quarterly updates, potentially unacceptable for daily ones. Does the serving infrastructure support running two model versions simultaneously during the rollout, so that online metrics can be validated before the new version serves all traffic? Does the rollback procedure execute quickly enough to contain a bad model deployment that is discovered minutes after full rollout? None of these questions arise in the "how do we automate retraining?" session because they are serving infrastructure questions, not training pipeline questions. The background job infrastructure decision record documents the batch job execution framework for the retraining pipeline; the model serving ADR must document the deployment frequency ceiling given the deployment mechanism, and the minimum serving infrastructure requirements for increasing update frequency beyond that ceiling, so that the retraining automation investment is not blocked by an undocumented serving infrastructure constraint discovered after the automation is built.
The WhyChose extractor surfaces the founding model deployment session, the first "make it faster" session, the first "new model architecture" session, and the first "update more frequently" session from AI chat history. The model serving ADR converts the implicit choices in those sessions — Flask endpoint, pickle serialization, CPU hardware, rolling restart — into documented latency-floor constraints, model update frequency ceilings, hardware cost projections, and rollback windows, so that the next engineer who proposes a new model architecture can check whether the serving infrastructure supports it before training a model that cannot be deployed on the existing fleet.
The five sections of a model serving infrastructure ADR
Section 1: Serving architecture and prediction latency SLA. Document the serving architecture: batch scoring (entity set pre-scored on a schedule, scores written to a database for downstream read) or online inference (prediction requested at serving time, computed synchronously within the SLA). For batch serving, document the scoring cadence, the entity set, the feature materialization requirement (features must be fully computed before the batch job runs), and the staleness constraint (predictions are at most N hours old at serving time). For online inference, document the model architecture and its inference latency floor on the target hardware, the serving framework and its overhead contribution to end-to-end latency, the per-component latency budget allocation (feature lookup + framework overhead + model inference + network ≤ total SLA), and the latency SLA at p99. Document the serialization format and the serving runtime: the format determines which hardware targets and runtime optimizations are available, and the runtime determines which model architectures and layer types are compatible. Document the feature serving dependency: which features are pre-computed in the feature store and looked up at serving time, which features are computed at serving time from request context, and the latency contribution of each feature group lookup. The feature store decision record documents the feature serving architecture; the model serving ADR must document the integration point — how feature values are passed from the feature store to the inference call, and the timeout and fallback behavior if the feature store does not respond within the latency budget.
Section 2: Model update frequency and deployment protocol. Document the deployment mechanism: in-place model file replacement with process restart (documents the serving gap duration and the acceptable serving gap tolerance), blue-green deployment (documents the instance initialization time and the traffic cutover mechanism), or canary deployment (documents the traffic split infrastructure, the minimum canary percentage, and the monitoring window before full rollout). Document the model update frequency ceiling given the deployment mechanism: the maximum frequency at which model weights can be refreshed in production without the deployment overhead exceeding the acceptable serving impact. Document the deployment pipeline: the steps from "new model artifact is validated" to "new model is serving 100% of traffic," who triggers each step, what automated validations are run (minimum quality score, comparison against the current production model on a held-out validation set, latency regression check), and the manual approval gate if one exists. Document the canary and shadow serving protocols: the traffic percentage served by the new model during validation, the online metrics monitored during the canary period, the decision criterion for advancing to full rollout versus rolling back, and the monitoring window duration. The CI/CD pipeline decision record documents the deployment automation infrastructure; the model serving ADR must document the model-specific deployment steps that are distinct from application code deployment — model artifact upload, serving process warm-up (models with slow initialization must serve a minimum number of warm-up requests before they are added to the traffic pool), and post-deployment validation metrics.
Section 3: Inference compute hardware selection and cost model. Document the hardware target: CPU (commodity compute, no GPU initialization overhead, latency floor determined by model complexity and CPU clock speed) or GPU (lower latency for neural models, higher fixed cost, GPU availability and quota constraints in cloud environments). For GPU serving, document the GPU type (GPU memory, CUDA compute capability, the minimum VRAM required to hold the model in memory), the batching configuration (minimum batch size, maximum batch size, maximum wait time), and the GPU utilization target at steady-state traffic (the GPU utilization that maximizes throughput while satisfying the p99 latency SLA). Document the autoscaling policy: the metric used to trigger scale-out (inference latency p99 exceeding the SLA, request queue depth, GPU utilization), the scale-out lag (the time between the trigger and additional capacity serving traffic), and the minimum instance count (the floor that prevents scale-in to zero during low-traffic periods, avoiding cold-start latency on the next traffic ramp). Document the cost model: the per-request inference cost at current traffic volume (hardware cost per hour / requests per hour per instance), the cost at peak traffic, the cost at 10× growth, and the cost inflection point where GPU serving becomes cheaper than CPU serving given the model's inference latency characteristics. The container orchestration decision record documents the Kubernetes configuration; the model serving ADR must document the GPU resource request and limit configuration per inference pod, the node pool or instance type required, and the GPU quota headroom available for traffic spikes.
Section 4: Model versioning, rollback, and canary serving contract. Document the version retention policy: how many previous model versions are retained in the artifact store (model registry, object storage), the retention duration for each version, and the access mechanism for retrieving a previous version during a rollback. Document the rollback procedure: the sequence of steps to execute (who identifies the regression, who initiates the rollback, the step-by-step process for each deployment mechanism, the expected time to complete), the rollback criterion (the specific online metric, the threshold, the monitoring window duration, and the minimum number of observations required before a rollback is considered statistically reliable rather than noise), and the communication protocol for notifying stakeholders. Document the shadow serving protocol when a new model type requires extended validation: the infrastructure for serving the new model in parallel without returning its predictions to callers, the metrics collected from shadow requests (prediction distribution, latency, error rate), and the exit criterion for graduating from shadow to canary serving. Document the multi-version serving capability: whether the serving infrastructure can simultaneously route traffic to two model versions with different weights (required for canary deployment and rollback without serving gaps), the traffic split mechanism, and the per-version metrics collection needed to compare model performance during canary periods. The API rate limiting decision record documents the rate limiting configuration for serving endpoints; the model serving ADR must document the rate limit behavior during a canary rollout when one version has degraded latency — whether rate limiting applies per version or to the aggregate endpoint, and whether a rate limit on a slow canary version blocks traffic from reaching the healthy production version.
Section 5: Model serving observability — latency, error rate, and prediction distribution monitoring. Document three distinct monitoring surfaces. The first is serving infrastructure health: inference latency at p50, p99, and p999 per model version, serving error rate (model load failures, inference timeouts, serialization errors), and hardware utilization metrics (CPU utilization, GPU utilization, GPU memory usage). These metrics indicate that the serving infrastructure is healthy and processing requests within the SLA; they do not indicate that the model is producing accurate predictions. The second is prediction distribution monitoring: the statistical distribution of predicted values (prediction mean, standard deviation, fraction of predictions at extreme values) tracked over time, with alerts when the distribution shifts significantly from the baseline established during the model's first week in production. A sudden shift in the prediction distribution — all recommendations converging to a single item, fraud scores collapsing to near zero, credit risk scores bimodally clustering at the extremes — is often the first detectable signal of a model accuracy regression that has not yet produced degraded business metrics. The logging strategy decision record documents the structured logging contract; prediction distribution monitoring requires logging each prediction's output value alongside the entity ID, model version, and timestamp, so that distribution statistics can be computed as aggregations over the prediction log rather than requiring a separate sampling mechanism. The third monitoring surface is online business metrics: the downstream metrics that the model is intended to improve (click-through rate for a recommendation model, fraud detection recall for a fraud model, loan default rate for a credit risk model). These are the ground-truth signal for model accuracy but have a delay — fraud default metrics require waiting for loans to mature, and click-through rate requires sufficient traffic to be statistically reliable. The model serving ADR must document the expected monitoring window for each metric — the minimum observation period required to detect a meaningful regression with statistical confidence — and the escalation path when an online metric regression is detected: how long to monitor before initiating rollback, who makes the rollback decision, and how to communicate the rollback and its cause to stakeholders. The observability strategy decision record documents the metrics infrastructure; the model serving ADR must specify which of the three monitoring surfaces are implemented in the standard metrics pipeline and which require separate instrumentation — prediction log analysis and online business metric collection typically require data pipelines that are distinct from the infrastructure monitoring stack.
None of these five sections appear in the founding deployment session that got the model serving requests. The session records that Flask is the serving framework, that the model is loaded with pickle, and that the endpoint returns a ranked list of predictions. It does not document that CPU inference for a neural architecture will exceed the SLA by 130 milliseconds, that batch scoring cannot serve real-time predictions when the product team adds an instant approval flow, that the 4-second serving restart gap becomes a constraint when the business needs daily model updates, or that a model rollback after a bad deployment will take 20 minutes because the previous artifact was overwritten and must be retrieved from a backup. The model serving ADR converts the implicit architectural choice — Flask endpoint, pickle serialization, rolling restart — into the operational parameters that determine whether a team can deploy a new model architecture without rebuilding the serving infrastructure, update model weights daily without manual intervention, and roll back a bad deployment within minutes while the previous model continues serving at full quality. The WhyChose extractor recovers the founding deployment session, the first latency investigation, the first new model architecture session, and the first automated retraining proposal from AI chat history; the model serving ADR extracts the durable architectural constraints from those sessions and documents them where engineers encounter them: next to the serving infrastructure configuration and the deployment pipeline, not in a Slack thread from the founding sprint and a Jupyter notebook that the backend engineer who built the Flask endpoint has never read.
FAQs
When should a team use batch scoring versus online inference for model serving?
Batch scoring is appropriate when the consumer of predictions tolerates staleness equal to the batch cadence, the entity set is finite and enumerable, features are fully available at batch time rather than depending on real-time request context, and the prediction request rate makes per-request online inference prohibitively expensive. Online inference is required when predictions must be delivered within a latency SLA shorter than the batch cadence, features depend on real-time context (current session state, exact request timestamp, live request content), or the entity space is open-ended and new entities must receive predictions immediately.
The failure mode of choosing batch serving when online inference is needed is not a gradual degradation but a hard architectural ceiling. When the product team adds a feature requiring real-time predictions, the batch infrastructure cannot be incrementally extended — it must be replaced with an online inference service. That migration is typically two to four months for a production-grade implementation, and it blocks every product feature requiring real-time predictions until it completes.
What determines the inference latency floor for a model serving endpoint?
The inference latency floor is the sum of four compounding constraints: the model architecture's computational complexity on the target hardware (logistic regression under 1ms on CPU; gradient-boosted tree 5–80ms on CPU; transformer 80–250ms on CPU, 5–20ms on GPU); the serving framework overhead (Flask adds 8–20ms per request; gRPC with protobuf under 2ms; in-process library call negligible); the hardware's per-operation throughput (CPU cores vs GPU parallel execution units); and the serialization and runtime format (ONNX Runtime with hardware-specific optimization is 20–60% faster than the framework's default Python runtime for the same model).
The latency floor cannot be reduced below the model architecture's theoretical minimum on the target hardware without changing the model (quantization, pruning, distillation, architectural simplification) or changing the hardware (GPU instead of CPU). Both remediations require non-trivial engineering investment. The serving ADR must document the floor given the current model architecture and hardware, so that the team knows in advance whether a proposed new model architecture is compatible with the existing serving SLA.
What should an ML model serving ADR document that a general API infrastructure decision does not?
A general API infrastructure decision records the web framework, load balancer, autoscaling policy, and deployment mechanism. An ML model serving ADR must document five additional structural properties: (1) the serving architecture — batch vs online, the model architecture's inference latency floor on the target hardware, and the per-component latency budget; (2) the model update frequency ceiling and deployment protocol — how often weights can be refreshed in production given the deployment mechanism, the canary and shadow serving procedures, and the rollback criterion and window; (3) the inference compute hardware selection — CPU vs GPU, the autoscaling policy, the cost model at peak QPS, and the GPU batching configuration; (4) the model versioning and rollback contract — version retention policy, the rollback procedure step-by-step, and whether multi-version simultaneous serving is supported; (5) the model serving observability SLA — infrastructure health metrics, prediction distribution monitoring, and online business metric collection with the expected monitoring window for statistical significance.
None of these appear in the founding deployment session. All of them determine whether a team can serve a new model architecture without infrastructure migration, update weights daily without a serving gap, and detect a model accuracy regression before it degrades business metrics at scale.