Why does API rate limiting need an architecture decision record?

API rate limiting looks like a configuration choice — set a number, pick a window, deploy a middleware — but it is actually a set of architectural decisions whose consequences compound over time and cannot be reversed without breaking clients. The rate limit identifier (IP address, API key, user ID, or organization ID) determines who gets a shared bucket and who gets an isolated one; choosing IP address works correctly in development and breaks behind corporate proxies in production because thousands of users share one IP. The rate limiting algorithm (fixed window, sliding window, token bucket, leaky bucket) determines the burst behavior: a fixed window allows a burst of 2x the declared rate at window boundaries, which is the normal behavior for any client that queues requests, and the burst is invisible until a traffic spike reveals it. The storage backend (in-process, Redis, a distributed counter) determines whether limits are enforced consistently across a horizontally scaled application or whether each instance keeps its own count, allowing a client to send N requests per second to each of K instances for an effective rate of N*K. The 429 response format — which headers are included, whether a Retry-After header is present, what the response body contains — is a client contract that published API consumers build retry logic around, and changing it is a breaking change. The override and bypass policy (which API keys get higher limits, which IP ranges are exempt, whether there is an emergency override mechanism at 2am without a deployment) is a security decision disguised as an operational convenience. Each of these decisions is made at different points in time — often under pressure, after an incident, or in response to a partner complaint — and none of them are visible in the code. The rate limiting ADR is the document that holds the full set together, makes the trade-offs legible, and gives the on-call engineer the context to make a safe change at 2am rather than guessing which limit controls which behavior.

What is the difference between IP-based and API-key-based rate limiting and why does it matter?

IP-based rate limiting applies the rate limit counter to the client's IP address. It requires no authentication — it works for unauthenticated endpoints, it stops bots that do not have API credentials, and it is the only option for endpoints that must be reachable before authentication (login endpoints, password reset, public content). Its structural limitation is that IP address is not a reliable proxy for a single client. A corporate office of 2,000 employees shares a single egress IP through its NAT gateway. A mobile carrier may route thousands of users through a small pool of shared IPs. A CDN or API gateway proxies requests from an unpredictable range of client IPs behind a small number of infrastructure IPs. In each case, a rate limit that was calibrated for a single user becomes a shared quota for an arbitrarily large group of users, and a legitimate traffic spike can trigger the limit across the entire group simultaneously. API-key-based rate limiting applies the counter to the API key presented in the request. It requires that the client authenticate before rate limits are applied, which means it cannot protect unauthenticated endpoints and cannot stop a client that creates multiple keys to circumvent limits. Its structural advantage is that each key holder gets an isolated quota regardless of network topology: the 2,000-person office has one key and one quota, not 2,000 people sharing one IP quota. In B2B SaaS products where the customer's integration is identified by an API key and the customer is the billing unit, API-key-based limiting is the correct semantic: the customer's integration has a published rate, the customer can monitor their own usage, and the sales conversation about upgrading to a higher-rate tier is a commercial conversation, not a support ticket. User-ID-based and organization-ID-based rate limiting extend this principle: in a multi-tenant application where the API key identifies the integration but the requests are made on behalf of distinct users or organizations, the rate limit counter should match the accountability unit. The choice between these identifiers belongs in the rate limiting ADR because it is a data model decision — the counter storage key is a claim extracted from the authenticated request — and changing it after the API is published requires a coordinated migration of both the rate limiting infrastructure and the clients that built retry logic around the existing limit semantics.

What should an API rate limiting decision record include?

An API rate limiting ADR needs five sections. First, the rate limit scope: which endpoints are rate limited (all endpoints, authenticated endpoints only, expensive endpoints separately from cheap endpoints), whether rate limits are global or per-endpoint, and the reasoning for any endpoint-specific limits that differ from the global default. Second, the identifier and algorithm: the rate limit counter key (IP, API key, user ID, organization ID, or a composite) with the rejection reason for each alternative; the algorithm (fixed window, sliding window, token bucket, or leaky bucket) with the specific burst behavior that algorithm implies at the declared limit; and the storage backend (in-process, Redis with INCR+EXPIRE, Redis with a Lua script for atomicity, a distributed approximate counter) with the consistency guarantee each provides across horizontal scale. Third, the limit values and their derivation: the specific limit values for each tier (unauthenticated, free, paid, partner) and the load test or capacity analysis that justifies each value — limits set without derivation are guesses that become technical debt when the first customer complaint about throttling arrives. Fourth, the 429 response specification: the response headers (RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset or Retry-After), the response body format and which limit was hit, whether the response varies by the identifier type (a per-IP limit hit vs. a per-key limit hit should be distinguishable so clients can take different action), and the documented client behavior the response is designed to produce (exponential backoff, immediate retry after Retry-After, circuit break). Fifth, the bypass and override policy: which API keys or IP ranges are exempt from rate limiting and why (partner webhooks from known IPs, internal service accounts, load testing keys), the mechanism for granting an emergency rate limit increase without a deployment (a feature flag, a database row, a Redis key), and the approval process for permanent limit changes to ensure the values in the ADR match the values in production.

How do rate limiting decisions appear in AI chat history?

Rate limiting decisions appear in AI chat history in four session types. First, the initial security session: 'how do I prevent someone from spamming my API?', 'should I rate limit by IP or by user?', 'what is a good rate limit for a REST API?', 'how do I add rate limiting to an Express.js middleware?' These sessions contain the identifier choice (IP vs API key), the algorithm choice, and the initial limit values — all made under the framing of abuse prevention without the context of legitimate traffic spikes or partner integration requirements. The security framing produces different defaults than a capacity planning framing: a security engineer sets the limit as low as abuse prevention requires; a platform engineer sets the limit as high as the upstream service can sustain. The initial session captures whichever framing was used and why. Second, the traffic spike session: 'my API is being hammered, how do I add rate limiting quickly?', 'we are getting thousands of requests per second and the server is falling over', 'how do I rate limit without blocking legitimate users?' These sessions document the incident-driven addition of rate limiting — often implemented under pressure with values chosen to stop the immediate spike rather than calibrated against normal traffic patterns. The choices made in this session (IP-based because it was fastest to implement, fixed window because the middleware had a simple config, limits set to 100/minute because it stopped the bots) become the permanent rate limiting strategy because the incident resolves and the team moves on. Third, the partner integration session: 'our partner's webhook delivery is getting rate limited and they're losing events', 'we need to whitelist a specific IP range for a partner', 'our integration partner says they need 10,000 requests per minute and we're set to 1,000', 'how do we give one customer a higher rate limit than others?' These sessions document the bypass and override policy — typically added piecemeal as individual partner requirements are raised, without a general framework. The decisions made here (add the partner's IP to an allowlist, bump their API key's limit in a database table, create a 'partner' tier) become the rate limiting tier structure, often without any general policy that governs future requests. Fourth, the SLA and compliance session: 'we need to publish our API rate limits in our developer docs', 'a customer's legal team is asking for our rate limiting policy in the contract', 'our SOC 2 auditor wants to see our API security controls including rate limiting', 'we had a brute force attack on the login endpoint and need to document the controls we have'. These sessions document the transition of rate limiting from an internal technical measure to a published commitment with contractual and compliance implications — which changes the decision-making context for all future rate limit modifications.

2026-06-19 · ~20 min read

The API rate limiting decision record: why the rate limiting approach you chose determines your abuse surface and your SLO degradation behavior under traffic spikes

Rate limiting is almost never planned — it is added reactively, after the first abuse incident or the first traffic spike that takes the service down. The approach chosen at that moment, under pressure, without context about legitimate traffic patterns or future partner requirements, determines whether the service degrades gracefully when traffic spikes, whether the abuse surface can be narrowed without breaking existing integrations, and whether the on-call engineer can change a rate limit value at 2am without triggering a deployment. None of these properties are visible in the code without the decision record.

A developer API launches quietly and accumulates a small community. Four months after launch, a prominent technical newsletter mentions the product. The newsletter goes out on a Tuesday morning and 50,000 people click the link. Ten thousand of them try the API within the first hour.

The rate limiting was added three weeks earlier, after a bot scraped the entire public dataset over a weekend. The implementation took an afternoon: an Express middleware that counts requests by IP address in a fixed one-minute window, with a limit of 60 requests per minute. The limit was chosen because the scraping bot was sending about 120 requests per minute. Halve it, block the bot. Done.

The newsletter traffic is different. It is not bots — it is engineers in offices, at their laptops, behind corporate NAT gateways. A fintech firm in London has 800 developers; all 800 share a single egress IP. The rate limit fires immediately. All 800 engineers see 429 Too Many Requests. The support queue fills with tickets from people who have never heard of rate limits: "your API is broken." The team is debugging a flood of support tickets for a service that is, technically, working exactly as designed.

Making it worse: the 429 response body says "rate limit exceeded" with no indication of which limit was hit, no Retry-After header, and no documentation link. The firm's engineers assume it is a bug. They retry immediately. They hit the limit again. They escalate to their engineering leadership, who sends an email to the company's sales contact. The sales contact escalates to the founder. The founder is on a call with an investor.

The rate limit value is in an environment variable. Changing it requires a deployment. The deployment pipeline takes twelve minutes. While the team debates whether to disable rate limiting entirely (which would expose the API to the scraping bot again) or raise the limit (which would require deciding a new value, right now, with no data about what legitimate traffic looks like), the newsletter moment is passing. Half the engineers who tried the API have moved on.

Like most foundational infrastructure decisions, the rate limiting implementation is visible as a working system but invisible as a set of decisions. The IP-based identifier, the fixed-window algorithm, the 60-requests-per-minute value, the minimal 429 response format, the environment-variable-only configuration with no runtime override — each was a decision, made in the context of stopping a specific bot, without documentation of the reasoning or the constraints. When a different kind of traffic arrived, there was no decision record to consult. The on-call engineer had no map of why the system worked the way it did, no authority to change the limit without a deployment, and no context for what "safe" looked like.

Why rate limiting is an architectural decision, not a middleware configuration

Rate limiting appears to be a configuration problem: set a number, choose a window, deploy a middleware. The configuration framing obscures what is actually being decided. There are at least five distinct architectural decisions embedded in a rate limiting implementation, each with consequences that compound over time.

The rate limit identifier is a data model decision. The counter key in the rate limiting store is derived from a claim extracted from the authenticated (or unauthenticated) request — the client IP address, the API key value, the authenticated user ID, the organization ID for a multi-tenant B2B product. This choice determines who shares a quota and who gets an isolated one. IP-based limiting is a reasonable default for unauthenticated endpoints; it is a harmful default for authenticated endpoints used by enterprise customers with thousands of employees behind a single egress IP. Changing the identifier after the API is published requires migrating clients who built retry logic against the existing limit semantics — the quota was per-IP, and they sized their request rate accordingly; per-key limits have different semantics that require different client behavior.

The rate limiting algorithm determines the burst behavior. A fixed-window counter resets at the boundary of each time window: 60 requests per minute means 60 requests between 00:00 and 00:59, and another 60 requests between 01:00 and 01:59. A client that makes 60 requests in the last second of one window and 60 requests in the first second of the next window sends 120 requests in two seconds without violating the limit. This is the window boundary burst, and it is not an edge case — it is the normal behavior of any client that batches requests or retries a queue at window reset. A sliding window counter tracks the request count over the trailing N seconds from the current time, eliminating the burst at window boundaries, but requires either more complex storage (a sorted set of request timestamps) or an approximate algorithm that introduces a small counting error. The choice of algorithm determines the actual burst rate that the upstream service must absorb, which determines the upstream service's capacity requirement.

The storage backend determines the consistency of enforcement across scale. An in-process counter (a hash map in the application instance) is exact for a single instance and meaningless for a horizontally scaled application: each instance keeps its own count, and a client that round-robins across K instances is allowed K times the declared rate. A Redis counter (INCR on a key with a TTL) provides near-exact enforcement across all instances, with the caveat that Redis becomes a synchronous dependency on the critical request path — a Redis latency spike adds latency to every API request, and a Redis unavailability blocks all rate limiting decisions, which means either all requests are allowed (fail open) or all requests are blocked (fail closed). The fail-open vs. fail-closed policy for rate limiting store unavailability is a security decision that must be made explicitly, and it belongs in the rate limiting ADR alongside the storage backend choice.

The 429 response format is a client contract. The headers included in a 429 response — RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset, Retry-After — are the signals that well-behaved API clients use to implement backoff. A 429 response without a Retry-After header leaves the client guessing when to retry; clients typically implement exponential backoff with jitter, which is correct behavior but produces a thundering-herd pattern when thousands of clients are simultaneously rate limited and all start retrying at slightly randomized intervals after the same window reset. A 429 response with a precise Retry-After value allows clients to backoff exactly to the moment the limit resets, which produces a synchronized retry surge that can exceed the original traffic. The response format determines the retry behavior at scale; the retry behavior at scale determines the load the service receives after each rate limit window resets.

The override and bypass policy is a security boundary. Exemptions from rate limiting — partner webhooks from known IP ranges, internal service accounts, load testing keys issued for performance testing — are security exceptions. Each exception narrows the effective coverage of the rate limiting control: if the partner's IP range is allowlisted and that IP range is compromised, the rate limiting protection does not apply to traffic from those addresses. The exception list grows over time as individual cases are accommodated without a general policy, and the security team discovers the full exception set at audit time by inspecting the code rather than reading a document. The security threat model for the API depends on knowing what the rate limiting control actually covers — the exceptions are part of the control's scope.

Rate limiting algorithms and their structural consequences

The choice of rate limiting algorithm is not a performance micro-optimization — it is a decision about what guarantees the API makes about burst behavior, and those guarantees are implicit in the client integration contract once the API is published.

Fixed window counters are the simplest implementation: a counter keyed by identifier and time window (typically the current Unix timestamp divided by the window duration) increments on each request and rejects requests when the counter exceeds the limit. The implementation fits in a single Redis INCR + EXPIRE command pair. The structural consequence is the window boundary burst: because the counter resets at the window boundary, a client that exhausts its quota at the end of one window and immediately begins sending requests at the start of the next window sends twice the declared rate in a period equal to the window duration. For most API use cases this is acceptable — the burst is bounded and predictable, and the upstream service capacity planning should account for the 2x factor. For APIs that protect expensive upstream resources (LLM inference calls, database-heavy aggregations, third-party API calls with their own rate limits), the 2x burst at window boundaries can cause cascading failures that the declared rate limit was intended to prevent. The fixed window algorithm is the right choice when the implementation simplicity is worth the burst behavior, and that trade-off belongs in the ADR.

Sliding window log counters record the timestamp of each request and count only the requests within the trailing window duration from the current time. The burst at window boundaries is eliminated: the count at any moment reflects exactly the requests in the preceding N seconds. The cost is storage proportional to the request rate — for a limit of 1,000 requests per minute, the counter stores up to 1,000 timestamps per identifier. Redis sorted sets (ZADD, ZREMRANGEBYSCORE, ZCARD in a Lua script) implement this cleanly. The sliding window log is the correct algorithm when the burst elimination is worth the storage cost — typically for expensive upstream calls where a 2x burst is genuinely unsafe — and the storage cost should be documented in the ADR alongside the algorithm choice so the platform team can project storage requirements as the API scales.

Token bucket algorithms model the rate limit as a bucket with a maximum capacity (the burst limit) that fills at a constant rate (the sustained rate) and is consumed by each request. When the bucket is empty, requests are rejected. Token bucket naturally allows controlled bursts: a client that has been idle for an hour accumulates tokens up to the bucket capacity and can spend them in a burst, then must wait for the bucket to refill. This is often the semantically correct model for developer APIs: a client that processes a batch job may legitimately need to send a burst of requests, and the token bucket allows it as long as the sustained rate is within the declared limit. The token bucket algorithm requires storing both the current token count and the last refill timestamp per identifier; Redis implementations typically use a Lua script to atomically read the state, compute the refill, apply the request, and write the new state. The burst capacity (maximum bucket size) and the refill rate (sustained rate) are two distinct parameters, both of which must appear in the rate limiting ADR, because they together determine what "rate limited" means for a client building a batch integration.

Leaky bucket algorithms model the rate limit as a queue that drains at a constant rate, smoothing bursty inbound traffic into a steady output stream. Requests that arrive when the queue is full are rejected. Unlike token bucket, which allows bursts up to the bucket capacity, the leaky bucket enforces a strict constant output rate regardless of input burst pattern. Leaky bucket is appropriate when the goal is upstream protection — shielding a slow downstream service from bursty traffic — rather than fair usage enforcement. It is rarely the right choice for a developer API where the contract is about request counts and the client controls the retry timing. Documenting the algorithm choice in the ADR prevents a future engineer from replacing a leaky bucket with a token bucket for performance reasons without understanding that the change relaxes the burst guarantee.

The identifier choice: who shares a bucket and who gets their own

The rate limit identifier is the most consequential decision in the rate limiting architecture, and it is the decision most likely to be made wrong because the failure mode only appears at scale, behind proxies and gateways that are not part of the development environment.

IP-based rate limiting works correctly in development (one developer, one IP) and fails silently in enterprise production (thousands of employees, one egress IP). The failure is not an error — the rate limiting enforces exactly the rule it was configured with — but the outcome is that a large enterprise customer's entire workforce is throttled as if they were a single user. IP-based limiting is appropriate for unauthenticated public endpoints where API keys are not available: login endpoints, password reset endpoints, public search endpoints. It is not appropriate as the primary rate limiting strategy for authenticated API endpoints used by enterprise customers, because the quota granularity does not match the accountability granularity. The authentication strategy and the rate limiting strategy must be designed together: the rate limiting identifier should be derivable from the authenticated identity, which means the authentication layer must surface the relevant claim (API key, user ID, organization ID) to the rate limiting middleware.

API-key-based rate limiting isolates each key holder's quota regardless of network topology. The 800 engineers in the London office each use their own API key — or more commonly, the firm's integration uses a single API key — and the rate limit applies to the key, not the shared IP. In a developer-facing product where the API key is the primary identity of the integration, this is the correct semantic. The rate limit is a property of the integration, not of the network path. The cost is that IP-based abuse (a bot that creates throwaway API keys) requires a separate control layer: account creation rate limiting, email verification requirements, or anomaly detection on new key usage. The decision to use API-key-based limiting implicitly chooses to rely on account creation controls as the abuse surface defense, and both decisions should appear in the rate limiting ADR.

Organization-based rate limiting is the correct choice for B2B SaaS products where the organization is the billing and accountability unit. An organization may have many users, each with their own API key; the organization has a rate limit that is the aggregate of all its users' activity. This model matches the commercial relationship: the organization purchased a plan with a declared rate limit, the rate limit applies to the organization's total usage, and usage above the limit triggers a conversation about plan upgrade rather than blocking individual users. Implementing organization-based rate limiting requires that the organization ID be available in the rate limiting middleware, which means it must be embedded in the API key or resolved from a key-to-organization mapping. The mapping lookup adds latency to every rate limiting decision; caching the mapping reduces the latency but introduces cache invalidation requirements when keys are revoked or organizations merge.

Endpoint-specific rate limits are needed when the cost of different endpoints varies significantly. A search endpoint that runs a full-text query against a large index is more expensive than a metadata endpoint that returns a cached value. A report generation endpoint that triggers a multi-minute background job is categorically different from a status endpoint that returns a counter. A flat global rate limit treats all endpoints as equally expensive and optimizes for the average case — which means expensive endpoints are underprotected (they are expensive, so the global rate allows more calls than the upstream resource can sustain) and cheap endpoints are overprotected (they are cheap, so the global rate is far below what the upstream resource can sustain). Per-endpoint rate limits calibrated to each endpoint's upstream cost are the correct architecture; they are more complex to implement and to document, but they prevent the class of incidents where a client discovers that calling a cheap endpoint 60 times a minute is fine and calling an expensive endpoint 60 times a minute takes the service down. The endpoint-to-limit mapping and the derivation of each limit value (the capacity analysis or load test result that determined the value) belong in the rate limiting ADR.

Distributed rate limiting: consistency versus performance

A horizontally scaled API has a fundamental rate limiting problem: the counter must be shared across all instances to enforce limits consistently, but a shared counter adds a network round-trip to every request. The decision between consistency and performance is not a configuration choice — it is a trade-off that determines the effective rate limit enforcement under normal load and under partial infrastructure failure.

Redis atomic counters (INCR on a key with an expiry) provide near-exact counting across all instances at the cost of a synchronous Redis call on every request. The call adds latency: a Redis instance in the same availability zone typically responds in under 1ms, but under load or network partition, latency can spike to tens of milliseconds. Whether this latency is acceptable depends on the API's SLO latency percentiles: adding 1ms to a p99 latency of 200ms is negligible; adding 1ms to a p99 of 10ms doubles the p99 latency for rate-limited requests. The atomicity matters: a non-atomic read-increment-write sequence (GET, increment in application code, SET) has a race condition where two concurrent requests can both read the same counter value and both be allowed when the combined count exceeds the limit. A Lua script that executes GET, INCR, and EXPIRE atomically on the Redis server eliminates the race at the cost of slightly higher Redis CPU usage.

Approximate distributed rate limiting maintains a local counter in each application instance and periodically synchronizes with a shared store. A client's request increments the local counter; the local counter is added to the shared counter on a schedule (every second, every 100ms). Between synchronizations, each instance enforces limits against its local count, which means the effective rate limit is the declared limit multiplied by the number of instances — a limit of 100 requests per minute per key, enforced locally across 10 instances, allows 1,000 requests per minute per key during the synchronization interval. This is a deliberate trade-off: the limit is enforced approximately, with a burst headroom proportional to the number of instances and the synchronization interval. This approach is appropriate when the goal is load shedding rather than strict quota enforcement — preventing the service from being overwhelmed by a single client is achievable with approximate counting even if the client can sometimes exceed the declared limit during a synchronization window.

Edge rate limiting (Cloudflare Workers, API gateway rate limiting, CDN-layer rules) applies rate limits before the request reaches application infrastructure. Edge limiting is the most effective approach for large-scale abuse prevention: requests that hit the rate limit never consume application server resources or database connections. The cost is that edge rate limiting is typically coarser — it operates on IP addresses and request paths, not on API key or organization identity, because the authenticated identity is not available at the edge without a key lookup that reintroduces the latency problem. Service mesh rate limiting (Envoy, Istio, AWS App Mesh with rate limit filters) solves the identity problem for authenticated service-to-service traffic by making the authentication context available at the proxy layer. For developer-facing APIs, a layered approach — edge rate limiting by IP for unauthenticated requests and scraping prevention, application-layer rate limiting by API key for authenticated requests — addresses both abuse prevention and fair usage enforcement with the appropriate identifier at each layer. The layered architecture is an architectural decision that belongs in the rate limiting ADR.

The 429 response: a client contract, not just an error code

The 429 Too Many Requests response is the most client-visible part of the rate limiting implementation. The headers included in the response determine whether clients can implement correct retry behavior; the response body determines whether clients can distinguish which limit was hit and what to do about it; and the documentation of the response format is the commitment the API makes to its consumers about how rate limiting will behave in the future.

The IETF standard (RFC 6585 for 429, RFC 7231 for Retry-After) specifies the minimum: a 429 status code and optionally a Retry-After header indicating when to retry. The draft IETF RateLimit header fields (RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset) have become a de facto standard across major API providers (GitHub, Stripe, Twilio, Shopify all include them). Including these headers allows clients to implement proactive backoff — reducing request rate before hitting the limit — rather than reactive backoff — discovering the limit by hitting it and then waiting. Proactive backoff reduces the number of 429 responses a client generates, which reduces the load on the rate limiting store and avoids the retry surge that follows a simultaneous rate limit event for thousands of clients.

The Retry-After header has two forms: a delay-seconds integer (retry after this many seconds) and an HTTP-date (retry after this specific time). The HTTP-date form is appropriate for a fixed-window counter where the exact reset time is known; the delay-seconds form is appropriate for a token bucket where the refill time depends on the current bucket state. Sending the wrong form — an HTTP-date when the counter uses a token bucket — produces incorrect client behavior: the client waits until the HTTP-date and retries, but if the bucket has not fully refilled by that time, the retry is rejected again, producing a different Retry-After in the next 429 response, which the client must handle as a separate case.

The response body should distinguish between distinct limit types. A client that hits an IP-based pre-authentication rate limit and a client that hits an API-key-based post-authentication rate limit are in different situations requiring different remediation: the first client should wait and retry from a different IP or via a proxy, the second client should examine their request rate and either space requests more slowly or contact the API provider to discuss limit increases. A 429 body that says "rate limit exceeded" without identifying which limit provides no signal for remediation. A 429 body that says "authenticated API key rate limit exceeded: 1000/minute, resets in 47 seconds" gives the client everything it needs to take the correct action. The response body format is a client contract; once published in API documentation, changing it is a breaking change that requires versioning.

The bypass and override policy

Rate limiting exceptions are made for legitimate operational reasons — partner webhooks must not be throttled, internal monitoring endpoints must not fail under load, load testing must be conducted without hitting production rate limits. Each exception is individually reasonable and collectively represents a gap in the rate limiting security control. The exception policy must be documented as a security decision, not just as an operational configuration.

IP allowlisting exempts specific IP ranges from rate limiting. The typical use case is a partner that delivers webhooks from a known IP range: the partner's IP range is added to an allowlist, and webhook delivery is never rate limited regardless of delivery rate. The security decision implicit in this exemption is that traffic from the allowlisted IP range is trusted at the rate limiting layer — if the partner's IP range is compromised, or if the allowlist is too broadly specified (the partner provided a /16 CIDR instead of a /30), the rate limiting control provides no protection against traffic from those addresses. The allowlist entries, the justification for each, and the review cadence (who periodically verifies that the allowlisted IP ranges are still correct and still minimal) belong in the rate limiting ADR.

Tiered rate limits by API key give different key holders different limits based on their plan or relationship. Free tier keys have one limit; paid tier keys have a higher limit; partner keys have a separately negotiated limit. The tier structure is both a commercial architecture and a rate limiting architecture: adding a new tier requires coordinating the rate limiting store configuration, the API documentation, the billing system, and the customer-facing dashboard. The API versioning strategy and the rate limiting tier structure are co-dependent: a rate limit tier is effectively a version of the API's throughput contract, and changes to tier limits have the same client impact as API changes. The tier-to-limit mapping and the process for granting a key a custom limit (a database row, a Redis key override, a feature flag) should be in the rate limiting ADR so that the on-call engineer has a documented procedure for making an emergency limit increase without a deployment.

Emergency runtime overrides are the operational mechanism that prevents the scenario at the top of this post: a legitimate traffic spike that cannot be addressed without a twelve-minute deployment. The override mechanism — a Redis key that the rate limiting middleware checks before applying the default limit, a feature flag that disables rate limiting for a specific identifier, an admin endpoint that temporarily raises the limit for an IP range — must be designed into the rate limiting architecture from the start, because adding it after the fact requires modifying the rate limiting middleware and deploying the change. The existence of an override mechanism is itself a security decision: the override must be access-controlled so that only authorized operators can change limits at runtime, and the override must be audited so that all limit changes (emergency and planned) are logged with the identity of who made the change and the business justification. The override mechanism and its access controls belong in the rate limiting ADR alongside the limits themselves.

Compliance, SLA, and the rate limiting commitment

Rate limiting transitions from an internal technical control to a published commitment at two moments: when the limits are documented in developer-facing API documentation, and when they are referenced in a customer contract or SLA. Both moments change the cost of modifying the limits and require the rate limiting ADR to be treated as a living document whose changes have external stakeholder implications.

Published rate limits are contractual obligations. Once rate limits are documented in the developer portal, API clients build integrations calibrated to those limits. A retail platform that builds its product catalog sync against an API with a documented limit of 10,000 requests per hour structures its sync logic around that limit. Reducing the limit to 5,000 requests per hour is a breaking change to the integration — the sync that was designed to complete in under an hour now takes over two hours. Published rate limits should be versioned alongside the API, and changes to rate limits that affect existing integrations should be communicated with the same lead time as other breaking changes. The rate limiting ADR is the document that tracks the version history of rate limit values and the reasoning behind each change, making the commitment history auditable.

SOC 2 and the rate limiting security control. SOC 2 CC7.2 (System Monitoring) and CC6.6 (Logical Access — Authentication) cover the security controls deployed to prevent unauthorized access and to detect and respond to threats. Rate limiting on authentication endpoints (login, password reset, account creation) is a control that prevents brute force credential attacks and credential stuffing at scale — attacks that are among the most common external threats to SaaS applications. The security ADR that documents the threat model depends on the rate limiting ADR to describe the control deployed against credential brute force threats. An auditor reviewing SOC 2 CC6.6 will ask for evidence that the authentication surface is protected against automated attacks; the rate limiting ADR is the documentation of that protection, and its absence is a finding.

GDPR and data extraction via API. A rate limiting design that allows a sophisticated actor to extract large volumes of user data from an API — even data that each individual API response legitimately reveals — may conflict with GDPR data minimization and purpose limitation principles. The data retention and access policy determines what data the API exposes; the rate limiting policy determines how much of it can be extracted in a given time window. If the API exposes personal data (email addresses, user profiles, behavioral data), the rate limiting policy is part of the technical and organizational measures deployed to protect that data under GDPR Article 32. The rate limiting ADR should note which data is accessible through rate-limited endpoints and what the aggregate extraction rate is at the declared limits — not as a justification for any specific limit value, but as a documented artifact that the data protection review can reference.

What a complete API rate limiting ADR looks like

The five sections of a rate limiting ADR and the decisions each section records:

Section 1: Scope and tiering. Which endpoints are rate limited (global, per-endpoint, authenticated-only, unauthenticated-only) and the reasoning for any endpoint-specific limits. The tier structure: what limit applies to each API key type (unauthenticated, free, paid, partner, internal) and the commercial or capacity reason for each tier value. The limit values with their derivation: a reference to the load test, capacity analysis, or upstream service limit that determined each value, not just the value itself. Limits set without derivation become legacy technical debt because no one knows whether they are safe to change.

Section 2: Identifier and algorithm. The rate limit counter key (IP, API key, user ID, organization ID, or composite) with the explicit rejection reasons for each alternative not chosen. The algorithm (fixed window, sliding window log, token bucket, leaky bucket) with the specific burst behavior it implies at the declared limit. The storage backend (in-process, Redis INCR+EXPIRE, Redis Lua script, approximate distributed) with the consistency guarantee and the fail-open vs. fail-closed policy for storage unavailability. The atomicity requirement (whether the read-check-increment must be atomic) and the implementation mechanism that satisfies it.

Section 3: Response specification. The 429 response headers included (RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset, Retry-After) and the format of each. The response body schema and the distinct content for each limit type (IP limit hit vs. key limit hit). The designed client behavior the response is intended to produce (exponential backoff with jitter, immediate retry at Retry-After, circuit break on persistent 429). The documentation commitment: the rate limit values, response format, and Retry-After semantics published in the developer portal are part of the API contract and are subject to the same versioning policy as endpoint changes.

Section 4: Bypass and override policy. The allowlist entries (IP ranges exempt from rate limiting) with the justification and scope of each, the review cadence, and the access control for modifying the allowlist. The tiered limit override mechanism — how a specific API key is granted a custom limit, who has authority to grant one, and how the grant is audited. The emergency runtime override mechanism — the feature flag, Redis key override, or admin endpoint that allows raising or disabling a limit without a deployment — with the access control and audit logging requirements. The security exception acknowledgment: each bypass entry is a gap in the rate limiting control, accepted for a documented operational reason.

Section 5: Compliance commitments. The rate limiting controls deployed against specific threat model entries (credential brute force on login endpoints, data extraction on personal data endpoints, denial-of-service from a single client). The SOC 2 criteria the rate limiting satisfies (CC7.2 system monitoring, CC6.6 logical access protection) and the specific endpoints and limits that satisfy each criterion. The GDPR technical measure description: what personal data is accessible at rate-limited endpoints, what the aggregate extraction rate is at declared limits, and the review trigger for limit changes that affect personal data endpoints. The documentation of published rate limits and the versioning policy for changes.

Finding rate limiting decisions in AI chat history

Rate limiting decisions appear in AI chat history in four session types, each capturing a different layer of the rate limiting architecture at a different point under a different kind of pressure.

The initial security session is where the rate limiting architecture is first defined, typically in response to an immediate threat: "how do I prevent someone from spamming my API?", "should I rate limit by IP or by user?", "what is a good rate limit for a REST API?", "how do I add rate limiting to an Express.js middleware?" These sessions contain the identifier choice (IP vs API key — usually chosen as IP because it requires no authentication and is the fastest to implement), the algorithm choice (usually fixed window, because it is the simplest to explain and the first example in most tutorials), and the initial limit values (often a round number — 60 requests per minute, 1,000 requests per hour — without derivation from upstream capacity analysis). The security framing of these sessions produces defaults optimized for stopping bots: low limits, IP-based, with minimal response headers because the bot does not read them. Those defaults become the production rate limiting architecture when the bot threat is addressed and the ticket is closed. Three months later, the initial session is the only place the reasoning exists.

The traffic spike session documents the incident-driven modification of rate limiting: "my API is being hammered and the server is falling over", "we're getting 50,000 requests per second from a single IP", "how do I add rate limiting without blocking legitimate users?", "we need to raise the rate limit but we don't know what's safe." These sessions are made under pressure, after a service impact, with the primary goal of stopping the immediate problem. The decisions made here — raising the limit to stop one incident, lowering it to stop another, switching from IP to API-key-based limiting because a partner complained — are made without the context of the original rate limiting design. The postmortem review captures what happened during the incident; the rate limiting ADR update captures the decision made in response. Without both, the next engineer who looks at the rate limiting configuration sees a set of values with no explanation of why they are what they are or what incidents they were set in response to.

The partner integration session documents the bypass and override policy being built one exception at a time: "our partner's webhook delivery is getting rate limited and they're dropping events", "we need to whitelist the Stripe IP ranges for webhook validation", "our enterprise customer says they need 50,000 requests per hour and we're set to 10,000", "how do we give one customer a higher rate limit than others without changing the default?" These sessions produce the allowlist entries, the tiered limit overrides, and often the first custom limit granted to a specific API key — all made in the context of a specific partner relationship without a general policy. The partner integration session is where the rate limiting architecture is most likely to diverge silently from the ADR: a Redis key override added to unblock a partner is exactly the kind of configuration change that is not committed to source control and is not reflected in any documentation. Platform teams who manage rate limiting infrastructure for multiple product teams need the allowlist and override policy documented at the platform level so product teams do not make exceptions that conflict with the platform's security model.

The SLA and compliance session documents the transition from internal control to published commitment: "we need to document our API rate limits in the developer portal", "a customer's legal team is asking us to specify rate limits in the contract", "our SOC 2 auditor wants to see our API abuse prevention controls", "we had a credential stuffing attack on the login endpoint and need to show the auditor what we have in place." These sessions are where the rate limiting architecture is first examined by people outside the engineering team — legal, sales, the compliance function, the security auditor. The gaps that surface in these sessions (the limits were set for bots but the compliance team needs to know they protect against credential stuffing; the limits are documented in the code but not in the developer portal; the allowlist was never reviewed for security scope) are the findings that produce the rate limiting ADR as a retroactive documentation exercise. The first year of a startup often produces all four session types for every major infrastructure decision; the rate limiting sessions are particularly valuable to recover because they contain the original reasoning that was never written down, the incident context that explains the current limit values, and the partner exception history that determines the current security scope.

The WhyChose extractor identifies rate limiting sessions in AI chat history by the characteristic language: algorithm names (token bucket, sliding window, leaky bucket, fixed window), status codes (429, Retry-After, RateLimit-Remaining), infrastructure terms (Redis INCR, EXPIRE, Lua script, NGINX limit_req_zone, Express rate-limit middleware), incident framing (being hammered, falling over, IP flooding, credential stuffing), and compliance terms (SOC 2 CC7, GDPR Article 32, brute force protection). These sessions contain decisions that are not in the code — the identifier choice and its rejection reasons, the algorithm choice and the burst behavior it implies, the limit value derivations, the allowlist entries and their justifications — and are recoverable only from the conversation where they were first worked through.