The caching strategy decision record: why the cache invalidation approach you chose shapes your consistency guarantees and the classes of bugs your users experience in production
Caching adoption is treated as a performance optimization, not an architecture decision. The mechanism is chosen quickly when the first slow endpoint appears and rarely documented. Two years later, the TTL policy determines the staleness window for every cached resource, the invalidation strategy determines which writes propagate immediately versus after a delay, the cache stampede behavior determines what users experience when the cache is flushed, and the failure mode determines whether the application degrades gracefully or breaks entirely when the cache node restarts. None of this was visible when the first endpoint was accelerated. None of it is written down.
A user updates their billing address. The update succeeds — the database write completes, the response returns HTTP 200, the UI shows a success message. Four hours later, the user's invoice arrives with the old address. The support ticket arrives the same day. The on-call engineer traces the billing invoice generation to a user profile lookup, which hits the cache, which returns the profile as it existed at the time of the last cache population. The profile was cached with a 24-hour TTL. The cache was not invalidated on the address write. The billing service read a six-hour-old profile and generated the invoice with the previous address.
The fix is straightforward: add a cache invalidation call to the address update endpoint. The issue is that "add a cache invalidation call" requires knowing which cache keys encode the user's profile — and the profile is cached under three different key patterns depending on which service populated the cache. The billing service caches user:{id}:profile. The API gateway caches the full user object under session:{token}:user. The frontend GraphQL layer caches the profile query result under a key derived from the query hash. Invalidating "the user's profile" requires invalidating three different cache keys across two different cache stores, and the mapping between "user changed their address" and "these keys must be deleted" was never documented.
Like most foundational infrastructure decisions, the caching mechanism is visible as a fact — the application uses Redis, the TTL is 24 hours, the keys follow a naming pattern — but invisible as a decision. The fact answers "what is true now?" The decision record answers "what consistency commitment was made when the TTL was set to 24 hours, what data classes are excluded from caching because their consistency requirements cannot be met, and what the invalidation policy is for data that changes outside the primary write path." Without the record, the stale address bug is a surprise rather than a documented consequence of a known policy gap.
What "we use caching" means across five patterns
The first decision inside "we use Redis" is the caching mechanism — the architectural pattern by which cached data is populated, validated against the authoritative source, and invalidated when the authoritative data changes. The mechanism choice is often made by whoever is blocked on a slow endpoint, driven by the first tutorial that appears in a search, and carries specific consistency, latency, and invalidation model commitments that determine the behavior of every cached data class that follows.
Cache-aside (lazy loading) is the most common pattern and the one most often adopted by default. The application checks the cache before reading from the authoritative source (database, external API). On a cache hit, the cached value is returned without touching the authoritative source. On a cache miss, the application reads from the authoritative source, writes the result to the cache with a TTL, and returns the value. The application owns both the cache read and the cache population logic. Cache and authoritative source are not automatically synchronized — divergence accumulates until the cached value expires. The write path is decoupled from the cache: an update to the database does not automatically invalidate the corresponding cache key unless the application explicitly adds an invalidation call to the write path. Teams that add cache-aside without adding invalidation-on-write accept eventual consistency at the TTL boundary as the default behavior for all cached data, often without naming this as the consistency model they have chosen.
The write path decoupling is where staleness bugs originate. A user's account tier is cached under cache-aside with a 6-hour TTL. The user's subscription expires at 11am. The subscription expiry is processed by a background job that writes to the database at 11am and does not touch the cache. The user attempts to use a Pro-tier feature at 1pm — the cache still shows Pro tier, the feature gate reads from the cache, the user accesses a feature they should not have. At 5pm, when the 6-hour TTL expires, the cache is re-populated from the database, and the user loses access. The 6-hour window of incorrect access is a direct consequence of the TTL policy and the absence of invalidation on write — both decisions that were made (implicitly) at cache adoption and that determine the consistency guarantee for account tier data.
Write-through caching updates the cache synchronously on every write to the authoritative source. When the application writes to the database, it also writes the new value to the cache before returning the response to the caller. The cache is always at most one write behind the database — in practice, it is up to date with every committed write, because the cache is updated within the same write transaction or immediately after. The consistency guarantee is strong: a read that follows a write always sees the most recent data, as long as the read goes through the cache. The write latency cost is the cache write added to the write path — typically 1–5ms for a local Redis instance, which is negligible for most write patterns. The cold start problem is more significant: a freshly provisioned cache contains no data, so every read is a cache miss until the cache warms up through writes. Write-through caching without a separate cache-warming mechanism produces a degraded read latency period after a cache flush or node restart, because the cache only populates through writes, not reads.
Write-behind (write-back) caching acknowledges writes to the cache immediately and persists to the authoritative source asynchronously. The application writes to the cache, the cache acknowledges the write, and the response is returned to the caller — the database write happens later, in a background process. Write latency is minimized because the slow operation (the database write) is moved off the critical path. The risk is durability: if the cache node fails or is flushed between the cache write acknowledgment and the database write completion, the write is lost. The application told the user "your update succeeded" — and it is silently gone. Write-behind is correct for write-heavy workloads where write latency is the primary constraint and some data loss under cache failure is acceptable (analytics counters, view counts, non-critical preference updates). It is incorrect for any data where the user expectation is that "your update succeeded" means "your update is durable." Most teams that adopted write-behind for a specific high-write use case discover that the pattern has been applied to other data classes that require durability — when the cache node fails, they lose data that users assumed was persisted.
Read-through caching moves the cache miss handling from the application into the caching layer itself. When the application reads through a cache miss, the cache layer automatically calls a configured data loader to fetch from the authoritative source, stores the result, and returns it — the application receives the result without needing separate cache population logic. The application always reads from the cache; the cache is responsible for cache miss resolution. This pattern requires a cache provider that supports read-through configuration (some Redis client libraries, Ehcache, Hazelcast) and a data loader implementation that the cache layer can call. The benefit is consistent cache population logic — the application does not contain multiple code paths that each populate the cache differently, which is the source of the multi-key-pattern problem in the billing address scenario above. The constraint is that the cache layer must be able to call the data loader with the same dependencies and context as the application — authorization context, tenant identifier, database connection — which is straightforward for simple data loaders and complex for data that requires application-layer business logic to assemble.
CDN and edge caching caches HTTP responses at network edges — CDN nodes geographically distributed close to users — without requiring application code changes, because the caching behavior is controlled by Cache-Control response headers. An origin server that responds with Cache-Control: public, max-age=3600 instructs CDN nodes to cache the response for one hour and serve subsequent requests from the edge without hitting the origin. The latency reduction for static and semi-static content (product catalog pages, pricing pages, blog posts, API responses for data that changes infrequently) is significant: CDN edge latency is typically 5–20ms vs. 100–300ms for an origin request from a distant user. The invalidation model is where CDN caching decisions have the greatest architectural consequence. TTL expiry requires waiting for the cache to expire before updated content is served — a one-hour TTL means users may see stale content for up to one hour after an update. Cache purge APIs (Cloudflare, Fastly, AWS CloudFront) allow the application to explicitly invalidate specific URLs or patterns — but the purge call must be wired into the write path, and the mapping from "this record changed" to "these URLs must be purged" must be maintained. Surrogate key invalidation (Fastly Surrogate-Key, Cloudflare Cache-Tag) allows tagging cached responses with entity identifiers and purging by tag — when a blog post is updated, the application emits a purge for the post's cache tag, which invalidates all edge-cached responses that include that post's content, including listing pages and related-content blocks. The surrogate key model requires that the application track which cache tags are emitted for each response, and the tag design is a cache architecture decision that determines the granularity at which content can be invalidated.
The invalidation mechanism decision
The invalidation mechanism is the second decision inside caching strategy, and it is the one most frequently left undocumented because it feels like an implementation detail rather than an architecture decision. The mechanism chosen at adoption determines the consistency boundary for every cached data class — how long a write to the authoritative source takes to propagate to cache readers, and which events trigger cache invalidation versus leaving the cache to expire naturally.
TTL-based invalidation is the default for most cache-aside implementations. The cached value is stored with a fixed expiration time; at expiry, the next read produces a cache miss and the cache is re-populated from the authoritative source. The consistency guarantee is bounded staleness: a cached value is at most TTL seconds out of date. The TTL is typically chosen to balance cache hit rate (longer TTL = higher hit rate = less database load) against staleness tolerance (shorter TTL = more current data = more database load). The choice is made at cache adoption under the conditions that exist at that time: a team with a small database and a medium-latency read path might choose a 60-second TTL; the same team three years later with a high-traffic API and a slower database might never revisit the TTL because it was never documented as a policy decision. Like retention policies that determine how long historical data is queryable, TTL values set once at adoption become the operative policy until a staleness bug forces a revisit — and without documentation, each revisit rediscovers the TTL from first principles rather than building on prior reasoning.
Event-driven invalidation supplements or replaces TTL expiry by explicitly deleting or updating cached keys when the authoritative data changes. When a user updates their profile, the write path calls cache.del("user:{id}:profile") before or after the database write. The consistency guarantee improves to near-real-time: a read that follows a write sees updated data as soon as the invalidation completes. The complexity cost is the requirement to maintain the mapping from "write event" to "cache keys to invalidate" — a mapping that must be updated every time a new cache key pattern is introduced and every time a new write path is added. The mapping omission is the root cause of most cache staleness bugs: a new write path is added (background job, API endpoint, admin interface, batch import) that modifies data without calling the corresponding cache invalidation, and the cache serves stale data until the TTL expires. Like error handling decisions that determine which failure modes the application surfaces versus silently absorbs, the invalidation mapping decision determines which writes propagate immediately versus which silently diverge from the cache until TTL expiry.
Tag-based invalidation (available in HTTP caching via Fastly Surrogate-Key or Cloudflare Cache-Tag, and in application-layer caching via libraries like cache-tags for Redis) allows grouping cached values under named tags and invalidating all values associated with a tag in a single operation. A product listing page might be tagged with product-catalog and category:{id}. When any product in the catalog is updated, the application purges the product-catalog tag, invalidating all listing pages simultaneously. When a specific category is updated, the application purges category:{id}, invalidating only the pages for that category. The tag design is the critical decision: coarse tags (purging product-catalog on every product update) produce large invalidation scope and reduce cache efficiency; fine-grained tags (purging product:{id} on update) require tracking which cached responses include which products, which is complex to maintain for responses that include multiple products. The tag design decision is rarely documented because it is made incrementally — each new cache tag is added when needed — and the aggregate policy (which data class maps to which tags, what the invalidation granularity is for each class) is never stated as a coherent decision.
The cache stampede and cold start problem
The cache stampede — also called thundering herd — occurs when a high-traffic cached value expires and concurrent requests simultaneously find a cache miss. Each request independently executes the cache miss handler: the database query, the external API call, the complex aggregation. For a value that is cheap to compute (a simple database row lookup), the stampede produces duplicate reads with negligible impact. For a value that is expensive to compute (a JOIN across three tables aggregating 10,000 rows, an external API call with a 200ms latency, a machine learning model inference), the stampede saturates the database or downstream service with duplicate in-flight computations, each of which will produce the same result.
The severity multiplier is traffic volume at the moment of expiry. A cached value with a 60-second TTL that receives 1,000 requests per minute produces a stampede of roughly 17 simultaneous cache miss handlers at expiry — the number of requests that arrive in the window between the expiry and the first re-population. A value receiving 10,000 requests per minute produces a stampede of roughly 170 simultaneous handlers. Teams that cache high-traffic values with fixed TTLs and never documented the stampede behavior encounter it after a cache flush (which synchronously expires all keys, producing simultaneous stampedes for all hot values), a cache node restart (same effect), or a coordinated TTL expiry for a set of keys set to the same absolute TTL value at the same time (a common consequence of cache warming scripts that populate a batch of keys simultaneously).
The two standard mitigations have different operational tradeoffs. The mutex or single-flight pattern allows only one request to execute the cache miss handler per key; other concurrent requests wait for the first to complete and then read the newly populated cache value. The consistency guarantee is strong — only one computation runs — but the waiting requests experience latency equal to the computation time of the miss handler, which is typically the slow operation the cache was added to avoid. For a 500ms database query, all requests that arrive during the stampede window wait up to 500ms for the single-flight computation to complete. The probabilistic early expiration pattern (also called "jitter TTL" or "random early expiration") extends the effective TTL by a random amount proportional to the remaining TTL, causing individual cache clients to probabilistically regenerate the cached value before it expires rather than all at once. A value with a 60-second TTL might be regenerated by one request at 58 seconds, another at 59 seconds, and the third at 60 seconds — spreading the regeneration load rather than concentrating it at expiry. Like performance optimization decisions that determine latency behavior under load, the stampede mitigation decision determines user-visible latency at the moment the system is under its highest pressure — after a cache flush, which often happens during or shortly after a deployment.
The cold start problem is the related constraint that applies after a cache node restart or a full cache flush. A write-through cache is populated through writes — after a cold start, the cache is empty and every read is a cache miss until the cache accumulates data through production writes. For a cache-aside implementation, every read is a cache miss until the hot data set has been read at least once. For high-traffic applications, the transition from a cold cache to a warm cache can last minutes during which the database handles a traffic load it was not provisioned to handle independently. Cache warming procedures — pre-populating the cache with hot data after a restart before routing traffic to the application — are the standard mitigation, but the warming procedure is rarely documented at cache adoption time. It is typically developed reactively after the first cold start incident reveals that the application cannot handle its production traffic load without the cache. Like the feature flag bootstrap behavior that determines what the application returns before the SDK has synced its configuration, the cache cold start behavior determines what the application does in the window between startup and full cache warmth — a constraint that is only visible when the cache is not warm.
The consistency guarantee and the bugs it produces
The most consequential undocumented aspect of caching strategy is the consistency guarantee — the formal statement of what level of data currency the application commits to for each cached data class. Without a documented consistency guarantee, every cached data class implicitly commits to whatever the TTL policy produces: bounded staleness at the TTL boundary. Most teams never named this commitment, never evaluated which data classes can tolerate it, and never excluded data classes whose consistency requirements exceed it.
The stale permission bug is the most common high-severity consequence. Account permissions, subscription tier, and feature entitlements are frequently cached for read performance — permission checks happen on every authenticated request, and reading from the database on every check adds a database query per request at the volume of authenticated traffic. A cache with a 30-minute TTL for permission data means that a permission change — subscription upgrade, subscription expiry, role assignment, role revocation — takes up to 30 minutes to propagate to the application layer. The specific failure modes: a user whose subscription is cancelled should lose access to Pro features immediately; with a 30-minute TTL they retain access for up to 30 minutes, during which they can export data, add team members, or take actions that the subscription cancellation was intended to prevent. A user who is granted admin role by an administrator should have admin access immediately; with a 30-minute TTL they use the application as a regular user for up to 30 minutes while their admin session is pending. Neither failure mode was visible when the 30-minute TTL was chosen to reduce database load — the TTL was evaluated for its cache hit rate impact, not for its consistency implications for permission data.
The stale price display bug produces financial and trust consequences. A product catalog's pricing is cached for read performance — pricing pages receive high traffic and pricing data changes infrequently. A TTL of 24 hours is chosen. A pricing change is made: the Pro plan price changes from $9/month to $12/month at the start of a new pricing tier rollout. The database is updated. For the next 24 hours (on average 12 hours for users who hit a cache whose remaining TTL is uniformly distributed), users see the $9/month price on the pricing page. Users who sign up during this window pay $9/month and expect to continue paying $9/month — the price they saw when they signed up. Like API versioning decisions that determine which clients are affected by breaking changes, the TTL policy determines which users see updated pricing and which see stale pricing — and the answer is determined by the clock at which their cache entry was populated, not by any intentional policy.
The checkout race condition is a specific version of stale price data where the consistency window creates a financial liability. A user adds items to their cart, sees prices from the cached catalog, and proceeds to checkout. The cart total is computed from cached prices. The actual prices at checkout time are read from the database. If a price increased between the time the cart was populated from the cache and the time the checkout total was computed from the database, the user sees a different total at checkout than in the cart. If a price decreased, the user overpays and expects a refund. The checkout race condition is a direct consequence of using cached prices for display and authoritative prices for payment computation — an inconsistency that is built into the architecture by the combination of cache-aside on the display path and direct database reads on the payment path. Whether this inconsistency is acceptable (show stale prices in cart, always use authoritative prices at payment) or not (invalidate product caches on any price change) is an architecture decision that should be documented, not discovered through customer support tickets.
The cache coherence problem across multiple cache stores is the hardest version of the consistency challenge. When the same logical data is cached in multiple stores — the CDN edge cache, the application-layer Redis cache, and the in-process memory cache of a long-running application server — a single write to the authoritative source must invalidate all three caches to achieve consistency. Invalidating only the Redis cache leaves the CDN edge cache and the in-process memory cache serving stale data. The propagation order matters: invalidating the CDN before the Redis cache produces a window where CDN-served content is fresh but Redis-served content is stale; invalidating Redis before the CDN produces the opposite. Multi-level cache invalidation is an architecture problem that is invisible when each cache layer is adopted independently and first becomes visible when a write fails to propagate to one of the layers and a user reports seeing different data depending on whether their request hits an edge node or the origin. Like the service mesh observability constraint where trace context propagation requires explicit application-layer participation at every service boundary, multi-level cache invalidation requires explicit application-layer coordination at every write path — a coordination requirement that must be documented as a policy, not discovered through inconsistency incidents.
Writing the caching strategy decision record
The Nygard ADR format adapts for caching decisions with five sections that most cache adoptions leave entirely undocumented.
The caching mechanism and cache provider decision. Name the caching pattern, the cache backend, and the alternatives evaluated with rejection reasons. "We evaluated three approaches in January 2025: Memcached (simpler data model — key-value only, no data structures; horizontal scaling via consistent hashing; no persistence; evaluated for session caching use case only and rejected because we also needed sorted sets for the activity feed ranking), Redis Cluster (distributed Redis with automatic sharding across 6 nodes; supports all Redis data structures; higher operational complexity; evaluated and rejected for the initial deployment — we have one primary application without a write volume that justifies cluster overhead), and Redis Standalone with read replicas (single write primary with 2 read replicas via Redis Sentinel; supports all data structures; automatic failover via Sentinel; our current database cluster already uses a primary-replica pattern we operate confidently). Redis Standalone with Sentinel was selected. Caching pattern: cache-aside for application-layer caching (application checks Redis before reading from Postgres, populates Redis on miss). Write-through is not used — write path is responsible for explicit cache invalidation (see invalidation section). CDN caching: Cloudflare, controlled via Cache-Control response headers, with Cache-Tag headers for surrogate key invalidation on content that requires purge-on-update."
The TTL policy and invalidation mechanism. Name TTL values by data class and the invalidation strategy for each. "Data classes and TTL policies: (1) User profile (name, email, preferences — not account tier or permissions): TTL 300 seconds (5 minutes); invalidation: explicit cache delete on any profile write, including writes by background jobs and admin operations. The 5-minute TTL is the fallback for missed invalidations; the explicit delete is the primary invalidation path. (2) Account tier and permissions: NOT CACHED. Permission data (account tier, feature entitlements, team membership, role assignments) is always read from Postgres. The read cost (one database query per authenticated request on the permission-checked path) is acceptable at current traffic volume. The consistency requirement for permission data (immediate propagation of subscription changes, immediate reflection of role assignments) cannot be met by any TTL policy shorter than the acceptable staleness window for permissions, which is zero seconds. If permission data is added to the cache in the future, it requires event-driven invalidation wired to every write path that modifies permissions, and this decision record must be updated before the cache key is added. (3) Product catalog (names, descriptions, feature lists — not prices): TTL 3600 seconds (1 hour); invalidation: cache tag purge via Cloudflare Cache-Tag API on any catalog update. The one-hour TTL is the CDN edge TTL; the application layer reads product catalog from the database. (4) Product prices: NOT CACHED at application layer. Prices are always read from Postgres. Prices shown on the pricing page are subject to CDN TTL (see CDN policy). If the pricing page is updated, the Cloudflare cache tag for pricing content must be explicitly purged. (5) User-generated content (comments, posts, decision records): TTL 60 seconds; invalidation: explicit delete on content write. The 60-second TTL addresses the case where invalidation is missed and ensures content is not stale for more than one minute. Invalidation must be called from the write path before returning the success response to the user. (6) Aggregated counts (total decisions, team member count): TTL 30 seconds; no explicit invalidation (the count changes frequently and the 30-second bounded staleness is acceptable — displaying a count that is 25 seconds out of date does not affect correctness)."
The consistency guarantee. State explicitly what level of consistency each cached data class provides. "Consistency levels by data class: User profile — near-real-time via explicit invalidation, with 5-minute bounded staleness fallback. Account tier and permissions — strong consistency (always read from database, no caching). Product catalog — eventually consistent with 1-hour TTL, near-real-time for explicit Cloudflare purges. Product prices — strong consistency at application layer (always database); eventually consistent at CDN edge for pricing page HTML (Cache-Control max-age=3600, Cloudflare Cache-Tag purge on price change). User-generated content — near-real-time via explicit invalidation, with 60-second bounded staleness fallback. Data classes not listed: if a data class is not listed, it must not be added to the cache without a documented TTL and invalidation policy and a review of the consistency requirement by the data owner. The consistency guarantee for any cached data class is determined by the TTL and invalidation policy in this section, not by the correctness assumptions in the consuming code. If a service assumes strong consistency for data it reads from the cache, and the cached data class has a non-zero TTL with TTL-only invalidation, the service's assumption is incorrect and will produce incorrect behavior during the staleness window."
The cache key design and namespace policy. Name key conventions, namespace isolation, and composite key encoding. "Key naming convention: {namespace}:{entity_type}:{entity_id}[:{sub_resource}]. Examples: wc:user:8f2a9b3c:profile (user profile for user ID 8f2a9b3c in the 'wc' namespace), wc:decision:{id}:full (full decision record), wc:team:{id}:members (team member list). Namespace: all application keys use the 'wc' prefix. Test and staging environments use 'wc-test' and 'wc-stg'. Do not use numeric IDs as raw key components — encode with entity type prefix to prevent key collisions between different entity types that happen to share ID values. Maximum key length: 512 bytes (Redis limit). Composite keys for queries that return collections (e.g., 'all decisions for user X in team Y') must include all query parameters that affect the result, including sort order and pagination parameters if the result set is paginated. Collections must be invalidated when any member entity changes — the collection cache key must be deleted when a member is added, removed, or modified. This is the primary source of staleness bugs for collection caches: a new member is added, the individual member's cache key is invalidated, but the collection key that includes that member is not invalidated and serves a stale list."
The failure behavior and cache stampede policy. Name what happens when the cache is unavailable and how stampedes are mitigated. "Cache unavailability: the application treats Redis as a performance optimization, not a required dependency. If Redis is unavailable (connection timeout, connection refused, command timeout), the application falls through to the database for all reads. Cache writes are best-effort: if the Redis write fails, the application logs the failure at WARN level and returns the database result without caching it. The application must not return an error to the user because Redis is unavailable. Circuit breaker: after 10 consecutive Redis failures in a 10-second window, the Redis client opens a circuit breaker and all cache reads return 'cache unavailable' immediately without attempting a Redis connection. The circuit resets after 30 seconds. This prevents Redis connection pool exhaustion from cascading into application thread exhaustion during a Redis outage. Cache stampede: the application uses the single-flight pattern (singleflight library / equivalent) for cache miss handlers that execute database queries costing more than 50ms at p99. Single-flight deduplicates concurrent cache miss executions for the same key — only one database query runs, other concurrent requests wait and receive the shared result. Single-flight is not applied to cache misses for cheap database queries (p99 under 50ms) — the overhead of the deduplication mechanism exceeds the benefit for fast queries. Cache warm-up: after a Redis node restart or a cache flush, the application does not pre-warm the cache. The application relies on the database to handle the increased read load during the warm-up period (typically 5–15 minutes for most traffic patterns). If the database cannot sustain production read load without the cache warm — confirmed through load testing — a warm-up script that pre-populates hot user profile keys and hot product catalog keys must be added to the deployment runbook and this section updated with the warm-up procedure."
Finding caching decisions in AI chat
The WhyChose extractor surfaces caching decisions from four session types that contain the reasoning most teams cannot reconstruct when a new engineer asks why users sometimes see stale data after an update, or when a cache stampede incident prompts someone to ask whether the cache miss handler is protected against thundering herd.
The initial adoption session. "Redis vs. Memcached for a Node.js API — which should we use?", "how to add caching to a Django REST framework API", "best practices for caching database query results in Python", "cache-aside pattern vs. read-through — when to use which?", "how to set Redis TTL for user session data vs. product data", "should we cache database query results or full API responses?" These sessions contain the cache provider choice, the caching pattern, and the initial TTL reasoning. The adoption session is the most important to recover because the mechanism chosen at adoption carries all the downstream consistency commitments — and the rejection reasons for Memcached, for write-through, for database-level query caching are why the chosen mechanism cannot simply be replaced when a consistency bug reveals its limitations. A team that rejected write-through because of write latency concerns has a documented reason to revisit that tradeoff if write latency tolerance changes; without the record, the decision to use cache-aside is just an observed fact, not a reasoned position that can be updated.
The staleness incident session. "User updated their profile but still sees old data after page refresh", "how to invalidate Redis cache after a database write in Python", "Redis cache not reflecting updated database values", "cache invalidation pattern for cache-aside — when to delete vs. update the cached value", "user sees expired subscription features — permissions are cached with wrong data", "how to make cache invalidation happen immediately when a record changes." These sessions reveal when the team first encountered the staleness problem and what invalidation strategy they applied. The session that asks "user sees expired subscription features" is the permission caching bug being diagnosed in real time — it contains the diagnosis (permissions cached without invalidation on subscription change), the fix applied (add explicit cache delete to subscription expiry handler), and the scope of the problem (all other permission-modifying write paths that also lack invalidation). For platform teams, recovering staleness incident sessions from individual service teams identifies which write paths lack cache invalidation — the map of consistency gaps that a platform-level invalidation standard would address.
The stampede session. "Redis cache stampede — how to prevent thundering herd in production", "multiple requests hitting the database simultaneously when cache key expires", "how to use mutex for Redis cache regeneration in Node.js", "probabilistic early cache expiration to prevent thundering herd", "singleflight pattern for cache miss deduplication in Go", "cache warming after Redis restart — application slow after Redis node failure", "how to prevent all cache keys from expiring at the same time." These sessions emerge after a cache flush or coordinated TTL expiry incident. Like performance debugging sessions that reveal system behavior under production load, the stampede incident session contains the actual database saturation metrics, the specific hot keys that produced the stampede, and the fix applied. Recovering this session produces the stampede mitigation section of the decision record without requiring the team to reconstruct the mitigation from first principles after the second stampede incident.
The cache failure session. "What happens to the application when Redis goes down", "how to make my application continue working without the cache", "Redis connection pool exhausted — application throwing errors on cache reads", "how to implement graceful degradation when cache is unavailable", "circuit breaker for Redis in a Python application", "application performance without Redis — do we need to scale the database?", "how to handle Redis timeout without failing the user request." These sessions contain the failure behavior decision — whether the application fails open to the database, implements a circuit breaker, or returns an error when the cache is unavailable. A technical leader who inherits a system where Redis is an undocumented hard dependency — where Redis unavailability produces application failures because the code treats Redis as required infrastructure rather than optional acceleration — cannot assess the failure blast radius without reading the cache client code in every service. The failure session from the first Redis outage contains the decision about whether Redis should be a hard or soft dependency, and the fix applied (fail open vs. circuit breaker vs. error return) is the failure behavior policy that new services should follow.
What the decision record prevents
A documented caching strategy prevents three recurring problems that teams encounter as their cache usage grows and their engineering team turns over.
It prevents the undocumented consistency commitment. A team that caches account permissions with a 30-minute TTL has implicitly committed to 30-minute eventual consistency for permission changes — but without documenting this commitment, no engineer can confirm whether 30-minute staleness is acceptable for permission data, and no product decision that modifies how permissions are enforced can account for the staleness window. The decision record that names "account tier and permissions: NOT CACHED — strong consistency required" converts the implicit policy into an explicit constraint that new services can rely on. A new service that reads permissions from the cache, believing them to be current, builds on a false premise; a new service that reads permissions from the database, knowing that permission data is explicitly excluded from caching, builds on a documented guarantee. Like security decisions that determine which threat models are in scope, the consistency guarantee determines which data integrity assumptions are valid for downstream services — and it must be documented to be reliable.
It prevents the write-path invalidation gap. The most common source of staleness bugs is a new write path added after the cache was deployed that modifies cached data without adding the corresponding cache invalidation. An admin interface added six months after launch that can modify user profiles without calling the profile cache invalidation. A batch import job that updates product descriptions without purging the catalog cache. A background job that processes subscription cancellations without invalidating the permission cache. Each new write path that modifies cached data must be paired with the cache invalidation for the modified data class — and without a documented policy that makes this pairing explicit, each new write path represents a potential staleness bug. The decision record that lists the invalidation requirement for each cached data class converts the pairing requirement from an implicit engineering convention into an explicit policy that code reviewers can verify. Like ADR lifecycle policies that define when a decision requires revisitation, the invalidation policy is only reliable if it is documented as a requirement that applies to all future write paths, not just the ones that exist at the time of adoption.
It prevents the cascade failure after a cache flush. A team that deploys without knowing their application cannot sustain production database load without the cache discovers this during the post-deploy cache warm-up period, when the database experiences a sudden increase in read load that saturates its connection pool and produces cascading timeouts. The decision record that documents the cache cold start behavior — whether the application can sustain production load from a cold cache, and if not, what the warm-up procedure is — converts the cache flush from an unpredictable production event into a documented operational procedure with a known recovery time. Like the logging infrastructure decisions that determine whether the on-call engineer can answer incident questions at 3am, the cache failure behavior documented in the decision record determines whether the on-call engineer has a procedure to follow when Redis restarts or whether they are diagnosing the application's cache dependency from first principles during a production incident.
Further reading
- Decisions that never get written down — the cache invalidation policy is one of the decisions most likely to be undocumented: it feels like an implementation detail rather than an architecture decision, it is applied per-data-class by individual engineers, and the consequences (stale permissions, stale prices, cache-dependent application failures) appear only when a specific write path lacks the corresponding invalidation call; the policy must be explicit to be reliably applied
- The data retention decision record — TTL policies and retention policies share a structural property: both are point-in-time decisions that become the operative policy for all data in scope until a specific incident forces a revisit; a retention policy that deletes data after 30 days cannot support queries that require 60 days of history; a TTL that caches permission data for 30 minutes cannot support a security policy that requires permission changes to propagate immediately
- The performance optimization decision record — caching adoption is the most common performance optimization decision, and the performance-consistency tradeoff it makes is the decision most likely to be undocumented; the performance benefit (reduced database load, lower read latency) is visible and measurable; the consistency cost (bounded staleness, invalidation gaps, stampede risk) is invisible until an incident reveals it; documenting both is what makes the tradeoff explicit rather than accidental
- The feature flag decision record — caching and feature flags share a structural pattern: both are adopted as operational conveniences (performance / dark launches) and both carry implicit consistency commitments (staleness window / assignment model) that determine behavior in scenarios not visible at adoption time; the flag lifecycle policy and the cache invalidation policy are both decisions that are constructed reactively after the first incident reveals the absence of the policy
- The error handling strategy decision record — the cache failure behavior (fail open vs. fail closed vs. circuit breaker) is a specific error handling decision that determines the user-visible consequence of cache unavailability; an application that returns an error when Redis is unavailable has made a different error handling decision than one that fails through to the database; both decisions need to be documented as policies rather than left to per-engineer judgment at each cache read site
- The service mesh decision record — multi-level cache invalidation (CDN edge + application Redis + in-process memory) has the same coordination requirement as service mesh trace context propagation: both require explicit application-layer participation at every boundary, and both produce silent inconsistency bugs when the participation is absent at a specific boundary; the coordination requirement must be documented as a policy, not discovered through inconsistency incidents
- The security ADR: threat model and compliance — caching decisions have direct security implications: caching authentication tokens, session data, or permission information introduces risks if the cache is accessible to multiple tenants without namespace isolation; the cache namespace policy and the list of data classes explicitly excluded from caching (permissions, authentication tokens) are security decisions as much as performance decisions, and they belong in both the caching ADR and the threat model
- The API versioning decision record — CDN edge caching and API versioning interact: a cached response for API version v1 must not be served for a v2 request; the cache key must include the API version, and the invalidation policy must account for version-specific cache entries; teams that add CDN caching to a versioned API without including the version in the cache key serve wrong-version responses to clients until the TTL expires
- ADR lifecycle: superseding and deprecating decisions — caching decisions are frequently revisited as traffic patterns change; the TTL that was correct at 1,000 requests per minute may be incorrect at 100,000 requests per minute; the cache provider selected for a single-service application may be wrong for a multi-service architecture; documenting the revisitation conditions (traffic threshold, consistency incident rate threshold, multi-tenancy requirement) makes cache mechanism migrations deliberate rather than reactive
- Three months of AI chat history, undocumented — caching decisions appear in AI chat in four session types: the adoption session (mechanism choice, TTL reasoning); the staleness incident session (invalidation gap discovery and fix); the stampede session (thundering herd discovery and mitigation); and the cache failure session (failure behavior and circuit breaker decision); the staleness incident session is the most actionable for the decision record because it documents the specific invalidation gap in terms of a real production incident rather than a theoretical consistency property
- The new-CTO onboarding problem — a technical leader who inherits a system with caching cannot determine which data classes are cached, what the TTL policies are, which write paths include cache invalidation, and what happens when the cache is unavailable without reading every service's cache client code; the caching decision record converts these questions from a codebase archaeology exercise into a lookup against documented policies that the new leader can verify against the code
- Nygard ADR template — the standard format adapts for caching decisions with the consistency guarantee and the failure behavior as the most critical additions to the standard Consequences section; both are system-level policies that determine correctness properties for every service that reads cached data, and both need to be documented as explicit commitments rather than left as implicit consequences of the TTL and mechanism choices
- WhyChose extractor — caching decisions appear in AI chat in four session types: the initial adoption session (mechanism choice, provider selection, TTL reasoning); the staleness incident session (invalidation gap discovery); the stampede session (thundering herd discovery and mitigation); and the cache failure session (failure behavior and fallback policy); the staleness incident session is the most valuable for the decision record because it documents the specific consistency gap and the fix applied in terms of a real incident rather than a theoretical constraint