The search relevance decision record: why the ranking model you chose determines your query understanding ceiling and your content-scoring iteration velocity
The ranking model chosen when search is first implemented is almost always Elasticsearch's default BM25 — reached for because it works out of the box and satisfies the immediate requirement of returning results for the demo. BM25's limitations are not visible at launch. They accumulate over months, as users search for things using different words than the content uses, as new content categories create cold-start ranking failures, and as the synonym dictionary required to patch the vocabulary gap becomes a full-time maintenance burden that never achieves complete coverage. Changing the ranking model after launch requires rebuilding the retrieval layer, migrating the index schema to support vector fields, deploying embedding infrastructure, and rewriting the query pipeline — a project that blocks every relevance improvement until it completes. The ranking model choice sets three structural properties that determine whether a team can improve search after launch without rebuilding it from scratch.
A 30-person B2B SaaS company building a customer support documentation platform added search in year one using Elasticsearch with default BM25 ranking. The product was a knowledge base tool where support teams published documentation articles and end users searched for answers to their questions. The initial implementation was straightforward: article content was indexed as a text field with the default English analyzer, and user queries were passed to a match query that returned articles sorted by BM25 score. The search worked well during the onboarding demos — the engineering team could type "password reset" and retrieve the password reset article, type "billing" and retrieve the billing FAQ, type "API key" and retrieve the API documentation. The feature shipped.
Over the next eighteen months, the support team noticed a pattern in the user feedback queue: customers were filing tickets for questions that had documented answers in the knowledge base, but the answers were not surfacing in search. Investigation revealed a vocabulary alignment problem that BM25 exposed systematically. The documentation had been written by support engineers using product-specific terminology: "billing record," "invoice document," "payment method on file." Users searched using natural language queries that expressed the same concepts without using those exact terms: "how do I bill a client," "where is my invoice," "update my credit card." BM25 is a term frequency model — it retrieves documents whose indexed terms overlap with the query terms. A query for "how do I bill a client" against a document titled "Creating a billing record" has low term overlap: "bill" overlaps with "billing" through stemming, but "client" has no match against "record" or "creating." The document does not rank in the top results.
The engineering team's first response was to add synonyms to the Elasticsearch analyzer: "bill" → ["invoice," "billing," "charge"], "client" → ["customer," "user," "account"]. Synonyms expanded the retrieval recall for the specific query patterns they had observed. They added another set of synonyms the following month after another batch of zero-result tickets. And another set the month after that. By the end of year two, the synonym list contained over 400 entries maintained in a configuration file owned by the senior backend engineer who had built the search feature. New support documentation authors were trained to run searches against their own articles before publishing to check whether user-query terminology would retrieve them — a workflow that caught some gaps but not all, because authors were searching for what they knew users would search, which was the same vocabulary gap the synonym list was trying to close. The synonym maintenance became a recurring quarterly task that the engineering team resented and that never fully resolved the coverage problem, because the list could only patch vocabulary gaps that had been observed, not the ones that hadn't yet surfaced in the support queue.
By year three, the product team had scoped a search improvement project: add semantic search using dense retrieval embeddings to handle the vocabulary alignment gap without manual synonym maintenance. The engineering scoping revealed that the existing Elasticsearch index could not accommodate this incrementally. Dense retrieval requires a vector field in the index schema (a kNN or ANN-capable field storing the embedding vector for each document), a vector index type (HNSW in Elasticsearch 8+) separate from the inverted index used for BM25, an embedding generation pipeline that computed the document embedding at index time and stored it alongside the BM25 fields, and a hybrid query path that combined BM25 keyword retrieval with ANN vector retrieval. The Elasticsearch version running in production was 7.10, released before native kNN vector support was stable; upgrading required a full cluster migration and index rebuild. The embedding generation pipeline required an embedding model hosted either in-process or as a sidecar service, adding latency to the indexing path and a new infrastructure dependency that had never been budgeted. The hybrid query that combined BM25 scores with ANN similarity scores required a score fusion approach (Reciprocal Rank Fusion or linear combination) that had to be calibrated against an offline evaluation set — which did not exist, because the team had never built a labeled query-document relevance dataset. The project scope grew to four months. The vocabulary alignment problem that had driven synonym maintenance for two years was a direct consequence of the BM25-only ranking model choice made when search was first added — and the choice had never been documented as a decision with a vocabulary-ceiling consequence.
A B2C marketplace serving 80,000 active sellers implemented search ranking with a learned-to-rank model in year two. After a year of BM25-only search, the data science team had accumulated click-through rate and purchase signals on 2 million (query, item) pairs and trained a LightGBM model using BM25 base score, item click-through rate, item purchase conversion rate, item rating, item age (log-transformed), and seller response rate as input features. The model improved P@5 by 18% on a held-out evaluation set and went live. For the marketplace's established categories — vintage clothing, handmade jewelry, home decor — the LTR model significantly improved relevance. Popular items with strong engagement signals rose in rankings; low-rated items with stale listings fell.
Twelve months after the LTR model launched, the marketplace expanded into a new vertical: antique and vintage furniture. The furniture category launched with 12,000 listings from 800 sellers who had joined specifically for the new vertical. The product team expected the search experience to be strong immediately — the LTR model was already live and the new category would benefit from the same ranking improvements the other categories had seen. The support team started receiving complaints within two weeks of launch. Users searching for "vintage oak writing desk" were getting results from the vintage clothing category — jackets and accessories with high engagement signals — ranked above furniture items with zero engagement history. The LTR model was doing exactly what it had been trained to do: ranking items by the linear combination of features where click-through rate, purchase conversion, and rating dominated the score. All 12,000 new furniture listings had click-through rate of zero (they had never appeared in search results before), purchase conversion of zero, rating of zero (no transactions yet), and log-age near zero (just listed). Their LTR scores were dominated by the BM25 base score — a term match signal — but the BM25 base score was the weakest feature in the LightGBM model, contributing 6% of feature importance in the trained model. A vintage clothing jacket with 3.8 stars, 12% click-through rate, and 4% purchase conversion outscored a new furniture listing about "vintage oak writing desk" even when the furniture listing's title was an exact term match for the query.
The engineering team's first remediation was to add a category filter — requiring furniture category listings in furniture queries. This resolved the cross-category contamination but exposed the within-category cold-start problem: among 12,000 furniture listings with identical zero engagement signals, BM25 was the only differentiator. Users searching "vintage oak desk" retrieved "antique wooden writing table" at position 14 and "mid-century teak writing surface" at position 22, because neither the word "oak" nor "vintage" appeared in those titles, and BM25 had no semantic signal to bridge the vocabulary gap. The LTR model had eliminated BM25's already-weak relevance signal as the primary ranking factor for established categories. For new categories with no engagement signals, BM25's structural limitations were exposed without the LTR model's learned corrections — and LTR had reduced BM25's weight in the combined score so low that even its term-matching capability was effectively suppressed. The furniture search experience degraded below the BM25-only baseline the marketplace had used before the LTR model. The data science team spent six weeks implementing a cold-start fallback: a hybrid model that switched to BM25-weighted ranking for items below an engagement threshold of 50 impressions, then transitioned to LTR once the threshold was exceeded. The six-week remediation delay — during which furniture search was visibly broken — was a direct consequence of the LTR model's behavioral feature dependency never having been documented as a cold-start risk in a ranking model ADR.
Structural properties set by the ranking model choice
Three structural properties are determined when a team chooses how to rank search results. None appear in the founding session that asks how to add search to the product — that session focuses on returning results at all, not on whether the ranking model can handle queries it has never seen.
Property 1: Query understanding ceiling and vocabulary alignment. The query understanding ceiling is the maximum semantic distance between a query term and a relevant document term that the ranking model can bridge without manual intervention. BM25's query understanding ceiling is the stemming distance: "billing" matches "bill" through stemming, but "invoice" does not match "billing" without an explicit synonym. The ceiling is a hard limit — BM25 cannot retrieve a document whose indexed terms do not overlap with the stemmed query terms. Synonym expansion lifts the ceiling for observed vocabulary gaps, but the synonym dictionary must be maintained manually as vocabulary evolves, and it can only cover gaps that have been observed through monitoring the zero-result rate and the support feedback queue. A synonym dictionary for a product with controlled vocabulary (a vertical SaaS where users and authors share domain terminology) may stabilize at a manageable size. A synonym dictionary for a consumer product with diverse user language typically grows without bound as new user populations bring new terminology.
Learned-to-rank models trained on behavioral features do not change the query understanding ceiling: LTR re-ranks the candidates returned by BM25 retrieval, so it inherits BM25's vocabulary limitation for candidate retrieval. A query that retrieves no relevant candidates from the BM25 step cannot be rescued by LTR re-ranking because LTR operates only on the candidate set. LTR improves the ordering of already-retrieved candidates — it cannot retrieve candidates that BM25 missed. The search architecture decision record documents the indexing pipeline and synchronization; the search relevance ADR must document the retrieval step's vocabulary coverage and the ceiling on what LTR re-ranking can correct.
Dense retrieval (semantic embeddings with ANN search) raises the query understanding ceiling to the semantic distance the embedding model can bridge. A bi-encoder model (query and document encoded independently into a shared embedding space) retrieves documents whose embedding is similar to the query embedding, regardless of term overlap. A query for "how do I bill a client" and a document titled "Creating a billing record" will have similar embeddings if the model was trained on a corpus where these phrasings co-occur or are used interchangeably. The ceiling is now the semantic distance the embedding model can represent — which is higher than BM25's term overlap ceiling for most query types but still has limits: highly domain-specific terminology not present in the embedding model's training data will produce poor embeddings, and the embedding model must be retrained or fine-tuned on the product's corpus and user query language to capture domain-specific semantics. The ML model serving decision record documents the neural model serving infrastructure required for embedding computation at query time and index time; the search relevance ADR must document the embedding model selection, the index-time embedding pipeline latency and cost, and the query-time embedding latency as a component of the end-to-end search latency budget.
Property 2: Relevance iteration velocity and labeling infrastructure dependency. Relevance iteration velocity is the rate at which the team can improve search relevance after the initial ranking model is deployed. BM25's relevance is tunable immediately after deployment: field weight adjustments (boosting the title field relative to the body text), BM25 parameters (b for length normalization, k1 for term frequency saturation), and synonym dictionary updates can be applied without retraining a model, and their effect can be evaluated against the live system. The iteration cycle is hours to days, gated only by the team's ability to measure the change's effect through A/B testing or offline evaluation against a query sample.
Learned-to-rank models improve relevance beyond what BM25 parameter tuning can achieve, but their iteration cycle is longer and their improvement requires a labeled dataset. Adding a new feature to an LTR model requires retraining the model on a dataset that includes the new feature. The dataset requires either a click-through rate log (implicit labels from user behavior) or explicit relevance judgments (human annotators rating (query, document) pairs). A click-through log requires a minimum traffic volume to produce reliable signal: at 1,000 daily searches with a 20% click-through rate, the team accumulates 200 labeled (query, item) pairs per day — adequate for retraining a LightGBM model monthly, insufficient for a deep learning ranking model requiring millions of examples. The logging strategy decision record documents the structured logging infrastructure; click-through logging requires capturing the query, the result set and their positions, the item clicked (if any), and the session context — none of which appear in the default Elasticsearch access log. Without a click-through logging schema designed at the time search is implemented, the click-through dataset must be backfilled from whatever signals were captured, which typically means position-biased click rates that require debiasing before they can be used as training labels.
Neural reranking models (cross-encoders that jointly encode the query and document to produce a relevance score) achieve higher accuracy than bi-encoder dense retrieval and LTR but require the largest labeled datasets and the longest training cycles. Fine-tuning a BERT-based cross-encoder on (query, document, relevance label) triples requires tens of thousands to hundreds of thousands of labeled examples and hours to days of GPU training time. The CI/CD pipeline decision record documents the deployment automation infrastructure; the search relevance ADR must document the model deployment pipeline for ranking model updates — the offline evaluation step that validates a new ranking model against the previous one on the held-out query set, the A/B test framework for deploying a new model to a percentage of traffic before full rollout, and the rollback procedure when a new ranking model degrades online metrics. The feature store decision record documents the behavioral feature computation architecture; LTR features (item-level engagement rates, user-level preference signals, query-item interaction history) must be computed by the feature store and served at query time within the search latency budget.
Property 3: Cold-start behavior and the feature availability contract. Cold-start ranking failures occur when new content, new creators, or new product categories enter the corpus without the engagement signals that the ranking model depends on. The ranking model's cold-start behavior is determined at the time the model is chosen, not at the time the cold-start problem is first observed. BM25 does not have a cold-start problem for content ranking: term frequency and inverse document frequency are computed from the document content at index time, so a new document with relevant content terms ranks proportionally to its content quality from the moment it is indexed. New documents that are well-written with query-relevant terminology rank well; new documents with poor content do not rank well. The quality signal is content-intrinsic.
Learned-to-rank models have a structural cold-start problem for every feature that depends on behavioral engagement data. A new item with click-through rate zero and purchase conversion zero is indistinguishable from any other new item by the features the model was trained on. If the BM25 base score is a weak feature in the trained LTR model (as is typical when the LTR model was trained on a corpus where behavioral features dominate), the new item's LTR score is near zero regardless of its content quality. The remediation is a cold-start fallback that routes new items below an engagement threshold to a content-based ranking (BM25 or a lightweight feature set that uses only content-intrinsic signals), transitioning to the full LTR model once the engagement threshold is exceeded. Implementing this fallback after the LTR model is in production requires instrumenting the engagement threshold, building the routing logic in the query pipeline, and calibrating the threshold against the distribution of item engagement levels in the corpus. This is the remediation the marketplace team implemented in six weeks — it is not complex, but it requires planning and cannot be done at zero cost after the cold-start problem surfaces in production. The search relevance ADR must document the cold-start handling contract at the time the ranking model is chosen, before the first new content category is launched against a behavioral-feature-dependent ranking model.
Dense retrieval with semantic embeddings handles content cold-start similarly to BM25: a new document's embedding is computed at index time, and the document can be retrieved immediately via ANN search based on semantic similarity to queries. The cold-start problem for dense retrieval is the embedding model's coverage — if the embedding model was trained on a corpus that does not include the new content domain's vocabulary and concepts, the new domain's documents will have poor embeddings. Fine-tuning the embedding model on domain-specific data is required when the new domain is semantically distant from the training corpus. The background job infrastructure decision record documents the batch job execution framework; offline embedding computation for new documents — either batch-computing embeddings for documents that don't yet have them, or running nightly re-indexing for documents whose content has changed — requires a scheduled job that runs the embedding model over the unindexed or updated document set and writes the results to the vector index.
What the founding session records and what it omits
The search ranking decision is made in the sprint the search feature is implemented. It is almost always made by asking a practical question: "how do we make search work for our product?" The session answers this question efficiently and correctly — it produces a working search endpoint. What it does not record is what the ranking model prevents from improving, what happens when vocabulary diverges between users and authors, and what the ranking model requires to produce better results after launch.
Four types of AI chat sessions generate the gaps:
The "how do we add search to our product?" session. The team has content to search and needs to return results for user queries. They ask how to implement search. The session explains Elasticsearch or another search engine, how to define an index mapping with text fields, how to write a basic match query, and how to return results ordered by relevance. It recommends BM25 because it works without labeled training data and is the default ranking function for every major search engine. The session produces a working search endpoint. It does not ask: what is the vocabulary alignment gap between how users will describe their intent and how the content is written — and how large will that gap grow as the user base diversifies? What is the zero-result rate on a representative sample of actual user queries against the actual content? What is the relevance iteration mechanism after launch — if ranking quality is not good enough, what is the path to improving it and what does that path require? The session answers "how do we get search results?" without documenting that BM25's term-overlap retrieval sets a vocabulary ceiling that manual synonym maintenance cannot fully close, and that closing it requires embedding infrastructure that must be planned in advance. The search infrastructure decision record documents the search engine selection; the search relevance ADR must document the ranking model choice alongside the engine selection, because the ranking model determines whether the search engine can serve the product's actual user query distribution at launch and at scale.
The "why are users getting zero results?" session. The team has observed a high zero-result rate — users are searching for things the content covers but the search is not retrieving. They ask how to fix it. The session recommends adding synonyms to the index analyzer, enabling query expansion using the search engine's built-in synonym token filter, and potentially adding a "did you mean?" suggestion using the search engine's suggest API. These are correct short-term remediations. The session does not ask: is the zero-result problem concentrated in specific query types (long-tail natural language queries, semantic intent queries, cross-domain queries) that indicate a structural vocabulary alignment gap that synonyms cannot close? What is the rate at which new vocabulary gaps emerge — is this a one-time fixable list or an ongoing maintenance burden? Is the team on a trajectory where the synonym dictionary becomes unmaintainable? The session patches the immediate gap without asking whether the patch is a remediation or a temporary measure, and without documenting the maintenance burden of synonym-based vocabulary alignment so that a future engineering team can make an informed decision about whether to invest in semantic search infrastructure. The caching strategy decision record documents the query result caching approach; zero-result rate monitoring requires distinguishing between queries with no results in the corpus and queries with results that the ranking model failed to retrieve — the distinction determines whether the fix is content gap (no relevant content exists) or retrieval gap (relevant content exists but was not ranked).
The "how do we improve our search relevance?" session. The team has enough click-through data to observe that relevant content is not ranking at the top positions. They ask how to improve ranking quality. The session recommends learned-to-rank using the click signal as implicit labels, training a gradient-boosted model on (query, item, click) triples, and deploying it as a re-ranker over BM25 candidates. These are correct recommendations for the stated goal. The session does not ask: what is the cold-start ranking behavior for new items that have no click history — and does the product's content growth pattern produce frequent cold-start scenarios? What is the offline evaluation framework for measuring whether the new LTR model is better than the current BM25 — is there a labeled query set, an NDCG@k baseline, and a holdout evaluation that ensures the improvement is real and not an artifact of training on the evaluation distribution? What is the minimum click volume required per (query, item) pair before the click signal is statistically reliable as a relevance label — and are there query-item pairs where the click rate is so low that the label is noise? The session produces a working LTR model that improves average precision on head queries where click signal is dense. It does not document the cold-start risk for new items, the minimum traffic requirement for reliable label quality, or the evaluation framework for validating future ranking model updates — all of which become load-bearing when the team next iterates on the ranking model. The real-time architecture decision record documents the event streaming infrastructure; real-time behavioral signal ingestion for LTR features (session-level click and purchase events processed within minutes for recency-sensitive features) requires streaming event infrastructure that is distinct from the daily batch feature computation documented in the feature store decision record.
The "how do we make search understand what users mean?" session. The team has recognized that synonym patching is not scaling and that users are expressing queries semantically rather than lexically. They ask how to implement semantic search. The session recommends dense retrieval with a bi-encoder: embed queries and documents into a shared vector space, build an ANN index (FAISS, HNSW) over document embeddings, retrieve the top-k nearest neighbors by cosine similarity, and optionally re-rank with a cross-encoder for higher accuracy. These are correct recommendations for the stated goal. The session does not ask: does the existing search engine support native vector search, or does this require migrating to a new search engine version or a dedicated vector database? What is the latency impact of embedding computation at query time — does the embedding inference latency fit within the end-to-end search SLA, or does it require a co-located embedding service with GPU inference? How does the semantic search integrate with the existing BM25 retrieval — is this a replacement or a hybrid (RRF, linear combination), and what is the calibration procedure for the hybrid score weights? What is the offline evaluation set for measuring whether semantic search improves over BM25 on the actual query distribution — and does the evaluation set include the long-tail and zero-result queries where semantic search should provide the biggest gains? The session recommends a technically correct solution without surfacing the infrastructure requirements, the migration cost from the existing BM25-only index, or the evaluation framework needed to validate the improvement. The API schema design decision record documents the search API response contract; adding semantic reranking to an existing search endpoint changes the ranking model without changing the API contract — but if the semantic ranker changes the result set significantly (retrieving items with no BM25 term overlap), the search API's existing filtering behavior (facets, category filters, price range filters applied as post-filters over BM25 results) may produce unexpected interactions with the new retrieval path that must be explicitly tested against the API contract.
The WhyChose extractor surfaces the founding "how do we add search?" session, the first "why are users getting zero results?" session, the first "how do we improve ranking quality?" session, and the first "how do we add semantic search?" session from AI chat history. The search relevance ADR converts the implicit ranking model choice in those sessions — Elasticsearch default BM25, synonym dictionary, LTR model, dense retrieval — into documented vocabulary ceiling estimates, cold-start contracts, labeling infrastructure requirements, and iteration velocity parameters, so that the next engineer who observes a relevance problem can check whether it is a configuration issue within the current ranking model's capability, or a structural limitation that requires a different ranking architecture, before spending six weeks on a synonym dictionary that cannot close a semantic gap.
The five sections of a search relevance ADR
Section 1: Ranking model selection and query understanding ceiling. Document the ranking model: BM25 with default or tuned parameters, learned-to-rank with the feature set and training data source, dense retrieval with the embedding model and ANN index type, or a hybrid architecture combining two or more of these. For each retrieval stage, document the query understanding ceiling — the types of queries the model can handle and the types it cannot. For BM25, document the synonym and query expansion approach: the synonym dictionary size and ownership, the update cadence, and the observable gap categories that synonyms cannot close (paraphrase gaps, cross-domain gaps, intent queries). For dense retrieval, document the embedding model: the architecture (bi-encoder, cross-encoder), the training corpus and domain coverage, the embedding dimension, the ANN index type (HNSW, IVF, ScaNN), and the recall@k on a held-out query sample from the product's actual query distribution. Document the vocabulary alignment gap estimate: the percentage of representative user queries that retrieve no relevant documents under the current ranking model, stratified by query type (head queries, torso queries, tail queries). The search infrastructure decision record documents the search engine selection and field schema; the search relevance ADR must document the ranking model's requirements from the index schema — the fields used by BM25 scoring, the vector fields required for ANN retrieval, and the stored fields required by LTR feature computation at query time.
Section 2: Relevance iteration protocol and labeling infrastructure. Document the relevance signal source: click-through rate from search result interactions (implicit labels), explicit relevance judgments from human annotators (explicit labels), A/B test outcomes measured on downstream business metrics (purchasing, session depth, return visits), or a combination. For click-through labels, document the debiasing approach: position bias (documents ranked higher receive more clicks regardless of relevance) must be corrected before using click rates as relevance labels; the standard approaches are inverse propensity scoring (IPS), pairwise debiasing (using click-skip pairs to extract relative relevance judgments), or a separate position-bias estimation experiment. Document the offline evaluation framework: the query set used for offline evaluation (size, sampling strategy, staleness policy), the relevance metric (NDCG@5, MRR, MAP), the per-query baseline and the minimum detectable improvement that justifies deploying a new ranking model, and the holdout procedure that ensures evaluation queries are not in the LTR training set. Document the deployment pipeline for ranking model updates: the offline evaluation gate (the new model must achieve a minimum NDCG@5 improvement over the current model on the held-out set before it is eligible for A/B testing), the A/B test configuration (traffic percentage, duration, primary metric, guardrail metrics), and the rollback procedure if the A/B test reveals a regression in business metrics after the offline evaluation showed improvement. The database vendor decision record documents the corpus database; the click-through logging table that captures (query, position, item_id, clicked, session_id, timestamp) must be in a database accessible to the offline evaluation pipeline — storing it in the application database is sufficient at low volume but typically requires migration to the data warehouse at the traffic volumes that produce adequate LTR training data.
Section 3: Cold-start handling and the feature availability contract. Document the cold-start definition: the engagement signal thresholds below which items, creators, or content categories are considered cold-start for the primary ranking model. Document the cold-start fallback ranking model: the ranking function applied to cold-start items (BM25 with content-only features, a content quality score derived from metadata, a recency boost), the feature set for the fallback model and how it avoids behavioral feature dependencies, and the transition mechanism (the engagement threshold at which an item graduates from the fallback model to the primary model). Document the cold-start monitoring: the percentage of total impressions served by the cold-start fallback (too high indicates the primary model is not receiving enough traffic for adequate engagement signal accumulation; too low indicates that new content is not being identified as cold-start and is being ranked by the primary model with zero-value features). Document the new category launch protocol: when a new content category is launched, which ranking model serves the new category's items, what the expected timeline is for the category to accumulate enough engagement signal for the primary LTR model, and what manual interventions (boosting new category items temporarily, curating a seed set of highly relevant items to receive initial traffic) are available during the cold-start period. The observability strategy decision record documents the metrics monitoring infrastructure; cold-start monitoring requires per-item impression counts and per-item engagement rates accessible in near-real-time for the engagement threshold transition logic — this is typically implemented as a materialized view or an aggregation pipeline over the click-through log table, not as part of the search engine's built-in ranking signal.
Section 4: Query latency budget and two-stage retrieval architecture. Document the search latency SLA: the end-to-end response time from the client's query submission to the ranked result set being returned, measured at p50 and p99 under peak query volume. Document the latency budget allocation across the retrieval pipeline stages. For a two-stage retrieval architecture (candidate retrieval → re-ranking), the candidate retrieval stage (BM25 inverted index lookup or ANN vector search, returning top-k candidates) must complete within the first fraction of the latency budget; the re-ranking stage (LTR feature computation and scoring, or neural cross-encoder inference) must complete within the remaining budget. Document the candidate set size k: too small and the re-ranker cannot surface relevant items that ranked outside the top-k in the retrieval stage; too large and the re-ranking latency budget is exceeded. For BM25 retrieval, k=100–500 is typical; for ANN retrieval, k=50–200 is typical; for neural re-ranking (cross-encoder inference per candidate), k=10–50 is the practical ceiling given the per-candidate inference cost. Document the index size growth projection and its latency implications: BM25 retrieval latency grows sub-linearly with index size due to the inverted index structure (logarithmic in the number of unique terms, linear in the number of matching documents); ANN retrieval latency is more sensitive to index size and requires periodic re-indexing as new documents are added, with the re-indexing job's I/O affecting concurrent query latency if they share the same infrastructure. The caching strategy decision record documents the query result caching approach; caching search results for frequent head queries reduces the fraction of queries that incur the full retrieval + re-ranking latency, but the cache TTL must be calibrated against the rate of new content indexing — a cache that is valid for 60 minutes during high content ingestion periods will serve stale results that exclude recently indexed documents.
Section 5: Search quality observability — zero-result rate, CTR, and NDCG monitoring. Document three monitoring surfaces. The first is retrieval coverage: the zero-result rate (the fraction of queries that return no results), stratified by query length (zero-result rate is typically higher for longer, more specific queries), query category (if the product has categorized content, zero-result rates may differ across categories), and query vocabulary (queries whose terms are not in the index are a BM25-only problem; queries whose terms are in the index but whose semantics are not matched are a semantic search problem). A zero-result rate above 1–2% for a mature product with adequate content coverage typically indicates a retrieval failure, not a content gap, and should trigger investigation of the vocabulary alignment gap. The second monitoring surface is engagement quality by ranking position: click-through rate at position 1, 3, 5, and 10 (position CTR curves provide a proxy for relevance — a position-1 CTR significantly below the expected rate for the query category indicates that the top-ranked result is not the most relevant result). Position CTR should be disaggregated by query type (head queries vs tail queries) and by content category to identify the query strata where ranking quality is weakest. The third monitoring surface is offline NDCG on the held-out evaluation query set, measured before and after each ranking model update. A sustained degradation in NDCG@5 on the evaluation set that is not accompanied by a degradation in the position CTR metric indicates that the evaluation query set has diverged from the actual query distribution and needs to be refreshed — the most common cause of this divergence is that the evaluation set was built from a historical query sample and the product's user query distribution has shifted (new features, new user populations, seasonal patterns). The observability strategy decision record documents the metrics infrastructure; search quality metrics require a data pipeline that joins the click-through log (which captures user behavior) with the search result log (which captures what was served and at what position) to compute per-query CTR, per-position CTR, and per-query NDCG — none of which are available from the search engine's built-in logging without this join.
None of these five sections appear in the founding "how do we add search?" session. The session records that Elasticsearch was chosen, that the match query returns results for the demo queries, and that BM25 was used as the ranking function. It does not document that BM25's term-overlap ceiling will require synonym maintenance that grows without bound as user vocabulary diversifies, that the LTR model trained on engagement features will rank all new content at the bottom of results when new categories are launched, that improving ranking quality after launch requires a click-through logging schema that must be designed before the first query is served, or that transitioning from BM25 to semantic dense retrieval after the fact requires an index migration that is not incrementally achievable on an existing BM25-only index. The search relevance ADR converts the implicit ranking model choice — Elasticsearch default BM25, synonym dictionary, LTR model — into documented query understanding ceilings, cold-start contracts, labeling infrastructure requirements, and latency budget allocations, so that the next engineer who observes a relevance failure can determine whether it is within the current ranking model's remediable capability or whether it signals an architectural boundary that requires planning a ranking model transition before the remediation can begin. The WhyChose extractor recovers the founding search session, the first synonym-dictionary session, the first LTR model session, and the first semantic search session from AI chat history and surfaces them as the decision record chain that determines whether a team's search is one configuration change away from better relevance or six months away from a ranking infrastructure migration.
FAQs
When should a team use BM25 versus semantic dense retrieval for search ranking?
BM25 is appropriate when the corpus vocabulary is controlled (users and authors share the same terminology), queries are predictable and benefit from exact-term matching, and the team has no labeled training data or click-through signal to train a learned model. BM25 is also the right starting model when the team cannot afford embedding computation, vector indexing, and ANN search infrastructure, and when search is not the core product experience where marginal relevance improvements produce large business value.
Semantic dense retrieval is required when the vocabulary alignment gap is persistent and large — when users regularly use terminology that differs from the content terminology, when queries express intent rather than keywords, or when the corpus covers multiple domains where term overlap is structurally low. The critical distinction is that BM25's vocabulary gap can be patched with synonym dictionaries, but the maintenance burden grows proportionally to vocabulary diversity and can never achieve complete coverage. Dense retrieval handles unknown vocabulary gaps automatically once trained, but requires embedding infrastructure that must be planned and budgeted before the search feature is first implemented — not added retroactively to an existing BM25-only index, which requires a full index migration.
What is the cold-start problem in search ranking and how does the ranking model choice affect it?
The cold-start problem in search ranking is the inability to rank new content accurately when the ranking model relies on behavioral engagement signals (click-through rate, purchase rate, dwell time) that don't yet exist for newly indexed items. A learned-to-rank model trained on engagement features cannot distinguish between a new high-quality item and a new low-quality item: both have identical feature vectors of zeros, and the LTR score is dominated by whichever feature has the highest weight in the trained model — typically behavioral engagement, not content quality.
BM25 does not have this problem: term frequency and IDF are intrinsic to document content and are available at index time for every new document. Dense retrieval handles cold-start similarly — a new document's embedding is computed at index time, enabling semantic retrieval from first indexing. The cold-start problem is structural to LTR models with behavioral feature dependencies, and the remediation (a fallback model for items below an engagement threshold) must be planned at the time the LTR model is chosen, not discovered when the first new content category launches against a behavioral-feature-dependent ranker.
What should a search relevance ADR document that a general search infrastructure decision does not?
A general search infrastructure decision documents the search engine, the indexing pipeline, and the field schema. A search relevance ADR must document the ranking model's structural properties: the query understanding ceiling (BM25's term-overlap limit, the synonym dictionary maintenance burden, or the embedding model's semantic distance ceiling); the relevance iteration protocol (the click-through logging schema, the offline evaluation query set, the NDCG@k metric, and the A/B test framework for ranking model updates); the cold-start handling contract (the engagement threshold below which items use a fallback model, and the fallback ranking function); the two-stage retrieval latency budget (candidate retrieval step latency, re-ranking step latency, and candidate set size k); and the search quality observability SLA (zero-result rate by query stratum, position CTR curves, and offline NDCG monitoring).
None of these appear in the session that first asked how to add search. All of them determine whether the team can improve relevance after launch without rebuilding the ranking infrastructure, serve new content categories without cold-start penalties, and detect relevance regressions before users report them through the support queue.
Further reading
- The search infrastructure decision record — search engine selection and the indexing pipeline upstream of ranking
- The search architecture decision record — index synchronization and the document pipeline that feeds the ranking layer
- The ML model serving decision record — the inference serving infrastructure required by neural reranking models
- The feature store decision record — behavioral feature computation for LTR ranking models
- The observability strategy decision record — the metrics infrastructure that search quality monitoring relies on
- The logging strategy decision record — click-through logging for relevance labeling and NDCG evaluation
- The WhyChose open-source extractor — recover the founding search session from your AI chat history