Part 3 of 3
🤖 Ghostwritten by Claude Opus 4.5 · Curated by Tom Hundley
This article was written by Claude Opus 4.5 and curated for publication by Tom Hundley.
Your RAG system works beautifully when someone asks "What's the refund policy for damaged items?" The embedding model captures the semantic intent, retrieves the right policy document, and the LLM generates a helpful response.
Then someone searches for "SKU-7749-BLK" and the system returns nothing useful.
This is not a bug. It is the fundamental limitation of how embedding models work.
Embedding models like text-embedding-3-small or multilingual-e5-large are trained on natural language. They learn relationships between words and concepts by processing millions of sentences where those words appear in meaningful contexts.
Product codes have no such context. "SKU-7749-BLK" appears in training data as a random alphanumeric string surrounded by other words. The model cannot learn that it represents a specific black widget in your inventory. The embedding it produces is essentially noise: a vector that does not meaningfully relate to anything.
The same problem affects:
Traditional keyword search (TF-IDF, BM25) excels at exact matching but cannot bridge vocabulary gaps. A query for "athletic footwear" will not match documents that only mention "shoes." A question about "cancellation procedures" might miss the document titled "How to end your subscription."
This vocabulary mismatch problem is well-documented in information retrieval research. Users do not always use the same terms as document authors, and without semantic understanding, relevant results go unfound.
The most widely adopted method for combining search results is Reciprocal Rank Fusion (RRF), introduced by Cormack, Clarke, and Buettcher at SIGIR 2009.
The elegance of RRF lies in what it ignores: raw scores. Different retrieval systems produce scores on incompatible scales. BM25 scores are unbounded and depend on corpus statistics. Cosine similarity ranges from -1 to 1. Attempting to normalize and combine these scores is fragile and dataset-dependent.
RRF sidesteps this entirely by operating only on ranks.
For each document, sum its contribution from each retrieval system:
RRF_score(d) = sum( 1 / (k + rank_i(d)) ) for each retrieval system iWhere:
rank_i(d) is the document's position in the results from system i (1-indexed)k is a constant, typically set to 60A document ranked 1st by vector search contributes 1/(60+1) = 0.0164. The same document ranked 10th by keyword search contributes 1/(60+10) = 0.0143. Its combined RRF score is 0.0307.
The k value of 60 was determined experimentally in the original paper and has proven robust across diverse datasets. Research indicates that RRF performance is not critically sensitive to the choice of k, making it a reliable default.
The power of rank-based fusion:
RRF has become the default fusion method in major search platforms including Elasticsearch, OpenSearch, and Azure AI Search.
Weaviate implements hybrid search by running BM25 and vector search in parallel, then combining results. The alpha parameter controls the weighting:
alpha = 0.0: Pure keyword (BM25) searchalpha = 0.5: Equal weight to both (default)alpha = 1.0: Pure vector search# Weaviate hybrid search example
response = collection.query.hybrid(
query="customer refund policy ABC-12345",
alpha=0.5, # balanced hybrid
limit=10
)Weaviate offers two fusion algorithms:
The max_vector_distance parameter allows filtering results where the vector similarity is too low, but no equivalent exists for BM25 scores since they are unbounded.
Elasticsearch combines lexical and semantic search through its retriever abstraction:
{
"retriever": {
"rrf": {
"retrievers": [
{
"standard": {
"query": {
"match": { "content": "refund policy" }
}
}
},
{
"knn": {
"field": "embedding",
"query_vector": [...],
"k": 10
}
}
],
"rank_constant": 60,
"rank_window_size": 100
}
}
}Elasticsearch 8.18 and 9.0 introduced weighted RRF, allowing different importance levels for each retriever. This provides more control than simple RRF when you have prior knowledge about which signal matters more for your use case.
Pinecone takes a different architectural approach, combining sparse and dense vectors within the same index. Each record contains both:
# Pinecone upsert with both representations
index.upsert(vectors=[{
"id": "doc-123",
"values": dense_embedding, # semantic
"sparse_values": {
"indices": [102, 5789, 23001],
"values": [0.8, 0.4, 0.6]
}
}])Pinecone recommends generating sparse vectors using either their hosted pinecone-sparse-english-v0 model or traditional BM25/SPLADE encoding. The alpha weighting is applied to the query vectors rather than at search time.
Important caveat: Pinecone's serverless indexes select initial candidates based only on dense vectors. This can affect accuracy when dense and sparse representations are not correlated. For maximum flexibility, Pinecone now recommends separate dense and sparse indexes with reranking to combine results.
For teams already using PostgreSQL, hybrid search can be implemented without additional infrastructure:
-- Hybrid search combining full-text and vector similarity
WITH keyword_results AS (
SELECT id, ts_rank(to_tsvector('english', content), query) as kw_score
FROM documents, plainto_tsquery('english', 'refund policy') query
WHERE to_tsvector('english', content) @@ query
ORDER BY kw_score DESC
LIMIT 50
),
vector_results AS (
SELECT id, 1 - (embedding <=> query_embedding) as vec_score
FROM documents
ORDER BY embedding <=> query_embedding
LIMIT 50
)
SELECT
COALESCE(k.id, v.id) as id,
-- Simple weighted combination (not RRF)
COALESCE(k.kw_score, 0) * 0.3 + COALESCE(v.vec_score, 0) * 0.7 as hybrid_score
FROM keyword_results k
FULL OUTER JOIN vector_results v ON k.id = v.id
ORDER BY hybrid_score DESC
LIMIT 10;This approach trades sophistication for simplicity. For true RRF in PostgreSQL, you would need to compute ranks and apply the fusion formula, which is straightforward but more verbose.
The optimal weighting between keyword and vector signals depends entirely on your data and queries. There is no universal best value.
Research consistently shows that simply combining dense and sparse search without tuning can underperform pure approaches. The gains from hybrid search require thoughtful optimization.
Hybrid search retrieves candidates. A reranker evaluates them more carefully.
The retrieval stage prioritizes speed and recall: get a reasonable set of candidates from millions of documents. The reranking stage prioritizes precision: for a small candidate set, invest more compute to order them optimally.
Cross-encoder rerankers like Cohere Rerank, BGE-Reranker, or ColBERT process the query and each candidate together, enabling much richer interaction than independent embeddings allow. This architecture enables gains of 15-35% or higher in accuracy beyond what retrieval alone achieves.
The two-stage pattern:
This adds latency (typically 50-200ms for the reranking step) but substantially improves result quality for production systems where accuracy matters.
Each search component adds latency:
For user-facing search with 200ms latency budget, you may need to skip reranking or limit candidate set size. For RAG pipelines where an LLM will process results anyway, the additional latency is often acceptable.
Hybrid search requires maintaining two representations of each document. Updates must propagate to both:
This synchronization is handled automatically by integrated platforms like Weaviate or Elasticsearch. For separate systems (e.g., PostgreSQL full-text + Pinecone vectors), you need explicit coordination to prevent drift.
What happens when one search fails?
Design your hybrid system to degrade gracefully rather than fail completely.
Combine keyword and vector search for better RAG retrieval.
Hybrid search combines vector (semantic) and keyword (lexical) retrieval to get the best of both approaches. Use Reciprocal Rank Fusion (RRF) to merge rankings without normalizing scores. Tune the alpha/weight parameter based on your query patterns: favor keywords for identifiers and codes, favor vectors for conceptual queries. Consider adding a reranker for the final precision boost. Monitor and adjust as query patterns evolve.
To implement production hybrid search:
Choose your platform: Weaviate, Elasticsearch, and Pinecone all offer integrated hybrid search. PostgreSQL + pgVector works for simpler needs.
Run both searches: Execute BM25 (keyword) and kNN (vector) searches in parallel. Retrieve 2-5x more candidates than you need.
Fuse with RRF: Apply the formula 1/(k+rank) with k=60 for each document from each search. Sum scores. Sort by combined score.
Tune the weighting: Start at 50/50. If queries contain identifiers, shift toward keywords (alpha 0.3-0.4). If queries are conceptual, shift toward vectors (alpha 0.6-0.7). Use a test set to validate.
Add reranking (optional): For top 20-50 candidates, apply a cross-encoder reranker like Cohere Rerank or BGE-Reranker. Expect 50-200ms additional latency.
Monitor and iterate: Track query patterns, click-through rates, and retrieval metrics. Adjust weights as your corpus and users evolve.
Common pitfall: Deploying untuned hybrid search and expecting automatic improvements. The combination only helps when properly calibrated to your data.
Use this framework to decide your retrieval architecture:
| Query Type | Corpus Type | Recommendation |
|---|---|---|
| Conceptual, natural language | Natural language documents | Vector search may suffice |
| Contains identifiers/codes | Mixed content | Hybrid search essential |
| Precise terminology | Technical documentation | Hybrid, keyword-weighted |
| Multilingual | Any | Hybrid, vector-weighted |
| High accuracy requirements | Any | Hybrid + reranking |
For most enterprise RAG systems, hybrid search is not optional: it is the baseline for reliable retrieval. The question is not whether to implement it, but how to tune it for your specific needs.
Building enterprise search or RAG systems?
Hybrid search is just one piece of production AI architecture. In our Executive AI Enablement Boot Camp ($5,000), we build complete retrieval pipelines tailored to your data and use cases, not just theory.
Schedule now: essrocks.io
Elegant Software Solutions — "That's AI done right."
Part 3 of 3
Discover more content: