Reranking & Hybrid Retrieval
Learn why two-stage retrieval and keyword+vector fusion improve relevance in real-world RAG systems.
What Is Reranking & Hybrid Retrieval?
Think of retrieval like hiring from a huge pile of resumes.
- First pass: a recruiter quickly scans thousands of resumes and picks a short list that looks promising.
- Second pass: a hiring manager reads only that short list carefully and picks the true top candidates.
That is the core idea of two-stage retrieval.
In many AI systems, the first pass is fast search over a large corpus. It often gets you “pretty relevant” results. But when your question is nuanced, “pretty relevant” is not enough. You need a second pass that understands the query-document pair at a deeper level. That second pass is reranking.
Now add hybrid retrieval: instead of relying on only keyword search or only vector search, you combine both. Keyword search (like BM25) is great for exact terms, product codes, and rare names. Vector search is great for semantic similarity and paraphrases. Together, they usually outperform either one alone.
Technical definition:
- Hybrid retrieval combines lexical and semantic retrieval signals to produce a better candidate set.
- Reranking applies a stronger relevance model to that candidate set and reorders it so the final top-k is actually useful.
Why Does It Matter?
If retrieval is weak, your assistant fails even when the generation model is strong. RAG systems do not hallucinate only because of “model weakness” - they also hallucinate when retrieval sends weak context.
Reranking and hybrid retrieval matter because they improve:
- Answer quality: Better top documents means better grounded answers.
- Precision at top-k: The first few chunks matter most because context windows are limited.
- Robustness: Hybrid methods handle both exact lookups and fuzzy semantic questions.
- Trust: Users see fewer wrong citations and fewer “almost right” responses.
In production, this often has a measurable effect on business metrics: higher answer acceptance, fewer support escalations, and less manual correction.
How It Works
A practical pipeline is:
-
Build two retrieval channels
- Lexical index (BM25 or equivalent) over tokenized text.
- Vector index over embeddings.
-
Run both channels for each query
- Lexical retrieval returns documents with exact token overlap.
- Vector retrieval returns semantically similar chunks.
-
Fuse the candidate lists
A common method is Reciprocal Rank Fusion (RRF). Intuition: a document is valuable if it ranks well in one or more lists.
A simple version is:
RRF(d) = sum_i 1 / (k + rank_i(d))where
rank_i(d)is documentdin listi, andkis a smoothing constant (often around 60). -
Take a wider candidate set
For example, keep top 50 or top 100 fused candidates. This keeps recall high before expensive scoring.
-
Rerank with a stronger model
Use a cross-encoder or instruction-tuned reranker that reads
[query, document]jointly and outputs a relevance score.- Bi-encoders (embedding search) are fast because query and docs are encoded separately.
- Cross-encoders are slower but more accurate because they compare tokens across query and doc directly.
-
Select final context
Keep top N chunks after reranking, then apply context assembly rules (deduplicate, diversify sources, respect metadata filters).
-
Pass to generation model
Now the LLM receives fewer but stronger chunks.
Simple example
Query: “How do I rotate API keys without downtime?”
- Keyword search finds docs with exact phrase “API key rotation” and “zero downtime”.
- Vector search also finds “credential rollover” and “grace-period token migration” docs.
- Fusion combines both sets.
- Reranker pushes documents that specifically discuss migration sequence and rollback checks to the top.
Result: the final context is not just related to security - it is specifically relevant to safe rotation procedure.
Key Terminology
- Lexical retrieval (BM25): Search that relies on token overlap and term statistics.
- Vector retrieval: Search in embedding space for semantically similar content.
- Hybrid retrieval: Combining lexical and vector signals in one retrieval stack.
- Reranker: A stronger model that reorders candidates by deeper relevance.
- RRF (Reciprocal Rank Fusion): Rank-based fusion method that combines multiple ranked lists.
Real-World Applications
- Customer support copilots: Blend exact policy lookups with semantic FAQ matching, then rerank for issue-specific passages.
- Enterprise search: Combine strict keyword constraints (legal terms, IDs) with semantic retrieval across wiki pages.
- E-commerce search: Match product codes exactly while still understanding intent like “lightweight trail shoes for wet weather.”
- Code assistants: Retrieve by symbol names and function signatures, then rerank by actual implementation relevance.
Common Misconceptions
-
“Vector search makes keyword search obsolete.” Not true. Exact matching remains critical for identifiers, formulas, product SKUs, and compliance language.
-
“Reranking is optional polish.” In many systems it is a major quality lever, especially when the top few retrieved chunks determine answer quality.
-
“Hybrid retrieval is always too slow.” With practical candidate limits and batched reranking, latency is often acceptable for significant relevance gains.
Further Reading
- Cormack, Clarke, and Buettcher (2009), Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.
- Robertson and Zaragoza (2009), The Probabilistic Relevance Framework: BM25 and Beyond.
- Nogueira and Cho (2019), Passage Re-ranking with BERT.