All posts RAG Patterns

Hybrid search and reranking: why vector similarity alone fails

L'équipe RagNight · 11 min read · January 27, 2026

Vector similarity grasps meaning but misses the literal (codes, acronyms). Hybrid dense + BM25 search fused with RRF, then cross-encoder reranking: the combo that makes a RAG genuinely precise.

Vector search has a flaw we prefer not to mention: it is excellent at meaning, mediocre at the literal. Ask it to find a document about "early contract termination" and it shines. Ask it for the product code XR-4420-B or the exact acronym DSP2 and it may miss entirely. That is why serious RAG systems almost never use vector similarity alone. They combine hybrid search and reranking.

Here is why, and how to assemble them — with the numbers, trade-offs and settings that make the difference in production.

The blind spot of vector similarity

An embedding captures the meaning of a passage in a continuous space. Two phrasings of the same idea end up close. That is dense retrieval's strength — and its limit.

Some queries are not about meaning but exact matches:

  • product codes, references, version numbers (v2.14.0);
  • proper nouns, acronyms, identifiers (SIREN, IBAN);
  • rare terms or domain jargon absent from the embedding's training data.

On these, an embedding model may judge as "close" passages that share the topic but not the exact term. The mechanism is easy to grasp: a subword tokenizer splits XR-4420-B into fragments (XR, -, 44, 20…) with no stable meaning, and the resulting vector lands next to any other alphanumeric reference. To the embedding, XR-4420-B and XR-4420-C are nearly indistinguishable; to your user, they are two different parts. Result: the right document — the one literally containing the reference — may not surface, or surfaces at rank 30, outside the context passed to the LLM.

Dense search understands what you mean; lexical search finds what you wrote. A good RAG needs both.

A concrete query where dense alone fails

Take a technical support corpus. The query: "error E-1042 after update v2.14.0". The only useful document is a release note that mentions E-1042 and v2.14.0 verbatim.

  • Dense alone: the model surfaces five articles about "errors after an update", "migration problems", "version regressions" — semantically close, but none contains the exact code. The true note lands 18th.
  • Lexical (BM25): E-1042 and v2.14.0 are rare tokens with very high IDF weight. The document containing both comes out on top immediately.
  • Hybrid: the document is caught by BM25 and judged topically relevant by dense → fusion pushes it to the top, and the reranker confirms. The LLM gets the right release note in the top-5.

This is exactly the kind of query — common in support, legal, finance, R&D — where you lose a customer if the RAG answers off-target.

BM25: why lexical captures the literal

BM25 (Best Matching 25) is the reference algorithm for lexical retrieval, inherited from the TF-IDF family. It ranks a document against a query using three ingredients:

  1. TF (term frequency) — the more a query term appears in the document, the higher the score, but with saturation: the 10th occurrence weighs far less than the 2nd (parameter k1, typically ~1.2–2.0).
  2. IDF (inverse document frequency) — a term rare in the corpus (like E-1042) is worth much more than a common word (the, update). This is what makes BM25 surgical on identifiers and jargon.
  3. Length normalization — prevents a very long document from mechanically scoring more (parameter b, typically ~0.75).

BM25's strength lies in IDF: a token appearing in just one document of the corpus becomes a near-deterministic signal. No semantic generalization, no "approximately" — precisely what dense cannot do. In return, BM25 is blind to synonyms ("termination" ≠ "cancellation") and to rephrasing. Hence the complementarity, not the competition.

Dense generalizes, BM25 discriminates. One finds the document about the same topic, the other finds the document containing the right word. Fusing them means you stop choosing.

Hybrid search: dense + lexical

Hybrid search combines these two complementary engines:

  • Dense (vector) — captures semantic similarity.
  • Lexical (BM25 / keywords) — captures exact-term matches.

Run both in parallel, then fuse the rankings. The most robust method is RRF (Reciprocal Rank Fusion), which combines ranks (not scores, hard to compare across engines) into a unified ranking.

Query
   ├──► dense search (pgvector / HNSW) ──► top-k_dense
   └──► lexical search (BM25)          ──► top-k_lexical
                       │
                   RRF fusion
                       │
              unified candidates (large top-N)

How RRF works, in numbers

RRF ignores raw scores — a cosine distance (between 0 and 1) and a BM25 score (unbounded, corpus-dependent) are not comparable. It keeps only the rank of each document in each list and applies:

score_RRF(d) = Σ  1 / (k + rank_i(d))
              lists i

where k is a damping constant (usual value: 60) preventing the very first rank from crushing all others. Take three documents A, B, C and two engines:

Doc Dense rank BM25 rank Calc (k=60) RRF score
A 1 4 1/61 + 1/64 0.0320
B 3 1 1/63 + 1/61 0.0323
C 2 30 1/62 + 1/90 0.0272

Reading the result: B takes the lead even though it is first in only one engine, because it ranks well in both. A, strong in dense and decent in BM25, follows closely. C, brilliant in dense (rank 2) but nearly absent in BM25 (rank 30), comes last: one engine is not enough to carry it. That is the whole point of RRF — cross-engine consensus beats an isolated spike, with no need to calibrate fragile weights between incompatible score scales.

RRF rewards agreement between engines. A document both dense and BM25 deem relevant rises strongly; one carried by a single engine stays a candidate without dominating. That is what makes fusion robust without fine tuning.

Reranking: retrieve broad, decide fine

Fusion yields a candidate list — say the top 50. But 50 is too much for the LLM context, and the order is not yet optimal. Enter reranking.

Bi-encoder vs cross-encoder

The distinction is central. A bi-encoder — that is what your embeddings are — encodes the query and each document separately, into two vectors, then compares by dot product or cosine. Upside: you precompute all document vectors at indexing time, so a query costs only one encoding plus an index lookup. Downside: query and document never "see" each other; the whole interaction is collapsed into a distance between two points.

A cross-encoder — that is the reranker — concatenates query and candidate into a single input ([query] [SEP] [document]) and lets the transformer's attention cross every word of the query with every word of the document. It outputs a relevance score directly. This is far finer: it "understands" that the query's v2.14.0 matches the document's v2.14.0 exactly, and that context links them. The cost: nothing can be precomputed, since the score depends on the pair. You need one model pass per (query, candidate) pair at query time.

Hence the two-stage architecture: the bi-encoder (fast, precomputed) filters millions of documents down to a few dozen; the cross-encoder (slow, precise) reorders those few dozen.

The winning pattern:

  1. Retrieve broad (hybrid, top-50) for recall.
  2. Rerank those candidates for precision.
  3. Keep the fine top-k (top-5 to top-8) for the LLM.

Mature 2026 rerankers: Cohere Rerank 3, bge-reranker-v2-m3 (open, multilingual), Voyage rerank, the Jina reranker.

candidates (top-50) ──► cross-encoder reranker ──► top-8 ──► LLM
        (recall)               (precision)

Retrieve broad then rerank fine: the most cost-effective recall/precision trade-off in RAG. Vector similarity is the coarse filter; the reranker does the sorting that counts.

Latency and cost: the orders of magnitude

Reranking is not free, but it is largely affordable if you respect the funnel. Prudent orders of magnitude to expect in 2026:

  • Hosted reranker (API) — Cohere Rerank 3, Voyage, Jina: reranking 50 candidates typically adds 50 to 200 ms of latency (a single network call, batched server-side). On cost, these APIs bill per "search" (one call = 1 query + N documents), often on the order of a dollar per few thousand searches — negligible next to the downstream LLM cost.
  • Self-hosted open rerankerbge-reranker-v2-m3 on GPU: expect a few milliseconds per pair, i.e. a few tens of ms for 50 batched candidates. No marginal per-request cost, but a GPU instance to provision and maintain.
  • What blows up: reranking 500 candidates instead of 50 multiplies cost/latency by ~10 for almost no marginal gain. Discipline on top-N is the main lever.

Compared to LLM generation (often hundreds of ms to several seconds), the reranking overhead is almost always a good investment: it improves context precision, hence answer quality, hence cuts iterations.

Why not rerank everything?

Because the cross-encoder is expensive: one model pass per (query, candidate) pair. Reranking 10,000 documents per query would be unaffordable in latency and cost — and pointless: beyond a few dozen candidates, you mostly rerank noise. Hence the funnel: hybrid search narrows to a few dozen candidates the reranker can process quickly, concentrating the precise compute where it matters.

Tuning and pitfalls

  • Calibrate retrieval top-N — too small starves the reranker (the right document is never in the list, reranking can't help); too large costs latency for nothing. Top-30 to top-50 is a good start; measure recall@N on your golden set to decide.
  • Tune ef_search on pgvector/HNSW — this parameter controls how wide the graph search explores at query time. Too low, the HNSW index misses good candidates (degraded recall); too high, latency climbs. Aim for an ef_search that keeps your dense top-N stable, typically a few hundred. It is the index-side counterpart to the application-side top-N.
  • Keep RRF k around 60 — a proven value. Touching it rarely helps: if fusion disappoints, the problem is almost always upstream (embedding quality, chunking, one engine's recall), not in k.
  • Watch end-to-end latency — reranking adds a network hop and compute. Measure perceived impact, not just p50: it is the p95/p99 that drive users away.
  • Lexical is not optional for multilingual or jargon — on technical, legal or financial corpora, BM25 often rescues what dense misses. In multilingual settings, check your BM25 tokenization handles the language well (accents, compounds).
  • Set a reranker score threshold — beyond top-k, also filter on a minimum relevance score. Better to pass 3 truly relevant chunks than 8 where 5 are noise diluting the LLM context.
  • Evaluate with a golden set — the only way to know hybrid + rerank actually helps your case, not just "in theory". Measure context recall and context precision (e.g. via RAGAS) before/after.

Conclusion

Vector similarity alone is a good start and a bad finish. Add RRF-fused lexical search to recover what meaning alone misses — codes, references, jargon; add a cross-encoder reranker to turn a decent candidate list into precise context, finally letting query and documents "see" each other. That combination — broad then fine, meaning then literal — separates a demo RAG from one that answers correctly in production. The good news: every building block (BM25, RRF, reranking) is now mature, cheap, and available both as an API and as self-hostable open source.

Further reading

  • Production RAG architecture: from chunking to reranking, the complete guide
  • Chunking, the most underestimated step in RAG: 2026 strategies

Ready to ground your agents in your data?

Start free. First Knowledge Pulse audit in 60 seconds.

Start free