RAG impresses in a demo and disappoints in production. Three patterns change everything: hybrid vector + BM25 search with RRF fusion, mandatory cross-encoder reranking, and decomposing complex questions into sub-queries.
Tutorial RAG fits on one line: embed the question, run a top-k search, stuff the chunks into the prompt, and pray. In a demo, on ten hand-picked questions, it impresses. In production, across thousands of real queries from users who don't know your internal vocabulary, it disappoints almost every time. Vector similarity alone misses exact references, surfaces plausible-but-irrelevant passages, and collapses the moment a question contains more than one intent.
The good news: we know why, and we know what to do. Three patterns turn a RAG prototype into a reliable system. None is exotic in 2026 — they've become the expected baseline of any serious retrieval architecture. Here, in the order you should implement them, are hybrid search, mandatory reranking, and query decomposition.
Pattern 1: hybrid search (vector + BM25)
Vector search understands meaning. Ask "how do I cancel my subscription" and it will find a passage titled "ending your contract," even with no shared word. That's its strength, and why it took over. But it has a structural blind spot: it crushes the literal. A reference number INV-2024-08831, a product name RagNight Enterprise, an error code PG::ConnectionBad, a rare proper noun — strings whose embedding captures almost nothing distinctive. Two different invoice references have nearly identical vectors. Dense retrieval is blind to exactly the details that matter most in the enterprise.
BM25, the lexical ranking algorithm behind Elasticsearch, OpenSearch, and pg_search, does the opposite. It scores on exact term matches, weighted by their frequency in the document (TF) and their rarity in the corpus (IDF). A rare, exact term like a serial number spikes the BM25 score, where it's invisible to dense retrieval. BM25 understands nothing about meaning, but it never misses an identifier.
Hybrid search isn't an optional refinement: it's the recognition that no single representation can capture both meaning and the literal. Dense for semantics, lexical for exactness.
Fusing the two: RRF
One concrete problem remains: how do you combine two result lists whose scores aren't comparable? A cosine score of 0.82 and a BM25 score of 14.3 don't live on the same scale, and normalizing such different distributions is brittle. The now-standard answer is Reciprocal Rank Fusion (RRF). RRF ignores raw scores and looks only at each document's rank in each list. The formula, for a document d:
score_RRF(d) = Σ 1 / (k + rank_i(d))
i∈lists
k is a damping constant (typically 60) that prevents the very top positions from drowning out everything else. A worked example, with k = 60:
| Document |
Vector rank |
BM25 rank |
RRF calc |
Score |
| A |
1 |
3 |
1/61 + 1/63 |
0.0323 |
| B |
2 |
1 |
1/62 + 1/61 |
0.0325 |
| C |
— |
2 |
0 + 1/62 |
0.0161 |
| D |
3 |
— |
1/63 + 0 |
0.0159 |
Document B rises to the top: it ranks well in both signals, even without being first anywhere. Document A, first in vector but middling in lexical, lands just behind. Documents found by only one list (C, D) stay present but heavily discounted. That's exactly the intended behavior: RRF rewards agreement between signals and stays robust without any normalization tuning at all.
def hybrid_search(query, k: 20)
dense = vector_search(query, limit: 50) # list ranked by cosine
sparse = bm25_search(query, limit: 50) # list ranked by BM25
rrf = Hash.new(0.0)
[dense, sparse].each do |results|
results.each_with_index do |chunk, rank|
rrf[chunk.id] += 1.0 / (60 + rank + 1)
end
end
rrf.sort_by { |_id, score| -score }.first(k).map(&:first)
end
In practice, on most enterprise knowledge bases — technical docs, contracts, support tickets riddled with identifiers — hybrid search delivers roughly 15 to 30% more precision than vector alone, with the gain growing the more jargon, codes, and exact references the corpus contains.
Pattern 2: reranking is not optional
Here's the uncomfortable truth: your vector store's top-k is not your real top-k. The top five results of a dense or hybrid search are close enough to the query, not the most relevant. The cause is architectural.
Bi-encoder vs cross-encoder
Vector search relies on a bi-encoder: the query and each document are encoded separately into vectors, then compared by dot product. That's what makes search fast — all document embeddings are precomputed, and at query time only a vector comparison over an HNSW index remains. But the query and the document never "see" each other: their interaction collapses to a single number, the similarity. Nuance is lost.
A cross-encoder does the opposite: it feeds the (query, document) pair together into a single Transformer model, which can attend to every word of the question against every word of the passage. The result is a far finer relevance score. The cost: nothing can be precomputed; you need one inference pass per pair. Scoring a million documents would be absurd — but scoring 50 is trivial.
Bi-encoder to retrieve broad and fast. Cross-encoder to rerank fine and accurate. That's the funnel: retrieve 50 candidates coarsely, rerank them finely, keep 5.
The retrieve-broad / rerank-fine funnel
def retrieve(query)
candidates = hybrid_search(query, k: 50) # broad, fast, imprecise
reranked = reranker.rerank(query, candidates) # fine, slow, precise
reranked.first(5) # the real top-k
end
The reranker corrects the first stage's mistakes. A passage that had the right keywords but answered something else is demoted; a passage ranked 23rd by the bi-encoder that exactly answers the question rises to the top. This single step does more for a RAG system's perceived quality than any other.
Concrete models in 2026:
- Cohere Rerank 3 — managed API, multilingual, excellent quality/latency ratio, the sensible default if you accept an external call.
bge-reranker-v2-m3 (BAAI) — open-weights, multilingual, self-hostable. The sovereignty choice: it runs on your own GPU, no data leaves. Very solid quality for an open model.
- Voyage rerank — another good API option, often competitive on technical corpora.
Reranking adds latency (tens to hundreds of milliseconds depending on model and candidate count), but over 50 candidates it stays manageable and the relevance gain almost always justifies it. Tune the number of retrieved candidates (40–60 is a good starting point) to trade off recall against cost.
Pattern 3: decomposition for complex questions
A single search answers a single question well. The problem: real user questions are often several questions in disguise. Take:
"How does our refund policy compare to AWS's?"
No chunk in your base contains that comparison — it doesn't exist anywhere as such. It's really three questions:
- What is our refund policy?
- What is AWS's refund policy?
- The comparative synthesis of the two.
A single search over the whole question produces an "average" embedding that resembles none of the relevant documents, and likely surfaces a mediocre grab-bag. Decomposition solves this step by step:
- Plan — an LLM breaks the question into atomic sub-questions.
- Retrieve — each sub-question triggers its own hybrid search + reranking (patterns 1 and 2, reused).
- Compose — the LLM receives the context from each sub-question and writes the final comparative answer.
def answer_complex(question)
sub_questions = planner.decompose(question)
# => ["Our refund policy?",
# "AWS's refund policy?"]
contexts = sub_questions.map do |sq|
{ question: sq, chunks: retrieve(sq) } # hybrid + rerank per sub-question
end
composer.synthesize(question, contexts)
end
Decomposition is the first step toward agentic RAG: rather than a fixed pipeline, an agent decides what to retrieve, judges whether the results suffice, launches additional searches if needed, and chooses when to stop. Static decomposition into sub-questions is its simplest, most predictable form — an excellent starting point before granting the agent real decision latitude. Reserve it for questions that need it: detect multi-intent queries (comparisons, enumerations, compound conditions) and keep simple questions on the fast path of a single search, so you don't pay needless latency.
Decompose when a question holds multiple intents; don't decompose a simple one. The cost of bad decomposition is real: multiplied latency and the risk of diluting context.
Why we're telling you this
These three patterns aren't advanced configuration options: they're the foundations of a RAG that actually answers, as opposed to a RAG that demos well. Hybrid search so you never again miss an exact reference. Reranking so your top-5 is the real top-5. Decomposition for questions that hide several inside.
RagNight implements all three patterns by default. Hybrid search with RRF fusion, cross-encoder reranking (managed model or self-hosted open-weights for sovereignty constraints), and complex-query decomposition are wired into the retrieval pipeline. You shouldn't have to re-code them on every project — and you shouldn't have to rediscover them in production.
Further reading
- Production RAG architecture: from chunking to reranking, the complete guide
- Hybrid search and reranking: why vector similarity alone fails
- Agentic RAG: when the agent decides what to retrieve (and when to stop)