All posts RAG Patterns

Production RAG architecture: from chunking to reranking, the complete guide

L'équipe RagNight · 15 min read · December 02, 2025

The complete guide from RAG POC to production: chunking, embeddings, hybrid search + RRF, reranking, pgvector/HNSW, and RAGAS evaluation. 2026 technical choices, code, and a go-live checklist.

There is a chasm between a RAG POC and a production RAG. The POC takes an afternoon: a few chunked documents, an embedding model, a vector store, and the demo impresses. Then you put it in real users' hands, on a real corpus, with real questions — and relevance collapses. Off-target answers, missing sources, confident hallucinations, erratic latency. The POC answered "often"; production must answer "correctly, reliably, and measurably."

This guide walks the full architecture of a production RAG, link by link, with 2026-current technical choices and the pitfalls that sink projects. The goal is not academic completeness but decisions: at each step, what to choose, what orders of magnitude to target, and which trade-offs to accept.

Anatomy of a serious RAG pipeline

A production RAG is not "a vector store and an LLM." It is a chain where every link conditions the final quality:

INGESTION
  parsing ─► cleaning ─► chunking ─► embedding ─► indexing
RETRIEVAL
  query ─► hybrid search ─► fusion ─► reranking ─► top-k
GENERATION
  context + question ─► LLM ─► answer + citations
EVALUATION (continuous loop)
  golden set ─► metrics ─► improvement
OBSERVABILITY (cross-cutting)
  logs ─► cost ─► latency ─► quality drift

Golden rule: quality is decided far upstream of generation. Bad chunking or a bad embedding cannot be rescued by a better prompt. Optimize in pipeline order, and measure at each step. Teams that invert this logic — polishing the system prompt before checking that the right passages even surface — spend weeks treating symptoms instead of the cause.

Chunking: the underestimated foundation

How you split documents determines what the system can retrieve. It is the most sabotaged step, because it looks trivial: "cut every 500 tokens." But a chunk is the atomic unit of retrieval. If it mixes two ideas, the embedding is blurry and search imprecise. If it truncates an idea, the retrieved passage is incomplete and the LLM fills the gaps — with a hallucination.

Strategies, from naive to advanced:

  • Fixed (tokens/chars + overlap) — simple, cuts mid-idea. Reserve for cases where structure is unusable.
  • Recursive respecting structure (paragraphs, headings, separators) — much better, the sensible default. Cut on natural boundaries first, fall back to raw splitting only as a last resort.
  • Semantic — cut where similarity between consecutive sentences drops, signaling a topic change. More expensive (every sentence must be embedded), with real gains on long continuous texts.
  • Structure-aware (Markdown, HTML, code) — never split a code block, table, or cell in two. For code, split by function or class, not by line count.

Two 2026 techniques that change the game:

  • Late chunking (Jina) — instead of embedding each chunk in isolation, you pass the whole document (or a large segment) through the model, then aggregate token-level embeddings afterward to produce one vector per chunk. Each chunk inherits global context. Concretely, the chunk "Revenue grew 12%" keeps its link to "Renault" mentioned three paragraphs earlier, whereas embedded in isolation it would lose the subject. Preserves long-distance references and anaphora ("this subsidiary," "said contract").
  • Contextual retrieval (Anthropic) — prefix each chunk with a short LLM-generated context before embedding. A concrete example:
Raw chunk:
  "Gross margin stands at 38.2%, down 1.4 points."

Contextualized chunk (LLM-generated prefix):
  "[Excerpt from Acme's 2025 annual report, 'Financial
   Performance' section, fiscal year ended 12/31/2025.]
   Gross margin stands at 38.2%, down 1.4 points."

Anthropic reports a notable reduction in retrieval-failure rate when contextual retrieval is combined with hybrid search. The cost: one LLM call per chunk at ingestion. Amortize it with prompt caching (the full document is cached, only the chunk varies) and reserve it for corpora where retrieval quality is critical.

Practical recommendations: aim for 256-512 tokens per chunk depending on the embedding model, 10-20% overlap, and always keep source, section, date, and version in metadata. A too-large chunk dilutes the semantic signal; a too-small one loses context. Test two or three sizes on your golden set rather than copying a default.

Embeddings: choosing your model

The embedding turns text into vectors. The model choice affects relevance, cost, and sovereignty.

Model Profile Strengths
text-embedding-3-large (OpenAI) API, up to 3072 dim Strong, adjustable dimensions
voyage-3 (Voyage AI) API Excellent quality/price
embed v3 (Cohere) API Solid, multilingual
bge-m3 (BAAI) Open, multilingual Self-hostable, sovereign, dense+sparse
embeddings v3 (Jina) API/Open Native late chunking
embed (Mistral) EU API European roots

Decision criteria: multilingual (crucial in Europe — bge-m3, Mistral, Qwen 3 shine on corpora mixing French, German, and English), dimensions, cost, and open vs API per your sovereignty needs.

Dimensions, cost, and storage. More dimensions = finer detail, but also more storage and similarity compute. A 3072-dimension float32 vector weighs about 12 KB; across 10 million chunks that is nearly 120 GB for embeddings alone, before the index. Hence the next two levers.

Matryoshka / MRL (Matryoshka Representation Learning). Models trained with Matryoshka (including text-embedding-3-large) order information by importance within the vector: you can truncate a 3072-dim vector to 1024 or 512 dimensions, keeping only the leading components, with a small, gradual quality loss. In practice you might store 1024 instead of 3072 dimensions, cutting storage and search cost by three for a few recall points. It is the most cost-effective quality knob in the pipeline.

For a multilingual corpus with sovereignty requirements, self-hosted bge-m3 is often the best compromise (dense + sparse in one model, data that never leaves your infrastructure). To start fast without data constraints, text-embedding-3-large or voyage-3. Absolute rule: never compare embeddings from two different models — switching models forces a full re-embedding of the corpus.

Hybrid search: dense + lexical

Vector similarity grasps meaning but misses the literal: product codes, acronyms, proper nouns, version numbers, identifiers. Hence hybrid search, combining:

  • Dense (vector) — semantic similarity. Catches "early termination contracts" even if the document says "break before term."
  • Lexical (BM25) — exact term matches. Catches the reference XR-4420-B or the acronym GDPR that dense search drowns in noise.

Fuse the two rankings via RRF (Reciprocal Rank Fusion), which combines ranks rather than incomparable scores. The formula is simple: for each document, sum 1 / (k + rank) across each engine, with k a constant (typically 60).

RRF(d) = Σ  1 / (k + rank_i(d))      with k = 60

Example: a document ranked 1st in dense and 3rd in lexical
  score = 1/(60+1) + 1/(60+3) = 0.01639 + 0.01587 = 0.03226

A document ranked 2nd in dense only
  score = 1/(60+2) = 0.01613
Query
   ├──► dense search (pgvector / HNSW) ──► top-k_dense
   └──► lexical search (BM25)          ──► top-k_lexical
                       │
                   RRF fusion
                       │
              unified candidates (large top-N, e.g. 50)

RRF's elegance: it does not depend on score scale (a cosine distance and a BM25 score are not comparable), only on ranks. A document found by both engines rises strongly — capturing both meaning and literal in a single unified ranking.

Reranking: retrieve broad, decide fine

The 50 fused candidates are too many for the LLM and poorly ordered. Why: a query embedding and a document embedding are computed independently (bi-encoder), then compared by dot product. Fast and indexable, but coarse. A reranker (cross-encoder) reads the query and each candidate together, in the same attention pass, and assigns a far finer relevance score.

Bi-encoder (embedding) Cross-encoder (reranker)
Computation Query and doc separately Query + doc together
Indexable Yes (precomputed) No (on the fly)
Precision Decent High
Latency Very low High (scales with candidate count)

The winning funnel pattern:

candidates (top-50) ──► cross-encoder reranker ──► top-8 ──► LLM
        (recall)               (precision)
  1. Retrieve broad (hybrid, top-50) → favor recall.
  2. Rerank → favor precision.
  3. Keep top-5 to top-8 for the LLM context.

Latency. A cross-encoder is expensive: reranking 50 candidates typically adds tens to a few hundred milliseconds depending on model and hardware. That is precisely why you never rerank the whole corpus — hybrid search narrows the field to a few dozen candidates first. The number of reranked candidates (top-N) is a direct latency/quality dial: 50 is a good start, 100 if recall is short, 25 if latency is tight.

Mature 2026 rerankers: Cohere Rerank 3, bge-reranker-v2-m3 (open, multilingual, sovereign — self-hostable on GPU), Voyage rerank, Jina reranker.

pgvector in production

PostgreSQL with the pgvector extension is a serious production choice: one system for business data and vectors, under your control (hence sovereign), with the transactions, joins, and SQL filtering you already know.

The HNSW index (Hierarchical Navigable Small World) is the key to approximate-search performance. It is a layered graph navigated hop by hop:

-- HNSW index for cosine similarity
CREATE INDEX ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- At query time, tune the recall/latency trade-off
SET hnsw.ef_search = 100;
  • m — connections per node. Indicative value 16 (up to 32-64 for very large corpora needing maximal recall). Higher = better recall, more memory, heavier index.
  • ef_construction — candidate-list size at build. Indicative value 64 (128-256 for more quality). Higher = better index, slower build.
  • ef_search — query effort. Indicative value 40-100, tunable dynamically per session. This is your recall/latency dial in production: raise it if recall is short, lower it if latency is tight.

An enriched retrieval query, with multi-tenant filtering and metadata:

SET hnsw.ef_search = 100;

SELECT dc.id, dc.content, dc.section, dc.source_url,
       dc.embedding <=> $1 AS distance
FROM document_chunks dc
JOIN documents d ON d.id = dc.document_id
WHERE d.organization_id = $2            -- tenant isolation
  AND d.status = 'completed'
  AND d.version = d.current_version      -- no stale content
ORDER BY dc.embedding <=> $1             -- <=> = cosine distance
LIMIT 50;

Important note on filtering: pgvector applies the WHERE filter then walks the HNSW index. On highly selective filters, recall can drop (the index doesn't "see" enough valid candidates); consider raising ef_search or partitioning per tenant.

2026 optimizations:

  • halfvec (16-bit, half precision) — halves vector storage with minimal quality impact. Index directly as halfvec: CREATE INDEX ON document_chunks USING hnsw ((embedding::halfvec(1024)) halfvec_cosine_ops). Valuable at scale.
  • Binary quantization — pushes compression to the extreme (1 bit per dimension) for massive corpora where storage and memory are the limiting factor; combine with exact re-scoring (full-distance re-ranking) on the top candidates to recover precision.
  • IVFFlat remains an alternative to HNSW: faster, lighter-memory build, but generally weaker recall and latency, and you must pick the number of lists (lists) per corpus size. HNSW is the recommended default for most cases.

Keep metadata (source, section, version, permissions, date) in indexed columns: they enable filtering (per tenant, per version), traceability, and sorting. They also make stale-content expiry possible — a classic production pitfall.

Evaluation: the loop everyone skips

A RAG you don't evaluate is a RAG you don't control. Evaluation turns "by feel" optimization into a steered process. Without it, every chunking or embedding change is a gamble: you don't know if it helps or hurts.

Build a golden set. A set of 50 to 200 representative questions with their expected answers and sources (the chunks or documents that should be retrieved). Build it from real user questions, covering easy cases, ambiguous cases, and out-of-scope cases (where the right answer is "I don't know"). It is your test bench for every change in chunking, embedding, or reranking.

RAGAS metrics in detail:

  • Context recall — of the passages that should have been retrieved, how many were? Measures retrieval quality (chunking + embedding + search). A low value means the information never surfaced: no prompt will save the answer.
  • Context precision — are the relevant passages ranked first? Measures ranking quality (fusion + reranking). A low value means the right passage is buried far down the list.
  • Faithfulness — is the generated answer fully grounded in the retrieved sources (no hallucination)? Measures generation quality.
  • Answer relevancy — does the answer truly address the question, without digression?

LLM-as-judge: use a strong LLM to automatically score faithfulness and relevance at scale. The pitfall is judge reliability: calibrate it on 20-50 human ratings, check agreement (does the judge score like a human?), and prefer binary or 3-level rubrics over a noisy 1-to-10 continuous score.

Effective optimization sequence: measure context recall first (does the info surface?), then context precision (is it ranked well?), then faithfulness (does the LLM stick to the sources?), then answer relevancy. Fix in pipeline order, measuring at each step. No point touching the prompt while recall is poor.

Observability and cost in production

Once live, a RAG must be instrumented like any critical system. Three axes to track continuously:

  • Quality over time. Replay the golden set on every deploy and track RAGAS metrics as regression tests. Watch for corpus drift: new documents can degrade precision (noise) or reveal unanswerable questions. Wire gap detection (queries with low retrieval scores) into corpus enrichment.
  • Latency, broken down by step. Trace each link: query embedding, dense search, lexical search, fusion, reranking, generation. Perceived latency is their sum; you only optimize what you measure. Reranking and generation are the heaviest items.
  • Cost, broken down by item. Ingestion embedding (one-shot but heavy on a large corpus, especially with contextual retrieval), query embedding (recurring), reranking (per call), and generation (context tokens dominate — hence the funnel limiting to 5-8 chunks). Prompt caching on the generation side sharply cuts cost when the system prompt is stable.

Always log: the query, retrieved chunks (with scores), cited sources, model versions (embedding, reranker, LLM), and per-step latency. Without this traceability, a relevance incident is impossible to diagnose — and compliance (who saw what, from which source) becomes uncertain.

Production checklist

  • [ ] Structure-aware chunking, metadata kept (source, section, version, date)
  • [ ] Embedding model chosen by multilingual / cost / sovereignty; dimensions tuned (Matryoshka) where relevant
  • [ ] Hybrid search (dense + BM25) with RRF fusion (k = 60)
  • [ ] Reranking funnel (broad → fine), top-N calibrated to the latency target
  • [ ] Tuned HNSW index (m, ef_construction, ef_search), halfvec if scale demands
  • [ ] Permission filtering at retrieval (multi-tenant), ef_search adjusted for selective filters
  • [ ] Sourced answers + anti-hallucination guardrail ("I don't know")
  • [ ] Golden set (50-200 questions) and RAGAS metrics in place, tracked over time
  • [ ] Observability: logging of queries/sources/versions, per-step latency and cost
  • [ ] Gap detection wired into corpus enrichment

Common mistakes

  • Optimizing the prompt before retrieval — if the right passages don't surface (low recall), no prompt will save the answer.
  • All-vector — without lexical, you miss exact matches (codes, acronyms, identifiers).
  • No reranking — you drown the LLM in mediocre passages, and faithfulness drops.
  • No evaluation — no way to know if a change helps or hurts; you optimize blind.
  • A stale corpus — the best architecture serves wrong answers from obsolete sources. Version and expire.
  • Switching embedding model without re-embedding — mixing vectors from two models produces absurd rankings.

Conclusion

A production RAG is not "smarter" than a POC — it is more rigorous. Each link — chunking, embedding, hybrid search, reranking, storage, evaluation, observability — contributes, and their sum separates a system that answers correctly from one that answers "roughly." Build in pipeline order, measure at every step with a golden set, instrument quality, latency, and cost, and treat relevance as a metric to steer, not magic to hope for. That is how you cross the chasm from demo to production.

Further reading

  • Chunking, the most underestimated step in RAG: 2026 strategies
  • Hybrid search and reranking: why vector similarity alone fails
  • Detecting knowledge-base gaps with user queries

Ready to ground your agents in your data?

Start free. First Knowledge Pulse audit in 60 seconds.

Start free