Your RAG's relevance is decided at chunking, before the embedding. 2026 strategies from fixed to semantic, late chunking and contextual retrieval, with trade-offs, pitfalls and concrete recommendations.
When a RAG system answers badly, the first instinct is to blame the embedding model, the reranker or the LLM. In the vast majority of cases the culprit sits upstream, in a stage almost no one treats with the seriousness it deserves: chunking. It decides what gets vectorized, and therefore what can — or cannot — surface for a query. No reranker can recover information that the chunking erased.
Chunking is underestimated because it looks trivial: split a text into pieces, where's the difficulty? The difficulty is that every cut is a semantic decision. Set badly, it splits an idea in two, mixes two topics into a single vector, or produces fragments so generic that vector similarity can no longer tell them apart. RAG relevance is largely decided here, before the embedding. This article surveys chunking strategies, from the most naive to the 2026 approaches (late chunking, contextual retrieval), with trade-offs and concrete recommendations.
Why bad chunking breaks RAG
An embedding is a compression. You summarize a passage of text into a vector of a few hundred to a few thousand dimensions. The more coherent and focused the passage, the more discriminative the vector. Bad chunking attacks this property in several ways.
Loss of context. An isolated chunk loses what came before it. The sentence "This limit does not apply to contracts predating 2023" is useless on its own: which limit, which contracts? The pronoun and the deictic point to context living in the previous chunk. Vectorized as is, this passage carries an ambiguity the embedding cannot resolve, and it will never surface for the right query.
Arbitrary boundaries in the middle of an idea. Fixed-size splitting cuts at 512 tokens, full stop. If a paragraph's key argument straddles the boundary, it ends up split across two chunks, neither of which contains the complete idea. The query looking for it weakly matches both halves, and the relevant passage drowns under partial results.
Semantic dilution of large chunks. Tempted to avoid fragmentation, some go up to 1500-2000 tokens. The problem: a large chunk often covers several sub-topics. Its vector becomes a blurry average that strongly resembles no precise query. You retrieve the right document but bury the LLM in noise, which degrades answer faithfulness and inflates generation token costs.
Over-fragmentation of small chunks. Conversely, 50-100 token chunks produce very specific vectors stripped of context. You multiply index entries, raise the risk of matching an off-topic fragment, and force the LLM to mentally stitch scattered pieces back together.
Key takeaway: chunking is a trade-off between specificity (small, focused chunks) and context completeness (large, self-contained chunks). The whole craft is finding the balance point for your corpus and your embedding model.
Strategies, from naive to advanced
Fixed-size splitting (tokens/characters + overlap)
Everyone's starting point: cut every N tokens (or characters) with an overlap of a few dozen tokens between consecutive chunks. The overlap is a safety net: an idea cut at a boundary stands a chance of being complete in the neighboring chunk.
Pros: simple, fast, deterministic, no external call. Cons: it completely ignores document structure. It cuts mid-sentence, mid-table, mid-code-block. An acceptable starting point for prototyping, rarely an endpoint for production.
Recursive structure-aware splitting
Recursive splitting tries to cut on hierarchical separators: first section breaks (headings), then paragraphs (\n\n), then sentences, and only as a last resort mid-sentence. You aim for a target size but always prefer a nearby natural boundary over a brutal cut.
This is today's reasonable default for prose. It respects paragraphs, keeps sentences whole, and stays cheap. Most RAG libraries implement it as the recursive character/text splitter. For 80% of textual corpora, well tuned, it is enough.
Semantic splitting
Rather than cutting on characters, you cut when meaning changes. You embed sentences (or sentence groups), measure similarity between neighbors, and place a boundary where similarity drops below a threshold — a sign of topic change. Chunks then follow the text's real thematic units.
Elegant, and often better on heterogeneous documents (reports mixing multiple themes). The cost: you must embed all sentences upfront, the result depends on the threshold (to be calibrated), and chunk size becomes variable, which can complicate the context budget. Reserve it for corpora where the quality gain justifies the effort.
Structure-aware (Markdown/HTML/tables/code)
Technical documents are not prose. A Markdown file has headings and lists; HTML has a DOM; a table has rows and columns; code has functions and classes. Structure-aware splitting parses the format and cuts on its natural boundaries: one chunk per section under its heading, a table kept whole (or split row by row with the header repeated), one function per chunk for code.
Key takeaway: there is no universal strategy. The right split depends on the document type. A serious production pipeline routes each format to the appropriate splitter rather than applying one fixed split to everything.
Late chunking: embed first, split later
Late chunking, popularized by Jina, reverses the usual order. Instead of embedding each chunk independently, you pass the whole document (or a long passage) through a long-context embedding model, retrieve token-level embeddings, and only afterwards aggregate them per chunk (typically by pooling over each chunk's tokens).
The benefit is direct: each chunk embedding has "seen" the entire document thanks to attention. The infamous "This limit does not apply…" benefits from upstream context because the token representations were computed in the presence of the preceding sentences. You preserve long context without inflating chunk size or multiplying calls.
Requirements: a long-context embedding model (Jina embeddings v3, or other encoders supporting large windows). Limit: the model's context window is finite, so very long documents must be processed in blocks. But for medium-sized documents, late chunking offers a near-free context gain, with no generator LLM in the loop.
Contextual retrieval: prefix each chunk with generated context
Anthropic's contextual retrieval tackles the same problem — context loss — from another angle. Before vectorizing each chunk, you ask an LLM to produce a short explanatory context situating the chunk within the document, then prefix that context to the chunk before embedding (and before the BM25 index).
Concretely, a chunk about "the €5,000 limit" becomes something like: "Excerpt from the 2024 framework contract, Reimbursement Caps section, concerning travel expenses. The €5,000 limit…". The chunk now carries the coordinates it was missing, which markedly improves recall — Anthropic reports a substantial reduction in retrieval failure rate, further improved by combining dense, BM25 and reranking.
Key takeaway: late chunking and contextual retrieval target the same ailment — the amnesic chunk. The first solves it via the embedding model's attention (cheap, no LLM), the second via context generation (more expensive, more universal and BM25-compatible).
Cost is the real issue. Generating context per chunk at indexing means one LLM call per chunk. At the scale of hundreds of thousands of chunks, that is not negligible. Two mitigations: use a small, cheap model for context generation (Mistral Small 3, Gemma 3, or an open model hosted in the EU for sovereignty), and exploit prompt caching by keeping the full document cached to process all its chunks in a row. It is a one-time indexing cost, amortized over the corpus lifetime.
Comparison table
| Strategy |
Complexity |
Indexing cost |
Relevance gain |
When to use |
| Fixed + overlap |
Very low |
Near zero |
Low |
Prototype, simple homogeneous corpus |
| Recursive structural |
Low |
Near zero |
Medium |
Reasonable default for prose |
| Semantic |
Medium |
Sentence embeddings |
Medium to high |
Heterogeneous, multi-topic documents |
| Structure-aware |
Medium |
Low (parsing) |
High (on structured formats) |
Markdown, HTML, code, tables |
| Late chunking |
Medium |
Low (1 encoder pass) |
High |
Long-context embedding model available |
| Contextual retrieval |
High |
High (1 LLM/chunk) |
Very high |
High-value corpus, recall critical |
These approaches are not mutually exclusive. In production you happily combine recursive/structure-aware splitting with contextual retrieval, then hybrid search and reranking on top.
Concrete pitfalls
- Cut tables. Fixed splitting slices a table right through the middle: orphan rows lose their headers and become uninterpretable. Detect tables and keep them whole, or repeat the header row in each sub-chunk.
- Mutilated code. Cutting a function in the middle produces syntactically absurd chunks. For code, split on function/class boundaries and keep imports or the signature in context.
- Shattered lists. A bulleted list cut apart loses the intro sentence that gives it meaning. Treat the intro + list as one unit.
- Mis-tuned overlap. Too low and it protects nothing; too high (e.g. 50%) and it massively duplicates content, inflates the index, skews similarity scores and raises costs. Aim for 10-20%.
- Lost metadata. The most frequent and most costly mistake. If you don't store the source, section title, page and chunk position, you can neither cite correctly, nor filter, nor debug a bad result. Metadata is not optional.
Practical recommendations
Reasonable starting points, to validate on your golden set with RAGAS (context recall, context precision):
- Size: ~256-512 tokens. Tune it to your embedding model. Many recent models (
text-embedding-3-large, voyage-3, bge-m3) are comfortable around 512; go down toward 256 if your queries are very precise.
- Overlap: 10-20%, i.e. ~50-100 tokens for 512-token chunks. No more.
- Respect structure. Recursive at minimum, structure-aware for rich formats.
- Keep source and section in metadata, always.
- Consider late chunking or contextual retrieval when recall becomes critical, weighing the indexing cost.
Pseudo-config for a pipeline:
chunking:
strategy: recursive # recursive | semantic | structure_aware
target_tokens: 512
overlap_tokens: 80 # ~16%
separators: ["\n## ", "\n\n", "\n", ". ", " "]
keep_tables_intact: true
code_split_on: ["function", "class"]
contextual_retrieval:
enabled: true
context_model: mistral-small-3 # small EU model, prompt caching on
metadata:
- source_uri
- section_title
- page
- chunk_index
And the matching indexing pseudo-pipeline:
for doc in corpus:
blocks = structure_aware_split(doc, target=512, overlap=80)
for i, block in enumerate(blocks):
context = small_llm.summarize_context(doc, block) # contextual retrieval
text = f"{context}\n\n{block.text}"
vector = embed(text) # or late chunking
index.upsert(
id=f"{doc.id}:{i}",
vector=vector,
text=block.text,
metadata={
"source_uri": doc.uri,
"section_title": block.section,
"page": block.page,
"chunk_index": i,
},
)
Chunking is not an implementation detail you set once and forget. It is a first-order quality lever, to be measured and iterated like any other piece of the pipeline. Get it right and half your relevance problems vanish before you even touch the reranker.
Further reading
- Production RAG architecture: from chunking to reranking, the complete guide
- Hybrid search and reranking: why vector similarity alone fails