5 signs your knowledge base isn't ready for AI
Your AI pilots dazzle in demos and disappoint in production? The culprit is rarely the model. Here are the five signs that reveal a knowledge base not ready for RAG — and how to fix each one.
Every user query reveals what is missing from your base. Detect gaps (vs mere retrieval problems), diagnose the three causes, and close the loop for a living corpus driven by real usage.
Your users constantly tell you what is missing from your knowledge base. Every question put to your assistant is a signal: either the answer exists and it finds it, or it does not exist — and that is exactly where the value hides. Most organizations let that signal go to waste. Used properly, it turns a static corpus into a living organism driven by real usage.
Here is how to detect knowledge-base gaps from user queries, and above all how to close the loop to fill them.
A gap is a question for which your corpus contains no satisfactory answer. That differs from a retrieval problem, where the information exists but is not surfaced correctly. Confusing the two is the most common mistake — and it sends teams to fix the wrong problem.
Take a concrete example. An internal RAG assistant gets the question: "What is the withdrawal period for a contract signed electronically with a German customer?" Three scenarios are possible. (a) The procedure exists, well chunked, and the system surfaces it with a high similarity score: all good. (b) The procedure exists in an 80-page PDF, but the relevant passage is drowned in a 1000-token chunk mixing five jurisdictions: that is a retrieval problem. (c) The procedure was never documented for Germany: that is a true gap. All three give the user a bad answer, but each calls for a radically different action.
Gap ≠ poor retrieval. The first is solved by creating content; the second by fixing the pipeline. Diagnosing correctly is half the work.
Gaps leave measurable traces. No single signal is reliable on its own; it is their combination that makes the diagnosis.
Cohere Rerank 3, bge-reranker-v2-m3), no passage clearly stands out from the noise.The similarity score depends on the embedding model and the metric (cosine, dot product). With a modern model like text-embedding-3-large or voyage-3 and normalized cosine similarity between 0 and 1, you often see, indicatively:
These figures are orders of magnitude to calibrate on your own data, never universal constants. The right method: take 100 to 200 manually annotated queries (answer found / not found), plot the score distribution, and set the threshold where the false-negative rate becomes acceptable. A threshold set too high classifies as "gap" questions whose answer existed: you create duplicates. Too low, and real gaps slip through.
Never drive detection off a single frozen similarity threshold. Always cross it with reranker confidence and real feedback: a mid score plus a thumbs down is a far stronger signal than an isolated score.
Once a gap is spotted, identify its cause before acting:
Before writing a new article, always check the information is not already sitting somewhere, misfiled. Too many duplicates are created for lack of diagnosis.
A simple test to decide between a gap and a pipeline problem: rephrase the question yourself with the exact keywords of the document you think exists, and rerun the search. If the right passage now surfaces, it is a retrieval problem (vocabulary, chunking). If it still does not surface and you cannot find the information anywhere by searching manually, it is a true gap.
The most neglected link is clustering. Twenty users phrase the same gap in twenty different ways ("paid leave abroad," "expat time off," "days off while working remotely from Spain"…). Without grouping, you see twenty anecdotal incidents instead of one priority gap.
The method is direct and reuses the infrastructure you already have for RAG:
-- Sketch: surface likely-gap query clusters
SELECT
q.cluster_label,
COUNT(*) AS volume,
AVG(q.top_similarity) AS median_sim,
AVG(CASE WHEN q.feedback = 'down' THEN 1 ELSE 0 END) AS neg_rate
FROM query_logs q
WHERE q.created_at >= now() - interval '30 days'
AND q.top_similarity < 0.60 -- calibrated gap threshold
GROUP BY q.cluster_label
HAVING COUNT(*) >= 5 -- ignore anecdotal noise
ORDER BY volume DESC;
You do not need an expensive dedicated tool: pgvector already stores your embeddings, and a weekly clustering + labeling job is enough to feed a dashboard. What matters is regularity, not sophistication.
One-off detection is useless; it is a continuous process:
Queries ──► logging ──► gap scoring ──► clustering
│
prioritization (frequency × value)
│
content creation / correction
│
verification ──► (loop)
Prioritize "on gut feeling," and you fill whatever is most visible. With a frequency × value score, you fill whatever pays off most. Assign each cluster a monthly frequency and a business value (1 = trivial, 5 = critical: contractual impact, paid support, compliance). The product gives a defensible processing order:
| Cluster | Frequency/month | Value (1-5) | Score | Rank |
|---|---|---|---|---|
| Remote-work policy outside France | 180 | 4 | 720 | 1 |
| B2B customer refund procedure | 90 | 5 | 450 | 2 |
| SSO setup for a new customer | 40 | 5 | 200 | 3 |
| HQ cafeteria opening hours | 220 | 1 | 220 | 4 |
The cafeteria, despite 220 queries, ranks below a refund procedure half as frequent but critical. Without value weighting, you would have documented the lunch menu before the contractual topic. That is exactly the trap of sorting by volume alone.
Measure these indicators before launching the effort, otherwise you will have no baseline to prove progress. A monthly dashboard is plenty; real time is a false good idea at this stage.
A knowledge base is never "done." Products evolve, procedures change, users ask new questions. By linking real queries to a prioritized enrichment process, you stop guessing what your users need: they tell you, query after query. All that's left is to listen — and close the loop.
Start small: log queries, calibrate a threshold on 100 annotated queries, cluster the last month, and tackle the top three clusters by score. You will get more value from two weeks of disciplined looping than from six months of guesswork enrichment.
Your AI pilots dazzle in demos and disappoint in production? The culprit is rarely the model. Here are the five signs that reveal a knowledge base not ready for RAG — and how to fix each one.
A RAG amplifies the quality — or mediocrity — of its corpus. The complete method to audit your base before AI: inventory, authority/freshness scoring, contradiction and gap detection, Knowledge Ops, and health KPIs.
Start free. First Knowledge Pulse audit in 60 seconds.
Start free