All posts Knowledge Audit

Detecting knowledge-base gaps with user queries

L'équipe RagNight · 10 min read · December 16, 2025

Every user query reveals what is missing from your base. Detect gaps (vs mere retrieval problems), diagnose the three causes, and close the loop for a living corpus driven by real usage.

Your users constantly tell you what is missing from your knowledge base. Every question put to your assistant is a signal: either the answer exists and it finds it, or it does not exist — and that is exactly where the value hides. Most organizations let that signal go to waste. Used properly, it turns a static corpus into a living organism driven by real usage.

Here is how to detect knowledge-base gaps from user queries, and above all how to close the loop to fill them.

What is a knowledge gap?

A gap is a question for which your corpus contains no satisfactory answer. That differs from a retrieval problem, where the information exists but is not surfaced correctly. Confusing the two is the most common mistake — and it sends teams to fix the wrong problem.

Take a concrete example. An internal RAG assistant gets the question: "What is the withdrawal period for a contract signed electronically with a German customer?" Three scenarios are possible. (a) The procedure exists, well chunked, and the system surfaces it with a high similarity score: all good. (b) The procedure exists in an 80-page PDF, but the relevant passage is drowned in a 1000-token chunk mixing five jurisdictions: that is a retrieval problem. (c) The procedure was never documented for Germany: that is a true gap. All three give the user a bad answer, but each calls for a radically different action.

Gap ≠ poor retrieval. The first is solved by creating content; the second by fixing the pipeline. Diagnosing correctly is half the work.

How to detect them

Gaps leave measurable traces. No single signal is reliable on its own; it is their combination that makes the diagnosis.

  • Low similarity scores — the best retrieved match stays below a relevance threshold. The system is "scraping the bottom of the barrel."
  • Unsourced answers or "I don't know" — a well-built assistant refuses to invent; these refusals are gold, since they pinpoint the gaps.
  • Low reranker confidence — even after cross-encoder reranking (Cohere Rerank 3, bge-reranker-v2-m3), no passage clearly stands out from the noise.
  • Explicit negative feedback — thumbs down, cascading rephrasings, escalations to a human.
  • Repeated unresolved questions — the same theme recurs, never satisfied.

Indicative thresholds, not absolute truths

The similarity score depends on the embedding model and the metric (cosine, dot product). With a modern model like text-embedding-3-large or voyage-3 and normalized cosine similarity between 0 and 1, you often see, indicatively:

  • above 0.80: strong match, the content almost certainly exists;
  • between 0.60 and 0.80: gray zone, to be arbitrated with the reranker and feedback;
  • below 0.55-0.60: likely gap signal, especially if the best of the top-k passages stays low.

These figures are orders of magnitude to calibrate on your own data, never universal constants. The right method: take 100 to 200 manually annotated queries (answer found / not found), plot the score distribution, and set the threshold where the false-negative rate becomes acceptable. A threshold set too high classifies as "gap" questions whose answer existed: you create duplicates. Too low, and real gaps slip through.

Never drive detection off a single frozen similarity threshold. Always cross it with reranker confidence and real feedback: a mid score plus a thumbs down is a far stronger signal than an isolated score.

Three causes, three remedies

Once a gap is spotted, identify its cause before acting:

  1. The information does not existcreate content. A true gap.
  2. The information exists but is poorly indexed or chunked → a pipeline problem (chunking, embedding, metadata). Fix retrieval, not the corpus. Concrete levers: revisit the chunking strategy (smaller chunks, respect semantic boundaries), test contextual retrieval (prefix each chunk with a short LLM-generated context, a technique formalized by Anthropic), add hybrid search with BM25 + dense fused via RRF to catch exact-vocabulary queries (product references, error codes).
  3. The information exists but is contradictory or stale → a governance problem. The system finds incompatible answers and cannot pick one. Here, writing a new article only deepens the mess: you must designate a source of truth, archive obsolete versions, and date your content.

Before writing a new article, always check the information is not already sitting somewhere, misfiled. Too many duplicates are created for lack of diagnosis.

A simple test to decide between a gap and a pipeline problem: rephrase the question yourself with the exact keywords of the document you think exists, and rerun the search. If the right passage now surfaces, it is a retrieval problem (vocabulary, chunking). If it still does not surface and you cannot find the information anywhere by searching manually, it is a true gap.

Tooling the query clustering

The most neglected link is clustering. Twenty users phrase the same gap in twenty different ways ("paid leave abroad," "expat time off," "days off while working remotely from Spain"…). Without grouping, you see twenty anecdotal incidents instead of one priority gap.

The method is direct and reuses the infrastructure you already have for RAG:

  1. Embed each query with the same model as your corpus (vector-space consistency).
  2. Group by similarity. For a few hundred to a few thousand queries, density-based clustering (e.g. HDBSCAN) avoids fixing the number of clusters in advance and cleanly isolates one-off queries as "noise." For larger volumes, a cosine-similarity threshold (e.g. group any query above 0.75 from a centroid) often suffices.
  3. Label each cluster automatically by asking an LLM to summarize the 5-10 representative queries in one sentence — this yields a readable label ("Remote-work policy outside France") far more useful than a cluster ID.
  4. Keep, per cluster: monthly volume, median answer similarity score, negative-feedback rate.
-- Sketch: surface likely-gap query clusters
SELECT
  q.cluster_label,
  COUNT(*)                       AS volume,
  AVG(q.top_similarity)          AS median_sim,
  AVG(CASE WHEN q.feedback = 'down' THEN 1 ELSE 0 END) AS neg_rate
FROM query_logs q
WHERE q.created_at >= now() - interval '30 days'
  AND q.top_similarity < 0.60          -- calibrated gap threshold
GROUP BY q.cluster_label
HAVING COUNT(*) >= 5                    -- ignore anecdotal noise
ORDER BY volume DESC;

You do not need an expensive dedicated tool: pgvector already stores your embeddings, and a weekly clustering + labeling job is enough to feed a dashboard. What matters is regularity, not sophistication.

Setting up the detection loop

One-off detection is useless; it is a continuous process:

  1. Log every query with retrieval score, confidence, answer status (sourced / unsourced / refusal), and any feedback.
  2. Score each query's gap risk (a combination of the signals above).
  3. Cluster unsatisfied questions: many phrasings often point to one gap.
  4. Prioritize by frequency × business value.
  5. Assign content creation or correction to an owner, with a deadline.
  6. Verify the fix actually resolved it (the next question on the topic should now find an answer).
Queries ──► logging ──► gap scoring ──► clustering
                                            │
                              prioritization (frequency × value)
                                            │
                              content creation / correction
                                            │
                                   verification ──► (loop)

A worked prioritization example

Prioritize "on gut feeling," and you fill whatever is most visible. With a frequency × value score, you fill whatever pays off most. Assign each cluster a monthly frequency and a business value (1 = trivial, 5 = critical: contractual impact, paid support, compliance). The product gives a defensible processing order:

Cluster Frequency/month Value (1-5) Score Rank
Remote-work policy outside France 180 4 720 1
B2B customer refund procedure 90 5 450 2
SSO setup for a new customer 40 5 200 3
HQ cafeteria opening hours 220 1 220 4

The cafeteria, despite 220 queries, ranks below a refund procedure half as frequent but critical. Without value weighting, you would have documented the lunch menu before the contractual topic. That is exactly the trap of sorting by volume alone.

Metrics to track

  • Gap rate — share of queries with no satisfactory answer. The trend should fall; a plateau at 5-10% is often where marginal effort stops paying off.
  • Time-to-fill — time between detecting a priority gap and resolving it. A lengthening delay reveals a bottleneck on content production, not on detection.
  • Sourced-answer rate evolution — the overall corpus-health indicator.
  • Recurrence — a filled gap should not return; its reappearance signals a governance problem (the new content is not found, or it is contradicted by another source).

Measure these indicators before launching the effort, otherwise you will have no baseline to prove progress. A monthly dashboard is plenty; real time is a false good idea at this stage.

Common pitfalls

  • Confusing gap and retrieval — creating duplicate content when the info existed, misindexed. This is the #1 trap, and it is costly: double maintenance, future contradictions.
  • Not closing the loop — detection without correction is worthless. Detection only has value when tied to an action and a named owner.
  • Prioritizing on gut feeling — without frequency and value, you fill trivia and leave real gaps.
  • Ignoring temporal context — some gaps are seasonal (product launch, fiscal year-end, leave-booking season). Volume alone can mislead: a gap that spikes in November then vanishes is not as urgent as a structural one.
  • Vectorizing without cleaning queries — internal test questions, empty queries, pasted error logs: this noise skews clustering. Filter before you analyze.

A living corpus

A knowledge base is never "done." Products evolve, procedures change, users ask new questions. By linking real queries to a prioritized enrichment process, you stop guessing what your users need: they tell you, query after query. All that's left is to listen — and close the loop.

Start small: log queries, calibrate a threshold on 100 annotated queries, cluster the last month, and tackle the top three clusters by score. You will get more value from two weeks of disciplined looping than from six months of guesswork enrichment.

Further reading

  • Audit your knowledge base before AI: the complete method
  • RAG-augmented customer support: cut resolution time without losing quality

Ready to ground your agents in your data?

Start free. First Knowledge Pulse audit in 60 seconds.

Start free