Single-pass RAG fails on multi-hop and comparison questions. Agentic RAG lets an agent decompose, retrieve iteratively, and stop at the right time — with its guardrails and cost. When to use it, and when not.
Classic RAG always follows the same ritual: receive a question, retrieve passages, generate an answer. One question, one retrieval, one answer. That works for simple factual questions. But as soon as the question gets harder — several sub-questions, a need to cross sources, ambiguity to resolve — this rigid scheme shows its limits. Agentic RAG breaks the ritual: it lets an agent decide what to retrieve, when, and above all when to stop.
Here is what changes, how it works turn by turn, and when the extra complexity is worth the price.
The limit of single-pass RAG
Single-pass RAG makes a strong assumption: one retrieval is enough to answer. Many real questions violate it: multi-hop questions ("what is the leave policy of the country where our biggest office is?" requires identifying the office first, then retrieving the right country's policy), comparisons (two contracts, two versions), ambiguous questions (clarify before searching), and broad-scope questions (one batch of passages won't cover it).
On these, single-pass RAG retrieves "roughly" and confidently generates a partial answer. The problem isn't generation: it's that the initial query does not contain the information needed to retrieve the right passages. Take the leave example: no chunk talks about both the "biggest office" and a "leave policy." The question's embedding points into a semantic void. The model gets shaky passages and answers anyway — exactly the scenario that produces confident hallucinations.
Classic RAG answers the question it thinks it understood, in one retrieval. Agentic RAG takes time to decompose, search, verify, and retry if needed.
What a RAG agent does
A RAG agent has a reasoning loop and tools (at minimum, search). Each turn it decides: should it retrieve? (some questions don't need the corpus), what to retrieve? (rewrite, decompose into sub-queries, target a source), is the information sufficient? (if a piece is missing, search again), and when to stop? (once context is complete enough — not too early, not looping forever).
Question ─► agent reasons ─► retrieve? what?
▲ │
│ retrieval
enough info? ◄───────────┘
│ yes
verified answer
A full example, turn by turn
To make the loop tangible, let's run the multi-hop question "What is the leave policy of the country where our biggest office is, and how many days more than the local legal minimum does it grant?" The corpus holds an HR directory, headcount sheets per site, and leave policies per country.
Turn 1 — decomposition. The agent reads the question and realizes it can't solve it in one shot. It builds a plan: (a) find the office with the most headcount, (b) identify its country, (c) retrieve that country's leave policy, (d) compare to the legal minimum. First search issued: headcount per office.
Turn 2 — first retrieval. Search returns a sheet listing sites: Paris 240, Lisbon 310, Berlin 180. The agent reads, infers the biggest office is Lisbon (310), so the country is Portugal. It records this intermediate fact in its working context. It still can't answer, so it follows up with the sub-query leave policy Portugal.
Turn 3 — targeted retrieval. Search returns the passage "Portugal: 25 working days of paid leave per year." The agent now has the core of the answer. But the question also asked for the gap to the legal minimum. That piece is missing. New sub-query: legal minimum paid leave Portugal.
Turn 4 — closing the gap. Search returns "legal minimum in Portugal: 22 working days." The agent now holds the three facts: Lisbon = biggest office, policy = 25 days, legal minimum = 22 days.
Turn 5 — self-verification and answer. Before delivering, the agent checks each claim against a cited source: 310 headcount (sites sheet), 25 days (PT policy), 22 days (legal reference). The arithmetic 25 − 22 = 3 is done without a tool. Answer: "The biggest office is Lisbon (Portugal). The policy grants 25 days, that is 3 more than the legal minimum of 22." The turn cap (say 8) was not reached; the agent stops because the context covers the whole question.
Note what a single-pass RAG would have produced: one search on the whole question, passages vaguely mixing "office," "leave," and "legal," and a plausible but unverified answer — correct by luck at best, inventing a country or a number at worst.
The patterns, in detail
Decomposition. Split a complex question into sequential sub-questions, as in turn 1 above. The pattern shines when sub-questions are dependent: one result conditions the next (you need the country before searching its policy). For independent sub-questions ("compare the termination clauses of contracts A, B, and C"), retrieve them in parallel and merge — fewer turns, less latency.
Routing. Pick the right source or index for the question's nature. An agent can route a legal question to the contracts index, a product question to technical docs, a numeric question to a structured database queried in SQL. Mini-example of routing logic:
if question ∈ {price, stock, dates} → SQL tool (products table)
else if legal question → "contracts" index (hybrid + rerank)
else → general knowledge index
Routing avoids drowning a precise query in a generalist corpus, and cuts noise upstream of reranking.
Iterative retrieval. Retrieve, read, identify what's missing, retrieve again — exactly turns 2 to 4. The key is gap detection: the agent must recognize a missing piece rather than answer with what it has. A robust heuristic is to ask the model each turn, "can you answer fully and with citations? yes/no + what's missing."
Self-verification. Check the answer against sources before delivering. In practice: require every factual claim to map to a retrieved passage, and trigger an extra retrieval if a claim has no support. This net catches residual hallucinations — at the cost of one more LLM turn.
The price: latency, cost, complexity
Agentic RAG is not free. Each loop turn means an extra LLM call (more latency and cost), a risk of divergence (a poorly scoped agent can loop, over-retrieve, or go off-track), and more debugging surface (a multi-step path is harder to trace than a single pass).
For orders of magnitude: a typical decision turn consumes the reasoning history plus the passages already read. Count roughly 300 to 800 input tokens per turn just for reasoning, plus the retrieved passages (often 1,000 to 4,000 tokens per batch depending on top-k and chunk size). On latency, each turn adds one LLM round trip, typically 0.5 to 2 s depending on the model, plus search latency (tens of ms in pgvector HNSW). Our 5-turn example therefore costs, as a first approximation, 5× the LLM-call budget of a single pass, and stacks five round trips of latency. A simple question mistakenly routed through the agent pays this overhead for nothing.
Frugality rule: agentify only what needs it. A factual FAQ does not need an agent; a multi-hop question does.
When to use it: concrete thresholds
Yes: multi-hop questions, comparisons, exploratory search, heterogeneous corpora needing routing, cases where self-verification sharply reduces errors. No: simple factual questions, strong latency constraints (real-time), massive volume where per-query cost must stay minimal. There, a well-built single-pass RAG (hybrid + reranking) remains best.
To decide without dogma, measure on your golden set and apply explicit thresholds:
- if fewer than ~10% of questions are multi-hop or need source crossing, the agent won't pay off: optimize the single pass.
- if single-pass RAG already reaches high context recall (say > 0.85 on RAGAS) for most traffic, keep it as default and route only the hard minority to the agent.
- if your latency budget per answer is under ~2 s, multi-turn agentic is effectively ruled out on the synchronous path; reserve it for async.
- if the measured faithfulness gain (agent vs single pass) is under a few points, it isn't worth it.
A pragmatic approach: start simple (single-pass RAG) and introduce the agent only where measurement proves complex questions fail. This avoids paying for complexity everywhere for a localized benefit. A good middle ground is routing at the entrance: a lightweight classifier sends 90% of traffic to the single pass and only wakes the agent for the 10% that justify it.
Guardrails
- Cap the number of turns — a hard limit (e.g. 6 to 8) prevents costly loops. Beyond it, the agent should answer with what it has, flagging uncertainty, rather than loop indefinitely.
- Bound the tools — an agent should only reach authorized sources. Permissions apply to the agent too: a user's question must never trigger retrieval in a knowledge base they aren't entitled to. Permission filtering happens inside the search tool, not in post-processing.
- Trace every step — log the plan, each sub-query, the sources returned, and the decision to continue or stop. This trace is essential for debugging ("why did the agent loop three times?") and for audit (showing where each claim comes from), a key point for compliance.
- Measure the overhead — weigh quality gains against added cost/latency. If the benefit isn't clear, simplify. In production, track the average number of turns per question: an upward drift is a warning sign (mis-routed questions, overly timid gap detection).
Conclusion
Agentic RAG is not a fad to apply everywhere, but a tool for a class of problems single-pass RAG cannot handle: questions that require reasoning, decomposing, and verifying. Well-scoped — capped, traced, measured — it moves an assistant from "often right" to "right even on hard questions." Poorly scoped, it adds cost and unpredictability. Maturity is knowing which world your question demands — and, most often, letting the two coexist behind a router that reserves the heavy artillery for the questions that deserve it.
Further reading
- Production RAG architecture: from chunking to reranking, the complete guide
- GraphRAG vs vector RAG: when a knowledge graph changes the game