Agentic AI
RAG in production: the mistakes we see in almost every enterprise system
Most enterprise RAG systems fail the same way — and it's almost never the model's fault. The recurring mistakes, and how to fix them.
Retrieval-augmented generation is the default way enterprises put large language models to work, and for good reason: it grounds the model in your own documents instead of its training data, and it does so without the cost and fragility of fine-tuning. The architecture is simple enough to prototype in an afternoon. That's precisely why so many production RAG systems are quietly mediocre — the easy version works well enough in a demo to hide the parts that don't survive contact with real users.
When we're called in to fix a RAG system that "kind of works but nobody trusts it," the problem is almost never the model. It's the retrieval. Here's what we keep finding.
The chunking is naïve
Most systems split documents into fixed-size windows — 500 tokens, a bit of overlap, move on. That's fine for uniform prose and quietly disastrous for the documents enterprises actually have: contracts where a clause's meaning depends on a definition three pages earlier, tables that lose all meaning when cut mid-row, technical specs where the heading is the only thing that disambiguates the paragraph under it. Fixed-size chunking treats structure as noise. The fix is to chunk along the document's real boundaries — sections, clauses, table rows kept whole — and to carry enough context (the parent heading, the section title) into each chunk that it can stand on its own when retrieved out of order.
There's no reranking
A surprising number of production systems retrieve the top-k chunks by vector similarity and feed them straight to the model. Vector similarity is a coarse instrument; it's good at "roughly about the same topic" and bad at "actually answers this specific question." Without a reranking step — a cross-encoder or a dedicated rerank model that scores each candidate against the actual query — you're asking the LLM to find the needle while you hand it the whole haystack, ranked badly. Adding reranking is often the single highest-impact change available, and it's a few hours of work.
Nobody measures relevance
This is the one that matters most and gets done least. Teams measure whether the system responds; they don't measure whether it retrieved the right thing. So when answers are wrong, there's no way to tell whether retrieval missed the relevant chunk, or retrieved it and the model ignored it, or there was no relevant chunk to find. These are three completely different bugs with three different fixes, and without a relevance evaluation set you're guessing. Build a labelled set of real questions with known-correct sources, and measure retrieval precision and recall as a first-class metric. Until you do, every improvement is a vibe.
Retrieval runs even when it shouldn't
Not every query needs the knowledge base. "Summarise the document I just pasted" doesn't; "what's our policy on X" does. Systems that retrieve unconditionally pollute the context with irrelevant chunks on exactly the queries that didn't need them. A cheap routing step — does this query need retrieval at all, and from which corpus — pays for itself.
The context window gets treated as free
It isn't. Stuffing twenty chunks into the prompt because they fit doesn't improve answers; past a point it degrades them, as the relevant passage gets buried among the merely-plausible ones. Fewer, better-ranked chunks beat more chunks, every time we've measured it.
The thread running through all of these
In a production RAG system, retrieval quality sets the ceiling and the model only determines how close you get to it. Teams spend their energy on prompt engineering and model selection — the visible knobs — while the actual constraint sits upstream, in how documents are split, scored, and selected. Swapping to a better model rarely fixes a retrieval problem. It just makes the wrong chunks sound more confident.
For regulated environments there's a further layer — every retrieval and every answer has to be traceable and auditable, which is its own engineering problem — but the foundation is the same: get retrieval right, measure it honestly, and most of the "the AI is unreliable" complaints disappear.
Work with us
Does this apply to your context?
A senior engineer can give you a direct perspective in thirty minutes.