
RAG isn’t “add a vector DB.” It’s a reliability architecture: define truth boundaries, build a testable retrieval pipeline, and evaluate groundedness like you mean it.
Axel Domingues
RAG is not a feature.
RAG is what happens when you accept a hard truth:
Your model does not “know your business.”
It knows language. It knows patterns. It knows how to sound right.
So the only way to ship an LLM feature that stays correct as your world changes is to wrap it in an explicit knowledge system:
That’s Retrieval-Augmented Generation done right.
Not “vector DB + prompt.”
Contracts, pipelines, and evaluation.
What RAG really is
A system that turns your knowledge into bounded context the model can use safely.
What RAG is not
A magic spell that makes hallucinations disappear.
The failure mode
“Looks confident” is not the same as “is grounded.”
The missing piece
Evaluation that measures grounded correctness, not vibes.
If you remember one sentence from this post, make it this:
RAG is a context assembly pipeline whose output must be testable.
The model is the last step. The reliability comes from everything around it.
So before you pick a vector store, ask a more architectural question:
Make “I don’t know” a valid outcome, and force answers to cite sources.
From there, you can design the right boundary:
RAG is primarily for the middle case: documents and knowledge bases that are too big for a prompt and too dynamic to bake into weights.
RAG is a tax. It adds ingestion, indexing, query-time retrieval, caching, and evaluation.
So spend that tax only when it buys you something real.
You need RAG when…
Truth lives in documents (policies, manuals, runbooks) and changes often enough that “train it in” won’t keep up.
You don’t need RAG when…
Truth lives in structured systems (DBs/APIs). Use tool calls and strong contracts instead of retrieval.
Here’s the pipeline that tends to survive production contact:

You need a reliable way to pull content from:
And you need metadata, not just text:
Chunking determines what your system can recall.
Bad chunking produces:
Good chunking is guided by your use cases:
Chunking is about preserving meaning boundaries.
In production, “vector similarity” is rarely enough.
A robust pattern is:
Reranking is where relevance gets real, and where your top-k stops being random.
You don’t have infinite tokens.
So you need a context assembler that decides:
This should be deterministic and testable.
If context packing is “whatever came back,” you will ship a system that fails in ways you can’t reproduce.
Most teams use “grounded” to mean:
“The answer mentions the docs.”
That’s not grounded. That’s decorated.
Grounded means:
If you don’t measure citation precision, you’re building a liar with footnotes.
What happens: the right chunk exists, but retrieval doesn’t fetch it.
Why it happens:
Fix: add query rewriting, improve chunking, add reranking, and measure retrieval recall.
What happens: the right chunk is present, but surrounded by noise, and the model chooses the wrong anchor.
Fix: context packing rules: fewer chunks, higher precision, better ordering, and trimmed boilerplate.
What happens: the model forms an answer from priors, then hunts for supporting text.
Fix: force a source-first plan (“cite before you claim”), and consider constrained formats.
What happens: the model answers correctly… for last month’s policy.
Fix: ingest version metadata, apply recency/authority ranking, and require “effective date” in context.
RAG is an evaluation problem disguised as a retrieval problem.
Because you can’t ship “the model seems better” to production.
You need to measure outcomes that map to reliability.
Answer correctness
Is the final answer correct for this question, in this context?
Groundedness
Is the answer supported by retrieved sources (not model priors)?
Citation precision
Do the citations truly support the claims they’re attached to?
Retrieval quality
Did we fetch the right evidence (recall/precision), fast enough, within budget?
A common anti-pattern:
You need a repeatable harness.
Create a dataset of real questions:
Measure two distinct things:
If you don’t split these, you won’t know what to fix.
Include tests where the system should:
Changes that can break RAG:
If these aren’t tested, you’re deploying random.
Overall score hides pain. Track by slices:
A RAG system is an information-routing machine.
So security is not an add-on. It is part of correctness.
Do not assume “internal docs” are safe. Treat retrieval content as hostile until proven otherwise.
Retrieval-Augmented Generation (RAG) paper (Lewis et al., 2020)
The original framing: retrieval as a way to ground generation in external knowledge.
REALM (Guu et al., 2020)
An influential perspective on retrieval as a learned component of language understanding.
Not usually.
Fine-tuning shapes behavior and style (and can teach task formats), but it doesn’t solve “fresh knowledge” reliably. RAG solves “knowledge at query time.” In practice, the best systems combine both:
Because retrieval doesn’t guarantee grounding.
Common reasons:
RAG reduces hallucinations when retrieval + context packing + evaluation are designed as a system.
Split evaluation:
If you don’t separate these, you’ll guess—and spend weeks tuning the wrong layer.
Start with a narrow domain and a hard contract:
Then expand one axis at a time: sources, domains, tools, and scale.
This month was about grounding:
Next month, I’ll move from “answer with evidence” to “act with tools”:
Tool Use and Agents: When the Model Becomes a Workflow Engine
Because once the model can retrieve knowledge… the next temptation is to let it push buttons.
And that’s where reliability either becomes architecture—
or becomes an incident report.
Tool Use and Agents: When the Model Becomes a Workflow Engine
Tool use is not “prompting better” — it’s turning an LLM into a controlled orchestrator of deterministic systems. This month is about the architecture, safety boundaries, and eval discipline that make agents shippable.
Hallucinations: A Probabilistic Failure Mode, Not a Moral Defect
Hallucinations aren’t “the model lying”. They’re what happens when a probabilistic engine is forced to answer without enough grounding. This post is about designing products that stay truthful anyway.