Sep 24, 2023 - 16 MIN READ

RAG Done Right: Knowledge, Grounding, and Evaluation That Isn’t Vibes

RAG isn’t “add a vector DB.” It’s a reliability architecture: define truth boundaries, build a testable retrieval pipeline, and evaluate groundedness like you mean it.

Axel Domingues

RAG is not a feature.

RAG is what happens when you accept a hard truth:

Your model does not “know your business.”

It knows language. It knows patterns. It knows how to sound right.

So the only way to ship an LLM feature that stays correct as your world changes is to wrap it in an explicit knowledge system:

what content is allowed
what content is current
what content is authoritative
and how the answer proves it came from that content

That’s Retrieval-Augmented Generation done right.

Not “vector DB + prompt.”

Contracts, pipelines, and evaluation.

What RAG really is

A system that turns your knowledge into bounded context the model can use safely.

What RAG is not

A magic spell that makes hallucinations disappear.

The failure mode

“Looks confident” is not the same as “is grounded.”

The missing piece

Evaluation that measures grounded correctness, not vibes.

The Core Mental Model: Truth Boundaries + Context Assembly

If you remember one sentence from this post, make it this:

RAG is a context assembly pipeline whose output must be testable.

The model is the last step. The reliability comes from everything around it.

So before you pick a vector store, ask a more architectural question:

What must never be wrong?

Pricing rules?
Legal clauses?
Medical instructions?
Security configuration?
Customer account state?

If “wrong” is expensive, do not let the model improvise.

Make “I don’t know” a valid outcome, and force answers to cite sources.

From there, you can design the right boundary:

Strong truth (must be exact): comes from authoritative systems (databases, APIs)
Weak truth (can be approximate): comes from documents (policies, wikis)
No truth (creative): can be pure generation (marketing copy, brainstorming)

RAG is primarily for the middle case: documents and knowledge bases that are too big for a prompt and too dynamic to bake into weights.

Mini-glossary (as used in this post)

When You Need RAG (and When You Don’t)

RAG is a tax. It adds ingestion, indexing, query-time retrieval, caching, and evaluation.

So spend that tax only when it buys you something real.

You need RAG when…

Truth lives in documents (policies, manuals, runbooks) and changes often enough that “train it in” won’t keep up.

You don’t need RAG when…

Truth lives in structured systems (DBs/APIs). Use tool calls and strong contracts instead of retrieval.

A lot of “RAG apps” should actually be tool + retrieval hybrids:

use tools for live/structured truth (account state, pricing, inventory)
use RAG for narrative context (policies, explanations, constraints)

Architecture: The RAG Pipeline You Can Defend

Here’s the pipeline that tends to survive production contact:

RAG context assembly pipeline: ingestion -> chunking -> embedding -> indexing -> retrieval -> reranking -> context packing -> generation -> eval/telemetry

1) Ingestion is an engineering system, not a script

You need a reliable way to pull content from:

docs (Markdown, HTML, PDFs)
wikis and knowledge bases
ticketing systems / incident reports
product specs and change logs

And you need metadata, not just text:

source system
document type
owner / authority
update timestamp
access control labels
version identifiers

If you don’t preserve provenance and ACL metadata at ingestion time, you will either:

leak data later, or
block safe retrieval later.

Both are expensive.

2) Chunking is a product decision

Chunking determines what your system can recall.

Bad chunking produces:

fragments without context
or massive chunks that waste the context window

Good chunking is guided by your use cases:

policy clauses: chunk by section + keep heading hierarchy
runbooks: chunk by step sequences + include prerequisites
API docs: chunk by endpoint + include schema snippets

Chunk size is not a magic number.

Chunking is about preserving meaning boundaries.

3) Retrieval is a two-stage system (candidate + rerank)

In production, “vector similarity” is rarely enough.

A robust pattern is:

Stage A: fast recall (vector search returns N candidates)
Stage B: precision (reranker re-sorts and filters)

Reranking is where relevance gets real, and where your top-k stops being random.

4) Context packing is where most RAG apps quietly fail

You don’t have infinite tokens.

So you need a context assembler that decides:

which chunks to include
in what order
how much to trim
what metadata to keep (titles, timestamps, source IDs)
how to attach citations

This should be deterministic and testable.

If context packing is “whatever came back,” you will ship a system that fails in ways you can’t reproduce.

Grounding: What It Means (and How It Breaks)

Most teams use “grounded” to mean:

“The answer mentions the docs.”

That’s not grounded. That’s decorated.

Grounded means:

claims can be traced to specific retrieved text
citations point to the right chunks
unsupported claims are rejected or flagged
the system can say “I don’t know” without collapsing

A model can hallucinate with citations.

If you don’t measure citation precision, you’re building a liar with footnotes.

Common grounding failure modes

Evaluation: If It’s Not Measured, It’s Not Real

RAG is an evaluation problem disguised as a retrieval problem.

Because you can’t ship “the model seems better” to production.

You need to measure outcomes that map to reliability.

The metrics that actually matter

Answer correctness

Is the final answer correct for this question, in this context?

Groundedness

Is the answer supported by retrieved sources (not model priors)?

Citation precision

Do the citations truly support the claims they’re attached to?

Retrieval quality

Did we fetch the right evidence (recall/precision), fast enough, within budget?

“Vibes eval” is how teams ship regressions

A common anti-pattern:

you try 20 prompts
you like the outputs
you ship
production traffic finds all the questions you didn’t test

You need a repeatable harness.

Build a golden set (and treat it like a product asset)

Create a dataset of real questions:

representative of production usage
tagged by intent/type
with expected answer shape (not always exact phrasing)
with a “no-answer is correct” label where appropriate

Separate retrieval evaluation from generation evaluation

Measure two distinct things:

Did retrieval fetch the right chunks?
Given the right chunks, did generation answer correctly and cite properly?

If you don’t split these, you won’t know what to fix.

Add adversarial and boundary tests

Include tests where the system should:

refuse (missing evidence)
ask clarifying questions
avoid policy violations
respect ACL boundaries

Run eval on every change

Changes that can break RAG:

chunking rules
embedding model version
index parameters
reranker version
prompt template
context packing logic

If these aren’t tested, you’re deploying random.

Track regressions by slice, not just overall score

Overall score hides pain. Track by slices:

question type
domain/source
recency
length/complexity
“needs tool” vs “docs-only”

If you can’t tell whether a change improved retrieval recall or generation groundedness, you don’t have an eval harness yet.

Production RAG Is Also a Security System

A RAG system is an information-routing machine.

So security is not an add-on. It is part of correctness.

Non-negotiables

enforce ACLs at retrieval time (and ideally at index time)
log which sources were accessed (auditability)
treat retrieved text as untrusted input (prompt injection exists inside documents)
sanitize and constrain tool outputs if you mix RAG + tools

Your model will happily follow malicious instructions hidden in retrieved text.

Do not assume “internal docs” are safe. Treat retrieval content as hostile until proven otherwise.

A Practical RAG Checklist (The One I Wish Teams Started With)

Resources

Retrieval-Augmented Generation (RAG) paper (Lewis et al., 2020)

The original framing: retrieval as a way to ground generation in external knowledge.

REALM (Guu et al., 2020)

An influential perspective on retrieval as a learned component of language understanding.

OpenAI Cookbook — RAG patterns (practical)

Practical implementation patterns and examples you can translate into your stack.

LangChain — evaluation + retrieval utilities

Even if you don’t adopt the framework, it’s useful to study its RAG building blocks and eval ideas.

What’s Next

This month was about grounding:

retrieval is a pipeline
grounding is a contract
evaluation is the only honest judge

Next month, I’ll move from “answer with evidence” to “act with tools”:

Tool Use and Agents

Because once the model can retrieve knowledge… the next temptation is to let it push buttons.

And that’s where reliability either becomes architecture—

or becomes an incident report.

Tool Use and Agents: When the Model Becomes a Workflow Engine

Tool use is not “prompting better” — it’s turning an LLM into a controlled orchestrator of deterministic systems. This month is about the architecture, safety boundaries, and eval discipline that make agents shippable.

Hallucinations: A Probabilistic Failure Mode, Not a Moral Defect

Hallucinations aren’t “the model lying”. They’re what happens when a probabilistic engine is forced to answer without enough grounding. This post is about designing products that stay truthful anyway.