Jul 28, 2024 - 18 MIN READ

RAG You Can Evaluate: retrieval pipelines, reranking, citations, and truth boundaries

RAG isn’t “add a vector DB and hope.” It’s a search-and-reasoning subsystem with contracts, metrics, and failure budgets — and you can only operate what you can evaluate.

Axel Domingues

June was about multimodal UX: once the model can see and hear, the interface stops being “a chat box” and becomes a product surface.

July is the less glamorous part.

Because the moment your LLM answers questions about your business, you’re no longer building a prompt.

You’re building a truth system.

And Retrieval-Augmented Generation (RAG) is the first architecture pattern that forces you to say, out loud:

What counts as truth?
What evidence is allowed?
What happens when evidence is missing?
How do we know the system is getting worse?

RAG is not “hallucination prevention”.

RAG is a search subsystem that feeds a probabilistic generator — and you need the same discipline you’d bring to:

search ranking
caching
security boundaries
and reliability budgets

If you’ve ever shipped RAG and felt like:

“it works in dev, but production feels random”
“we changed chunking and everything got weird”
“citations look convincing but sometimes point to nothing”

…that’s not bad luck.

That’s missing architecture.

The goal

Ship RAG that stays correct under change — new docs, new models, new prompts.

The core idea

Treat RAG as a subsystem with contracts: retrieval → evidence → answer.

The operability rule

If you can’t evaluate it, you can’t safely tune it.

The safety rule

Your documents are untrusted input. RAG inherits injection risk.

RAG Is Three Planes, Not One Feature

When teams say “we built RAG”, they often mean: “we embedded PDFs and stuffed chunks into the prompt.”

That’s a prototype.

A production RAG system has three planes:

Data plane

Ingestion, chunking, embeddings, indexing, freshness, permissions.

Query plane

Query rewriting, retrieval, reranking, dedupe, context assembly.

Answer plane

Truth boundaries, citation contract, refusal behavior, formatting.

Evaluation plane

Offline test sets, regression gates, online telemetry, audits.

If you only build the first three and ignore the fourth, your system will regress quietly — and you’ll only notice when a customer pastes a screenshot.

Define Your Truth Boundary First (Or Your Metrics Will Lie)

Before pipelines and vector stores, define the truth boundary:

What claims must be supported by retrieved evidence, and what claims may be speculative?

A practical way to do this is to separate outputs into two zones:

Inside the boundary: must be grounded in retrieved sources (and cite them)
Outside the boundary: allowed to be best-effort (and must be labeled as such)

Here’s a version I’ve used in real systems:

Output type	Allowed?	Policy
Exact facts about your internal docs (pricing, policy, procedures)	✅	Must cite sources. If retrieval confidence is low → refuse / ask clarifying Q.
Summaries of provided documents	✅	Must cite sections; prefer quoting key lines.
Recommendations based on internal docs	✅	Must tie each recommendation to cited constraints.
General world knowledge	⚠️	Allowed only if labeled as general knowledge and separated from cited claims.
Legal/medical/financial advice	🚫/⚠️	Usually disallowed; or gated to “informational only” with strong disclaimers.

Truth boundaries are not “prompt rules”.

They are product contracts.

And once you have them, your evaluation becomes meaningful:

did we retrieve the right evidence?
did the answer stay inside the boundary?
were citations honest?

The Retrieval Pipeline That Actually Works

A useful mental model:

Retrieval is a recall problem first, a precision problem second.

If the right information never enters the candidate set, no reranker or LLM can save you.

1) Ingestion that doesn’t corrupt truth

RAG quality is usually lost before embeddings exist.

Common ingestion mistakes:

OCR artifacts become “facts”
headers/footers repeat and dominate similarity
tables are flattened into nonsense
duplicate versions are indexed with no versioning
permission boundaries are not enforced

If your system can retrieve a chunk the user shouldn’t see, you don’t have a RAG problem.

You have a data security problem.

2) Chunking is a product decision, not a tuning knob

Chunking controls:

what the model can cite
what the retriever can find
how often you exceed token budgets

Rules that survive reality:

chunk by semantic boundaries (sections, headings), not fixed size alone
keep enough local context: definitions, units, dates, scope
store structured metadata: doc_id, section, timestamp, permission_scope

A great retrieval system is often just:

better chunking + better metadata + reranking.

3) Retrieval: go hybrid before you go fancy

Dense vectors are great, but lexical signals still matter:

product names
error codes
policy IDs
acronyms

A strong baseline is hybrid retrieval:

BM25 (lexical) for exact match and rare tokens
vector search for semantic similarity
merge candidates → rerank

4) Reranking is the highest ROI “accuracy feature”

Most teams underuse reranking because it feels “extra.”

But reranking is often the cheapest way to improve grounded answers because it:

boosts the best evidence
removes near-duplicates
lowers the chance of citing the wrong section

A simple, reliable pattern:

retrieve top K = 50–200 candidates (cheap recall)
rerank down to N = 5–12 evidence chunks (precision)
assemble context with dedupe + ordering + token budgeting

Context Assembly: the Hidden System You’re Actually Operating

By March, we already treated context assembly like a subsystem.

RAG makes that subsystem unavoidable.

Context assembly decisions you must make explicitly:

How do you order evidence?
By relevance score? By document structure? By timestamp?
How do you dedupe?
Same paragraph in multiple PDFs will destroy your token budget.
How do you handle conflicts?
Two docs disagree. Which one is authoritative?
How do you cap tokens?
If you overflow context, the model becomes nondeterministic (different truncation → different answer).

A rule that saves you

Token budgets are not an optimization. They are an operability constraint.

A practical assembly policy (that you can explain to stakeholders)

Prefer authoritative sources (versioned policy docs > wiki pages > chat logs)
Prefer newer sources when the topic is “current state”
Include definitions early (so later citations don’t float without meaning)
For each source, include the smallest snippet that still supports the claim
Never cite a source that isn’t present in the context the model saw

That last one sounds obvious.

And yet it’s the #1 citation integrity bug I see.

Citations Are an API Contract, Not a UI Decoration

A citation system can be honest or cosmetic.

Honest citations require a contract.

The citation contract

For every cited claim, the system must be able to answer:

Which document is this from? (doc_id)
Which exact region supports it? (start_offset, end_offset OR page, paragraph)
What text was actually used as evidence? (snippet)
Was that snippet actually present in the model context?

If you don’t track this, you cannot audit.

And if you can’t audit, you will eventually ship confident nonsense with a citation badge.

A strong pattern is: snippet-first citations.

Instead of “Source: Employee Handbook”, store and show:
the exact snippet used
with a link to the parent doc
and highlight the supporting span

Citation failure modes (the ones that hurt trust)

“RAG You Can Evaluate” Means You Have a Test Harness

If you want to improve RAG safely, you need to evaluate it in layers.

Not because metrics are fun.

Because without them, you’re tuning blind.

Retrieval eval

Did we fetch the right evidence at all?

Answer eval

Given good evidence, did we answer correctly?

Citation eval

Do citations actually support the claims?

Policy eval

Did the system stay inside the truth boundary (refuse when needed)?

Step 1: Build a “question set” that represents reality

Not 10 cherry-picked examples.

You want:

frequent questions (customer support, onboarding)
hard questions (multi-step, cross-doc)
failure-bait questions (ambiguous phrasing, conflicting docs)
permission-sensitive questions (role-based access)

For each question, store:

expected answer (or acceptance criteria)
known supporting sources (gold docs/chunks if possible)
what a refusal should look like (when evidence is missing)

If you don’t know the expected answer reliably, that’s fine.Turn those into human-judged items and track them separately.

Step 2: Evaluate retrieval as a recall problem

You can run retrieval evaluation without the LLM.

Common retrieval metrics:

Recall@K: did any of the gold evidence appear in top K?
MRR: how high did the first correct result rank?
nDCG: did we rank the most relevant evidence near the top?

You don’t need academic perfection here.

You need trend detection:

chunking change made Recall@50 drop from 0.82 → 0.61? That’s a red flag.
embedding model swap improved MRR but increased permission leakage? That’s a red flag too.

Step 3: Evaluate answer correctness separately from retrieval

A trick that saves time:

Run two modes:

Closed-book mode: no retrieval; does the model hallucinate?
Oracle mode: feed the correct evidence; can the model answer with perfect context?
End-to-end mode: normal RAG

If oracle mode is bad, your problem is:

prompt contract
truth boundary
or model capability

If oracle is good but end-to-end is bad, your problem is:

retrieval
reranking
or assembly

Step 4: Evaluate citations as claim→evidence links

This is where most teams stop.

Don’t.

A simple practical protocol:

Extract atomic claims from the answer (sentences or bullet items)
For each claim, check:
- is there a cited snippet?
- does the snippet semantically support the claim?
- is the snippet present in the final context?

You can do this with human review, model-as-judge, or both.

But you must do it.

“Has citations” is not a metric.

Supported-by-citation rate is a metric.

Production Telemetry: The Minimum Dashboard That Prevents Blindness

Offline eval prevents regressions.

Online telemetry catches reality.

What I consider “minimum viable observability” for RAG:

Retrieval diagnostics
- top query terms / rewrites
- top retrieved doc IDs
- retrieval latency
- candidate count before/after filters
- reranker score distribution
Answer diagnostics
- refusal rate (by route / tenant / topic)
- “no evidence” responses
- answer length / verbosity drift
Citation diagnostics
- cited sources count
- unsupported-by-citation rate (sampled + audited)
- “citation points to empty / missing source” errors
Cost + token diagnostics
- context tokens used
- truncation events
- cache hit rate (for repeated questions)

The metric that catches “quiet degradation” fastest is:

retrieval confidence + refusal rate + user re-ask rate

When retrieval gets worse, users ask again in different words. That’s signal.

Security: RAG Inherits Prompt Injection (Because Docs Are Untrusted)

If you index:

web pages
customer-provided files
emails
tickets

…you are indexing adversarial text, even if the adversary is accidental.

The rule is simple:

Your RAG corpus is an input channel.

Treat it like user input.

Practical guardrails:

never let retrieved text override system instructions
isolate tool instructions from content (“content is data, not commands”)
strip or downrank instruction-looking segments (“ignore previous”, “system prompt”, etc.)
store and enforce permission scopes at retrieval time, not after generation
log “injection-shaped” retrievals and audit them

A Concrete Architecture Pattern (That Scales)

Here’s the reference pattern I use because it scales with complexity and keeps things testable:

Indexer pipeline (batch + incremental)
- canonicalize text
- chunk + metadata
- embed + hybrid index
- store provenance + permissions
Query service
- rewrite query (optional)
- retrieve candidates (hybrid)
- filter by permissions + freshness
- rerank
- assemble context (budgeted)
Answer service
- apply truth boundary policy
- generate answer with citation contract
- run lightweight verifiers (optional)
- return: answer + citations + evidence snippets + traces
Eval harness
- nightly regression run
- gating metrics for deploy
- sampled audits for citations and security

RAG subsystem: ingestion, retrieval, reranking, context assembly, answer generation, and evaluation loop

Even if you implement it inside one codebase, these boundaries keep you sane.

Resources

Retrieval-Augmented Generation (Lewis et al., 2020)

The original RAG framing: retrieval as non-parametric evidence that can be updated, audited, and cited.

BEIR benchmark (Thakur et al., 2021)

A practical retrieval benchmark suite (lexical, dense, rerankers) that’s great for measuring Recall@K / MRR / nDCG and detecting regressions.

RAGAS — Automated RAG evaluation (Es et al., 2023/2024)

A toolkit for evaluating RAG quality (faithfulness, answer relevance, context precision/recall) so tuning becomes measurable, not vibes.

OWASP GenAI — Prompt Injection (truth boundary + untrusted docs)

Threat model + mitigations for the core RAG security reality: retrieved documents are an input channel and can carry hostile instructions.

What’s Next

June expanded the interface: multimodal UX.

July built the truth pipeline:

retrieval is a subsystem
reranking buys precision
citations need contracts
evaluation makes tuning safe
truth boundaries make the product trustworthy

Next month we move from “the model answers” to “the model acts”:

Tool use with open models

Because the moment the model can trigger side effects, the question changes again:

It’s no longer “is this answer correct?”

It’s:

“What is this model allowed to do, under what constraints, and how do we prove it stayed inside the boundary?”

Tool Use with Open Models: function calling, sandboxes, and “capability boundaries”

Tool use is where LLMs stop being “text generators” and start being integration surfaces. With open weights, reliability isn’t a given — you have to engineer it with contracts, sandboxes, and explicit capability boundaries.

Multimodal Changes UX: designing text+vision+audio systems

Multimodal isn’t “a bigger prompt”. It’s a perception + reasoning + UX system with new contracts, new failure modes, and new latency/cost constraints. This month is about designing it so it behaves predictably.