
RAG isn’t “add a vector DB and hope.” It’s a search-and-reasoning subsystem with contracts, metrics, and failure budgets — and you can only operate what you can evaluate.
Axel Domingues
June was about multimodal UX: once the model can see and hear, the interface stops being “a chat box” and becomes a product surface.
July is the less glamorous part.
Because the moment your LLM answers questions about your business, you’re no longer building a prompt.
You’re building a truth system.
And Retrieval-Augmented Generation (RAG) is the first architecture pattern that forces you to say, out loud:
RAG is not “hallucination prevention”.
RAG is a search subsystem that feeds a probabilistic generator — and you need the same discipline you’d bring to:
That’s missing architecture.
The goal
Ship RAG that stays correct under change — new docs, new models, new prompts.
The core idea
Treat RAG as a subsystem with contracts: retrieval → evidence → answer.
The operability rule
If you can’t evaluate it, you can’t safely tune it.
The safety rule
Your documents are untrusted input. RAG inherits injection risk.
When teams say “we built RAG”, they often mean: “we embedded PDFs and stuffed chunks into the prompt.”
That’s a prototype.
A production RAG system has three planes:
Data plane
Ingestion, chunking, embeddings, indexing, freshness, permissions.
Query plane
Query rewriting, retrieval, reranking, dedupe, context assembly.
Answer plane
Truth boundaries, citation contract, refusal behavior, formatting.
Evaluation plane
Offline test sets, regression gates, online telemetry, audits.
If you only build the first three and ignore the fourth, your system will regress quietly — and you’ll only notice when a customer pastes a screenshot.
Before pipelines and vector stores, define the truth boundary:
What claims must be supported by retrieved evidence, and what claims may be speculative?
A practical way to do this is to separate outputs into two zones:
Here’s a version I’ve used in real systems:
| Output type | Allowed? | Policy |
|---|---|---|
| Exact facts about your internal docs (pricing, policy, procedures) | ✅ | Must cite sources. If retrieval confidence is low → refuse / ask clarifying Q. |
| Summaries of provided documents | ✅ | Must cite sections; prefer quoting key lines. |
| Recommendations based on internal docs | ✅ | Must tie each recommendation to cited constraints. |
| General world knowledge | ⚠️ | Allowed only if labeled as general knowledge and separated from cited claims. |
| Legal/medical/financial advice | 🚫/⚠️ | Usually disallowed; or gated to “informational only” with strong disclaimers. |
And once you have them, your evaluation becomes meaningful:They are product contracts.
A useful mental model:
Retrieval is a recall problem first, a precision problem second.
If the right information never enters the candidate set, no reranker or LLM can save you.
RAG quality is usually lost before embeddings exist.
Common ingestion mistakes:
You have a data security problem.
Chunking controls:
Rules that survive reality:
doc_id, section, timestamp, permission_scopebetter chunking + better metadata + reranking.
Dense vectors are great, but lexical signals still matter:
A strong baseline is hybrid retrieval:
Most teams underuse reranking because it feels “extra.”
But reranking is often the cheapest way to improve grounded answers because it:
A simple, reliable pattern:
By March, we already treated context assembly like a subsystem.
RAG makes that subsystem unavoidable.
Context assembly decisions you must make explicitly:
A rule that saves you
Token budgets are not an optimization. They are an operability constraint.
That last one sounds obvious.
And yet it’s the #1 citation integrity bug I see.
A citation system can be honest or cosmetic.
Honest citations require a contract.
For every cited claim, the system must be able to answer:
doc_id)start_offset, end_offset OR page, paragraph)If you don’t track this, you cannot audit.
And if you can’t audit, you will eventually ship confident nonsense with a citation badge.
Instead of “Source: Employee Handbook”, store and show:
- the exact snippet used
- with a link to the parent doc
- and highlight the supporting span
This happens when retrieval is ok, but chunking boundaries are messy or reranking is weak.
Fixes:
This happens when the UI renders citations from retrieval logs, not from the final assembled context.
Fixes:
This is the core groundedness problem.
Fixes:
If you want to improve RAG safely, you need to evaluate it in layers.
Not because metrics are fun.
Because without them, you’re tuning blind.
Retrieval eval
Did we fetch the right evidence at all?
Answer eval
Given good evidence, did we answer correctly?
Citation eval
Do citations actually support the claims?
Policy eval
Did the system stay inside the truth boundary (refuse when needed)?
Not 10 cherry-picked examples.
You want:
For each question, store:
You can run retrieval evaluation without the LLM.
Common retrieval metrics:
You don’t need academic perfection here.
You need trend detection:
A trick that saves time:
Run two modes:
If oracle mode is bad, your problem is:
If oracle is good but end-to-end is bad, your problem is:
This is where most teams stop.
Don’t.
A simple practical protocol:
You can do this with human review, model-as-judge, or both.
But you must do it.
Supported-by-citation rate is a metric.
Offline eval prevents regressions.
Online telemetry catches reality.
What I consider “minimum viable observability” for RAG:
When retrieval gets worse, users ask again in different words. That’s signal.retrieval confidence + refusal rate + user re-ask rate
If you index:
…you are indexing adversarial text, even if the adversary is accidental.
The rule is simple:
Treat it like user input.
Practical guardrails:
Here’s the reference pattern I use because it scales with complexity and keeps things testable:

Even if you implement it inside one codebase, these boundaries keep you sane.
Retrieval-Augmented Generation (Lewis et al., 2020)
The original RAG framing: retrieval as non-parametric evidence that can be updated, audited, and cited.
BEIR benchmark (Thakur et al., 2021)
A practical retrieval benchmark suite (lexical, dense, rerankers) that’s great for measuring Recall@K / MRR / nDCG and detecting regressions.
If your system answers questions with high trust requirements, yes.
Embedding similarity is a rough recall tool. Reranking is where you decide which evidence actually matters.
In practice, reranking is often the cheapest way to improve:
Big enough to preserve meaning, small enough to be specific.
In production, I’ve found the right answer is rarely “one size”. Use section-based chunking with size caps, and store metadata that lets you stitch context:
It’s useful as a trend detector, not as a truth oracle.
Use it to:
For high-stakes correctness, keep a human audit loop — even if it’s small and sampled.
Because retrieval failed.
Re-asks are often not “UX problems”. They’re recall problems — the user is trying to search around your search.
June expanded the interface: multimodal UX.
July built the truth pipeline:
Next month we move from “the model answers” to “the model acts”:
Tool use with open models — function calling, sandboxes, and capability boundaries.
Because the moment the model can trigger side effects, the question changes again:
It’s no longer “is this answer correct?”
It’s:
“What is this model allowed to do, under what constraints, and how do we prove it stayed inside the boundary?”
Tool Use with Open Models: function calling, sandboxes, and “capability boundaries”
Tool use is where LLMs stop being “text generators” and start being integration surfaces. With open weights, reliability isn’t a given — you have to engineer it with contracts, sandboxes, and explicit capability boundaries.
Multimodal Changes UX: designing text+vision+audio systems
Multimodal isn’t “a bigger prompt”. It’s a perception + reasoning + UX system with new contracts, new failure modes, and new latency/cost constraints. This month is about designing it so it behaves predictably.