
Bigger context windows tempt teams to paste everything. But long context is just a larger input buffer — not memory, not grounding, and not a plan. This month: how to budget context, decide “stuff vs retrieve,” and build a context assembler that stays fast, cheap, and safe.
Axel Domingues
January was about contracts:
February is about the temptation that breaks those disciplines:
“We have a bigger context window now.
Why not just paste everything?”
Because long context isn’t memory.
It’s not even understanding.
It’s just a larger place to put tokens before you press “run.”
And if you treat it like a free memory upgrade, you’ll ship systems that are:
So this month is a practical guide to a simple question:
When should you stuff context, and when should you retrieve?
The mental model
Context window is RAM.
Retrieval is disk + index.
The trap
Stuffing everything feels safe…
until cost, latency, and confusion explode.
The goal
A repeatable rule: stuff, retrieve, or hybrid — by design.
The deliverable
A Context Assembler with budgets, ranking, safety filters, and telemetry.
Teams melt down here because they use one word (“context”) for three different jobs.
Let’s separate them cleanly:
Context window
What the model can see right now in a single request.
Memory
Durable state across requests: preferences, facts, decisions, user profile.
Retrieval
Selecting a small set of relevant artifacts on demand.
Grounding
Constraining answers to verifiable sources (docs, DB, tools), not vibes.
The key point:
Long context is useful. It’s also a footgun.
Even if the model can “see” everything, it may not use the right part.
Large prompts introduce:
Stuffing is the easiest way to quietly ship a cost bomb.
It also makes latency worse:
If untrusted content enters the context, you’ve effectively handed the attacker a megaphone.
Document injection looks like:
If you’re stuffing raw documents, emails, tickets, PDFs — you are expanding the model’s instruction surface.
You wouldn’t do that. Don’t do the LLM equivalent.
When answers are wrong, you won’t know:
Retrieval gives you observability:
Stuffing gives you: “¯_(ツ)_/¯ it was in there somewhere.”
Here’s the rule I use.
The fastest heuristic
If the answer should change when the docs change, you need retrieval.
A useful mental model:

This model naturally leads to architecture:
Build a context assembler that decides what goes into RAM.
Everything else stays indexed on disk.
If January was “build an LLM boundary layer,” February is the next layer up:
a context assembler.
A context assembler is the subsystem that decides:
Gather possible context sources:
Assign each candidate a score and a cost:
Output a packet that your application can log and evaluate:
This is how you make context operable.
Most teams don’t have a context problem. They have a budgeting problem.
The “paste everything” approach avoids making tradeoffs — until production forces the tradeoffs violently.
A context budget has three parts:
LLM incidents love your tail latency.
Here’s a stable order that works surprisingly well:
That last step is where stuffing belongs:
Retrieval is not “add embeddings and pray.”
Retrieval is a pipeline, and each stage can be engineered.
A practical retrieval stack:
Chunk size and overlap decide what your system can “find.”
Common patterns:
Fix chunking before you blame the model.
Embedding similarity gets you candidates. Reranking gets you relevance.
Even a simple reranker reduces:
Retrieval is where you enforce:
This is why retrieval is an architecture decision: it’s the boundary between “knowledge” and “random text.”
The most useful real-world pattern is hybrid:
Here’s a simple context layout (conceptually):
That layout isn’t magic — it’s just separation of concerns.
If you do retrieval, you will eventually retrieve something hostile.
So design for it.
Rules that work:
Your tool gateway and contracts are the boundary.
If you can’t measure retrieval, you can’t trust it.
A minimal dashboard for this month’s topic:
Subsystems get dashboards.
Retrieval-Augmented Generation (RAG) paper (Lewis et al., 2020)
The canonical framing: retrieval as non-parametric “memory” to ground generation and keep knowledge updatable.
Lost in the Middle (Liu et al., 2023)
Why “just paste everything” fails: models often miss key facts buried in long contexts (the classic U-shaped attention effect).
OWASP GenAI Top 10 — Prompt Injection (LLM01)
A practical threat model for document injection + jailbreaking, with mitigation guidance for LLM apps.
UK NCSC — “Prompt injection is not SQL injection (it may be worse)”
Clear security framing: prompts mix data + instructions, so you must design systems to reduce impact, not assume perfect prevention.
Yes — for the same reason you still need databases when you have RAM.
Retrieval gives you:
No. Summaries are lossy compression.
They are useful for:
They are not a substitute for:
Start with hybrid:
Then expand coverage.
“Stuffing works… until it doesn’t.”
It fails first on:
Retrieval plus budgets fixes all four.
Here’s a drop-in replacement for your February article’s “What’s Next” section (updated to tease the new March topic, without referencing unreleased models):
January built the boundary layer.
February built the context strategy:
Next month we turn this into a real subsystem:
Context Assembly as a Subsystem: summaries, state, and token budgets
Because “stuff vs retrieve” is only half the battle.
The part that makes it operable is having a component that can:
If you can’t answer “what did the model see, and why?”, you don’t have a system.
You have a vibe.
Context Assembly as a Subsystem: Summaries, State, and Token Budgets
“Stuff vs retrieve” is only half the battle. The operable part is a context assembler: a subsystem that selects, budgets, sanitizes, and logs exactly what the model sees—so you can debug, evaluate, and scale LLM features without vibes.
Prompting is Not Programming: Contracts, Schemas, and Failure Budgets
Prompting feels like coding until it fails like statistics. This month I start treating LLMs as probabilistic components: define contracts, enforce schemas, and design failure budgets so your system survives outputs that are “plausible” but wrong.