
“Stuff vs retrieve” is only half the battle. The operable part is a context assembler: a subsystem that selects, budgets, sanitizes, and logs exactly what the model sees—so you can debug, evaluate, and scale LLM features without vibes.
Axel Domingues
January built the LLM boundary layer: contracts, schemas, failure budgets.
February built the context strategy: long context isn’t memory—when to stuff, when to retrieve.
March is where those ideas become real software:
If you can’t answer “what did the model see, and why?”
you don’t have a system. You have a vibe.
This article is about building the missing component in most LLM products:
a context assembler.
Not “prompt engineering.”
Not “RAG integration.”
A subsystem with:
So that LLM features behave like something you can operate.
The thesis
Context is produced by a pipeline, not written by hand.
The unit of work
Emit a Context Packet: what went in, why, and how much it cost.
The constraints
Budgets are non-negotiable: token cap, latency cap, cost cap.
The payoff
You get observability: selection quality, failures, drift, and ROI.
When teams say prompt, they usually mean a whole bundle of concerns:
That is not a “prompt.”
That’s a compiled artifact.
So let’s name it properly:
If you don’t separate those, you can’t optimize or debug either.
A context assembler is a subsystem that builds the model’s input from multiple sources under hard constraints.

The Context Packet is the most important design decision in this article.
It’s how you:
{
"packet_version": "2024-03-01",
"request_id": "req_...",
"user_id": "u_...",
"conversation_id": "c_...",
"task": {
"name": "support_reply",
"risk_tier": "medium",
"output_schema": "SupportReplyV2"
},
"budgets": {
"max_input_tokens": 8000,
"target_input_tokens": 5000,
"max_latency_ms": 2000
},
"sources": [
{
"type": "policy",
"id": "policy:v3",
"tokens": 420,
"trust": "trusted"
},
{
"type": "conversation_summary",
"id": "summary:c_...:rev17",
"tokens": 280,
"trust": "trusted"
},
{
"type": "retrieved_chunk",
"id": "kb:billing:doc42#p3",
"tokens": 310,
"trust": "untrusted",
"score": 0.83
}
],
"render": {
"template_id": "prompt:support:v5",
"input_tokens_est": 5200
},
"safety": {
"pii_redaction": true,
"acl_enforced": true,
"instruction_isolation": true
}
}
This is boring on purpose.
Boring means you can operate it.
And if you can’t reproduce failures, you can’t improve reliability—only vibes.
A context assembler exists to enforce budgets.
Not “nice-to-have” budgets.
Hard budgets.
Token budget
Cap input size and allocate tokens by priority tiers.
Latency budget
Don’t let retrieval, reranking, or tool calls blow up p95.
Cost budget
Retries and long prompts scale cost faster than you expect.
Risk budget
High-risk tasks require stricter sources and more verification.
A clean pattern is to partition context into tiers:
This makes tradeoffs explicit:
When we exceed budget, we drop Tier 3 first.
Then reduce Tier 2.
We never drop Tier 0.
Conversation history is expensive and noisy.
You don’t want to carry 200 turns forever.
So you compress state into a summary that is:
I recommend splitting summary into two artifacts:
Why? Because “what’s going on” changes, but “what we decided” is an invariant.
Treat summaries like data:
- schema them
- validate them
- and update them via a controlled process.
The summary update should be an explicit workflow, not an accidental side effect.
Common triggers:
The model can draft an update, but your boundary layer accepts only schema-valid updates.
Store:
If the update contradicts the decision ledger, require user confirmation or refuse the update.
The point: summaries are state. State is production-critical.
In February we talked about retrieval strategy.
Here, the key move is architectural:
retrieval is one stage in the context assembly pipeline.
So you need the same properties as any other pipeline stage:
That’s not architecture. That’s outsourcing your truth boundary to cosine similarity.
Your context packet contains both:
Treating data as instructions is how prompt injection becomes a product incident.
When you render retrieved content, label it like evidence:
Then structure it consistently:
[UNTRUSTED EVIDENCE]
Source: kb:billing:doc42#p3 (timestamp=2024-02-10)
Content:
...
[/UNTRUSTED EVIDENCE]
This doesn’t “solve security.”
But it makes your intent legible and makes injections easier to detect in logs.
The model is not your security boundary.tool scopes, allowlists, argument validation, and audit logs.
The “render” step should be deterministic.
No last-minute cleverness. No dynamic formatting inside the model.
Given a Context Packet, render the exact same prompt every time.
Why? Because reproducibility is everything.
Context compilation rule
If two requests build the same Context Packet, they must render the same prompt.
This is how you turn LLM requests into something you can cache, replay, and test.
A context assembler is a factory.
Factories need dashboards.
Here’s the design I recommend for most teams.
Context Store
Summaries, ledgers, source metadata, cached packets.
Retrieval Service
Top-k + rerank + filters with bounded latency.
Context Compiler
Budgets + selection + sanitize + deterministic render.
Packet Log
Append-only log for replay, evals, and audits.
Determine:
Load:
Run retrieval with:
Select sources by tier until budget is met. Record what was dropped and why.
Template + stable formatting + clear labels.
Validate output schema, apply retries/fallbacks.
On success, propose a summary update and validate it.
This blueprint is intentionally boring.
Boring is what survives production.
Likely cause: non-deterministic context build (different sources selected).
Fix: Context Packet + deterministic render + stable ranking + versioned templates.
Likely cause: unbounded history + too many chunks + retry loops.
Fix: tiered budgets + hard caps + log token breakdown.
Likely cause: no instruction isolation, untrusted content treated as policy.
Fix: format evidence explicitly as data + tool scopes + policy precedence.
Likely cause: bad chunking, no reranking, no filtering, no evals.
Fix: retrieval stage contract + offline eval harness + rerank + filters.
Likely cause: context not logged or not versioned.
Fix: store Context Packets + template IDs + source IDs for replay.
JSON Schema (Context Packets, summaries, contracts)
Use schemas to make Context Packets + summary artifacts validatable, diffable, and boring (in the best way).
OpenTelemetry — Semantic Conventions for Generative AI
A shared vocabulary for tracing LLM calls (token counts, model attrs, tool calls) so your context assembler can have real dashboards.
OpenAI — Structured Outputs (JSON Schema guarantees)
A practical way to enforce “schema-valid or fail” outputs at the boundary layer—pairs perfectly with contracts and packet logging.
OpenAI — Working with evals (operability)
Turn “did it behave?” into a repeatable test suite you can run before model upgrades, prompt/template changes, or retrieval tweaks.
If it’s a toy, you can keep it simple.
If it:
then context assembly becomes a production subsystem whether you admit it or not.
A larger window increases capacity, not discipline.
You still need:
Big windows make the need for assembly stronger, not weaker.
Start with:
That alone unlocks replay and cost visibility.
March made context operable:
Next month we build on this foundation:
Model Selection Becomes Architecture: routing, budgets, and capability tiers
Because once you can reliably build context, the next question becomes architectural:
Which model should run this contract under this budget and this risk tier—and what’s the fallback when it can’t?
Model Selection Becomes Architecture: Routing, Budgets, and Capability Tiers
The moment you have multiple models with different strengths, “pick the best model” turns into system design. This month is a practical blueprint for routing, fallbacks, and budget-aware capability tiers—without turning your stack into an untestable mess.
Long Context Isn’t Memory: When to Stuff, When to Retrieve
Bigger context windows tempt teams to paste everything. But long context is just a larger input buffer — not memory, not grounding, and not a plan. This month: how to budget context, decide “stuff vs retrieve,” and build a context assembler that stays fast, cheap, and safe.