
Long context doesn’t kill RAG — it changes what’s cheap, what’s risky, and what needs architecture. This month is a practical guide to building “context-first” systems without shipping a cost bomb (or a data leak).
Axel Domingues
When long context crossed into “absurd,” a bunch of teams celebrated for the wrong reason.
They treated it like a feature that replaces architecture:
“We can just shove everything into the prompt.”
That works… right up until:
Long context is real leverage — but it doesn’t remove engineering. It moves the boundary of what’s worth retrieving, what’s worth caching, and what’s worth curating.
This month is about that boundary shift.
I’m saying the option changes system design the way “cheap storage” changed databases: you start designing around what’s cheap enough, not what’s theoretically minimal.
The new reality
Your context window can fit whole manuals, large codebases, or multi-day threads.
The new constraint
Cost + latency scale with “how much you stuff,” and attention doesn’t scale linearly.
The new risk
More context means more secrets, injections, and compliance surface area.
The new opportunity
Context becomes a tiered memory system: hot, warm, cold — not “prompt vs RAG”.
Before long context, the default plan was:
With long context, you can load an entire knowledge slice and let the model navigate.
So what changes?
If you can fit the full product spec, you can stop playing “top‑k roulette” for many tasks:
But…
Long context is “coverage.” RAG is “precision.”
If the task is needle‑finding (one clause, one config key, one exception case), retrieval + tight context is still cheaper and often more accurate.
Long context forces you to budget across:
That last one is the trap: you can buy tokens and still get shallow answers.
You gave the model everything… and it treated the most important page like background noise.
I’ve started treating context as a pyramid with explicit tiers.

This is the part you never let be ambiguous:
Long context mostly expands Tier 3 — and that’s where your architecture needs to show up.
They’re better packers:
- dedupe
- prioritize
- compress
- and only expand cold memory when the task demands it
If you want long context to be an advantage, you need a pipeline. Not a prompt.
Here’s the minimal architecture I’ve found that holds up in production.

Pick hard limits per request class:
Then build enforcement into the system (not “developer discipline”).
Do redaction and classification before anything else:
Summarization after leakage is still leakage.
Long context dies from redundancy:
Dedupe aggressively, then compress:
Models have positional biases. You want:
Log a context manifest:
If you can’t see it, you can’t tune it.
I use this decision matrix in architecture reviews.
| Task type | Long context is great when… | RAG is great when… | Typical hybrid |
|---|---|---|---|
| Summarization | You need coherent coverage of an entire artifact | You only need one section | Retrieve key sections + load the whole doc only if needed |
| Q&A | The question is broad and multi-part | The question is needle-like | RAG to locate needles + long context for surrounding reasoning |
| Code assistance | You need cross-file reasoning in one repo slice | You only need exact symbol usage | Retrieve call graph + load the relevant modules |
| Compliance/policy | You must interpret a full policy document | You need one clause fast | Load policy index + retrieve clause + include policy definitions |
| Support/incident | You need timeline + multi-source coherence | You need one error signature | RAG for signatures + long context for the incident narrative |
The pattern to notice: hybrid wins, but with a new default:
Long context changes what “debugging” looks like.
In classic RAG debugging, you ask:
In long-context debugging, you ask:
Here’s what I recommend logging as a baseline.
Context manifest
Token counts per tier, source IDs, compression ratio, dedupe ratio.
Attention proxies
Citations/quotes mapped to sources, plus “unused context” estimates.
Outcome quality
Task success metrics + reviewer feedback + auto-checkers where possible.
Cost & latency
p50/p95 latency, tokens in/out, cache hit rate, tool time.
If the task is high-risk (policy, finance, legal, production ops), don’t ask for “citations” as a vibe.
Make grounding a contract:
You’ll be shocked how many “confident” answers collapse when you require traceable sources.
If your system includes 600k tokens and the output only references one 2k chunk, you probably paid for noise.
You can approximate “unused context” by:
Long context makes two old problems worse:
If you stuff:
…you increased your chance of exposing them in output, telemetry, or tool calls.
An injection string inside a doc is no longer “some weird chunk.” It can be embedded anywhere in the cold tier.
So you need controls that are built for scale.
Practical controls that actually ship:
The winning pattern I’ve seen is a two-pass flow:
Why it works:
Long context is a capability. “Expand-on-demand” is a strategy.
If I’m reviewing a “let’s use long context” proposal, I ask:
Lost in the Middle - How Language Models Use Long Context (2023)
A clear look at why “more tokens” doesn’t automatically mean “better use of information”.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)
The modern baseline for grounding generation with retrieval — still relevant in the long-context era.
May was about context economics:
Next month we zoom out again.
If agents are doing work in the background — with retries, tools, and side effects — they start behaving like distributed systems.
Agents as Distributed Systems: outbox, sagas, and “eventually correct” workflows
Agents as Distributed Systems: outbox, sagas, and “eventually correct” workflows
Agents don’t “run in a loop” — they run across networks, vendors, and failures. This month is about the three patterns that make agent workflows survivable: durable intent (outbox), long-running transactions (sagas), and reconciliation (“eventually correct”).
Agent Runtimes Emerge: SDKs, orchestration primitives, and observability
In 2025, “agents” stop being demos and start being products. This is the month you realize you don’t need a smarter model — you need a runtime: durable execution, safety gates, and traces you can debug.