May 25, 2025 - 16 MIN READ

The 1M-Token Era: how long context changes retrieval economics and system design

Long context doesn’t kill RAG — it changes what’s cheap, what’s risky, and what needs architecture. This month is a practical guide to building “context-first” systems without shipping a cost bomb (or a data leak).

Axel Domingues

When long context crossed into “absurd,” a bunch of teams celebrated for the wrong reason.

They treated it like a feature that replaces architecture:

“We can just shove everything into the prompt.”

That works… right up until:

latency doubles,
cost triples,
the model ignores the paragraph you care about,
and your legal team asks why production logs are in the context window.

Long context is real leverage — but it doesn’t remove engineering. It moves the boundary of what’s worth retrieving, what’s worth caching, and what’s worth curating.

This month is about that boundary shift.

When I say “1M-token era,” I’m not arguing that every request should be 1M tokens.

I’m saying the option changes system design the way “cheap storage” changed databases: you start designing around what’s cheap enough, not what’s theoretically minimal.

The new reality

Your context window can fit whole manuals, large codebases, or multi-day threads.

The new constraint

Cost + latency scale with “how much you stuff,” and attention doesn’t scale linearly.

The new risk

More context means more secrets, injections, and compliance surface area.

The new opportunity

Context becomes a tiered memory system: hot, warm, cold — not “prompt vs RAG”.

Long context didn’t kill RAG — it changed the economics

Before long context, the default plan was:

retrieve a handful of chunks,
keep prompts tight,
optimize for “just enough grounding.”

With long context, you can load an entire knowledge slice and let the model navigate.

So what changes?

1) Retrieval is no longer the only way to get “coverage”

If you can fit the full product spec, you can stop playing “top‑k roulette” for many tasks:

summarizing a multi-section doc
answering within one manual
extracting structured requirements from a repo of ADRs

But…

2) Retrieval still wins on precision and cost

Long context is “coverage.” RAG is “precision.”

If the task is needle‑finding (one clause, one config key, one exception case), retrieval + tight context is still cheaper and often more accurate.

3) You now have three costs, not one

Long context forces you to budget across:

token cost: you pay to send it
latency cost: you wait to process it
attention cost: the model may not actually use it

That last one is the trap: you can buy tokens and still get shallow answers.

The most common long-context failure mode is attention dilution:

You gave the model everything… and it treated the most important page like background noise.

The “context pyramid”: think tiers, not a single prompt

I’ve started treating context as a pyramid with explicit tiers.

A context pyramid showing hot, warm, and cold context tiers with different costs and risks.

Tier 0: System constraints (hot, tiny, non-negotiable)

This is the part you never let be ambiguous:

“what must never happen” (truth boundaries)
tool rules and safety policies
output format contracts

Tier 1: Task-specific facts (hot)

the user’s request
current inputs (files, records, UI state)
the smallest grounding you can justify

Tier 2: Working memory (warm)

conversation summary
last decisions
a running plan / checklist
a short “known constraints” block

Tier 3: Reference memory (cold, large)

manuals
runbooks
policies
codebase slices
incident timelines

Long context mostly expands Tier 3 — and that’s where your architecture needs to show up.

The best long-context systems aren’t “bigger prompts.”

They’re better packers:
dedupe
prioritize
compress
and only expand cold memory when the task demands it

What “context-first” system design looks like

If you want long context to be an advantage, you need a pipeline. Not a prompt.

Here’s the minimal architecture I’ve found that holds up in production.

Context assembly pipeline: sources -> sanitation -> ranking -> packing -> execution -> telemetry.

Define a context budget (explicitly)

Pick hard limits per request class:

max cold tokens
max tool output tokens
max conversation memory tokens

Then build enforcement into the system (not “developer discipline”).

Sanitize before you summarize

Do redaction and classification before anything else:

strip secrets (keys, tokens, credentials)
mark PII and regulated data
block disallowed sources

Summarization after leakage is still leakage.

Rank, dedupe, compress

Long context dies from redundancy:

repeated headers
quoted email chains
boilerplate legal text
duplicated logs

Dedupe aggressively, then compress:

structural summaries (section outlines + key claims)
extract only the clauses that matter
collapse repeated tables to references

Pack with intention (order matters)

Models have positional biases. You want:

constraints early
task facts early
references grouped by topic, with headings
the “answer-critical” bits duplicated into Tier 1 as a “golden snippet”

Measure what you ship

Log a context manifest:

token counts per tier
top included sources (IDs)
compression ratio
retrieval hit rate (if you use RAG)

If you can’t see it, you can’t tune it.

When should I use long context vs RAG?

I use this decision matrix in architecture reviews.

Task type	Long context is great when…	RAG is great when…	Typical hybrid
Summarization	You need coherent coverage of an entire artifact	You only need one section	Retrieve key sections + load the whole doc only if needed
Q&A	The question is broad and multi-part	The question is needle-like	RAG to locate needles + long context for surrounding reasoning
Code assistance	You need cross-file reasoning in one repo slice	You only need exact symbol usage	Retrieve call graph + load the relevant modules
Compliance/policy	You must interpret a full policy document	You need one clause fast	Load policy index + retrieve clause + include policy definitions
Support/incident	You need timeline + multi-source coherence	You need one error signature	RAG for signatures + long context for the incident narrative

The pattern to notice: hybrid wins, but with a new default:

start tight,
expand cold memory on demand,
and stop expanding when confidence stabilizes.

Observability for long context: what to log or you will suffer

Long context changes what “debugging” looks like.

In classic RAG debugging, you ask:

did retrieval fetch the right chunks?

In long-context debugging, you ask:

did the model use the right chunks?

Here’s what I recommend logging as a baseline.

Context manifest

Token counts per tier, source IDs, compression ratio, dedupe ratio.

Attention proxies

Citations/quotes mapped to sources, plus “unused context” estimates.

Outcome quality

Task success metrics + reviewer feedback + auto-checkers where possible.

Cost & latency

p50/p95 latency, tokens in/out, cache hit rate, tool time.

Security and compliance: long context expands your blast radius

Long context makes two old problems worse:

1) Data exfiltration is easier

If you stuff:

logs,
internal tickets,
customer records,
or secrets,

…you increased your chance of exposing them in output, telemetry, or tool calls.

2) Prompt injection becomes a document property

An injection string inside a doc is no longer “some weird chunk.” It can be embedded anywhere in the cold tier.

So you need controls that are built for scale.

If you use long context without source allowlists, redaction, and tool output constraints, you are effectively letting “whatever is in your documents” participate in production behavior.

Practical controls that actually ship:

source allowlists by request class (support vs ops vs compliance)
automatic redaction (secrets + PII) before the LLM sees anything
tool call gating (the model can propose actions, but execution is policy-checked)
output filters for regulated strings (keys, IDs, credential patterns)
context hashing for audit (“what exactly did we send?”)

The new design pattern: “expand-on-demand” context

The winning pattern I’ve seen is a two-pass flow:

tight pass: minimal task facts + a small retrieval set
expansion pass (conditional): load large cold memory only if needed

Why it works:

you keep default cost low
you avoid attention dilution for simple tasks
you still have an escape hatch for complex tasks

If you want one sentence to remember from this month:

Long context is a capability. “Expand-on-demand” is a strategy.

A concrete checklist you can use in a design review

If I’m reviewing a “let’s use long context” proposal, I ask:

Resources

Lost in the Middle - How Language Models Use Long Context (2023)

A clear look at why “more tokens” doesn’t automatically mean “better use of information”.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)

The modern baseline for grounding generation with retrieval — still relevant in the long-context era.

Prompt Injection (OWASP LLM Top 10)

A practical threat-model lens for why “documents are untrusted input,” especially at scale.

LLMLingua Compressing Prompts for LLMs (2023)

Compression is a first-class long-context primitive — this is a strong starting point.

What’s Next

May was about context economics:

what long context makes cheap,
what it makes risky,
and what needs architecture to keep costs and safety under control.

Next month we zoom out again.

If agents are doing work in the background — with retries, tools, and side effects — they start behaving like distributed systems.

Agents as Distributed Systems

Agents as Distributed Systems: outbox, sagas, and “eventually correct” workflows

Agents don’t “run in a loop” — they run across networks, vendors, and failures. This month is about the three patterns that make agent workflows survivable: durable intent (outbox), long-running transactions (sagas), and reconciliation (“eventually correct”).

Agent Runtimes Emerge: SDKs, orchestration primitives, and observability

In 2025, “agents” stop being demos and start being products. This is the month you realize you don’t need a smarter model — you need a runtime: durable execution, safety gates, and traces you can debug.