Blog
May 25, 2025 - 16 MIN READ
The 1M-Token Era: how long context changes retrieval economics and system design

The 1M-Token Era: how long context changes retrieval economics and system design

Long context doesn’t kill RAG — it changes what’s cheap, what’s risky, and what needs architecture. This month is a practical guide to building “context-first” systems without shipping a cost bomb (or a data leak).

Axel Domingues

Axel Domingues

When long context crossed into “absurd,” a bunch of teams celebrated for the wrong reason.

They treated it like a feature that replaces architecture:

“We can just shove everything into the prompt.”

That works… right up until:

  • latency doubles,
  • cost triples,
  • the model ignores the paragraph you care about,
  • and your legal team asks why production logs are in the context window.

Long context is real leverage — but it doesn’t remove engineering. It moves the boundary of what’s worth retrieving, what’s worth caching, and what’s worth curating.

This month is about that boundary shift.

When I say “1M-token era,” I’m not arguing that every request should be 1M tokens.

I’m saying the option changes system design the way “cheap storage” changed databases: you start designing around what’s cheap enough, not what’s theoretically minimal.

The new reality

Your context window can fit whole manuals, large codebases, or multi-day threads.

The new constraint

Cost + latency scale with “how much you stuff,” and attention doesn’t scale linearly.

The new risk

More context means more secrets, injections, and compliance surface area.

The new opportunity

Context becomes a tiered memory system: hot, warm, cold — not “prompt vs RAG”.


Long context didn’t kill RAG — it changed the economics

Before long context, the default plan was:

  • retrieve a handful of chunks,
  • keep prompts tight,
  • optimize for “just enough grounding.”

With long context, you can load an entire knowledge slice and let the model navigate.

So what changes?

1) Retrieval is no longer the only way to get “coverage”

If you can fit the full product spec, you can stop playing “top‑k roulette” for many tasks:

  • summarizing a multi-section doc
  • answering within one manual
  • extracting structured requirements from a repo of ADRs

But…

2) Retrieval still wins on precision and cost

Long context is “coverage.” RAG is “precision.”

If the task is needle‑finding (one clause, one config key, one exception case), retrieval + tight context is still cheaper and often more accurate.

3) You now have three costs, not one

Long context forces you to budget across:

  • token cost: you pay to send it
  • latency cost: you wait to process it
  • attention cost: the model may not actually use it

That last one is the trap: you can buy tokens and still get shallow answers.

The most common long-context failure mode is attention dilution:

You gave the model everything… and it treated the most important page like background noise.


The “context pyramid”: think tiers, not a single prompt

I’ve started treating context as a pyramid with explicit tiers.

Tier 0: System constraints (hot, tiny, non-negotiable)

This is the part you never let be ambiguous:

  • “what must never happen” (truth boundaries)
  • tool rules and safety policies
  • output format contracts

Tier 1: Task-specific facts (hot)

  • the user’s request
  • current inputs (files, records, UI state)
  • the smallest grounding you can justify

Tier 2: Working memory (warm)

  • conversation summary
  • last decisions
  • a running plan / checklist
  • a short “known constraints” block

Tier 3: Reference memory (cold, large)

  • manuals
  • runbooks
  • policies
  • codebase slices
  • incident timelines

Long context mostly expands Tier 3 — and that’s where your architecture needs to show up.

The best long-context systems aren’t “bigger prompts.”

They’re better packers:

  • dedupe
  • prioritize
  • compress
  • and only expand cold memory when the task demands it

What “context-first” system design looks like

If you want long context to be an advantage, you need a pipeline. Not a prompt.

Here’s the minimal architecture I’ve found that holds up in production.

Define a context budget (explicitly)

Pick hard limits per request class:

  • max cold tokens
  • max tool output tokens
  • max conversation memory tokens

Then build enforcement into the system (not “developer discipline”).

Sanitize before you summarize

Do redaction and classification before anything else:

  • strip secrets (keys, tokens, credentials)
  • mark PII and regulated data
  • block disallowed sources

Summarization after leakage is still leakage.

Rank, dedupe, compress

Long context dies from redundancy:

  • repeated headers
  • quoted email chains
  • boilerplate legal text
  • duplicated logs

Dedupe aggressively, then compress:

  • structural summaries (section outlines + key claims)
  • extract only the clauses that matter
  • collapse repeated tables to references

Pack with intention (order matters)

Models have positional biases. You want:

  • constraints early
  • task facts early
  • references grouped by topic, with headings
  • the “answer-critical” bits duplicated into Tier 1 as a “golden snippet”

Measure what you ship

Log a context manifest:

  • token counts per tier
  • top included sources (IDs)
  • compression ratio
  • retrieval hit rate (if you use RAG)

If you can’t see it, you can’t tune it.


When should I use long context vs RAG?

I use this decision matrix in architecture reviews.

Task typeLong context is great when…RAG is great when…Typical hybrid
SummarizationYou need coherent coverage of an entire artifactYou only need one sectionRetrieve key sections + load the whole doc only if needed
Q&AThe question is broad and multi-partThe question is needle-likeRAG to locate needles + long context for surrounding reasoning
Code assistanceYou need cross-file reasoning in one repo sliceYou only need exact symbol usageRetrieve call graph + load the relevant modules
Compliance/policyYou must interpret a full policy documentYou need one clause fastLoad policy index + retrieve clause + include policy definitions
Support/incidentYou need timeline + multi-source coherenceYou need one error signatureRAG for signatures + long context for the incident narrative

The pattern to notice: hybrid wins, but with a new default:

  • start tight,
  • expand cold memory on demand,
  • and stop expanding when confidence stabilizes.

Observability for long context: what to log or you will suffer

Long context changes what “debugging” looks like.

In classic RAG debugging, you ask:

  • did retrieval fetch the right chunks?

In long-context debugging, you ask:

  • did the model use the right chunks?

Here’s what I recommend logging as a baseline.

Context manifest

Token counts per tier, source IDs, compression ratio, dedupe ratio.

Attention proxies

Citations/quotes mapped to sources, plus “unused context” estimates.

Outcome quality

Task success metrics + reviewer feedback + auto-checkers where possible.

Cost & latency

p50/p95 latency, tokens in/out, cache hit rate, tool time.


Security and compliance: long context expands your blast radius

Long context makes two old problems worse:

1) Data exfiltration is easier

If you stuff:

  • logs,
  • internal tickets,
  • customer records,
  • or secrets,

…you increased your chance of exposing them in output, telemetry, or tool calls.

2) Prompt injection becomes a document property

An injection string inside a doc is no longer “some weird chunk.” It can be embedded anywhere in the cold tier.

So you need controls that are built for scale.

If you use long context without source allowlists, redaction, and tool output constraints, you are effectively letting “whatever is in your documents” participate in production behavior.

Practical controls that actually ship:

  • source allowlists by request class (support vs ops vs compliance)
  • automatic redaction (secrets + PII) before the LLM sees anything
  • tool call gating (the model can propose actions, but execution is policy-checked)
  • output filters for regulated strings (keys, IDs, credential patterns)
  • context hashing for audit (“what exactly did we send?”)

The new design pattern: “expand-on-demand” context

The winning pattern I’ve seen is a two-pass flow:

  1. tight pass: minimal task facts + a small retrieval set
  2. expansion pass (conditional): load large cold memory only if needed

Why it works:

  • you keep default cost low
  • you avoid attention dilution for simple tasks
  • you still have an escape hatch for complex tasks
If you want one sentence to remember from this month:

Long context is a capability. “Expand-on-demand” is a strategy.


A concrete checklist you can use in a design review

If I’m reviewing a “let’s use long context” proposal, I ask:


Resources

Lost in the Middle - How Language Models Use Long Context (2023)

A clear look at why “more tokens” doesn’t automatically mean “better use of information”.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)

The modern baseline for grounding generation with retrieval — still relevant in the long-context era.

Prompt Injection (OWASP LLM Top 10)

A practical threat-model lens for why “documents are untrusted input,” especially at scale.

LLMLingua Compressing Prompts for LLMs (2023)

Compression is a first-class long-context primitive — this is a strong starting point.


What’s Next

May was about context economics:

  • what long context makes cheap,
  • what it makes risky,
  • and what needs architecture to keep costs and safety under control.

Next month we zoom out again.

If agents are doing work in the background — with retries, tools, and side effects — they start behaving like distributed systems.

Agents as Distributed Systems: outbox, sagas, and “eventually correct” workflows

Axel Domingues - 2026