
“Think step by step” isn’t an architecture. In production you need budgets, routing, and verifiers so the system knows when to go fast, when to slow down, and when to refuse.
Axel Domingues
“Think longer.”
It’s the most tempting fix in LLM product work — and the most expensive.
Because in production, “thinking longer” isn’t a vibe. It’s a budget decision:
October is where I stop treating “reasoning” like a prompt trick and start treating it like a runtime control plane:
route requests into fast/slow paths, verify outputs, and only spend extra compute when it buys measurable reliability.
then “think longer” just produces slower mistakes.If your system can’t:
- say “no”,
- validate tool results,
- and degrade gracefully,
The problem
LLM quality is a function of compute, and compute is a product constraint.
The solution
Fast path by default, slow path on demand — with verifiers in between.
The key move
Spend budget on verification, not just generation.
The outcome
Lower cost and fewer incidents by making “thinking” a controlled resource.
A reasoning budget is the set of limits you impose on a request:
This isn’t theoretical. It’s how you turn a stochastic component into a system you can operate.
The moment you allow tools, multi-step plans, or long context, you’re running a workflow engine — and workflows need budgets.

If you only have “generate”, you’re not building a product — you’re gambling.
A production-grade assistant needs a loop:
Even simple checks (schema, citations, tool output validation) remove a shocking amount of failure.
A good default architecture is:
This isn’t “two models”. It’s two operating modes.
It’s more constraints plus more checking.
The router is a policy. Make it explicit.
In practice, you score a request with a few signals:
Impact
Does an error create real harm (money, safety, reputation)?
Uncertainty
Is the model likely to be wrong without extra grounding or checks?
Complexity
Does it require multi-step reasoning, long context, or tool chaining?
Attack surface
Does it touch tools, credentials, code execution, or user data?
A simple routing rule can look like this:
I like three tiers because teams can actually operationalize them:
Then you define budgets per tier.
Executives understand tiers. Engineers can map tiers to controls.Call it “Tier 2 execution policy”.
Spending budget on “more tokens” is a weak strategy.
Spending budget on checks is a strong strategy.
Here are the verifiers that consistently pay off.
If the output must drive downstream automation, force it into a contract:
If validation fails, you repair, not “hope”.
Tools are not truth. They are inputs.
Validate:
A classic failure mode is the model “summarizing” tool output incorrectly. So: keep tool output as structured data and render from it deterministically.
If you’re using RAG:
RAG is not grounding unless you verify the grounding happened.
Some correctness is not “model work”.
If a policy says:
Then enforce it with code. This is how you define truth boundaries.
A second pass can help, but treat it as expensive.
Use it when:
Design it as: critique → produce a minimal diff → re-verify.
This is the smallest set of components I’ve seen work repeatedly:

If you can’t answer “what percent of slow-path escalations actually improved outcomes?”
then you’re spending money blindly.
Every request needs a deadline. Even internal ones.
So: define a deadline and make the system degrade gracefully when it hits it.
Order verifiers by cost:
Stop when:
This turns an agent from “infinite wanderer” into a bounded system.
Caching is safe when you include a fingerprint:
Cache only when the output is:
Below is a deliberately boring sketch. Boring is good.
It shows how to structure the runtime so budgets are enforced, not “suggested”.
type Tier = "T0" | "T1" | "T2";
type Budget = {
deadlineMs: number;
maxTokens: number;
maxToolCalls: number;
maxRepairs: number;
allowTools: string[]; // allowlist
verifierLevel: "basic" | "standard" | "strict";
};
type Candidate = {
text: string;
toolCalls?: Array<{ name: string; args: unknown }>;
citations?: Array<{ id: string; span: [number, number] }>;
};
type Verdict =
| { ok: true; normalized: unknown }
| { ok: false; reason: string; repairHint?: string };
async function handleRequest(input: { userText: string }) {
const { tier, budget } = route(input);
const start = Date.now();
let attempts = 0;
while (attempts <= budget.maxRepairs) {
const timeLeft = budget.deadlineMs - (Date.now() - start);
if (timeLeft <= 0) return fallback("timeout", tier);
const candidate: Candidate = await generate(input, { tier, budget, timeLeft });
const verdict: Verdict = await verify(candidate, { tier, budget, timeLeft });
if (verdict.ok) return render(verdict.normalized);
attempts += 1;
// If we failed for a repeatable reason, repair with guidance.
input = { ...input, userText: input.userText + "\n\nRepair: " + (verdict.repairHint ?? verdict.reason) };
// Optional escalation: only when it's worth it.
if (shouldEscalate(verdict, tier) && tier !== "T2") {
return handleRequestEscalated(input);
}
}
return fallback("verification_failed", tier);
}
The important detail isn’t the code.
It’s the discipline:
If your team can’t see the system, they will debate it endlessly.
Here’s the minimum telemetry that turns this into engineering:
Budget burn
Tokens, time, tool calls, retries per request + per tier.
Verification stats
Which checks failed, how often, and whether repairs succeed.
Escalation ROI
Did slow-path escalation improve outcomes (quality, fewer incidents)?
Tail latency
p95/p99 latency per tier, with breakdown by stage.
If you can’t define what “verified” means, you can’t optimize.
Cause: no stop conditions, no bounded retries.
Fix:
Cause: downstream automation consumes free-form text.
Fix:
Cause: tools are too powerful and too implicit.
Fix:
Cause: treating every request as high-stakes.
Fix:
The Tail at Scale (Dean & Barroso, 2013)
The canonical read on tail latency and why p95/p99 dominates UX and SLOs — the “tail budget” intuition your fast/slow paths are built on.
Self-Consistency for Chain-of-Thought (Wang et al., 2022)
A concrete “spend budget on reliability” technique: sample multiple reasoning paths, then pick the most consistent answer (useful framing for resamples / retries).
Better models help, but they don’t remove the architectural problem:
Upgrading the model without upgrading the runtime is how teams get surprised in production.
Start with three tiers and pick budgets that match the user’s patience:
Then iterate based on telemetry: which verifiers catch real issues, and which escalations actually improve outcomes.
Refuse when:
Refusal is not failure — it’s a safety boundary.
Once you have budgets and verification, you can start doing something harder:
real-time interaction.
In November, the topic is the next constraint that changes everything:
Real-Time Agents: streaming, barge-in, and session state that doesn’t collapse
Because the moment you stream responses and accept interruptions, your “reasoning budget” becomes a session budget — and the runtime has to stay stable while the user is actively steering it.
Real-Time Agents: streaming, barge-in, and session state that doesn’t collapse
Real-time agents aren’t “LLMs with voice.” They’re distributed systems with hard latency budgets. This month is about streaming UX, safe interruption (barge-in), and session state you can actually operate.
Regulation as Architecture: Turning the EU AI Act into Controls and Evidence
The EU AI Act isn’t a PDF you “comply with” — it’s a set of control objectives you design into your product: evaluation, documentation, monitoring, and provable safety boundaries.