
The moment you have multiple models with different strengths, “pick the best model” turns into system design. This month is a practical blueprint for routing, fallbacks, and budget-aware capability tiers—without turning your stack into an untestable mess.
Axel Domingues
January gave us the boundary layer: contracts, schemas, failure budgets.
February made context explicit: stuff vs retrieve.
March turned it into software: context assembly as a subsystem, with Context Packets you can replay.
Now the next uncomfortable truth arrives:
In production, “the model” is not a single choice.
It’s a portfolio—and a router.
The moment you have multiple models available (different costs, latencies, modalities, safety properties, and accuracy profiles), you stop asking:
“Which model is best?”
And start asking:
“What model is best for this contract, under this budget, with this risk tier… and what happens when it fails?”
This is the month where model selection becomes architecture.
The shift
You’re not choosing a model. You’re designing routing + fallbacks.
The abstraction
Every request is a contract: task, schema, risk tier, budget tier.
The control surface
Routing is driven by capability tiers + policy constraints.
The operator view
Reliability is measured in p95, cost, retries, and error modes.
In a lab, you can chase a single score.
In production, you’re optimizing a vector:
The right question is:
What is the minimum capability that reliably satisfies this contract, inside the failure budget?
That sentence is architecture.
Instead of thinking in brand names, think in capability tiers.
A tier is defined by measurable properties:
You can swap the underlying model later without breaking your app— as long as the tier contract holds.
The exact number doesn’t matter.
What matters is that tiers are stable interfaces.
Model choice should not live inside random call sites.
It belongs in the same place as:
In other words: policy.
So we introduce a new actor in the system:
the Model Router.

A model router is a service (or module) that decides:
Every routing decision should be logged as a structured event.
In January we insisted on contracts and schemas.
Here’s the extension:
Model tier becomes part of the contract.
Not as a hard-coded choice, but as a constraint set.
{
"task": "invoice_support_reply",
"risk_tier": "high",
"output_schema": "SupportReplyV2",
"budget_tier": "silver",
"min_capability_tier": "M",
"max_capability_tier": "L",
"allowed_tools": ["kb_search", "crm_lookup"],
"disallowed": ["send_email", "refund_payment"]
}
This pushes model selection into the same governance path as everything else.
Most teams end up using a small set of patterns.
Here are the ones worth building intentionally.
Use when: tasks are stable and you want predictability.
Pros: simple, debuggable, stable costs
Cons: may overpay for easy requests
Use when: you have paid plans or internal cost controls.
Pros: product-friendly, easy to explain
Cons: budget tier becomes a product dependency (manage carefully)
Use when: you can detect failure reliably.
Pros: saves cost while keeping quality
Cons: requires strong validators and careful latency budgets
Use when: inputs vary wildly and you need smarter selection.
A Tier S model classifies:
Then routes to the right tier.
Pros: flexible, can reduce overuse of Tier L
Cons: easy to build a flaky router; must be evaluated like any other model
Use when: you’re migrating or testing providers.
Pros: safe evolution
Cons: needs metrics and kill switches
Fallbacks are not a “retry.”
Fallbacks are a design decision about what failure is acceptable.
Then map them to behavior:
If the system degrades quality, that should be visible to:
- telemetry
- and sometimes the user experience (depending on the product)
A model router without budgets will “work” until it meets real traffic.
This is where March’s Context Packet becomes essential.
Because routing needs visibility into:
This is not “AI.” It’s operations.
From the Context Packet:
Store:
The router changes outputs as much as the model does.
So you need evaluations at three levels:
Model eval
Per-tier task accuracy, formatting reliability, refusal behavior.
Router eval
Does it choose the right tier? Does it avoid over-upgrading?
System eval
End-to-end contract satisfaction: cost, latency, safety, correctness.
Drift monitoring
Quality changes over time due to providers, prompts, and data.
You can label a dataset with:
Then measure:
If you can’t evaluate it, don’t let it decide.
Once you have tiers, product and engineering can collaborate cleanly:
Everyone gets a knob that maps to their responsibility.
That’s what good architecture does.
Likely cause: silent upgrades to higher tiers, or retries multiplying tokens.
Fix: hard cost ceiling per request + log upgrade reasons + cap upgrade hops.
Likely cause: cascades without latency budgets, or tool calls in the critical path.
Fix: per-tier timeouts + staged budgets + “no tools” degrade mode.
Likely cause: tier switching changes tone and failure patterns.
Fix: shared contracts + style constraints + deterministic render + tier-specific prompts.
Likely cause: no provider-level health routing or kill switches.
Fix: health-aware routing + circuit breakers + safe degrade mode.
Likely cause: routing to cheap tiers under pressure, or missing evidence enforcement.
Fix: risk-tier rules: minimum tier + mandatory evidence + stricter validation.
Don’t bury tier logic in code if you can avoid it.
Make it a config artifact you can review and version.
tiers:
S:
max_input_tokens: 4000
max_tools: 0
timeout_ms: 1500
retries: 1
M:
max_input_tokens: 8000
max_tools: 3
timeout_ms: 2500
retries: 1
L:
max_input_tokens: 16000
max_tools: 5
timeout_ms: 4000
retries: 2
tasks:
classify_ticket:
min_tier: S
max_tier: M
risk_tier: low
draft_support_reply:
min_tier: M
max_tier: L
risk_tier: medium
generate_refund_decision:
min_tier: L
max_tier: L
risk_tier: high
require_evidence: true
require_schema: true
Make critical behavior reviewable and testable as configuration, not folklore.
If you have one model and a toy use case, yes.
If you have:
you will rebuild this anyway—either intentionally or in panic.
Build the concept first:
The implementation can be a small module or a service.
Frameworks help, but they don’t decide your budgets or risk rules for you.
Keep constants stable:
Let tiers differ mainly in capability, not personality.
April made model choice explicit:
Next month we tackle the part everyone rushes into—and then regrets:
Open Weights in Production: evaluation, licensing, and guardrails
Because once you can route between models, you’ll inevitably ask:
When do I host my own—and what does “production-safe” mean when the weights are mine?
Open Weights in Production: evaluation, licensing, and guardrails
Open weights shift your risk from vendor to you. This month is the playbook: evaluate like a product, treat licensing as architecture, and ship with guardrails that survive real users.
Context Assembly as a Subsystem: Summaries, State, and Token Budgets
“Stuff vs retrieve” is only half the battle. The operable part is a context assembler: a subsystem that selects, budgets, sanitizes, and logs exactly what the model sees—so you can debug, evaluate, and scale LLM features without vibes.