Blog
Apr 28, 2024 - 18 MIN READ
Model Selection Becomes Architecture: Routing, Budgets, and Capability Tiers

Model Selection Becomes Architecture: Routing, Budgets, and Capability Tiers

The moment you have multiple models with different strengths, “pick the best model” turns into system design. This month is a practical blueprint for routing, fallbacks, and budget-aware capability tiers—without turning your stack into an untestable mess.

Axel Domingues

Axel Domingues

January gave us the boundary layer: contracts, schemas, failure budgets.

February made context explicit: stuff vs retrieve.

March turned it into software: context assembly as a subsystem, with Context Packets you can replay.

Now the next uncomfortable truth arrives:

In production, “the model” is not a single choice.
It’s a portfolio—and a router.

The moment you have multiple models available (different costs, latencies, modalities, safety properties, and accuracy profiles), you stop asking:

“Which model is best?”

And start asking:

“What model is best for this contract, under this budget, with this risk tier… and what happens when it fails?”

This is the month where model selection becomes architecture.

The shift

You’re not choosing a model. You’re designing routing + fallbacks.

The abstraction

Every request is a contract: task, schema, risk tier, budget tier.

The control surface

Routing is driven by capability tiers + policy constraints.

The operator view

Reliability is measured in p95, cost, retries, and error modes.


Why “Pick the Best Model” Is the Wrong Question

In a lab, you can chase a single score.

In production, you’re optimizing a vector:

  • correctness (for the contract, not vibes)
  • latency (p95 and tail behavior)
  • cost (including retries and long contexts)
  • safety (tool permissions, refusal compliance, injection robustness)
  • modality needs (text-only vs image/audio/video)
  • availability (outages, rate limits, regional constraints)
  • governance (data residency, auditability, vendor constraints)

The right question is:

What is the minimum capability that reliably satisfies this contract, inside the failure budget?

That sentence is architecture.


The Core Idea: Capability Tiers (Not “Small vs Big”)

Instead of thinking in brand names, think in capability tiers.

A tier is defined by measurable properties:

  • supported modalities
  • max input/output tokens
  • tool-use reliability
  • reasoning quality (for your task suite)
  • safety behavior (for your policies)
  • median + p95 latency
  • cost per million tokens
  • quota and availability profile
A tier is a product-level promise, not a model card.

You can swap the underlying model later without breaking your app— as long as the tier contract holds.

A simple tiering scheme

  • Tier S (fast/cheap): classification, extraction, light rewriting, routing decisions
  • Tier M (balanced): most user-facing drafts, structured outputs, moderate reasoning
  • Tier L (premium): hard reasoning, long context synthesis, high-stakes drafts
  • Tier X (specialized): multimodal, code-heavy, domain-tuned, or policy-hardened

The exact number doesn’t matter.

What matters is that tiers are stable interfaces.


Routing Is a Policy Decision, Not a Developer Preference

Model choice should not live inside random call sites.

It belongs in the same place as:

  • auth rules
  • tool allowlists
  • PII policies
  • schema requirements
  • failure budgets

In other words: policy.

So we introduce a new actor in the system:

the Model Router.


The Model Router: What It Does (and What It Must Log)

A model router is a service (or module) that decides:

  1. which tier to use
  2. which provider/model within the tier
  3. what fallbacks are allowed
  4. when to degrade gracefully vs fail fast

Router inputs

  • task name (from your contract)
  • risk tier (low/med/high)
  • budget tier (bronze/silver/gold)
  • context packet stats (tokens, evidence count, tools requested)
  • policy constraints (data type, residency, “no external calls”, etc.)
  • current system state (provider health, quotas, outage flags)

Router outputs

  • selected model (or provider + model id)
  • allowed fallback set
  • timeout and retry policy
  • tool permissions (tightened when needed)
  • decision reason (for logs)
If you can’t explain why the router chose a model, you can’t debug production behavior.

Every routing decision should be logged as a structured event.


The Contract Drives the Tier

In January we insisted on contracts and schemas.

Here’s the extension:

Model tier becomes part of the contract.

Not as a hard-coded choice, but as a constraint set.

Example: task contract (conceptual)

{
  "task": "invoice_support_reply",
  "risk_tier": "high",
  "output_schema": "SupportReplyV2",
  "budget_tier": "silver",
  "min_capability_tier": "M",
  "max_capability_tier": "L",
  "allowed_tools": ["kb_search", "crm_lookup"],
  "disallowed": ["send_email", "refund_payment"]
}

This pushes model selection into the same governance path as everything else.


Routing Patterns That Actually Work

Most teams end up using a small set of patterns.

Here are the ones worth building intentionally.


Fallbacks: The Difference Between “Smart” and “Operable”

Fallbacks are not a “retry.”

Fallbacks are a design decision about what failure is acceptable.

Define failure modes explicitly

  • provider timeout
  • rate limit
  • safety refusal
  • schema violation
  • missing citations / missing evidence usage
  • tool call failure
  • low confidence (if you have a confidence proxy)

Then map them to behavior:

  • degrade (cheaper model, shorter context, no tools)
  • upgrade (higher tier, stricter reasoning)
  • fail fast (high-risk task + low trust input)
  • ask a question (interactive resolution)
  • hand off to human (if applicable)
A fallback that silently changes behavior is a trust bug.

If the system degrades quality, that should be visible to:

  • telemetry
  • and sometimes the user experience (depending on the product)

Latency and Cost: Routing Without Budgets Is Just Gambling

A model router without budgets will “work” until it meets real traffic.

The budgets that matter in routing

  • p95 end-to-end latency (not just model time)
  • max retries (per request and per stage)
  • max upgrade hops (how many tiers can we climb)
  • max input tokens (context assembly cap)
  • max tool calls (tools are latency multipliers)
  • max cost per request (hard ceiling)

This is where March’s Context Packet becomes essential.

Because routing needs visibility into:

  • token count (actual, not guessed)
  • tool needs (requested + allowed)
  • evidence count (how much retrieval is happening)

A Practical Router Decision Tree (You Can Implement Tomorrow)

This is not “AI.” It’s operations.

Start from the contract

  • task name
  • risk tier
  • output schema
  • allowed tools
  • budget tier

Compute a context profile

From the Context Packet:

  • input tokens
  • evidence tokens
  • tool count requested
  • “untrusted evidence” present? (usually yes)

Choose the minimum tier that fits the contract

  • If high-risk: minimum Tier M (often)
  • If tool-heavy: require a tier with proven tool reliability
  • If long-context: require a tier with stable long-context behavior

Apply budget gating

  • bronze may cap at Tier M
  • gold may allow Tier L

Apply health constraints

  • if provider A is degraded, route to provider B within same tier
  • if both degraded, degrade tier or fail fast depending on risk

Define fallback policy

  • max upgrades: 1 (usually)
  • retries: 1–2 with jitter
  • timeouts: tier-specific
  • schema validation required

Emit and log the routing decision

Store:

  • selected tier/model
  • reason codes
  • budgets
  • fallback set

Testing: You Must Evaluate the Router, Not Just the Models

The router changes outputs as much as the model does.

So you need evaluations at three levels:

Model eval

Per-tier task accuracy, formatting reliability, refusal behavior.

Router eval

Does it choose the right tier? Does it avoid over-upgrading?

System eval

End-to-end contract satisfaction: cost, latency, safety, correctness.

Drift monitoring

Quality changes over time due to providers, prompts, and data.

Router evaluation is surprisingly concrete

You can label a dataset with:

  • input complexity class
  • required tier (S/M/L)
  • allowed tools
  • acceptable cost ceiling

Then measure:

  • accuracy of routing
  • upgrade frequency
  • degradation frequency
  • impact on success rate and p95

If you can’t evaluate it, don’t let it decide.


Governance: Separating Product Choice From Engineering Choice

Once you have tiers, product and engineering can collaborate cleanly:

  • Product defines budget tiers and user entitlements
  • Engineering defines capability tiers and contracts
  • Security defines risk tiers and tool allowlists
  • Ops defines SLOs and kill switches

Everyone gets a knob that maps to their responsibility.

That’s what good architecture does.


Common Failure Modes (Multi-Model Edition)


A Template You Can Copy: “Tier Rules” as Configuration

Don’t bury tier logic in code if you can avoid it.

Make it a config artifact you can review and version.

tiers:
  S:
    max_input_tokens: 4000
    max_tools: 0
    timeout_ms: 1500
    retries: 1
  M:
    max_input_tokens: 8000
    max_tools: 3
    timeout_ms: 2500
    retries: 1
  L:
    max_input_tokens: 16000
    max_tools: 5
    timeout_ms: 4000
    retries: 2

tasks:
  classify_ticket:
    min_tier: S
    max_tier: M
    risk_tier: low
  draft_support_reply:
    min_tier: M
    max_tier: L
    risk_tier: medium
  generate_refund_decision:
    min_tier: L
    max_tier: L
    risk_tier: high
    require_evidence: true
    require_schema: true
This is the same move we made in 2022 with operational architecture:

Make critical behavior reviewable and testable as configuration, not folklore.


FAQ


What’s Next

April made model choice explicit:

  • capability tiers become an interface
  • routing becomes policy
  • fallbacks become design
  • budgets become enforceable
  • the router becomes evaluable and observable

Next month we tackle the part everyone rushes into—and then regrets:

Open Weights in Production: evaluation, licensing, and guardrails

Because once you can route between models, you’ll inevitably ask:

When do I host my own—and what does “production-safe” mean when the weights are mine?

Axel Domingues - 2026