
Open weights shift your risk from vendor to you. This month is the playbook: evaluate like a product, treat licensing as architecture, and ship with guardrails that survive real users.
Axel Domingues
April was about choosing models.
May is about owning them.
Because the moment you run an open-weights model in production, you inherit responsibilities that a vendor used to hide behind an API:
This isn’t an argument for or against open weights.
It’s the missing architecture doc:
how to use open weights without turning your product into an expensive science project.
That does not automatically mean “open source,” and it definitely does not mean “no strings attached.”
The real shift
Open weights move risk and responsibility from the vendor to your system.
The operational payoff
If you do it right, you buy control: privacy, cost shaping, and predictable behavior.
Open weights are appealing for good reasons:
And yet… most teams struggle on the first attempt.
Because they adopt open weights the way they adopted libraries in 2015:
“We’ll just run it.”
But an LLM is not “a library.”
It’s a stochastic subsystem with a large blast radius.
So this month, we treat open weights the same way we treated distributed data in 2022:
as a production system that requires controls.
There are only three sane ways to use open weights. Everything else is a variant.
1) Vendor API
Fastest to ship. Lowest infra burden. Highest dependency and least control.
2) Managed open weights
A provider hosts the open model for you. Good compromise: control + lower ops load.
3) Self-hosted runtime
Max control. Max responsibility. You own latency, scaling, safety, and outages.
The trap
“Self-hosted” without evaluation + guardrails is just DIY outages.
If April taught us that model selection becomes architecture, May adds the uncomfortable corollary:
If you can’t evaluate it, you can’t operate it.
Open weights make this non-negotiable because:
So: you need an evaluation harness that acts like CI.
Your harness needs to answer three questions continuously:
Treat LLM output like an API response that can be wrong in creative ways.
Contract adherence
Schema validity, required fields, refusal behavior, tool call correctness.
Truth risk
Hallucination rate on grounded tasks, citation behavior, “confident wrongness.”
Latency and throughput
p50 / p95 / p99, queue delay, batch efficiency, cold start frequency.
Cost and capacity
GPU seconds, memory headroom, context length usage, cache hit rate.
A golden set is not a benchmark leaderboard.
It’s a curated set of real tasks that represent your actual failure modes:
This is the part engineers love to skip.
But licensing shapes your architecture as much as:
Open weights come with licenses that vary widely. Some are permissive. Some are conditional. Some are commercial with restrictions.
And the key operational fact is:
Your model is now a dependency with legal constraints.
Fine-tuning, adapters, merges, and quantization are often treated differently across licenses. You need to know which artifacts you can ship and store.
In 2022 we treated “compatibility” as a discipline.
Do the same here:
With vendor APIs, you inherit a safety layer you didn’t build.
With open weights, you have to build it.
The way I frame this is simple:
Your model is a powerful worker. Your system is the supervisor.
That supervisor needs three kinds of controls:
In January we treated the model as probabilistic. In February we treated documents as untrusted input. In March we treated context assembly as a subsystem.
Now we apply the untrusted rule everywhere:
This isn’t cynicism. It’s how production systems survive.
Here’s the simplest shape I’ve seen work reliably:

Gateway
Auth, rate limits, customer tiering, request shaping.
Policy + budgets
Which model, max tokens, tool permissions, fallback rules.
Context assembly
Retrieval, summaries, state, memory policy, token accounting.
Validation layer
Schema validation, safety filters, refusal checks, tool call verification.
Under that sits the runtime:
You do not need to start “enterprise.” You do need the boundaries.
If your downstream expects JSON, enforce it.
A validator is a contract.
If an answer can cause harm (money, permissions, irreversible actions), split it:
This is the “humans approve actions” pattern — but automated.
Tools are not “capabilities.” Tools are attack surface.
When open weights fail, they fail oddly. So your product needs a consistent refusal and escalation story:
Self-hosting open weights makes three costs visible immediately:
Context discipline
Summarize aggressively, retrieve narrowly, cap tokens by route.
Caching
Cache embeddings, retrieved chunks, and deterministic tool results.
Batching
Batch short requests, separate queues by latency tier.
Tiered models
Cheap default, expensive escalation for hard cases.
Most product traffic is routine. Design for that.
EleutherAI — lm-evaluation-harness (golden sets as CI)
A practical evaluation harness you can wire into release gates: regressions, variance checks, and “does it still meet the contract?”
Stanford CRFM — HELM (holistic, reproducible evaluation)
A framework for evaluating across scenarios and metrics—useful when “accuracy” isn’t the whole product (safety, robustness, bias, etc.).
No. Many agent systems work well with vendor APIs.
Open weights are a control move — useful when you need predictable costs, data governance, customization, or resilience against provider shifts.
Teams optimize inference first and evaluation later.
They get fast responses… that are wrong, unsafe, or unstable — and they only discover it after users do.
Not in production.
Prompting is part of the safety story, but it’s not enforcement. You still need validators, tool permissions, and monitoring.
April reframed model selection as architecture.
May added the missing constraint:
Open weights are only “cheaper and safer” if you can evaluate and operate them.
Next month the surface area expands again:
multimodality.
Because the moment your system sees images, hears audio, or speaks back…
UX changes.
And your guardrails and budgets change with it.
Multimodal Changes UX: designing text+vision+audio systems
Multimodal Changes UX: designing text+vision+audio systems
Multimodal isn’t “a bigger prompt”. It’s a perception + reasoning + UX system with new contracts, new failure modes, and new latency/cost constraints. This month is about designing it so it behaves predictably.
Model Selection Becomes Architecture: Routing, Budgets, and Capability Tiers
The moment you have multiple models with different strengths, “pick the best model” turns into system design. This month is a practical blueprint for routing, fallbacks, and budget-aware capability tiers—without turning your stack into an untestable mess.