May 26, 2024 - 15 MIN READ

Open Weights in Production: evaluation, licensing, and guardrails

Open weights shift your risk from vendor to you. This month is the playbook: evaluate like a product, treat licensing as architecture, and ship with guardrails that survive real users.

Axel Domingues

April was about choosing models.

May is about owning them.

Because the moment you run an open-weights model in production, you inherit responsibilities that a vendor used to hide behind an API:

evaluation becomes your release gate
licensing becomes your compliance surface
safety becomes your operational discipline
and “performance” becomes your GPU bill, your latency, your incident

This isn’t an argument for or against open weights.

It’s the missing architecture doc:

how to use open weights without turning your product into an expensive science project.

When I say open weights, I mean “you can run the model yourself.”

That does not automatically mean “open source,” and it definitely does not mean “no strings attached.”

The real shift

Open weights move risk and responsibility from the vendor to your system.

The operational payoff

If you do it right, you buy control: privacy, cost shaping, and predictable behavior.

Why teams want open weights (and why they get burned)

Open weights are appealing for good reasons:

Cost shaping: predictable marginal cost when usage scales
Data control: sensitive workflows without sending data to a third party
Customization: domain tuning, adapters, retrieval augmentation, prompt templates
Availability: fewer “provider incident” surprises
Portability: your product is not welded to one vendor’s roadmap

And yet… most teams struggle on the first attempt.

Because they adopt open weights the way they adopted libraries in 2015:

“We’ll just run it.”

But an LLM is not “a library.”

It’s a stochastic subsystem with a large blast radius.

So this month, we treat open weights the same way we treated distributed data in 2022:

as a production system that requires controls.

The three deployment postures

There are only three sane ways to use open weights. Everything else is a variant.

1) Vendor API

Fastest to ship. Lowest infra burden. Highest dependency and least control.

2) Managed open weights

A provider hosts the open model for you. Good compromise: control + lower ops load.

3) Self-hosted runtime

Max control. Max responsibility. You own latency, scaling, safety, and outages.

The trap

“Self-hosted” without evaluation + guardrails is just DIY outages.

Evaluation is the product gate (not a research exercise)

If April taught us that model selection becomes architecture, May adds the uncomfortable corollary:

If you can’t evaluate it, you can’t operate it.

Open weights make this non-negotiable because:

you will upgrade models (or weights)
you will change prompts and tools
you will quantize, batch, cache, and optimize
you will ship “one small change” that shifts behavior unexpectedly

So: you need an evaluation harness that acts like CI.

The minimum viable evaluation harness

Your harness needs to answer three questions continuously:

Does it still meet the contract? (correctness + format + safety)
Is it within budget? (tokens, latency, GPU time, fallbacks)
Is it stable? (variance across runs, regressions, long-tail failures)

The best evaluation mindset is boring:

Treat LLM output like an API response that can be wrong in creative ways.

What you should measure (beyond accuracy)

Contract adherence

Schema validity, required fields, refusal behavior, tool call correctness.

Truth risk

Hallucination rate on grounded tasks, citation behavior, “confident wrongness.”

Latency and throughput

p50 / p95 / p99, queue delay, batch efficiency, cold start frequency.

Cost and capacity

GPU seconds, memory headroom, context length usage, cache hit rate.

Build a “golden set” that represents your product

A golden set is not a benchmark leaderboard.

It’s a curated set of real tasks that represent your actual failure modes:

messy user inputs
edge cases
adversarial prompts
incomplete context
tool failures
ambiguous instructions
“looks correct but wrong” traps

A good golden set is small (hundreds to a few thousand cases), but maintained. It evolves with your product the same way test suites evolve with APIs.

Licensing is architecture (because it constrains your product)

This is the part engineers love to skip.

But licensing shapes your architecture as much as:

a database license
a cloud provider contract
an open-source dependency
a data residency requirement

Open weights come with licenses that vary widely. Some are permissive. Some are conditional. Some are commercial with restrictions.

And the key operational fact is:

Your model is now a dependency with legal constraints.

I’m not a lawyer. Treat this section as an engineering checklist for what to ask and verify — not legal advice.

The practical licensing questions you must answer

Treat the license like a runtime policy

In 2022 we treated “compatibility” as a discipline.

Do the same here:

put license review in your release checklist
version your model artifacts and their license metadata
keep an internal “model bill of materials” (MBOM)
make upgrades a change-management event, not a casual swap

Guardrails: you are now the safety team

With vendor APIs, you inherit a safety layer you didn’t build.

With open weights, you have to build it.

The way I frame this is simple:

Your model is a powerful worker. Your system is the supervisor.

That supervisor needs three kinds of controls:

Input guardrails (stop obvious nonsense early)
Output guardrails (validate contracts, filter unsafe content)
Action guardrails (tool use with permissions, budgets, and sandboxing)

The “untrusted everything” rule

In January we treated the model as probabilistic. In February we treated documents as untrusted input. In March we treated context assembly as a subsystem.

Now we apply the untrusted rule everywhere:

user input is untrusted
retrieved text is untrusted
tool outputs are untrusted
and model output is untrusted

This isn’t cynicism. It’s how production systems survive.

A reference architecture for open-weight LLMs

Here’s the simplest shape I’ve seen work reliably:

Gateway

Auth, rate limits, customer tiering, request shaping.

Policy + budgets

Which model, max tokens, tool permissions, fallback rules.

Context assembly

Retrieval, summaries, state, memory policy, token accounting.

Validation layer

Schema validation, safety filters, refusal checks, tool call verification.

Under that sits the runtime:

inference server(s)
caching and batching
GPU scheduling
observability and tracing
canary and rollback mechanisms

You do not need to start “enterprise.” You do need the boundaries.

Guardrails that actually work in practice

1) Enforce structured outputs (and reject non-compliance)

If your downstream expects JSON, enforce it.

use JSON schema / typed decoding where possible
reject invalid responses and retry with a smaller “repair” prompt
log invalid outputs as evaluation failures

A model that “usually returns JSON” is not a contract.

A validator is a contract.

2) Separate generation from decision

If an answer can cause harm (money, permissions, irreversible actions), split it:

model proposes
system verifies
system executes

This is the “humans approve actions” pattern — but automated.

3) Tool use must be permissioned

Tools are not “capabilities.” Tools are attack surface.

allowlist tools per route/customer tier
enforce argument schemas
sandbox execution (timeouts, network egress controls)
record tool calls as audit events

4) Build a refusal policy you can explain

When open weights fail, they fail oddly. So your product needs a consistent refusal and escalation story:

what you refuse
how you explain it
what users can do next
how you log it for review

The real operational costs (and how to control them)

Self-hosting open weights makes three costs visible immediately:

GPU memory and throughput
long context amplification
variance (p99 latency and outlier behavior)

Cost controls that don’t ruin quality

Context discipline

Summarize aggressively, retrieve narrowly, cap tokens by route.

Caching

Cache embeddings, retrieved chunks, and deterministic tool results.

Batching

Batch short requests, separate queues by latency tier.

Tiered models

Cheap default, expensive escalation for hard cases.

The biggest cost mistake is treating every request like a “high reasoning” request.

Most product traffic is routine. Design for that.

A go-live checklist you can actually use

Define contracts per endpoint

expected output format (schema)
allowed tools and arguments
refusal behavior
max cost/latency budgets

Build the golden set + regression harness

representative user tasks
adversarial cases
tool failure cases
scorecards for contract adherence and safety

Instrument the runtime

request traces end-to-end
token accounting and context composition logs
GPU utilization and queue depth
p95/p99 latency per route

Ship with canaries + rollback

shadow traffic on the new model
small percent rollout
automatic rollback on metric regression

Establish a model change policy

who approves upgrades
how you version artifacts
how you record “which model answered what” for audits

Resources

EleutherAI — lm-evaluation-harness (golden sets as CI)

A practical evaluation harness you can wire into release gates: regressions, variance checks, and “does it still meet the contract?”

Stanford CRFM — HELM (holistic, reproducible evaluation)

A framework for evaluating across scenarios and metrics—useful when “accuracy” isn’t the whole product (safety, robustness, bias, etc.).

OWASP Top 10 for LLM Applications (threat model)

A practical map of failure modes (prompt injection, insecure output handling, DoS, data leakage) that directly informs guardrails and tool gating.

OpenChain (ISO/IEC 5230) — License compliance program basics

A lightweight process checklist for “licensing is architecture”: roles, review gates, records, and upgrade discipline.

What’s Next

April reframed model selection as architecture.

May added the missing constraint:

Open weights are only “cheaper and safer” if you can evaluate and operate them.

Next month the surface area expands again:

multimodality.

Because the moment your system sees images, hears audio, or speaks back…

UX changes.

And your guardrails and budgets change with it.

Multimodal Changes UX

Multimodal Changes UX: designing text+vision+audio systems

Multimodal isn’t “a bigger prompt”. It’s a perception + reasoning + UX system with new contracts, new failure modes, and new latency/cost constraints. This month is about designing it so it behaves predictably.

Model Selection Becomes Architecture: Routing, Budgets, and Capability Tiers

The moment you have multiple models with different strengths, “pick the best model” turns into system design. This month is a practical blueprint for routing, fallbacks, and budget-aware capability tiers—without turning your stack into an untestable mess.