Blog
May 26, 2024 - 15 MIN READ
Open Weights in Production: evaluation, licensing, and guardrails

Open Weights in Production: evaluation, licensing, and guardrails

Open weights shift your risk from vendor to you. This month is the playbook: evaluate like a product, treat licensing as architecture, and ship with guardrails that survive real users.

Axel Domingues

Axel Domingues

April was about choosing models.

May is about owning them.

Because the moment you run an open-weights model in production, you inherit responsibilities that a vendor used to hide behind an API:

  • evaluation becomes your release gate
  • licensing becomes your compliance surface
  • safety becomes your operational discipline
  • and “performance” becomes your GPU bill, your latency, your incident

This isn’t an argument for or against open weights.

It’s the missing architecture doc:

how to use open weights without turning your product into an expensive science project.

When I say open weights, I mean “you can run the model yourself.”

That does not automatically mean “open source,” and it definitely does not mean “no strings attached.”

The real shift

Open weights move risk and responsibility from the vendor to your system.

The operational payoff

If you do it right, you buy control: privacy, cost shaping, and predictable behavior.


Why teams want open weights (and why they get burned)

Open weights are appealing for good reasons:

  • Cost shaping: predictable marginal cost when usage scales
  • Data control: sensitive workflows without sending data to a third party
  • Customization: domain tuning, adapters, retrieval augmentation, prompt templates
  • Availability: fewer “provider incident” surprises
  • Portability: your product is not welded to one vendor’s roadmap

And yet… most teams struggle on the first attempt.

Because they adopt open weights the way they adopted libraries in 2015:

“We’ll just run it.”

But an LLM is not “a library.”

It’s a stochastic subsystem with a large blast radius.

So this month, we treat open weights the same way we treated distributed data in 2022:

as a production system that requires controls.


The three deployment postures

There are only three sane ways to use open weights. Everything else is a variant.

1) Vendor API

Fastest to ship. Lowest infra burden. Highest dependency and least control.

2) Managed open weights

A provider hosts the open model for you. Good compromise: control + lower ops load.

3) Self-hosted runtime

Max control. Max responsibility. You own latency, scaling, safety, and outages.

The trap

“Self-hosted” without evaluation + guardrails is just DIY outages.


Evaluation is the product gate (not a research exercise)

If April taught us that model selection becomes architecture, May adds the uncomfortable corollary:

If you can’t evaluate it, you can’t operate it.

Open weights make this non-negotiable because:

  • you will upgrade models (or weights)
  • you will change prompts and tools
  • you will quantize, batch, cache, and optimize
  • you will ship “one small change” that shifts behavior unexpectedly

So: you need an evaluation harness that acts like CI.

The minimum viable evaluation harness

Your harness needs to answer three questions continuously:

  1. Does it still meet the contract? (correctness + format + safety)
  2. Is it within budget? (tokens, latency, GPU time, fallbacks)
  3. Is it stable? (variance across runs, regressions, long-tail failures)
The best evaluation mindset is boring:

Treat LLM output like an API response that can be wrong in creative ways.

What you should measure (beyond accuracy)

Contract adherence

Schema validity, required fields, refusal behavior, tool call correctness.

Truth risk

Hallucination rate on grounded tasks, citation behavior, “confident wrongness.”

Latency and throughput

p50 / p95 / p99, queue delay, batch efficiency, cold start frequency.

Cost and capacity

GPU seconds, memory headroom, context length usage, cache hit rate.

Build a “golden set” that represents your product

A golden set is not a benchmark leaderboard.

It’s a curated set of real tasks that represent your actual failure modes:

  • messy user inputs
  • edge cases
  • adversarial prompts
  • incomplete context
  • tool failures
  • ambiguous instructions
  • “looks correct but wrong” traps
A good golden set is small (hundreds to a few thousand cases), but maintained. It evolves with your product the same way test suites evolve with APIs.

Licensing is architecture (because it constrains your product)

This is the part engineers love to skip.

But licensing shapes your architecture as much as:

  • a database license
  • a cloud provider contract
  • an open-source dependency
  • a data residency requirement

Open weights come with licenses that vary widely. Some are permissive. Some are conditional. Some are commercial with restrictions.

And the key operational fact is:

Your model is now a dependency with legal constraints.

I’m not a lawyer. Treat this section as an engineering checklist for what to ask and verify — not legal advice.

The practical licensing questions you must answer

Treat the license like a runtime policy

In 2022 we treated “compatibility” as a discipline.

Do the same here:

  • put license review in your release checklist
  • version your model artifacts and their license metadata
  • keep an internal “model bill of materials” (MBOM)
  • make upgrades a change-management event, not a casual swap

Guardrails: you are now the safety team

With vendor APIs, you inherit a safety layer you didn’t build.

With open weights, you have to build it.

The way I frame this is simple:

Your model is a powerful worker. Your system is the supervisor.

That supervisor needs three kinds of controls:

  1. Input guardrails (stop obvious nonsense early)
  2. Output guardrails (validate contracts, filter unsafe content)
  3. Action guardrails (tool use with permissions, budgets, and sandboxing)

The “untrusted everything” rule

In January we treated the model as probabilistic. In February we treated documents as untrusted input. In March we treated context assembly as a subsystem.

Now we apply the untrusted rule everywhere:

  • user input is untrusted
  • retrieved text is untrusted
  • tool outputs are untrusted
  • and model output is untrusted

This isn’t cynicism. It’s how production systems survive.


A reference architecture for open-weight LLMs

Here’s the simplest shape I’ve seen work reliably:

Gateway

Auth, rate limits, customer tiering, request shaping.

Policy + budgets

Which model, max tokens, tool permissions, fallback rules.

Context assembly

Retrieval, summaries, state, memory policy, token accounting.

Validation layer

Schema validation, safety filters, refusal checks, tool call verification.

Under that sits the runtime:

  • inference server(s)
  • caching and batching
  • GPU scheduling
  • observability and tracing
  • canary and rollback mechanisms

You do not need to start “enterprise.” You do need the boundaries.


Guardrails that actually work in practice

1) Enforce structured outputs (and reject non-compliance)

If your downstream expects JSON, enforce it.

  • use JSON schema / typed decoding where possible
  • reject invalid responses and retry with a smaller “repair” prompt
  • log invalid outputs as evaluation failures
A model that “usually returns JSON” is not a contract.

A validator is a contract.

2) Separate generation from decision

If an answer can cause harm (money, permissions, irreversible actions), split it:

  • model proposes
  • system verifies
  • system executes

This is the “humans approve actions” pattern — but automated.

3) Tool use must be permissioned

Tools are not “capabilities.” Tools are attack surface.

  • allowlist tools per route/customer tier
  • enforce argument schemas
  • sandbox execution (timeouts, network egress controls)
  • record tool calls as audit events

4) Build a refusal policy you can explain

When open weights fail, they fail oddly. So your product needs a consistent refusal and escalation story:

  • what you refuse
  • how you explain it
  • what users can do next
  • how you log it for review

The real operational costs (and how to control them)

Self-hosting open weights makes three costs visible immediately:

  • GPU memory and throughput
  • long context amplification
  • variance (p99 latency and outlier behavior)

Cost controls that don’t ruin quality

Context discipline

Summarize aggressively, retrieve narrowly, cap tokens by route.

Caching

Cache embeddings, retrieved chunks, and deterministic tool results.

Batching

Batch short requests, separate queues by latency tier.

Tiered models

Cheap default, expensive escalation for hard cases.

The biggest cost mistake is treating every request like a “high reasoning” request.

Most product traffic is routine. Design for that.


A go-live checklist you can actually use

Define contracts per endpoint

  • expected output format (schema)
  • allowed tools and arguments
  • refusal behavior
  • max cost/latency budgets

Build the golden set + regression harness

  • representative user tasks
  • adversarial cases
  • tool failure cases
  • scorecards for contract adherence and safety

Instrument the runtime

  • request traces end-to-end
  • token accounting and context composition logs
  • GPU utilization and queue depth
  • p95/p99 latency per route

Ship with canaries + rollback

  • shadow traffic on the new model
  • small percent rollout
  • automatic rollback on metric regression

Establish a model change policy

  • who approves upgrades
  • how you version artifacts
  • how you record “which model answered what” for audits

Resources

EleutherAI — lm-evaluation-harness (golden sets as CI)

A practical evaluation harness you can wire into release gates: regressions, variance checks, and “does it still meet the contract?”

Stanford CRFM — HELM (holistic, reproducible evaluation)

A framework for evaluating across scenarios and metrics—useful when “accuracy” isn’t the whole product (safety, robustness, bias, etc.).

OWASP Top 10 for LLM Applications (threat model)

A practical map of failure modes (prompt injection, insecure output handling, DoS, data leakage) that directly informs guardrails and tool gating.

OpenChain (ISO/IEC 5230) — License compliance program basics

A lightweight process checklist for “licensing is architecture”: roles, review gates, records, and upgrade discipline.


FAQ


What’s Next

April reframed model selection as architecture.

May added the missing constraint:

Open weights are only “cheaper and safer” if you can evaluate and operate them.

Next month the surface area expands again:

multimodality.

Because the moment your system sees images, hears audio, or speaks back…

UX changes.

And your guardrails and budgets change with it.

Multimodal Changes UX: designing text+vision+audio systems

Axel Domingues - 2026