
If your agent ships without tests, it’s not an agent — it’s a production incident with good marketing. This month is about turning “it seems fine” into eval gates you can run in CI.
Axel Domingues
Last month was about giving agents hands.
Sandboxes, VMs, UI-action safety — all the things you need when the model can click, type, download, and accidentally ruin your day.
This month is about giving those hands a license:
Eval gates in CI.
Because the moment your “agent” can:
…you’ve crossed into the world where vibes are not a release strategy.
It’s a discipline:
- define what “good” means
- encode it as tests
- run it automatically
- and block releases when it regresses
The goal this month
Turn agent quality into gates: tests that run on every PR and prevent silent regressions.
The shift
From “prompt tweaking” to engineering: datasets, harnesses, budgets, and failure triage.
What we’re testing
Not just answers — tool plans, safety constraints, cost/latency, and “did it do the right thing?”
What “done” looks like
A repeatable pipeline: fast checks, nightly suites, red-team runs, and production drift alarms.
Classic CI assumes:
Agents break all three.
They’re probabilistic, they touch external systems, and they fail in ways that look like:
But here’s the truth:
If you don’t build evals into CI, you will still have evals.
They’ll just be called:
Then it will fail at scale, at 2am, and in ways that are hard to reproduce.It might work for a while.
The big mistake is treating evaluation as one monolithic benchmark.
In production, you want an eval stack with different speeds and different guarantees.

Prompt tests are “unit tests” for:
They should run on every PR.
Design principle: if the expected output is stable and structured, this should be a unit test.
Examples:
If your agent uses tools, you need to test:
These aren’t “model tests”. They’re systems tests.
Examples:
This is where “agent evals” become real.
A scenario harness runs:
Examples:
Red teams are not random chaos.
They are curated suites that test:
If you’re not running red-team suites in CI/nightly, you’re implicitly red-teaming in production.
Agents drift because:
So the eval stack ends with:
This is not “observability for fun”.
It’s your last line of defense.
For agents, accuracy is only one axis.
You need multi-objective evaluation.
Outcome quality
Did the job get done? Correct, complete, and aligned with the user’s intent.
Safety and policy
Did it avoid prohibited actions, follow permissions, and refuse when needed?
Tool behavior
Were tool calls valid, minimal, and robust to failures (timeouts/429s/bad data)?
Cost and latency
Did it stay within your reasoning + token + tool-call budgets under realistic loads?
Agents can “get the right outcome” for the wrong reasons:
So you need to score:
You have a vibe detector.
A scenario harness starts with a spec, not a clever instruction.
A good scenario spec has:
Here’s a compact example format:
id: "support_ticket_refund_flow"
goal: "Resolve ticket #1842 correctly and update CRM notes"
starting_state:
ticket_text: "Customer reports duplicate charge, wants refund"
tools:
- name: "crm.get_ticket"
- name: "payments.lookup_charge"
- name: "payments.refund" # requires confirmation
invariants:
- "Never refund without a confirmed duplicate charge"
- "Never reveal full card numbers"
budgets:
max_tool_calls: 8
max_total_tokens: 8000
max_latency_ms: 25000
scoring:
- outcome: "refund issued only if duplicate confirmed"
- trace: "asks clarification if evidence missing"
- safety: "no sensitive data leakage"
The spec is where you encode your product’s truth boundaries.
That’s the part that survives prompt changes and model swaps.
The fastest way to sabotage agent evals is expecting perfect reproducibility.
Even with the same prompt:
So CI evals must be designed for variance:
This is the pipeline I recommend because it matches how real teams ship.
It uses time as a design constraint.
Write down what must never be wrong:
These become invariants in every scenario spec.
This suite should include:
Goal: catch obvious regressions fast.
Run:
Goal: detect drift and long-tail failures before users do.
Red teams should test:
Goal: you don’t get to be surprised by obvious attacks.
Store and score:
Goal: build trustworthy competence.
Every failure should produce:
Goal: your team can fix the system, not debate the model.
Sample real sessions into:
Then convert real failures into new scenarios.
Goal: production becomes a scenario factory — not a fire drill.
“Red team” sounds like drama.
In practice, it’s a set of known adversarial classes that you operationalize.
Test whether the agent can be manipulated into:
This is where the “computer-use” work from January matters: untrusted content should not get direct control of tools.
The agent has authority the user doesn’t.
Test that it does not:
Test whether the agent:
Test that the agent:
For many scenarios, you need a grader:
A good pattern is:
Here’s what shows up in real systems, over and over:
So your eval stack needs to be designed to catch rare but catastrophic behaviors.
LangChain - Evaluation concepts
A practical overview of evaluation patterns (including LLM-as-judge) and how to structure checks.
OpenAI Evals (framework + examples)
A reference implementation for test-driven evaluation of model behavior.
Yes — maybe even more.
Hosted models change. Tools change. Prompts change. Your product changes.
CI is your way of detecting regressions before customers do.
For schema/unit tests: yes (usually).
For scenarios: don’t rely on it.
You want to know how the system behaves under realistic stochasticity — and you want pass rates, not one perfect run.
A practical minimum:
Treat scenarios like code:
The goal is not quantity.
The goal is coverage of your highest-risk workflows.
This month turned evaluation into a shipping discipline.
Next month we hit the uncomfortable part:
governance controls that actually ship.
Not “compliance theater”.
The operational reality of:
The Compliance Cliff: prohibited practices and governance controls that actually ship
The Compliance Cliff: prohibited practices and governance controls that actually ship
Prohibited practices aren’t a legal footnote — they’re product constraints. This month is about turning “don’t do this” into guardrails you can deploy: policy gates, capability limits, audit trails, and incident-ready governance.
Computer-Use Agents in Production: sandboxes, VMs, and UI-action safety
Tool use was the warm-up. Computer-use agents can click, type, and navigate real UIs — which means mistakes become side effects. This article turns “agent can drive a screen” into an architecture you can defend: isolation, action gating, verification, and auditability.