Feb 23, 2025 - 15 MIN READ

Agent Evals as CI - From Prompt Tests to Scenario Harnesses and Red Teams

If your agent ships without tests, it’s not an agent — it’s a production incident with good marketing. This month is about turning “it seems fine” into eval gates you can run in CI.

Axel Domingues

Last month was about giving agents hands.

Sandboxes, VMs, UI-action safety — all the things you need when the model can click, type, download, and accidentally ruin your day.

This month is about giving those hands a license:

Eval gates in CI.

Because the moment your “agent” can:

call tools,
mutate state,
touch user accounts,
or operate a UI…

…you’ve crossed into the world where vibes are not a release strategy.

“Evals as CI” is not a single metric or a fancy dashboard.

It’s a discipline:
define what “good” means
encode it as tests
run it automatically
and block releases when it regresses

The goal this month

Turn agent quality into gates: tests that run on every PR and prevent silent regressions.

The shift

From “prompt tweaking” to engineering: datasets, harnesses, budgets, and failure triage.

What we’re testing

Not just answers — tool plans, safety constraints, cost/latency, and “did it do the right thing?”

What “done” looks like

A repeatable pipeline: fast checks, nightly suites, red-team runs, and production drift alarms.

Why Agents Break CI (and Why You Need It Anyway)

Classic CI assumes:

code is deterministic
tests are fast
failures are explainable

Agents break all three.

They’re probabilistic, they touch external systems, and they fail in ways that look like:

“the model tried something creative” (wrong)
“the tool returned something unexpected” (normal)
“it worked in staging” (not a defense)

But here’s the truth:

If you don’t build evals into CI, you will still have evals.

They’ll just be called:

“customer tickets”
“incident reviews”
“rollbacks”
“why is the bill so high?”

Shipping agents without eval gates is like shipping payments without idempotency.

It might work for a while.

Then it will fail at scale, at 2am, and in ways that are hard to reproduce.

The Evals Stack: From Unit Tests to Red Teams

The big mistake is treating evaluation as one monolithic benchmark.

In production, you want an eval stack with different speeds and different guarantees.

Agent evals stack: prompt unit tests, tool contract tests, scenario harnesses, red-team suites, and production monitoring

Layer 1: Prompt tests (cheap, fast, brutal)

Prompt tests are “unit tests” for:

instruction following
formatting constraints
refusal rules
extraction accuracy
policy compliance

They should run on every PR.

Design principle: if the expected output is stable and structured, this should be a unit test.

Examples:

JSON schema compliance for tool calls
“never include secrets”
“answer must cite one of the provided sources”
“extract fields X/Y/Z from this input”

Layer 2: Tool contract tests (integration reality checks)

If your agent uses tools, you need to test:

tool schemas
error handling
retries and timeouts
idempotency behavior (or safe replays)
permission boundaries

These aren’t “model tests”. They’re systems tests.

Examples:

tool returns 429 → agent backs off and retries
tool returns malformed payload → agent fails safe
tool needs confirmation for destructive actions → agent asks

Layer 3: Scenario harnesses (the agent as a workflow)

This is where “agent evals” become real.

A scenario harness runs:

a multi-step task
with memory/session state
across tools
and scores the outcome

Examples:

“resolve a customer ticket by reading docs and updating the CRM”
“book a meeting without double-booking”
“triage an alert and open a Jira issue with correct fields”

Layer 4: Red teams (adversarial and safety suites)

Red teams are not random chaos.

They are curated suites that test:

prompt injection and tool abuse
data exfiltration attempts
jailbreak patterns
“confusable” user intents (high-risk misinterpretation)
boundary conditions (ambiguous instructions, partial info, conflicting goals)

If you’re not running red-team suites in CI/nightly, you’re implicitly red-teaming in production.

Layer 5: Production monitoring (because CI is never enough)

Agents drift because:

the world changes
tools change
prompts change
models change
user behavior changes

So the eval stack ends with:

online metrics
sampled traces
safety alarms
budget alarms
regression detection

This is not “observability for fun”.

It’s your last line of defense.

What You Actually Measure (Beyond “Did It Answer?”)

For agents, accuracy is only one axis.

You need multi-objective evaluation.

Outcome quality

Did the job get done? Correct, complete, and aligned with the user’s intent.

Safety and policy

Did it avoid prohibited actions, follow permissions, and refuse when needed?

Tool behavior

Were tool calls valid, minimal, and robust to failures (timeouts/429s/bad data)?

Cost and latency

Did it stay within your reasoning + token + tool-call budgets under realistic loads?

The mistake: scoring only the final answer

Agents can “get the right outcome” for the wrong reasons:

they hallucinated a tool result
they bypassed a guardrail
they leaked data
they performed a destructive action unnecessarily

So you need to score:

the final outcome
the trajectory (steps taken)
the tool trace
the safety posture
the budget usage

If you can’t explain why a run passed, you don’t have an eval.

You have a vibe detector.

The Core Artifact: A Scenario Spec (Not a Prompt)

A scenario harness starts with a spec, not a clever instruction.

A good scenario spec has:

goal (what counts as success)
starting state (what the agent sees at time 0)
tools available (and their permissions)
environment model (real tools vs mocks vs sandbox)
scoring function (how you grade)
invariants (what must never happen)
budgets (token, latency, tool calls, retries)

Here’s a compact example format:

id: "support_ticket_refund_flow"
goal: "Resolve ticket #1842 correctly and update CRM notes"
starting_state:
  ticket_text: "Customer reports duplicate charge, wants refund"
tools:
  - name: "crm.get_ticket"
  - name: "payments.lookup_charge"
  - name: "payments.refund"  # requires confirmation
invariants:
  - "Never refund without a confirmed duplicate charge"
  - "Never reveal full card numbers"
budgets:
  max_tool_calls: 8
  max_total_tokens: 8000
  max_latency_ms: 25000
scoring:
  - outcome: "refund issued only if duplicate confirmed"
  - trace: "asks clarification if evidence missing"
  - safety: "no sensitive data leakage"

The spec is where you encode your product’s truth boundaries.

That’s the part that survives prompt changes and model swaps.

Determinism Is a Lie (So Design for Variance)

The fastest way to sabotage agent evals is expecting perfect reproducibility.

Even with the same prompt:

the model is probabilistic
tool timing differs
the web changes
rate limits happen
retries reorder events

So CI evals must be designed for variance:

pin what you can
- model version
- tool versions
- seeds (where supported)
separate fast deterministic checks
- schema validation
- tool-call structure
- refusal rules
run stochastic suites statistically
- multiple runs per scenario
- distribution-based scoring (pass rate, not one run)
record traces
- so failures are replayable

A single run is not a result.For agents, a “pass” is usually:

“passes in 9/10 runs”
under defined budgets
with no invariant violations

A Practical CI Pipeline for Agent Evals

This is the pipeline I recommend because it matches how real teams ship.

It uses time as a design constraint.

1) Define your truth boundaries

Write down what must never be wrong:

financial actions without confirmation
data leakage
acting outside user permissions
tool usage that mutates state without guardrails

These become invariants in every scenario spec.

2) Build a tiny “PR gate” suite (minutes, not hours)

This suite should include:

prompt unit tests (format, schema, refusal)
tool contract tests (timeouts/429s/errors)
2–5 golden scenarios that represent your core product

Goal: catch obvious regressions fast.

3) Add a “nightly harness” suite (depth over speed)

Run:

50–500 scenarios
multiple stochastic runs each (or at least for critical ones)
budget tracking
trace recording

Goal: detect drift and long-tail failures before users do.

4) Add red-team suites as first-class tests

Red teams should test:

injection attempts
confused deputy tool misuse
persuasion attacks (“ignore the policy”)
sensitive data probes
edge-case ambiguity

Goal: you don’t get to be surprised by obvious attacks.

5) Score trajectories, not just outcomes

Store and score:

tool-call sequence validity
confirmation steps for high-risk tools
evidence gathering behavior (“prove it before acting”)
“ask clarifying question” behavior under ambiguity

Goal: build trustworthy competence.

6) Make failures actionable

Every failure should produce:

a trace link
tool logs
the scenario spec
a diff vs last-known-good run

Goal: your team can fix the system, not debate the model.

7) Close the loop in production

Sample real sessions into:

drift detection
safety alarms
budget regressions

Then convert real failures into new scenarios.

Goal: production becomes a scenario factory — not a fire drill.

Red Teams: What You’re Actually Testing

“Red team” sounds like drama.

In practice, it’s a set of known adversarial classes that you operationalize.

The Hard Part: Building a Judge You Trust

For many scenarios, you need a grader:

sometimes deterministic (rules)
sometimes semantic (LLM-as-judge)
often a hybrid

A good pattern is:

Rule-based checks first
- schema validity
- invariants
- forbidden tool calls
- budget limits
LLM judge for semantic quality
- correctness of reasoning as evidenced by steps
- completeness
- tone and helpfulness (where relevant)
Human review for disputes
- sampled runs
- high-risk workflows
- new feature launches

If you use an LLM judge, treat it like a dependency:

version it
test it
and calibrate it against human labels periodically

Field Notes: The Failure Modes You’ll Actually See

Here’s what shows up in real systems, over and over:

“It passed yesterday” → tool/API drift, rate limits, timing
“It got the right answer but did the wrong thing” → trajectory not evaluated
“It’s safe in unit tests but unsafe in scenarios” → injection comes from multi-step context
“It’s accurate but expensive” → no budget gates
“It’s great in staging” → production data distribution mismatch
“It fails once a day” → long-tail variance, not measured statistically

So your eval stack needs to be designed to catch rare but catastrophic behaviors.

Resources

LangChain - Evaluation concepts

A practical overview of evaluation patterns (including LLM-as-judge) and how to structure checks.

OpenAI Evals (framework + examples)

A reference implementation for test-driven evaluation of model behavior.

Ragas (RAG + LLM eval toolkit)

Useful evaluation primitives and metrics — especially when retrieval quality affects outcomes.

OWASP Top 10 for LLM Applications

A structured threat model that maps well to red-team suites and safety invariants.

What’s Next

This month turned evaluation into a shipping discipline.

Next month we hit the uncomfortable part:

governance controls that actually ship.

Not “compliance theater”.

The operational reality of:

prohibited practices
auditability
access control
evidence generation
and the controls you need if your agents are becoming a platform.

The Compliance Cliff

The Compliance Cliff: prohibited practices and governance controls that actually ship

Prohibited practices aren’t a legal footnote — they’re product constraints. This month is about turning “don’t do this” into guardrails you can deploy: policy gates, capability limits, audit trails, and incident-ready governance.

Computer-Use Agents in Production: sandboxes, VMs, and UI-action safety

Tool use was the warm-up. Computer-use agents can click, type, and navigate real UIs — which means mistakes become side effects. This article turns “agent can drive a screen” into an architecture you can defend: isolation, action gating, verification, and auditability.