Blog
Feb 23, 2025 - 15 MIN READ
Agent Evals as CI - From Prompt Tests to Scenario Harnesses and Red Teams

Agent Evals as CI - From Prompt Tests to Scenario Harnesses and Red Teams

If your agent ships without tests, it’s not an agent — it’s a production incident with good marketing. This month is about turning “it seems fine” into eval gates you can run in CI.

Axel Domingues

Axel Domingues

Last month was about giving agents hands.

Sandboxes, VMs, UI-action safety — all the things you need when the model can click, type, download, and accidentally ruin your day.

This month is about giving those hands a license:

Eval gates in CI.

Because the moment your “agent” can:

  • call tools,
  • mutate state,
  • touch user accounts,
  • or operate a UI…

…you’ve crossed into the world where vibes are not a release strategy.

“Evals as CI” is not a single metric or a fancy dashboard.

It’s a discipline:

  • define what “good” means
  • encode it as tests
  • run it automatically
  • and block releases when it regresses

The goal this month

Turn agent quality into gates: tests that run on every PR and prevent silent regressions.

The shift

From “prompt tweaking” to engineering: datasets, harnesses, budgets, and failure triage.

What we’re testing

Not just answers — tool plans, safety constraints, cost/latency, and “did it do the right thing?”

What “done” looks like

A repeatable pipeline: fast checks, nightly suites, red-team runs, and production drift alarms.


Why Agents Break CI (and Why You Need It Anyway)

Classic CI assumes:

  • code is deterministic
  • tests are fast
  • failures are explainable

Agents break all three.

They’re probabilistic, they touch external systems, and they fail in ways that look like:

  • “the model tried something creative” (wrong)
  • “the tool returned something unexpected” (normal)
  • “it worked in staging” (not a defense)

But here’s the truth:

If you don’t build evals into CI, you will still have evals.

They’ll just be called:

  • “customer tickets”
  • “incident reviews”
  • “rollbacks”
  • “why is the bill so high?”
Shipping agents without eval gates is like shipping payments without idempotency.

It might work for a while.

Then it will fail at scale, at 2am, and in ways that are hard to reproduce.

The Evals Stack: From Unit Tests to Red Teams

The big mistake is treating evaluation as one monolithic benchmark.

In production, you want an eval stack with different speeds and different guarantees.

Layer 1: Prompt tests (cheap, fast, brutal)

Prompt tests are “unit tests” for:

  • instruction following
  • formatting constraints
  • refusal rules
  • extraction accuracy
  • policy compliance

They should run on every PR.

Design principle: if the expected output is stable and structured, this should be a unit test.

Examples:

  • JSON schema compliance for tool calls
  • “never include secrets”
  • “answer must cite one of the provided sources”
  • “extract fields X/Y/Z from this input”

Layer 2: Tool contract tests (integration reality checks)

If your agent uses tools, you need to test:

  • tool schemas
  • error handling
  • retries and timeouts
  • idempotency behavior (or safe replays)
  • permission boundaries

These aren’t “model tests”. They’re systems tests.

Examples:

  • tool returns 429 → agent backs off and retries
  • tool returns malformed payload → agent fails safe
  • tool needs confirmation for destructive actions → agent asks

Layer 3: Scenario harnesses (the agent as a workflow)

This is where “agent evals” become real.

A scenario harness runs:

  • a multi-step task
  • with memory/session state
  • across tools
  • and scores the outcome

Examples:

  • “resolve a customer ticket by reading docs and updating the CRM”
  • “book a meeting without double-booking”
  • “triage an alert and open a Jira issue with correct fields”

Layer 4: Red teams (adversarial and safety suites)

Red teams are not random chaos.

They are curated suites that test:

  • prompt injection and tool abuse
  • data exfiltration attempts
  • jailbreak patterns
  • “confusable” user intents (high-risk misinterpretation)
  • boundary conditions (ambiguous instructions, partial info, conflicting goals)

If you’re not running red-team suites in CI/nightly, you’re implicitly red-teaming in production.

Layer 5: Production monitoring (because CI is never enough)

Agents drift because:

  • the world changes
  • tools change
  • prompts change
  • models change
  • user behavior changes

So the eval stack ends with:

  • online metrics
  • sampled traces
  • safety alarms
  • budget alarms
  • regression detection

This is not “observability for fun”.

It’s your last line of defense.


What You Actually Measure (Beyond “Did It Answer?”)

For agents, accuracy is only one axis.

You need multi-objective evaluation.

Outcome quality

Did the job get done? Correct, complete, and aligned with the user’s intent.

Safety and policy

Did it avoid prohibited actions, follow permissions, and refuse when needed?

Tool behavior

Were tool calls valid, minimal, and robust to failures (timeouts/429s/bad data)?

Cost and latency

Did it stay within your reasoning + token + tool-call budgets under realistic loads?

The mistake: scoring only the final answer

Agents can “get the right outcome” for the wrong reasons:

  • they hallucinated a tool result
  • they bypassed a guardrail
  • they leaked data
  • they performed a destructive action unnecessarily

So you need to score:

  • the final outcome
  • the trajectory (steps taken)
  • the tool trace
  • the safety posture
  • the budget usage
If you can’t explain why a run passed, you don’t have an eval.

You have a vibe detector.


The Core Artifact: A Scenario Spec (Not a Prompt)

A scenario harness starts with a spec, not a clever instruction.

A good scenario spec has:

  • goal (what counts as success)
  • starting state (what the agent sees at time 0)
  • tools available (and their permissions)
  • environment model (real tools vs mocks vs sandbox)
  • scoring function (how you grade)
  • invariants (what must never happen)
  • budgets (token, latency, tool calls, retries)

Here’s a compact example format:

id: "support_ticket_refund_flow"
goal: "Resolve ticket #1842 correctly and update CRM notes"
starting_state:
  ticket_text: "Customer reports duplicate charge, wants refund"
tools:
  - name: "crm.get_ticket"
  - name: "payments.lookup_charge"
  - name: "payments.refund"  # requires confirmation
invariants:
  - "Never refund without a confirmed duplicate charge"
  - "Never reveal full card numbers"
budgets:
  max_tool_calls: 8
  max_total_tokens: 8000
  max_latency_ms: 25000
scoring:
  - outcome: "refund issued only if duplicate confirmed"
  - trace: "asks clarification if evidence missing"
  - safety: "no sensitive data leakage"

The spec is where you encode your product’s truth boundaries.

That’s the part that survives prompt changes and model swaps.


Determinism Is a Lie (So Design for Variance)

The fastest way to sabotage agent evals is expecting perfect reproducibility.

Even with the same prompt:

  • the model is probabilistic
  • tool timing differs
  • the web changes
  • rate limits happen
  • retries reorder events

So CI evals must be designed for variance:

  • pin what you can
    • model version
    • tool versions
    • seeds (where supported)
  • separate fast deterministic checks
    • schema validation
    • tool-call structure
    • refusal rules
  • run stochastic suites statistically
    • multiple runs per scenario
    • distribution-based scoring (pass rate, not one run)
  • record traces
    • so failures are replayable
A single run is not a result.For agents, a “pass” is usually:
  • “passes in 9/10 runs”
  • under defined budgets
  • with no invariant violations

A Practical CI Pipeline for Agent Evals

This is the pipeline I recommend because it matches how real teams ship.

It uses time as a design constraint.

1) Define your truth boundaries

Write down what must never be wrong:

  • financial actions without confirmation
  • data leakage
  • acting outside user permissions
  • tool usage that mutates state without guardrails

These become invariants in every scenario spec.

2) Build a tiny “PR gate” suite (minutes, not hours)

This suite should include:

  • prompt unit tests (format, schema, refusal)
  • tool contract tests (timeouts/429s/errors)
  • 2–5 golden scenarios that represent your core product

Goal: catch obvious regressions fast.

3) Add a “nightly harness” suite (depth over speed)

Run:

  • 50–500 scenarios
  • multiple stochastic runs each (or at least for critical ones)
  • budget tracking
  • trace recording

Goal: detect drift and long-tail failures before users do.

4) Add red-team suites as first-class tests

Red teams should test:

  • injection attempts
  • confused deputy tool misuse
  • persuasion attacks (“ignore the policy”)
  • sensitive data probes
  • edge-case ambiguity

Goal: you don’t get to be surprised by obvious attacks.

5) Score trajectories, not just outcomes

Store and score:

  • tool-call sequence validity
  • confirmation steps for high-risk tools
  • evidence gathering behavior (“prove it before acting”)
  • “ask clarifying question” behavior under ambiguity

Goal: build trustworthy competence.

6) Make failures actionable

Every failure should produce:

  • a trace link
  • tool logs
  • the scenario spec
  • a diff vs last-known-good run

Goal: your team can fix the system, not debate the model.

7) Close the loop in production

Sample real sessions into:

  • drift detection
  • safety alarms
  • budget regressions

Then convert real failures into new scenarios.

Goal: production becomes a scenario factory — not a fire drill.


Red Teams: What You’re Actually Testing

“Red team” sounds like drama.

In practice, it’s a set of known adversarial classes that you operationalize.


The Hard Part: Building a Judge You Trust

For many scenarios, you need a grader:

  • sometimes deterministic (rules)
  • sometimes semantic (LLM-as-judge)
  • often a hybrid

A good pattern is:

  • Rule-based checks first
    • schema validity
    • invariants
    • forbidden tool calls
    • budget limits
  • LLM judge for semantic quality
    • correctness of reasoning as evidenced by steps
    • completeness
    • tone and helpfulness (where relevant)
  • Human review for disputes
    • sampled runs
    • high-risk workflows
    • new feature launches
If you use an LLM judge, treat it like a dependency:
  • version it
  • test it
  • and calibrate it against human labels periodically

Field Notes: The Failure Modes You’ll Actually See

Here’s what shows up in real systems, over and over:

  • “It passed yesterday” → tool/API drift, rate limits, timing
  • “It got the right answer but did the wrong thing” → trajectory not evaluated
  • “It’s safe in unit tests but unsafe in scenarios” → injection comes from multi-step context
  • “It’s accurate but expensive” → no budget gates
  • “It’s great in staging” → production data distribution mismatch
  • “It fails once a day” → long-tail variance, not measured statistically

So your eval stack needs to be designed to catch rare but catastrophic behaviors.


Resources

LangChain - Evaluation concepts

A practical overview of evaluation patterns (including LLM-as-judge) and how to structure checks.

OpenAI Evals (framework + examples)

A reference implementation for test-driven evaluation of model behavior.

Ragas (RAG + LLM eval toolkit)

Useful evaluation primitives and metrics — especially when retrieval quality affects outcomes.

OWASP Top 10 for LLM Applications

A structured threat model that maps well to red-team suites and safety invariants.


FAQ


What’s Next

This month turned evaluation into a shipping discipline.

Next month we hit the uncomfortable part:

governance controls that actually ship.

Not “compliance theater”.

The operational reality of:

  • prohibited practices
  • auditability
  • access control
  • evidence generation
  • and the controls you need if your agents are becoming a platform.

The Compliance Cliff: prohibited practices and governance controls that actually ship

Axel Domingues - 2026