Blog
Dec 28, 2025 - 20 MIN READ
Reference Architecture v2: the Operable Agent Platform

Reference Architecture v2: the Operable Agent Platform

This is the 2025 finale: a practical reference architecture for running fleets of agents with governance—connectors you can trust, traces you can debug, evals you can ship, and humans you can hand off to.

Axel Domingues

Axel Domingues

All year, the story of agents changed.

In January, an “agent” still meant: a model with tools.

By December, an “agentic product” means something else entirely:

a platform that runs many agents, with many connectors, across many teams — without turning into chaos.

This article is the synthesis.

Why “v2”?

I wrote Reference Architecture v1 as the finale of the 2022 “operational architecture” year: a defendable baseline for a modern cloud product (APIs, distributed data, async workflows, CI/CD, observability, security, and governance). v2 is that same spine — plus the agent-specific control plane: orchestration, connector/tool governance, evaluation in CI, and the reliability guardrails needed to run fleets of probabilistic workers.

If you’re jumping in here, treat v1 as the “non-AI platform” foundation, and v2 as the upgrade that makes it agent-native.

A Reference Architecture v2 for an operable agent platform:

  • predictable behavior (as much as probabilistic components allow)
  • safe tool use (least privilege + injection resistance)
  • traceable execution (debuggable end-to-end)
  • measurable quality (evals in CI, not vibes)
  • governable change (versioning, approvals, auditability)
  • and a clean handoff to humans when the machine hits ambiguity
“Reference architecture” doesn’t mean “one true stack.”

It means you can point to a diagram, name the invariants, and defend your decisions under pressure: security reviews, incident postmortems, cost spikes, compliance asks, and product deadlines.

The goal

Run fleets of agents safely, cheaply, and repeatably.

The shift

Agents stop being a feature and become a runtime.

The constraint

Connectors turn into production attack surface.

The definition of done

You can operate it: observe, debug, roll back, and audit.


What “Operable” Means (In Practice)

“Operable” is not a buzzword. It’s a list of things you can do at 3AM when something breaks.

An operable agent platform can answer these questions quickly:

  • What happened? (trace + timeline)
  • Why did it happen? (inputs + tool calls + policy decisions)
  • What was the blast radius? (tenant, connector, model, workflow)
  • How do we stop it now? (kill switch, circuit breaker, rollback)
  • How do we prevent recurrence? (policy update, connector fix, eval gate)

If you can’t do that, you don’t have an agent platform.

You have a demo that will eventually become an incident.


The Architecture at 10,000 Feet: Data Plane vs Control Plane

The cleanest mental model I’ve found is the same one we use for infrastructure platforms:

  • Data plane: runs the work (sessions, tool calls, workflows)
  • Control plane: decides what is allowed (policies, routing, governance)

Data plane

Executes agent runs, tool calls, voice turns, and workflows.

Control plane

Defines policies, versions, approvals, routing, and audit requirements.

The mistake teams make is mixing these.

If your data plane “decides” policy at runtime with no versioning, approvals, or audit — you can’t govern change. And if your control plane tries to run the workload, you can’t scale execution cleanly.

So RA v2 keeps them separate.


RA v2 Components

Here’s the platform decomposed into components you can actually build and assign teams to.

1) Experiences (Chat, Voice, Automations, Embedded UI)

This is the surface area your users touch:

  • web chat
  • mobile chat
  • voice calls
  • proactive workflows (“run this plan nightly”)
  • embedded copilots inside products

Rule: Experiences should be thin. They collect context, render outputs, and host a clean human handoff path. They should not contain business logic for agent orchestration.

2) Agent Runtime (Session Engine)

The session engine is the heart of the data plane.

It owns:

  • conversation/session state
  • memory boundaries (what is retained vs not)
  • step execution (LLM call → tool call → evaluation → next step)
  • budgets (cost, tokens, tool call limits)
  • guardrails (policy enforcement points)

This is where multi-agent composition happens safely: supervisors, specialists, delegation, and structured handoffs.

3) Workflow Engine (Jobs + State Machines)

Agent work is not always interactive.

If your platform can’t:

  • run long workflows,
  • retry safely,
  • and survive partial failures,

…then it’s not a platform.

This engine should feel familiar if you’ve built distributed systems:

  • durable state machines
  • idempotency keys
  • retries with backoff
  • dead letter queues
  • explicit timeouts

(If you’ve read my earlier outbox/sagas work, you’ll recognize the patterns.)

4) Connector Layer (Tools, APIs, MCP, RPA)

Tools are “just functions”… until they aren’t.

The connector layer provides:

  • stable tool schemas
  • auth handshakes (OAuth, service accounts, delegation)
  • policy-bound execution (least privilege)
  • rate limiting and circuit breakers
  • safe parsing and response validation
  • versioning and compatibility guarantees

This is also where MCP-style connector ecosystems start to matter: a standard protocol for tool discovery, schemas, and execution boundaries.

5) Control Plane (Policy + Governance)

The control plane is where you make safety and compliance real, not aspirational.

It owns:

  • policy definitions and versioning
  • approvals (who can publish a connector? a prompt? a workflow template?)
  • environment promotion (dev → staging → prod)
  • audit logs and retention policy
  • model routing rules (latency/cost/quality constraints)
  • kill switches (tenant, connector, workflow, model)

6) Observability Plane (Tracing + Metrics + Replay)

Agents are probabilistic — so you need observability that treats them as distributed systems.

At minimum:

  • traces spanning LLM calls + tool calls + policy decisions
  • structured logs (no “string soup”)
  • metrics for latency, error rate, tool failure, cost, and token burn
  • run replay (reconstruct what happened from stored artifacts)
  • evaluation results linked to real runs
If your logs don’t include the exact tool arguments and the exact policy version used at runtime, your postmortems will become “we think it did X” — which is another way of saying: you can’t debug it.

The Non‑Negotiables: The Platform Invariants

Reference architectures are useless if they don’t specify what must never be violated.

Here are the invariants I treat as non-negotiable for an operable agent platform:

Identity & tenancy

Every action is attributable: user, tenant, agent, connector, run.

Least privilege by default

Tools run with minimal scope, time‑boxed credentials, and explicit grants.

Deterministic envelope

Even if model output is probabilistic, the execution contract is deterministic.

Auditable decisions

Policy version, model selection, tool choice, and overrides are recorded.

Under these invariants, “agents” become something you can safely operate:

  • You can prove who did what.
  • You can bound what they can do.
  • You can replay and explain incidents.
  • You can roll back safely.

The Deterministic Envelope: Where Probabilistic Meets Production

The trick to building reliable agent systems is not pretending the model is deterministic.

It’s building a deterministic envelope around it.

That envelope is a contract.

A good envelope includes:

  • Typed tool schemas (validated both directions)
  • State machine boundaries (what steps exist, what transitions are allowed)
  • Budgets (tokens, cost, time, tool calls)
  • Policy gates (what requires approval, escalation, or denial)
  • Human checkpoints (when ambiguity becomes risk)

Think of it like this:

The model is a creative proposal engine.
The platform is the execution authority.


Three Reference Flows (Where Systems Usually Break)

Let’s make the architecture concrete with three flows you’ll almost certainly run in production.


Connector Ecosystem Governance: How You Avoid “Tool Sprawl”

By late 2025, most teams discover the same painful truth:

Connectors scale faster than trust.

You start with 5 tools. Then someone adds 15 more. Then teams copy/paste wrappers. Then the platform becomes an un-auditable jungle.

So governance must be built in.

The connector lifecycle (what the platform enforces)

Register

A connector is registered with:

  • a unique ID
  • owner team
  • schemas and auth methods
  • declared data classifications (PII? financial? admin actions?)
  • required policies (confirmations, approvals, logging levels)

Verify

Automated checks run:

  • schema validation
  • injection-resistance tests (prompt + tool boundary)
  • fuzzing for parsing and argument handling
  • permissions tests (least privilege cannot be bypassed)

Publish

Publishing requires:

  • semantic versioning (breaking vs non-breaking changes)
  • approval gates (platform + security review for sensitive scopes)
  • staged rollout (canary tenants / limited traffic)

Operate

Runtime enforcement includes:

  • rate limits + circuit breakers
  • anomaly detection (sudden spike in calls or failures)
  • kill switches by connector version
  • rollback to last known good
If you do nothing else: make connector adoption boring.

A boring connector process means:

  • easy to do the right thing
  • hard to do the unsafe thing
  • and impossible to ship changes with zero traceability

Quality as a Release Gate: Evals in CI, Not in Production

In 2023 I treated evaluation as the missing discipline for LLM features.

In 2025, evaluation becomes the missing discipline for agent platforms.

Because now failures aren’t just “wrong text.” They’re:

  • wrong tool usage
  • wrong order of operations
  • wrong policy decisions
  • costly loops
  • and unsafe actions

So RA v2 requires an eval pipeline that looks like software engineering:

  • unit tests for tool schemas, parsers, and routers
  • simulation tests for workflows (state machine transitions)
  • scenario evals for end-to-end runs with mocked connectors
  • regression gates tied to the versions you deploy
If you can’t run your agent in a simulated environment with deterministic connector responses, you don’t have a test strategy — you have hope.

The “Human Handoff” Is a First‑Class Component

The strongest agent platforms don’t try to eliminate humans.

They treat humans as:

  • a safety boundary,
  • a quality backstop,
  • and a customer-experience tool.

Handoff is not a button. It’s a workflow with guarantees:

  • transfer context safely (without leaking secrets or irrelevant internal traces)
  • capture what the agent attempted (so humans don’t start from zero)
  • let the human override or correct (with feedback captured for evals)
  • resume the workflow after human action if appropriate

RA v2 places handoff inside the deterministic envelope: a state transition, not a best-effort UX flourish.


Implementation Checklist: A Practical Adoption Path

You don’t implement this architecture by rewriting everything.

You implement it by turning chaos into contracts — gradually.

Step 1 — Standardize tool contracts

  • typed schemas for arguments + responses
  • strict validation (reject unknown fields)
  • stable IDs and semantic versioning

Step 2 — Introduce a session engine

  • explicit run state
  • budgets + timeouts
  • trace IDs everywhere

Step 3 — Add policy enforcement points

  • confirmation gates
  • least privilege enforcement
  • deny-by-default for new tools

Step 4 — Add an eval harness

  • scenario suites for the top 20 workflows
  • regression gates for connector and policy changes

Step 5 — Add governance workflows

  • connector registry + approvals
  • environment promotions
  • kill switches + rollback paths

Step 6 — Make human handoff real

  • consistent handoff triggers
  • safe context packaging
  • post-handoff feedback loop

FAQ


The 2025 Takeaway

In 2018, RL taught me that unstable training loops need stabilizers.

In 2025, agent platforms taught me the same lesson in a new form:

A tool-using system is a feedback loop.
Reliability comes from the stabilizers you design around it.

Reference Architecture v2 is my attempt to name those stabilizers clearly:

  • separation of control plane and data plane
  • deterministic envelopes around probabilistic components
  • connector governance as a platform capability
  • evaluation as a release gate
  • observability as a first-class feature
  • humans as part of the system, not an afterthought

What’s Next

This closes the 2025 series: Agents become platforms (and platforms need governance).

In 2026, I’m switching modes again.

Back to a personal research journey — but this time inside the LLM/agent frontier: new protocols, emerging architectures, and the experimental edge where today’s “best practices” are still being invented.

Axel Domingues - 2026