Dec 28, 2025 - 20 MIN READ

Reference Architecture v2: the Operable Agent Platform

This is the 2025 finale: a practical reference architecture for running fleets of agents with governance—connectors you can trust, traces you can debug, evals you can ship, and humans you can hand off to.

Axel Domingues

All year, the story of agents changed.

In January, an “agent” still meant: a model with tools.

By December, an “agentic product” means something else entirely:

a platform that runs many agents, with many connectors, across many teams — without turning into chaos.

This article is the synthesis.

Why “v2”?

I wrote Reference Architecture v1 as the finale of the 2022 “operational architecture” year: a defendable baseline for a modern cloud product (APIs, distributed data, async workflows, CI/CD, observability, security, and governance). v2 is that same spine — plus the agent-specific control plane: orchestration, connector/tool governance, evaluation in CI, and the reliability guardrails needed to run fleets of probabilistic workers.

If you’re jumping in here, treat v1 as the “non-AI platform” foundation, and v2 as the upgrade that makes it agent-native.

A Reference Architecture v2 for an operable agent platform:

predictable behavior (as much as probabilistic components allow)
safe tool use (least privilege + injection resistance)
traceable execution (debuggable end-to-end)
measurable quality (evals in CI, not vibes)
governable change (versioning, approvals, auditability)
and a clean handoff to humans when the machine hits ambiguity

“Reference architecture” doesn’t mean “one true stack.”

It means you can point to a diagram, name the invariants, and defend your decisions under pressure: security reviews, incident postmortems, cost spikes, compliance asks, and product deadlines.

The goal

Run fleets of agents safely, cheaply, and repeatably.

The shift

Agents stop being a feature and become a runtime.

The constraint

Connectors turn into production attack surface.

The definition of done

You can operate it: observe, debug, roll back, and audit.

What “Operable” Means (In Practice)

“Operable” is not a buzzword. It’s a list of things you can do at 3AM when something breaks.

An operable agent platform can answer these questions quickly:

What happened? (trace + timeline)
Why did it happen? (inputs + tool calls + policy decisions)
What was the blast radius? (tenant, connector, model, workflow)
How do we stop it now? (kill switch, circuit breaker, rollback)
How do we prevent recurrence? (policy update, connector fix, eval gate)

If you can’t do that, you don’t have an agent platform.

You have a demo that will eventually become an incident.

The Architecture at 10,000 Feet: Data Plane vs Control Plane

The cleanest mental model I’ve found is the same one we use for infrastructure platforms:

Data plane: runs the work (sessions, tool calls, workflows)
Control plane: decides what is allowed (policies, routing, governance)

Data plane

Executes agent runs, tool calls, voice turns, and workflows.

Control plane

Defines policies, versions, approvals, routing, and audit requirements.

The mistake teams make is mixing these.

If your data plane “decides” policy at runtime with no versioning, approvals, or audit — you can’t govern change. And if your control plane tries to run the workload, you can’t scale execution cleanly.

So RA v2 keeps them separate.

RA v2 Components

Here’s the platform decomposed into components you can actually build and assign teams to.

Reference Architecture v2 components: experiences, agent runtime, connectors, governance, and observability

1) Experiences (Chat, Voice, Automations, Embedded UI)

This is the surface area your users touch:

web chat
mobile chat
voice calls
proactive workflows (“run this plan nightly”)
embedded copilots inside products

Rule: Experiences should be thin. They collect context, render outputs, and host a clean human handoff path. They should not contain business logic for agent orchestration.

2) Agent Runtime (Session Engine)

The session engine is the heart of the data plane.

It owns:

conversation/session state
memory boundaries (what is retained vs not)
step execution (LLM call → tool call → evaluation → next step)
budgets (cost, tokens, tool call limits)
guardrails (policy enforcement points)

This is where multi-agent composition happens safely: supervisors, specialists, delegation, and structured handoffs.

3) Workflow Engine (Jobs + State Machines)

Agent work is not always interactive.

If your platform can’t:

run long workflows,
retry safely,
and survive partial failures,

…then it’s not a platform.

This engine should feel familiar if you’ve built distributed systems:

durable state machines
idempotency keys
retries with backoff
dead letter queues
explicit timeouts

(If you’ve read my earlier outbox/sagas work, you’ll recognize the patterns.)

4) Connector Layer (Tools, APIs, MCP, RPA)

Tools are “just functions”… until they aren’t.

The connector layer provides:

stable tool schemas
auth handshakes (OAuth, service accounts, delegation)
policy-bound execution (least privilege)
rate limiting and circuit breakers
safe parsing and response validation
versioning and compatibility guarantees

This is also where MCP-style connector ecosystems start to matter: a standard protocol for tool discovery, schemas, and execution boundaries.

5) Control Plane (Policy + Governance)

The control plane is where you make safety and compliance real, not aspirational.

It owns:

policy definitions and versioning
approvals (who can publish a connector? a prompt? a workflow template?)
environment promotion (dev → staging → prod)
audit logs and retention policy
model routing rules (latency/cost/quality constraints)
kill switches (tenant, connector, workflow, model)

6) Observability Plane (Tracing + Metrics + Replay)

Agents are probabilistic — so you need observability that treats them as distributed systems.

At minimum:

traces spanning LLM calls + tool calls + policy decisions
structured logs (no “string soup”)
metrics for latency, error rate, tool failure, cost, and token burn
run replay (reconstruct what happened from stored artifacts)
evaluation results linked to real runs

If your logs don’t include the exact tool arguments and the exact policy version used at runtime, your postmortems will become “we think it did X” — which is another way of saying: you can’t debug it.

The Non‑Negotiables: The Platform Invariants

Reference architectures are useless if they don’t specify what must never be violated.

Here are the invariants I treat as non-negotiable for an operable agent platform:

Identity & tenancy

Every action is attributable: user, tenant, agent, connector, run.

Least privilege by default

Tools run with minimal scope, time‑boxed credentials, and explicit grants.

Deterministic envelope

Even if model output is probabilistic, the execution contract is deterministic.

Auditable decisions

Policy version, model selection, tool choice, and overrides are recorded.

Under these invariants, “agents” become something you can safely operate:

You can prove who did what.
You can bound what they can do.
You can replay and explain incidents.
You can roll back safely.

The Deterministic Envelope: Where Probabilistic Meets Production

The trick to building reliable agent systems is not pretending the model is deterministic.

It’s building a deterministic envelope around it.

That envelope is a contract.

A good envelope includes:

Typed tool schemas (validated both directions)
State machine boundaries (what steps exist, what transitions are allowed)
Budgets (tokens, cost, time, tool calls)
Policy gates (what requires approval, escalation, or denial)
Human checkpoints (when ambiguity becomes risk)

Think of it like this:

The model is a creative proposal engine.
The platform is the execution authority.

Three Reference Flows (Where Systems Usually Break)

Let’s make the architecture concrete with three flows you’ll almost certainly run in production.

Connector Ecosystem Governance: How You Avoid “Tool Sprawl”

By late 2025, most teams discover the same painful truth:

Connectors scale faster than trust.

You start with 5 tools. Then someone adds 15 more. Then teams copy/paste wrappers. Then the platform becomes an un-auditable jungle.

So governance must be built in.

The connector lifecycle (what the platform enforces)

Register

A connector is registered with:

a unique ID
owner team
schemas and auth methods
declared data classifications (PII? financial? admin actions?)
required policies (confirmations, approvals, logging levels)

Verify

Automated checks run:

schema validation
injection-resistance tests (prompt + tool boundary)
fuzzing for parsing and argument handling
permissions tests (least privilege cannot be bypassed)

Publish

Publishing requires:

semantic versioning (breaking vs non-breaking changes)
approval gates (platform + security review for sensitive scopes)
staged rollout (canary tenants / limited traffic)

Operate

Runtime enforcement includes:

rate limits + circuit breakers
anomaly detection (sudden spike in calls or failures)
kill switches by connector version
rollback to last known good

If you do nothing else: make connector adoption boring.

A boring connector process means:
easy to do the right thing
hard to do the unsafe thing
and impossible to ship changes with zero traceability

Quality as a Release Gate: Evals in CI, Not in Production

In 2023 I treated evaluation as the missing discipline for LLM features.

In 2025, evaluation becomes the missing discipline for agent platforms.

Because now failures aren’t just “wrong text.” They’re:

wrong tool usage
wrong order of operations
wrong policy decisions
costly loops
and unsafe actions

So RA v2 requires an eval pipeline that looks like software engineering:

unit tests for tool schemas, parsers, and routers
simulation tests for workflows (state machine transitions)
scenario evals for end-to-end runs with mocked connectors
regression gates tied to the versions you deploy

If you can’t run your agent in a simulated environment with deterministic connector responses, you don’t have a test strategy — you have hope.

The “Human Handoff” Is a First‑Class Component

The strongest agent platforms don’t try to eliminate humans.

They treat humans as:

a safety boundary,
a quality backstop,
and a customer-experience tool.

Handoff is not a button. It’s a workflow with guarantees:

transfer context safely (without leaking secrets or irrelevant internal traces)
capture what the agent attempted (so humans don’t start from zero)
let the human override or correct (with feedback captured for evals)
resume the workflow after human action if appropriate

RA v2 places handoff inside the deterministic envelope: a state transition, not a best-effort UX flourish.

Implementation Checklist: A Practical Adoption Path

You don’t implement this architecture by rewriting everything.

You implement it by turning chaos into contracts — gradually.

Step 1 — Standardize tool contracts

typed schemas for arguments + responses
strict validation (reject unknown fields)
stable IDs and semantic versioning

Step 2 — Introduce a session engine

explicit run state
budgets + timeouts
trace IDs everywhere

Step 3 — Add policy enforcement points

confirmation gates
least privilege enforcement
deny-by-default for new tools

Step 4 — Add an eval harness

scenario suites for the top 20 workflows
regression gates for connector and policy changes

Step 5 — Add governance workflows

connector registry + approvals
environment promotions
kill switches + rollback paths

Step 6 — Make human handoff real

consistent handoff triggers
safe context packaging
post-handoff feedback loop

FAQ

The 2025 Takeaway

In 2018, RL taught me that unstable training loops need stabilizers.

In 2025, agent platforms taught me the same lesson in a new form:

A tool-using system is a feedback loop.
Reliability comes from the stabilizers you design around it.

Reference Architecture v2 is my attempt to name those stabilizers clearly:

separation of control plane and data plane
deterministic envelopes around probabilistic components
connector governance as a platform capability
evaluation as a release gate
observability as a first-class feature
humans as part of the system, not an afterthought

What’s Next

This closes the 2025 arc: agents became platforms — and platforms need governance.

In 2026, I’m switching modes again.

No fixed syllabus.

Instead, each month I’ll pick one hot topic that’s actually moving the industry, then unpack it like an architect:

what the thing is (mechanics, not hype)
what breaks in production (failure modes, incentives, attack surface)
what contracts make it operable (budgets, evals, safety boundaries, governance)
and what you should build if you want the capability without the incident

Same rule as always:

capability is impressive — reliability is designed.

See you next month with whatever the real world decides is urgent.

Prism and the Architecture of Artifact-Native AI for Science

Prism is a LaTeX-native, AI-first workspace for scientific writing. Under the hood, it’s a new pattern: the model is embedded in the artifact, not bolted onto the side. This post explains the architecture, the contracts that keep it honest, and what “artifact-native AI” changes for reliability and governance.

The Connector Ecosystem: MCP adoption patterns, versioning, and governance

Once agents can call tools, connectors become the new platform surface. This month is a playbook for adopting MCP at scale: patterns that work, versioning that doesn’t break customers, and governance that keeps the ecosystem sane.