Blog
Apr 27, 2025 - 14 MIN READ
Agent Runtimes Emerge: SDKs, orchestration primitives, and observability

Agent Runtimes Emerge: SDKs, orchestration primitives, and observability

In 2025, “agents” stop being demos and start being products. This is the month you realize you don’t need a smarter model — you need a runtime: durable execution, safety gates, and traces you can debug.

Axel Domingues

Axel Domingues

March was the compliance wake-up call.

April is the engineering one.

Because once you decide an agent is allowed to act (call tools, click buttons, run workflows, touch data), the hard problem stops being “prompting” and becomes:

How do we run this thing reliably, safely, and debuggably?

That’s what agent runtimes are for.

Not a framework. Not a prompt library.

A runtime — the execution substrate that turns probabilistic reasoning into repeatable operations.

In this 2025 series, agents become platforms.

And every platform eventually needs the same three things:

  1. orchestration primitives (state, retries, timeouts)
  2. security boundaries (permissions, isolation, approvals)
  3. observability (traces, not vibes)

What changed in 2025

Agent capabilities improved — but the bigger shift is how we package execution: SDKs now ship with state, policies, and tracing.

The new unit of work

Not “a prompt”.
A run: multi-step, tool-heavy, budgeted, auditable.

The runtime job

Turn plans into operations: manage state, control tool access, handle retries, and surface traces humans can debug.

The failure mode

Without a runtime, “agent reliability” becomes folklore: invisible loops, silent tool errors, and zero accountability.


What Is an Agent Runtime (Really)?

Think of it like the JVM or the browser runtime — not the language model, but the thing that executes safely.

An agent runtime owns these responsibilities:

  • Session & state: what the agent knows right now (and what it must forget)
  • Tool I/O: calling tools with permissions, schemas, and sandboxes
  • Durable execution: retries, timeouts, idempotency, checkpointing
  • Policy enforcement: what actions are allowed (and under which conditions)
  • Budgeting: token + time + tool cost caps per run
  • Human-in-the-loop gates: approvals, escalations, review queues
  • Observability: traces that let you answer “what happened?” in minutes, not days

If your system lacks any of these, you don’t have a runtime — you have a demo.

The fastest way to tell if you need a runtime:

If your agent does more than one external side effect (emails, tickets, payments, deploys, UI clicks), you need durable execution + traceability.


SDKs Are Becoming Runtimes (Whether They Say So or Not)

In 2023–2024 we talked about tool use.

In 2025, the real product surface is execution.

So “agent SDKs” are evolving into something bigger:

  • from prompt utilities → to run orchestration
  • from one-shot tool calls → to multi-step plans
  • from logging text → to tracing runs
  • from best-effort → to budgeted + gated + replayable

This isn’t hype.

It’s a convergent architecture.

Every team that ships agents ends up reinventing the same primitives — until they standardize them.


The Orchestration Primitives That Actually Matter

Let’s name the primitives explicitly.

Because “agent orchestration” sounds mystical until you reduce it to runtime contracts.

Runs + checkpoints

A run has an ID, a current step, and a resumable checkpoint. If the worker dies, the run continues.

Deterministic state transitions

Treat run state as a state machine: explicit transitions, explicit failure states, no invisible “thinking”.

Tool contracts

Tools are APIs with schemas, auth scopes, and error contracts — not “functions the model can magically call”.

Budgets + timeouts

Cap tokens, wall time, and tool spend per run. Timeouts are not edge cases — they’re the default.

Retries + idempotency

Retries must be safe. Side effects need idempotency keys and “already done” handling.

Human gates

Approvals are first-class transitions. Humans are not “exceptions” — they’re part of the runtime.

If you’ve built distributed systems, none of this is new.

The only novelty is that the “business logic” is partly probabilistic — so you must compensate with stronger operational constraints.

A model that “tries again” is not the same thing as a safe retry.

If the agent can click “Confirm purchase” twice, you need idempotency and approvals — not a better prompt.


A Practical Runtime Model: Plan, Execute, Verify

A healthy runtime doesn’t let the model directly perform irreversible actions.

It wraps action in structure:

  1. Plan: decide steps (and produce a machine-readable plan)
  2. Execute: run tool calls with policies, budgets, and retries
  3. Verify: validate outcomes (with checks, not vibes)
  4. Commit: only then perform irreversible side effects

This is how you make “agentic” feel like engineering again.


Observability: The Runtime’s Superpower

Agents fail in ways that are qualitatively different from normal services:

  • the “logic” is not fully deterministic
  • the agent may try multiple strategies
  • tool errors can be misinterpreted (or ignored)
  • loops can be rational-sounding until you check the trace

So you need tracing, not just logs.

What a good run trace lets you answer

  • Which tools were called, with which parameters (redacted where needed)?
  • Why did the agent choose that action?
  • Where did time go (model vs tools vs waiting vs retries)?
  • Did we violate a policy gate?
  • What changed in state between step N and N+1?
  • If we replay this run, do we get the same operations?

A minimal “span taxonomy” for agent runs

You don’t need perfection. You need consistency.

agent.run
  agent.step (plan | tool | verify | summarize)
    llm.call
    tool.call (tool_name)
    retrieval.query (optional)
    policy.check (optional)
    approval.wait (optional)

And you need consistent attributes:

{
  "run_id": "run_123",
  "step_id": "step_07",
  "model": "gpt-4.1",
  "tool": "crm.create_ticket",
  "budget_tokens_remaining": 1200,
  "timeout_ms": 15000,
  "retry_count": 1,
  "policy_decision": "allow",
  "human_gate": "required:false"
}
Be extremely careful with what you store in traces.

Logs and traces have a habit of turning into a shadow data lake. Redaction, retention, and access control are runtime features — not compliance paperwork.


A Reference Architecture for Agent Runtimes

Here’s a blueprint that survives contact with production.

The layers

  • Experience layer: chat/UI + human approval surfaces
  • API layer: authentication, rate limits, tenant policy
  • Orchestrator: run state machine, step scheduling, retries, budgets
  • Tool adapters: typed connectors with scoped auth + sandboxing
  • State store: run state, checkpoints, memory pointers, artifacts
  • Trace pipeline: OpenTelemetry-compatible traces + run analytics
  • Eval hooks: post-run scoring + regression gates (from February)

This is also why “runtimes” are emerging as products: every serious team ends up building this stack anyway.


How to Adopt This Without Boiling the Ocean

A runtime is not a rewrite. It’s a set of constraints you add until the system becomes operable.

Start with a single run schema

Define what a “run” is:

  • run_id, status, step pointer, budgets
  • inputs, outputs, artifacts
  • audit metadata (who/what triggered it)

Add checkpoints before you add cleverness

If a worker dies, can you resume? If the answer is “no”, you’re still in demo territory.

Wrap every side effect with an idempotency key

Email, ticket, payment, deploy, UI click: every action must be safely repeatable.

Make policy checks explicit steps

“Allowed?” is a state transition, not a comment in code.

Instrument traces early

If you can’t debug it, you can’t ship it.

The sequence matters:

durability → safety → observability → intelligence

Teams that do it backwards end up with a smart agent they’re afraid to run.

Resources

OpenTelemetry Traces

The baseline for distributed tracing — and the lingua franca for agent run observability.

OpenTelemetry Semantic Conventions

How to standardize span names and attributes so traces stay queryable at scale.

Temporal (Durable Execution)

A practical reference for retries, timeouts, and crash-proof workflows — the core of production orchestration.

OpenAI Agents SDK

An example of SDKs moving “up the stack” into orchestration, tool mediation, and run management.

LangGraph

Graph-based agent orchestration: explicit state transitions instead of invisible loops.

Model Context Protocol (MCP)

A standard connector surface for tools and data — useful when your runtime needs a portable tool ecosystem.


FAQ


What’s Next

April is where “agents” start looking like a platform layer.

Next month, we hit the economic reality shift that changes system design again:

The 1M-Token Era: how long context changes retrieval economics and system design

When context becomes cheap enough (or big enough), the architecture of memory, retrieval, and caching gets rewritten.

Axel Domingues - 2026