
Agents don’t “run in a loop” — they run across networks, vendors, and failures. This month is about the three patterns that make agent workflows survivable: durable intent (outbox), long-running transactions (sagas), and reconciliation (“eventually correct”).
Axel Domingues
Agents are finally useful in production… and immediately annoying.
Not because the model is “wrong”.
Because the agent is a distributed system the moment it touches anything real:
So in June, I’m reframing the mental model:
If your agent performs actions, you are operating a distributed workflow engine — whether you intended to or not.
And distributed workflows have one job:
make reality converge.
Not instantly. Not perfectly.
But eventually — and provably.
This is where “eventual consistency” becomes too vague to be useful.
I prefer a more demanding phrase:
eventually correct.
Meaning:
It’s about the part that hurts after planning: durable execution across failures.
Treat every action as a transaction
Define invariants first, then build workflows that can’t violate them.
Durable intent (Transactional Outbox)
Write “what must happen” to your DB before you try to make it happen.
Long-running workflows (Sagas)
Split work into retryable steps with timeouts, compensation, and checkpoints.
Reality convergence (Reconciliation)
Assume some steps fail silently and build repair loops that detect + fix drift.
In a normal request/response app, failure is annoying but bounded:
In an agentic app, failure is the baseline state.
Because agent workflows:
That combination forces you into a small set of patterns that survive the real world.
They all start with the same question:
What must never be wrong?
“Invariant” sounds academic. In practice it’s simple:
identify the things you can’t undo cheaply.
Some examples from agent systems:
dashboards that look great while the system violates the only thing that matters.
A useful rule:
You can ship a surprising amount of agent capability with three patterns:
Let’s make each one concrete.
The most common agent failure mode in production is painfully mundane:
The agent changes local state… then fails to trigger the next step.
Or worse:
The agent triggers the next step… then fails before persisting local state.
Both lead to ghosts:
The Transactional Outbox fixes this by making a strict deal:
In a single DB transaction, write:
- the state change, and
- a durable “intent event” describing the side effect to perform.
Then a separate worker publishes/drives those intents.
You can start with a tiny table:
-- Outbox is not an “event stream”. It’s a durable todo list.
create table outbox (
id uuid primary key,
type text not null, -- e.g. "GenerateImageRequested"
key text not null, -- idempotency key / correlation id
payload jsonb not null,
status text not null, -- "pending" | "sent" | "failed"
attempts int not null default 0,
next_attempt_at timestamptz not null default now(),
created_at timestamptz not null default now(),
updated_at timestamptz not null default now()
);
create unique index outbox_unique_key on outbox(type, key);
A few notes that matter in real life:
key is not optional. It’s how you dedupe at-least-once delivery.next_attempt_at makes backoff explicit (and queryable).Example: “create job + emit event”
begin;
insert into jobs (id, state, ...) values (:job_id, 'queued', ...);
insert into outbox (id, type, key, payload, status)
values (:outbox_id, 'JobQueued', :job_id, :payload_json, 'pending');
commit;
Now you’ve made the system robust against:
Because the intent is durable.
An outbox is only useful if something drains it.
That “something” is usually boring — and that’s a compliment.
Scan + lock
Select a small batch of pending outbox rows and lock them to avoid double dispatch.
Enqueue work
Create queue messages / cloud tasks that trigger workers (at-least-once).
Advance state
Mark “sent” only after enqueue succeeds. Track attempts + next retry time.
Observe everything
Emit metrics for lag, attempts, failure rate, and oldest pending item.
Truth #1: delivery will be at-least-once.
Your queue will retry. Your dispatcher will retry. Your worker will retry.
So you must build idempotency.
Truth #2: ordering is not guaranteed.
If ordering matters, encode it in your workflow state machine — not in the queue.
Agents don’t do “one call”. They do sequences:
That’s a workflow.
And workflows are distributed transactions with time gaps.
A saga is just the honest framing:
a durable state machine where each step is retryable and some steps have compensation.
A central workflow record advances through states:
Agent platforms almost always want orchestration because it’s debuggable.
Services react to events and emit new events.
Choreography can work, but agent product iteration usually benefits from a central ledger.
Let’s map a realistic workflow (that crosses vendors):
Every step can fail.
So your saga record might look like:
Workflow: PublishArticle
State: DRAFT_GENERATED -> IMAGE_GENERATED -> CHECKED -> PUBLISHED -> ANNOUNCED -> VERIFIED
And each transition is driven by a job/worker, not by a single process “waiting”.
“Eventually consistent” is too vague to operate.
Eventually correct means you can answer these questions at 3am:
To get there, you need a small set of non-negotiables.
Idempotency keys everywhere
Every external side effect must be repeatable without duplication.
Durable state machine
Workflow state lives in the DB, not in memory.
Explicit timeouts + DLQs
Don’t “retry forever” silently. Escalate stuck work.
Safe replay + backfills
You should be able to re-run steps without fear.
Compensation is the scary part of sagas because it forces honesty.
Some actions are reversible:
Some actions are not:
So compensation is less about “undo” and more about:
if we can’t undo, we must prevent the action until intent is explicit and audited.
This is where governance controls actually become architecture.
Make it:
- explicit (human approval, or strong policy gate)
- logged (who/what/why)
- reversible where possible (soft deletes, staging)
Most agent systems ship a fantasy first:
“fully autonomous agent”
Then reality arrives:
The good news: sagas make humans easy.
Add states like:
AWAITING_APPROVALNEEDS_INPUTREJECTEDESCALATEDAnd treat “human decisions” as another event that advances the state machine.
Even with outbox + sagas, drift still happens:
Reconciliation is the system admitting that:
success signals are fallible, so we periodically verify reality and repair it.
Reconciliation must be safe:
You’ll just do it manually, under pressure, at the worst possible time.
This is the architecture I keep seeing converge across “agent platforms”:

If you squint, this is “classic distributed systems”… with a model in the loop.
Create a workflows table with:
type, id, statecontext (json)updated_at, deadline_atowner (agent, system, human)correlation_idA job does one thing:
Enforce idempotency at the database boundary:
(type, key)attempts and next_attempt_atNever publish “intent” without persisting the state change that justifies it.
Keep the dispatcher boring:
Reconciliation should:
You will need:
Cause: missing idempotency keys, or dedupe not enforced at DB boundary.
Fix: unique keys + idempotent external requests + “already done” handling in workers.
Cause: retries without deadlines, missing reconciliation, no DLQ/alerting.
Fix: explicit deadlines, DLQs, and a reconciler that escalates.
Cause: state lives in logs, not in a durable state machine.
Fix: workflow record as source of truth + trace correlation ids.
Cause: assuming success responses are truth.
Fix: reconciliation that verifies external reality and repairs.
Cause: no safe replay, no backfills, no invariant-first thinking.
Fix: design replayability and add a “repair job” toolkit.
Agents are not “a loop with tool calls”.
Agents are workflows across boundaries.
So ship them like workflows:
That’s how you stop building demos and start building platforms.
June takeaway
Eventual consistency is a property you hope for.
Eventually correct is a property you design, audit, and repair toward.
Distributed Data: Transactions, Outbox, Sagas, and ‘Eventually Correct’
My 2022 deep dive on the exact reliability patterns this month builds on.
“Designing Data-Intensive Applications” (Kleppmann)
The clearest mental model for distributed systems tradeoffs and correctness.
Next month: Multi-Agent Systems Without Chaos: supervisors, specialists, and coordination contracts.
Because once you can run a single agent workflow reliably… you’ll want to run many.
And that’s where coordination becomes architecture.
Multi-Agent Systems Without Chaos: supervisors, specialists, and coordination contracts
Multi-agent setups don’t fail because “agents are dumb.” They fail because we forgot distributed-systems basics: authority, contracts, budgets, and observability. This month is a practical architecture for scaling agents without scaling chaos.
The 1M-Token Era: how long context changes retrieval economics and system design
Long context doesn’t kill RAG — it changes what’s cheap, what’s risky, and what needs architecture. This month is a practical guide to building “context-first” systems without shipping a cost bomb (or a data leak).