
Real-time agents aren’t “LLMs with voice.” They’re distributed systems with hard latency budgets. This month is about streaming UX, safe interruption (barge-in), and session state you can actually operate.
Axel Domingues
Tool use gave agents hands.
RAG gave them memory (the grounded kind).
But in November, the hard truth shows up:
Real-time agents are not a model problem.
They’re a latency + state + cancellation problem.
The moment you add streaming audio and allow the user to interrupt mid-sentence, you’re no longer “calling an LLM”.
You’re operating a conversation machine whose parts run at different speeds, fail independently, and must remain coherent anyway.
I mean:
- the UX stays responsive while the system is thinking
- the user can change their mind mid-output
- and the session doesn’t desync into nonsense when that happens
The new constraint
Latency becomes a product contract, not an optimization ticket.
The new UI primitive
Streaming is the UI.
“Answer at the end” feels broken.
The new safety mechanism
Barge-in (interrupt) is required.
But it’s also a cancellation minefield.
The new failure mode
Session state can split: transcript, tools, audio, UI.
Your job is to force convergence.
A real-time agent is three loops running concurrently:
If you treat this as a single “request/response”, you’ll ship a demo.
If you treat it as a set of coordinated loops with shared state and cancellation, you can ship a product.

In non-real-time systems, latency is hidden behind a spinner.
In real-time conversation, a spinner is social death.
Streaming buys you time in three ways:
Voice agents feel magical… until the user can’t interrupt.
In real conversation, interruption is normal:
If your agent can’t stop speaking and adapt, it feels less intelligent than a phone menu.
Every output must be cancellable, and cancellation must be observable.
That means:
Barge-in forces you to design capability boundaries and commit points.If you cancel the model but a tool call completes and commits side effects (sent email, booked meeting, charged card), you’ve created a trust-killer.
Most agent demos treat “conversation state” as a blob of messages.
Real-time agents can’t.
Because streaming and barge-in create partial, overlapping reality:
So instead of a blob, you need a session timeline: an append-only sequence of events with strong ordering guarantees.
A session is a timeline of facts.
UI views are projections of that timeline.
That’s the only way to keep audio, text, tools, and UI aligned.
The session timeline pattern
Model everything as events: input chunks, transcript revisions, tool calls, tool results, output tokens, cancellations, commits. Then derive:
Your event model doesn’t need to be perfect.
It needs to be:
Here’s a minimal event envelope:
{
"session_id": "sess_123",
"event_id": "evt_456",
"ts": "2024-11-24T10:12:33.120Z",
"type": "asr.partial",
"turn_id": "turn_7",
"parent_event_id": "evt_455",
"payload": {}
}
And here are the event types that matter most in real-time agents:
Transcript evolution
audio.chunk, asr.partial, asr.final, turn.end
Agent work
plan.start, retrieve.start, tool.call, tool.result, model.token
Output control
tts.chunk, playback.start, playback.stop, cancel.requested, cancel.ack
Commit points
action.proposed, action.confirmed, action.committed, action.reverted
Notice what’s missing:
There is no “final answer” field.
There is only a timeline that allows you to reconstruct what happened.
Turn detection is not an implementation detail.
It decides:
For voice agents, you typically mix:
Turn detection is where perceived intelligence begins.If you rely only on “final ASR”, you’ll feel sluggish.
Streaming text is easy. Streaming speech is where mistakes become audible.
Because once you speak, you commit socially — even if you later “correct” yourself.
So you need a simple discipline:
Example:
This is the same “truth boundary” idea from 2023, now applied to realtime.
When barge-in happens, you need a deterministic order of operations.
Here’s the cancellation ladder I use:
Cut audio output now. Don’t wait for network.
Stop appending “assistant text” to the visible transcript until you reconcile what was actually spoken.
Cancel generation on the server and expect a cancel.ack.
If the tool is read-only, cancel aggressively.
If the tool has side effects, cancel only if it supports idempotent cancellation.
Append an event: who interrupted, what was playing, what was in-flight.
If tools can do real work (send, buy, delete, book), real-time makes safety harder.
Your agent needs explicit commit points.
A practical pattern:
Example:
{
"type": "action.proposed",
"payload": {
"action": "send_email",
"to": "billing@company.com",
"subject": "Invoice request",
"body": "..."
}
}
Then:
{
"type": "action.confirmed",
"payload": { "confirmation_id": "conf_901" }
}
Then:
{
"type": "action.committed",
"payload": { "idempotency_key": "send_email:sess_123:turn_7:conf_901" }
}
This is not “compliance theater”.
It’s how you avoid the worst failure mode in real-time agents:
The user interrupts…
but the agent still does the thing.
“State” becomes unmanageable when everything shares the same memory object.
So split it.
I like four layers:
What is happening right now:
Append-only events with ordering guarantees:
Views computed from the timeline:
Only durable user facts that survive sessions:
This separation is what prevents “session collapse”:
You can build real-time agents with multiple transports. The choice is less ideological than it looks:
The important part is not which transport you pick.
It’s that your session timeline and cancellation semantics are consistent across them.
Root cause: cancellation is best-effort, not end-to-end.
Fix: require cancel.ack events, stop playback locally first, and treat “ghost speech” as a P0 bug.
Root cause: ASR revisions weren’t reconciled into the timeline.
Fix: treat ASR partials as revisions, anchor the assistant response to a finalized turn_id.
Root cause: state blob drift across client/server.
Fix: timeline as source of truth + projections; avoid “hidden state” on the client.
Root cause: no commit points / no idempotency / cancellation not supported.
Fix: action.confirmed → action.committed pattern; side-effect tools must be idempotent and auditable.
Root cause: you streamed high-commitment claims before verification.
Fix: two-phase response: preface first, facts only after retrieval/tool confirmation.
If you can’t break down latency, you can’t improve it.
Log per-turn timing like a trace:
What I alert on
What I review weekly
Streaming and barge-in create parallel realities:
Your architecture must force all of that to converge back into a coherent session record.
And the only reliable way I’ve found is:
That’s what turns a “voice demo” into an operable system.
WebRTC 1.0 (W3C Recommendation) — low-latency audio/video + data
The foundational spec for real-time media transport in the browser. If you’re doing voice with tight latency/jitter constraints, this is the “adult” option.
RFC 6455 — The WebSocket Protocol (full-duplex control plane)
A practical backbone for real-time agent control messages: cancel/ack, timeline events, state sync, and barge-in signals—without polling.
Real-time agents increase surface area:
So the next step is inevitable:
Standards.
Next month: Standards for the Agent Ecosystem: connectors, protocols, and MCP
Standards for the Agent Ecosystem: connectors, protocols, and MCP
Agents are becoming the new integration surface. This is how you go from bespoke tool wiring to an ecosystem: portable connectors, common protocols, and a practical standard called MCP.
Reasoning Budgets: fast/slow paths, verification, and when to “think longer”
“Think step by step” isn’t an architecture. In production you need budgets, routing, and verifiers so the system knows when to go fast, when to slow down, and when to refuse.