Nov 24, 2024 - 14 MIN READ

Real-Time Agents: streaming, barge-in, and session state that doesn’t collapse

Real-time agents aren’t “LLMs with voice.” They’re distributed systems with hard latency budgets. This month is about streaming UX, safe interruption (barge-in), and session state you can actually operate.

Axel Domingues

Tool use gave agents hands.

RAG gave them memory (the grounded kind).

But in November, the hard truth shows up:

Real-time agents are not a model problem.
They’re a latency + state + cancellation problem.

The moment you add streaming audio and allow the user to interrupt mid-sentence, you’re no longer “calling an LLM”.

You’re operating a conversation machine whose parts run at different speeds, fail independently, and must remain coherent anyway.

When I say “real-time” here, I don’t mean “low latency on average”.

I mean:
the UX stays responsive while the system is thinking
the user can change their mind mid-output
and the session doesn’t desync into nonsense when that happens

The new constraint

Latency becomes a product contract, not an optimization ticket.

The new UI primitive

Streaming is the UI.
“Answer at the end” feels broken.

The new safety mechanism

Barge-in (interrupt) is required.
But it’s also a cancellation minefield.

The new failure mode

Session state can split: transcript, tools, audio, UI.
Your job is to force convergence.

The real architecture: three concurrent loops

A real-time agent is three loops running concurrently:

Input loop — capture user intent (audio/text), detect turn boundaries, stream partials.
Thinking loop — plan, retrieve, call tools, generate response (often streaming).
Output loop — speak/render tokens, allow interruption, reconcile partial outputs into a final trace.

If you treat this as a single “request/response”, you’ll ship a demo.

If you treat it as a set of coordinated loops with shared state and cancellation, you can ship a product.

Architecture diagram: input loop, thinking loop, output loop with shared session timeline and cancellation tokens

Streaming UX is not “nice to have” — it’s how you buy time

In non-real-time systems, latency is hidden behind a spinner.

In real-time conversation, a spinner is social death.

Streaming buys you time in three ways:

Perceived responsiveness: the user sees progress immediately.
Progressive commitment: you can emit “I’m doing X…” before you finish doing X.
Early error surfacing: tool failures and missing permissions can be shown before the user waits 10 seconds.

A good streaming UX has two channels:

confidence channel (“I’m doing something meaningful”)
content channel (the actual answer)

If you stream only content, you risk streaming confident nonsense.

Barge-in is a product requirement (and a cancellation contract)

Voice agents feel magical… until the user can’t interrupt.

In real conversation, interruption is normal:

“Wait—no, not that…”
“Actually, use my work email.”
“Stop, I already know.”

If your agent can’t stop speaking and adapt, it feels less intelligent than a phone menu.

The architectural rule

Every output must be cancellable, and cancellation must be observable.

That means:

cancel TTS playback immediately
cancel any in-flight model generation
cancel in-flight tool calls when possible
and record what happened in the session timeline

“Cancel generation” is not enough.

If you cancel the model but a tool call completes and commits side effects (sent email, booked meeting, charged card), you’ve created a trust-killer.

Barge-in forces you to design capability boundaries and commit points.

Session state that doesn’t collapse: build a timeline, not a blob

Most agent demos treat “conversation state” as a blob of messages.

Real-time agents can’t.

Because streaming and barge-in create partial, overlapping reality:

partial transcript updates (ASR (Automatic Speech Recognition) revisions)
partial model output (tokens)
partial tool results
partial audio playback
cancellations and restarts

So instead of a blob, you need a session timeline: an append-only sequence of events with strong ordering guarantees.

The mental model

A session is a timeline of facts.
UI views are projections of that timeline.

That’s the only way to keep audio, text, tools, and UI aligned.

The session timeline pattern

Model everything as events: input chunks, transcript revisions, tool calls, tool results, output tokens, cancellations, commits. Then derive:

the visible transcript
the “current intent”
the last committed actions
the playback state

A practical event model (what I actually ship)

Your event model doesn’t need to be perfect.

It needs to be:

append-only
idempotent
causally linked (what caused what)
and replayable (for debugging + evaluation)

Here’s a minimal event envelope:

{
  "session_id": "sess_123",
  "event_id": "evt_456",
  "ts": "2024-11-24T10:12:33.120Z",
  "type": "asr.partial",
  "turn_id": "turn_7",
  "parent_event_id": "evt_455",
  "payload": {}
}

And here are the event types that matter most in real-time agents:

Transcript evolution

audio.chunk, asr.partial, asr.final, turn.end

Agent work

plan.start, retrieve.start, tool.call, tool.result, model.token

Output control

tts.chunk, playback.start, playback.stop, cancel.requested, cancel.ack

Commit points

action.proposed, action.confirmed, action.committed, action.reverted

Notice what’s missing:

There is no “final answer” field.

There is only a timeline that allows you to reconstruct what happened.

Turn detection: your VAD is part of the product

Turn detection is not an implementation detail.

It decides:

when to stop listening
when to start thinking
when to start speaking
and when to yield to the user

For voice agents, you typically mix:

VAD (voice activity detection) to detect speech/silence boundaries
ASR stability heuristics (how often the partial transcript is changing)
explicit user controls (tap-to-talk, push-to-talk, “stop” button)

If you rely only on VAD, you’ll cut people off or wait too long.

If you rely only on “final ASR”, you’ll feel sluggish.

Turn detection is where perceived intelligence begins.

Streaming output: how to avoid talking yourself into a corner

Streaming text is easy. Streaming speech is where mistakes become audible.

Because once you speak, you commit socially — even if you later “correct” yourself.

So you need a simple discipline:

The two-phase response

Low-commitment preface (safe, reversible)
High-commitment content (only after you’re confident)

Example:

“Okay — I’m checking your last invoice…” (safe)
“Your October invoice is 82.50€” (committed, must be correct)

This is the same “truth boundary” idea from 2023, now applied to realtime.

The cancellation ladder: what to cancel first (and what to never auto-cancel)

When barge-in happens, you need a deterministic order of operations.

Here’s the cancellation ladder I use:

Stop playback immediately

Cut audio output now. Don’t wait for network.

Freeze UI projections

Stop appending “assistant text” to the visible transcript until you reconcile what was actually spoken.

Signal cancellation to the model stream

Cancel generation on the server and expect a cancel.ack.

Cancel tool calls only if safe

If the tool is read-only, cancel aggressively.
If the tool has side effects, cancel only if it supports idempotent cancellation.

Record the interruption

Append an event: who interrupted, what was playing, what was in-flight.

Treat cancellation like a first-class metric:

% of turns interrupted
cancellation latency (request → ack)
“ghost speech” incidents (audio continues after cancel)
“ghost actions” incidents (tool commits after cancel)

Side effects: design commit points like you’re processing payments

If tools can do real work (send, buy, delete, book), real-time makes safety harder.

Your agent needs explicit commit points.

A practical pattern:

LLM proposes an action (structured)
UI confirms (or policy auto-confirms only for low-risk scopes)
only then do you commit side effects
and you record the commit with an idempotency key

Example:

{
  "type": "action.proposed",
  "payload": {
    "action": "send_email",
    "to": "billing@company.com",
    "subject": "Invoice request",
    "body": "..."
  }
}

Then:

{
  "type": "action.confirmed",
  "payload": { "confirmation_id": "conf_901" }
}

Then:

{
  "type": "action.committed",
  "payload": { "idempotency_key": "send_email:sess_123:turn_7:conf_901" }
}

This is not “compliance theater”.

It’s how you avoid the worst failure mode in real-time agents:

The user interrupts…
but the agent still does the thing.

Session state: split it on purpose

“State” becomes unmanageable when everything shares the same memory object.

So split it.

I like four layers:

This separation is what prevents “session collapse”:

ephemeral state can glitch without corrupting the session
projections can be re-derived after reconnect
the timeline gives you debuggability and evaluation

Transport choices: SSE vs WebSockets vs WebRTC (the practical view)

You can build real-time agents with multiple transports. The choice is less ideological than it looks:

SSE: simplest for streaming text tokens server → client (one-way).
WebSockets: good for full-duplex text + control messages (cancel, ack, state).
WebRTC: best for low-latency audio with jitter handling, but operationally heavier.

In practice, many systems end up hybrid:

WebRTC for audio
WebSocket for control + timeline events
HTTP for slow tool APIs

The important part is not which transport you pick.

It’s that your session timeline and cancellation semantics are consistent across them.

Failure modes you must design for (real-time edition)

Observability: your latency budget needs a breakdown chart

If you can’t break down latency, you can’t improve it.

Log per-turn timing like a trace:

input capture → turn end
turn end → model first token
model first token → tool call
tool call → tool result
tool result → first committed claim
first committed claim → first audio sample
cancel.requested → cancel.ack

What I alert on

95th percentile time-to-first-token
cancel latency
“ghost action” rate
reconnect/resume failures

What I review weekly

interruption rate by feature
tool failure taxonomy
sessions with transcript/output mismatch
user drops during “dead air”

The punchline: real-time agents are a convergence problem

Streaming and barge-in create parallel realities:

what the user said
what the model thought they said
what the agent started to say
what was actually spoken
what tools actually did

Your architecture must force all of that to converge back into a coherent session record.

And the only reliable way I’ve found is:

timeline as source of truth
projections for UI
explicit commit points
end-to-end cancellation with acknowledgements

That’s what turns a “voice demo” into an operable system.

Resources

WebRTC 1.0 (W3C Recommendation) — low-latency audio/video + data

The foundational spec for real-time media transport in the browser. If you’re doing voice with tight latency/jitter constraints, this is the “adult” option.

RFC 6455 — The WebSocket Protocol (full-duplex control plane)

A practical backbone for real-time agent control messages: cancel/ack, timeline events, state sync, and barge-in signals—without polling.

WHATWG HTML — Server-Sent Events (SSE) / EventSource

The simplest “stream tokens to UI” primitive over HTTP. Great for one-way streaming (server → client) when you don’t need full duplex.

OpenTelemetry — GenAI semantic conventions (tracing/metrics)

A shared vocabulary for instrumenting model streams, tool calls, and latency breakdowns—so “time-to-first-token” and “cancel latency” become measurable.

What’s Next

Real-time agents increase surface area:

more sessions
more integrations
more tools
more security boundaries
more protocol choices

So the next step is inevitable:

Standards.

Next month: Standards for the Agent Ecosystem

Standards for the Agent Ecosystem: connectors, protocols, and MCP

Agents are becoming the new integration surface. This is how you go from bespoke tool wiring to an ecosystem: portable connectors, common protocols, and a practical standard called MCP.

Reasoning Budgets: fast/slow paths, verification, and when to “think longer”

“Think step by step” isn’t an architecture. In production you need budgets, routing, and verifiers so the system knows when to go fast, when to slow down, and when to refuse.