
Voice turns LLMs into real-time systems. This month is about building voice agents that meet latency budgets, degrade safely, and hand off to humans without losing context—or trust.
Axel Domingues
Text chat lets you hide a lot of sins.
Voice does not.
The moment you ship a voice agent, your system becomes a real-time pipeline:
So this month I’m treating voice agents like what they actually are:
distributed, stateful, latency-budgeted systems.
Not “a model with a microphone.”
The focus here is operational architecture:
- latency budgets
- failure modes
- caching
- reliability controls
- and human handoff as a first-class workflow
The goal this month
Build voice agents that feel responsive, stay safe under failure, and are operable by on-call humans.
The hard truth
Voice is a latency game.
If you can’t hit the budget, no amount of “smart” will save you.
The reliability lens
Treat every connector and model call as unreliable I/O and design graceful degradation.
The non-negotiable
Handoff is not a failure mode.
It’s a product feature with architecture behind it.
Text agents fail like software. Voice agents fail like phone calls.
When an LLM is “thinking” in text, the user waits. When an agent pauses in voice, the user assumes the line is dead.
This changes everything:

A voice conversation has an intuitive rhythm:
If you don’t design for this rhythm, you will ship something that feels broken even when it is “accurate.”
Your exact budget depends on your domain, but a good starting point is:
If you need more time, speak:
- “Got it — let me check that.”
- “One moment while I look that up.”
- “I’m pulling your account now.”
Voice latency is the sum of many parts:
WebRTC jitter, mobile radio variance, buffering, and codec decisions can add real delay.
Batch ASR is slow. Streaming ASR with partial hypotheses is what makes real-time interaction possible.
The model is rarely the only call. The slowest connector dominates your user’s perception.
High-quality voices cost time. Streaming TTS matters as much as streaming tokens.
Serialization, logging, tracing, prompt assembly, and network hops can be death-by-a-thousand-cuts.
If you want voice to feel natural, you need simultaneous input + output:
A good mental model is a duplex pipeline with explicit state machines.

Most teams only build one. Operable systems build all three:
Conversation state machine
Listening / Thinking / Speaking / Handoff / Ended, with explicit transitions.
Audio state machine
Capture / VAD / stream / buffer / drop / reconnect, with jitter handling.
Tooling state machine
Plan / call / retry / timeout / fallback / redact / audit, per connector.
Safety state machine
Policy checks, injection filters, action gating, and “safe response” behaviors.
Voice makes implicit state visible.
Voice agents must handle two hard realities:
The reliability move is to treat every stage as cancelable and bounded.
When the user starts speaking while the agent is speaking:
If cancellation only happens in the UI, you’ll still:UI → audio output → model stream → tool calls
For every external call, define:
Example:
Prefer a controlled fallback over silent correctness.
Caching in voice is not just “performance optimization.” It’s a product requirement.
You cache for three reasons:
Prompt assembly cache
Reuse system prompts, policy blocks, tool schemas, and static context across turns.
Semantic response cache
If the user asks a repeatable question, serve a validated answer template fast.
TTS audio cache
Cache common phrases (“One moment…”, “I can help with that…”) as audio snippets.
Connector result cache
Cache tool results with strict TTLs (and per-tenant scoping) to avoid repeated slow calls.
What you cache depends on risk:
If you cache sensitive tool results, you need:
In voice, people overshare.
Your caches will become your breach surface.
A voice agent that waits for the model to finish is dead on arrival.
You want:
But you also want correctness.
So you need a strategy for partial truth.
Split output into two layers:
Example:
This gives you responsiveness without hallucinating facts.
Most teams treat human handoff like a UI button.
Operational teams treat it like a workflow with guarantees.
Handoff has three requirements:
A good handoff packet is small, structured, and safe.
It usually includes:
If the human can’t understand the situation in 10 seconds, your handoff is not real.
You should have explicit policies, not vibes:
Immediate handoff. Don’t argue. Voice is not the place for persuasion.
If you can’t verify identity or confirm the invariant, fail closed and route to a human.
If you hit the same error twice, stop retrying and hand off with a reason code.
If tool calls repeatedly violate budgets, the system should degrade toward human assistance.
This is the minimum set of design moves I’d require before calling a voice agent “production.”
Track:
Support:
Implement:
Speak early with low-risk phrases. Delay facts until verified by tools or deterministic sources.
Produce a handoff packet. Stop side effects. Expose reason codes. Ensure continuity for the human.
When dependencies degrade:
If you can’t replay what happened, you can’t improve it.
A useful trace looks like this:
The metric that matters
Silence time.
Measure how long the user experiences no audio output after they stop speaking.
This is the “rage quit” predictor.
Cause: missing barge-in cancellation, slow VAD, or buffering.
Fix: lower output buffer, cancel TTS immediately, improve VAD tuning.
Cause: soft timeouts without hard timeouts.
Fix: enforce hard timeouts and route to fallback/handoff.
Cause: waiting for full sentences before speaking.
Fix: streaming + two-layer speech plan + cached scaffolding.
Cause: speaking high-risk facts before verification.
Fix: tool-backed facts only, or explicitly qualify uncertainty.
Cause: no conversation trace or no reason codes.
Fix: structured logs, stage timings, and deterministic handoff packets.
WebRTC 1.0 (W3C) — real-time media in browsers
The canonical spec for browser-based real-time audio/video: the foundation for low-latency voice transport, jitter handling expectations, and integration patterns.
RFC 8825 — Overview of WebRTC
A practical map of the WebRTC protocol suite (ICE/DTLS/SRTP/data channels), useful when you’re debugging “why is latency/cancellation weird?” across layers.
If you want natural voice UX, yes—at least for the user-facing edges:
Internally, you can still run batch tool calls, but the user should experience continuous feedback.
Usually, yes.
Voice gives you less room to display uncertainty, sources, or disclaimers. That means you should:
Cache what is:
Avoid caching personalized or sensitive content unless you have explicit compliance and audit controls.
By treating handoff as an engineered workflow:
October was about making voice agents operable:
Next month I zoom out from voice into the thing that makes all agent products scalable:
The Connector Ecosystem: MCP adoption patterns, versioning, and governance
Because once you can run one agent safely… the next challenge is running a hundred connectors without turning your platform into chaos.
The Connector Ecosystem: MCP adoption patterns, versioning, and governance
Once agents can call tools, connectors become the new platform surface. This month is a playbook for adopting MCP at scale: patterns that work, versioning that doesn’t break customers, and governance that keeps the ecosystem sane.
GPAI Obligations Begin: What Changes for Model Providers and Enterprises
The EU AI Act turns “model choice” into a regulated interface. This month is a practical playbook: what GPAI providers must ship, what enterprises must demand, and how to build compliance into your agent platform without slowing delivery.