
Tool use was the warm-up. Computer-use agents can click, type, and navigate real UIs — which means mistakes become side effects. This article turns “agent can drive a screen” into an architecture you can defend: isolation, action gating, verification, and auditability.
Axel Domingues
Tool use taught us an important lesson:
LLMs are great at choosing actions — and terrible at guaranteeing outcomes.
In 2024, “tool use” mostly meant calling APIs:
In January 2025, the frontier shifts again:
agents that operate computers.
They don’t call your API.
They click your UI.
They type into forms.
They download files.
They open tabs.
And when they fail… they fail with side effects.
If you can replay it like a screen recording, it’s in scope.
An API tool is structured.
A UI is unstructured, full of untrusted text and hidden traps:
If you treat UI actions like “just another tool”, you’ll ship a system that is:
The core risk
UI actions cause irreversible side effects.
The core constraint
The UI is untrusted input (including what the agent reads).
The core design move
Put a sandbox + policy firewall between the model and the screen.
The core success metric
You can replay and explain every action that happened.
Computer-use agents look deceptively simple:
The production reality is: you need control points.
Because “act” is not safe unless you can also:

You have a liability.
Here’s the design lens that scales:
The model is not allowed to touch the UI directly.
It must request actions through a UI Action Gateway.
That gateway is where you enforce:

A minimal reference architecture:
Not all sandboxes are equal. The right choice depends on what you’re automating.
Best for: web apps, structured navigation, low-risk tasks
How it works: run Chromium in a container (Playwright/Selenium), optionally expose VNC for debugging.
Pros
Cons
Controls to add
Best for: higher-risk web automation, multi-tenant systems
How it works: the agent drives a browser inside a microVM with hardened isolation.
Pros
Cons
Controls to add
Best for: desktop apps, legacy workflows, “human UI” systems
How it works: each session gets a VM (Windows/Linux) with a remote desktop surface. Agent actions are mouse/keyboard events.
Pros
Cons
Controls to add
In API tool use, the “tool” enforces invariants.
In UI automation, the tool is the mouse.
So you need a layer that acts like a firewall:
Don’t accept free-form “do the thing” commands.
Accept typed, inspectable primitives:
click(selector | coordinates)type(text, target)scroll(amount)navigate(url)wait(condition, timeout)download(file)copy/paste (often disabled)UI steps are not equal.
A simple risk model pays for itself:
Then enforce:
A practical rule
If the step is hard to undo, make it hard to do.
The common failure mode is:
Verification means checking observable outcomes with explicit criteria.
Examples of verification signals:
Computer-use agents are stateful by nature:
If you’re multi-tenant, state is a data leak vector.
Design rules:
The clean pattern:
A solid approach:
Production means:
Your goal is not to make failure impossible.
Your goal is to make failure containable and recoverable.
Here’s the checklist I use when teams ask: “can we ship a computer-use agent?”
Playwright Documentation
A practical browser automation toolkit (great baseline for “web-only” computer-use agents).
Selenium WebDriver
The long-running standard for UI automation; useful when you need broad compatibility.
For stable workflows, you should.
Computer-use agents are most valuable when:
But the moment you add an LLM, you must add the safety architecture.
Not automatically.
A badly configured VM can leak more than a hardened container. The correct question is:
Where is my isolation boundary, and can I enforce it consistently?
MicroVMs often land in the sweet spot: stronger isolation than containers, less overhead than full VMs.
Treat it as a boundary between automation and humans.
Common patterns:
If you’re trying to “beat” CAPTCHA automatically, you’re probably building the wrong product.
At minimum:
Video is optional — but when something goes wrong, it saves hours.
Computer-use agents make a promise:
“Give me a goal — I’ll operate the software for you.”
In production, that promise only holds if you can answer:
Next month, we move from “agent capability” to “agent governance”:
Agent Evals as CI: from prompt tests to scenario harnesses and red teams
Because if your agents can click the world…
you need a test harness that can break them before users do.
Agent Evals as CI - From Prompt Tests to Scenario Harnesses and Red Teams
If your agent ships without tests, it’s not an agent — it’s a production incident with good marketing. This month is about turning “it seems fine” into eval gates you can run in CI.
Standards for the Agent Ecosystem: connectors, protocols, and MCP
Agents are becoming the new integration surface. This is how you go from bespoke tool wiring to an ecosystem: portable connectors, common protocols, and a practical standard called MCP.