
Tool use is not “prompting better” — it’s turning an LLM into a controlled orchestrator of deterministic systems. This month is about the architecture, safety boundaries, and eval discipline that make agents shippable.
Axel Domingues
September was about grounding.
How to answer with evidence, not vibes.
October is where you let the model touch the world.
That’s the moment a demo turns into a system.
Because the minute an LLM can:
…it stops being “a chat model.”
It becomes a workflow engine.
And that changes everything about how you design it.
I’m talking about a very specific engineering reality:
- A loop where a probabilistic model proposes actions, deterministic tools execute them, and the system repeats until it reaches a goal (or fails safely).
The shift
From “generate text” to choose + call tools.
The risk
From “wrong words” to real side effects.
The job
Design control planes: permissions, sandboxing, telemetry, evals.
The win
Deterministic systems do the work; the model routes and composes.
Let’s define terms the way production systems need them:
Tool use is a capability.
An agent is a program built around that capability.
If you don’t separate those ideas, you’ll build the wrong thing.
The most useful mental model is a three-layer system:

If you only remember one thing:
The model should never be your control plane.
Your orchestrator is the control plane.
LLMs are good at:
LLMs are not good at:
Tool use is how you combine both worlds:
This is why tool use is not “prompting.”
It’s system decomposition.
Not all tools are equal. In production, treat them as different risk classes.
Read tools
Search, RAG retrieval, DB reads. Low risk. Great starter set.
Compute tools
Pure functions: pricing, validation, formatting. Low side-effect risk, high correctness value.
Write tools
Create/update/delete. High risk. Requires policy + idempotency + audit trails.
External tools
Third-party APIs. Adds reliability and security surfaces (timeouts, quotas, data leakage).
Why the taxonomy matters: it should change your default permissions.
Start read-only. Earn write privileges with telemetry + evals.
A minimal agent loop looks like this:
This sounds obvious — until you ship it.
The hard part isn’t the loop.
The hard part is everything around it:
So instead of “agent = the model”, treat agent = the runtime.
The model calls at most N tools, then answers. Great for:
The model alternates between thinking and acting. Great for:
Split the system:
Great for:
Planner/executor will feel familiar.
Agents are just workflow engines with a probabilistic planner.
An “agent” that is just “call the model repeatedly” is not an agent.
It’s a runaway loop.
Here are the non-negotiables your orchestrator must own:
Agents fail in ways that feel… eerily like RL training failures.
Not because the math is the same — but because you built a feedback loop.
Here are the ones you’ll see first.
Symptom: model invents a tool or parameters.
Cause: the model is optimizing for “continue the conversation” and tool names look like plausible tokens.
Design response:
Symptom: a retrieved doc says “ignore previous instructions and do X” and the model obeys.
Cause: the model can’t reliably separate instructions from data.
Design response:
Symptom: model keeps calling tools, re-checking, or “thinking” forever.
Cause: no strong stop condition + model uncertainty produces more steps.
Design response:
Symptom: perfect in happy-path examples, chaotic in production.
Cause: you evaluated on vibes, not distributions.
Design response:
Symptom: agent updates the wrong record, sends the wrong message, or triggers an expensive workflow.
Cause: write tools are too easy to call and too hard to rollback.
Design response:
If you’re building agents that write to systems, you’re back in 2022 territory:
Distributed systems, retries, and invariants.
The agent will:
So write tools must be designed like financial operations:
Only your systems can.
Design write tools so “safe retry” is the default behavior.
Agents are not debugged with “the final answer.”
They’re debugged with the trajectory.
Log these as structured events:
Then build a “flight recorder” view:
You’re just watching it happen.
For RAG we learned: evaluate grounding.
For agents: evaluate action selection and outcome correctness.
A practical eval stack looks like:
And most importantly:
Define success as a measurable outcome, not a nice paragraph.
Here’s the rollout plan that survives contact with production:
October takeaway
Agents don’t become safe because the model is “smart.”
They become safe because the orchestrator is strict, the tools are well-designed, and the system has telemetry + evals.
No.
Prompting is a way to steer outputs.
Agents are a runtime architecture:
If you remove the orchestrator and policy layer, what you have is a demo loop — not a shippable system.
No.
Tool discovery is part of your control plane. Expose tools explicitly via a registry and only include what the current user + context is allowed to access.
The safest system is the one where the model can’t even see forbidden tools.
When you need:
In those cases, the model should propose plans, but the executor should behave like a workflow engine.
They ship write tools too early.
Read-only agents teach you:
Write tools multiply risk because failures become real-world damage.
October was about turning a model into a workflow engine — and accepting the consequences:
Next month I’m switching from “agents that act” to models that create:
DALL·E: How Text Became Images (and Why It Changed Everything)”
Because once you understand tool use, you see the pattern:
The model isn’t replacing deterministic systems.
It’s becoming the universal interface and composer for them — whether the output is a tool call… or an image.
DALL·E: How Text Became Images (and Why It Changed Everything)
DALL·E wasn’t “just a cool generator.” It turned language into a control surface for visual distributions — and forced product teams to treat image creation like a probabilistic runtime with safety, latency, and cost constraints.
RAG Done Right: Knowledge, Grounding, and Evaluation That Isn’t Vibes
RAG isn’t “add a vector DB.” It’s a reliability architecture: define truth boundaries, build a testable retrieval pipeline, and evaluate groundedness like you mean it.