Aug 25, 2024 - 16 MIN READ

Tool Use with Open Models: function calling, sandboxes, and “capability boundaries”

Tool use is where LLMs stop being “text generators” and start being integration surfaces. With open weights, reliability isn’t a given — you have to engineer it with contracts, sandboxes, and explicit capability boundaries.

Axel Domingues

July was about truth.

If you want to ship RAG without lying to users, you need a retrieval pipeline you can evaluate:

queries you can inspect
rerankers you can A/B
citations you can audit
and boundaries that define “what must never be wrong”

August is about action.

Because the moment you let the model do anything beyond “answer in text” — browse, call APIs, run code, create tickets, trigger workflows — you’ve crossed a line:

The model is now an integration surface.

And that changes everything.

With closed models, the platform often gives you a lot “for free”:

stable function calling
strong instruction-following
safety layers
high-quality tool selection

With open weights, you get something else:

control + portability + cost leverage — but also:

you own the reliability.

This month is about how to design tool use so it’s:

operable (you can debug it)
safe (it can’t do dangerous things)
predictable (it does what the contract says)
and scalable (works across model tiers)

When I say “open models” here I mean “you can run them yourself” — on your own infra, cloud GPUs, or a managed host — and you control prompts, adapters, and runtime. That freedom is exactly why you need a real architecture around tool use.

The goal this month

Turn “agentic demos” into tool use you can operate in production.

The mindset shift

Open weights don’t reduce risk — they move it into your system design.

The core lever

Treat tool calls as contracts, not “model behavior.”

The deliverable

A tool runtime with sandboxes + capability boundaries + evidence logs.

Tool Use Is Not “Agents” — It’s Typed Integration

The internet loves the word agent.

Engineers should love a different phrase:

typed integration.

A tool call is not “the model deciding to do something.”

It’s the model producing a structured request that you interpret and execute.

That subtle framing matters because it forces the right architecture:

the model proposes
the system decides
tools execute inside constraints
results come back as evidence

If you let the model “execute” directly, you’ve already lost.

What “Function Calling” Really Means

Function calling is a family of techniques for making tool use reliable.

At minimum, you need:

a tool registry (name + description + schema)
a tool-call format (JSON-like)
a dispatcher (select tool → validate args → run)
a result envelope (success/error, structured output)

With open models, you usually also need:

strict schema validation
repair loops (re-ask the model to fix malformed args)
constrained decoding (JSON grammar / schema-guided output) where available
and fallbacks when the model can’t reliably produce structure

If you want tool use to scale across models, treat your tool boundary like an API boundary:

schemas
versions
error codes
backward compatibility
observability

The Three Boundaries That Matter

There are three boundaries in any tool-using system:

Language boundary (prompt ↔ model output)
Capability boundary (what the model is allowed to attempt)
Execution boundary (what the system is allowed to do)

Most failures happen because these boundaries are blurred.

Here’s a useful rule:

The model is untrusted input at every boundary.

It can be wrong. It can be manipulated. It can be confidently unsafe.

So we design as if tool calls are user input — because functionally, they are.

Capability Boundaries: Make Power Explicit

A capability boundary is a policy layer that answers:

Which tools can be used in this context?
How many times?
With what parameter constraints?
Under what approval rules?
With what audit requirements?

This is where “model selection becomes architecture” becomes real.

Not every model gets the same capabilities.

Not every request gets the same capabilities.

A simple capability matrix

Capability	Example tools	Risk	Default rule
Read-only retrieval	search, fetch doc, embeddings	Low	Allowed broadly
Write internal artifacts	create ticket, draft email, open PR	Medium	Allowed with constraints + logging
Execute compute	run code, transform data	Medium	Allowed in sandbox + budgets
Spend money / irreversible actions	payment, deploy, delete	High	Human approval + idempotency keys
External comms	send email, post message	High	Review, throttles, allowlists

If your tool can do an irreversible action, you must assume the model will eventually do it at the wrong time — unless you gate it with explicit approvals and strong invariants.

Sandboxes: The Execution Boundary Is Where Adults Live

A sandbox is the environment where tool execution happens under restrictions.

It’s not optional.

It’s how you turn “the model can run code” into “the model can run code safely.”

A good sandbox gives you:

isolation (filesystem, processes, network)
resource limits (CPU, memory, timeouts)
capability limits (no arbitrary network, allowlisted domains only)
deterministic IO (input/output envelopes, audit logs)
clean teardown (no state leaks between runs)

Two kinds of sandboxes you’ll actually use

Compute sandbox

Run code / transformations with strict limits, no secrets, no outbound network by default.

Connector sandbox

Call internal/external APIs through a gateway enforcing auth, allowlists, quotas, and logging.

In practice, the “connector sandbox” is your most important one.

Because the most dangerous tool is not “run Python.”

It’s “call production APIs with credentials.”

The Tool Gateway Pattern (My Default)

If you do one thing, do this:

Tools don’t call APIs directly. Tools call a gateway.

The gateway is where you enforce:

authentication and scoping
allowlisted endpoints
rate limits and quotas
payload size limits
PII redaction
request/response logging (evidence)
retry policies and idempotency

This makes tools boring — and boring is good.

Tool gateway pattern: model proposes tool calls; policy checks capability boundaries; dispatcher validates args; gateway enforces allowlists/quotas; sandbox executes; evidence logs returned.

When you own open weights, you can run multiple model tiers behind the same gateway. Your compliance and safety posture stays consistent even as you swap models.

The Contract Surface: Schemas, Versions, and Error Codes

Tool use becomes reliable when you stop thinking “prompt” and start thinking “protocol.”

Tool schema rules I use

every tool has a name and a version
inputs are strictly typed (JSON schema-like)
outputs are structured (never “just text” if you can avoid it)
errors are typed:
- VALIDATION_ERROR
- AUTH_ERROR
- RATE_LIMITED
- UPSTREAM_TIMEOUT
- NOT_FOUND
- CONFLICT
- INTERNAL_ERROR

And every tool call is wrapped in an envelope:

{
  "tool": "crm.create_ticket",
  "version": "v1",
  "args": { "title": "...", "priority": "P2", "customerId": "..." },
  "requestId": "..." 
}

The model is allowed to propose this.

Your system is allowed to reject it.

That separation is the whole game.

Failure Modes: How Tool Use Breaks in the Real World

Tool-using systems don’t fail like normal software.

They fail like distributed systems with a stochastic planner.

Here are the patterns you should assume will happen:

Tool Use with Open Models: What Changes (Practically)

Open models are totally usable for tool use — but you need to compensate for variability:

function calling may be less consistent across prompts
outputs may be less strictly structured
tool selection may be noisier
refusal behavior may be weaker or absent
long-horizon planning may degrade faster

So you shift more responsibility into system constraints.

Here’s the trade:

Closed model bias

More reliability from the platform.
Less control over the full stack.

Open model bias

More control and portability.
Reliability is your job.

If you want open weights to feel “production-grade,” the trick is simple:

Upgrade the model from “decision maker” to “proposal engine.”

A Minimal Architecture That Actually Works

This is the smallest tool-use architecture I’m comfortable shipping:

Minimal tool runtime: request -> context assembly -> model -> proposed tool calls -> policy/capability gate -> validation -> gateway/sandbox -> tool results -> evidence log -> response.

Components:

Context assembly
- system policy + task instructions
- user intent
- relevant retrieved facts (from July)
- tool registry summary (names + when to use)
Model
- produces either an answer or tool calls
Capability gate
- decides allowed tools for this request and model tier
- enforces budgets
Validator
- schema checks + grounding checks
Tool gateway + sandbox
- executes safely
- returns structured results
Evidence log
- stores tool calls + outputs + traces
Response composer
- renders an answer with citations, results, and uncertainty

This is not “overengineering.”

It’s the minimum to avoid being surprised in production.

Budgets: Tokens, Tool Calls, and Blast Radius

A tool-using system needs three budget types:

Token budget

How much context and reasoning you can afford per request.

Tool-call budget

How many tool executions you allow before you stop.

Blast-radius budget

What damage a single request is allowed to do.

Time budget

How long the whole workflow can run before fallback / escalation.

The blast radius budget is the most overlooked.

Examples:

max “create ticket” operations per request = 1
max emails sent per day per tenant = N
max external API calls per request = 5
max file writes per sandbox session = 20MB

Budgets turn “AI behavior” into something you can operate.

Implementation Guide: A Practical Checklist

Define tools like public APIs

schemas, versions, error codes
stable naming
structured outputs
backwards compatibility plan

Build a capability policy layer

per-user, per-tenant, per-feature gates
model-tier gates (not every model gets the same tools)
per-request budgets

Put every tool behind a gateway

allowlists
quotas
logging
PII policies
idempotency and retries

Sandbox anything that executes compute

no secrets
no outbound network by default
strict timeouts
full logs

Instrument the whole thing

Log:

model input (redacted)
tool proposals
gate decisions (allowed/denied)
validation failures
tool timings + errors
final response + citations

Build “repair” and “fallback” flows

schema repair loops (bounded)
“ask clarifying question” paths
degrade to read-only mode
escalate to human review for high-risk tools

Treat tool-use traces like distributed traces. If you can’t replay and explain a failure, you can’t operate the system.

The Hidden Payoff: Evidence-Ready Systems

Once you build tool use with:

capability boundaries
sandboxes
gateway controls
and evidence logs

…you get something incredibly valuable for free:

You can prove what happened.

And that’s not just good engineering.

That’s where this series is heading next.

Because the uncomfortable truth is:

When regulation shows up, teams scramble. When your system already produces evidence, regulation becomes an architecture problem — not a panic.

Resources

ReAct (Yao et al., 2022) — reasoning + tool actions

A clean mental model for “the model proposes, the system acts”: interleave reasoning with tool calls, then fold results back as evidence.

Toolformer (Schick et al., 2023) — models learning to call tools

Why tool use works at all: shows patterns for when to call tools, what args to pass, and how to incorporate results — useful context when open models are noisy.

OWASP Top 10 for LLM Applications (tool risk & injection)

A practical threat model for tool-using systems: prompt injection, insecure output handling, DoS, and supply-chain risks — exactly the failure modes capability boundaries and gateways address.

Function calling / tool calling guide (JSON-schema tools & dispatch)

A concrete reference implementation of the “tool registry + schema + validation + dispatcher” pattern. Even if you run open weights, the protocol discipline transfers.

What’s Next

August was about turning “tool use” into a real subsystem:

the model proposes
policy gates
sandboxes execute
evidence logs make it auditable

Next month we make that explicit:

Regulation becomes architecture.

Not as legal theory.

As controls, traceability, and proof.

Regulation as Architecture: turning the EU AI Act into controls and evidence

Regulation as Architecture: Turning the EU AI Act into Controls and Evidence

The EU AI Act isn’t a PDF you “comply with” — it’s a set of control objectives you design into your product: evaluation, documentation, monitoring, and provable safety boundaries.

RAG You Can Evaluate: retrieval pipelines, reranking, citations, and truth boundaries

RAG isn’t “add a vector DB and hope.” It’s a search-and-reasoning subsystem with contracts, metrics, and failure budgets — and you can only operate what you can evaluate.