Blog
Aug 25, 2024 - 16 MIN READ
Tool Use with Open Models: function calling, sandboxes, and “capability boundaries”

Tool Use with Open Models: function calling, sandboxes, and “capability boundaries”

Tool use is where LLMs stop being “text generators” and start being integration surfaces. With open weights, reliability isn’t a given — you have to engineer it with contracts, sandboxes, and explicit capability boundaries.

Axel Domingues

Axel Domingues

July was about truth.

If you want to ship RAG without lying to users, you need a retrieval pipeline you can evaluate:

  • queries you can inspect
  • rerankers you can A/B
  • citations you can audit
  • and boundaries that define “what must never be wrong”

August is about action.

Because the moment you let the model do anything beyond “answer in text” — browse, call APIs, run code, create tickets, trigger workflows — you’ve crossed a line:

The model is now an integration surface.

And that changes everything.

With closed models, the platform often gives you a lot “for free”:

  • stable function calling
  • strong instruction-following
  • safety layers
  • high-quality tool selection

With open weights, you get something else:

control + portability + cost leverage — but also:

you own the reliability.

This month is about how to design tool use so it’s:

  • operable (you can debug it)
  • safe (it can’t do dangerous things)
  • predictable (it does what the contract says)
  • and scalable (works across model tiers)
When I say “open models” here I mean “you can run them yourself” — on your own infra, cloud GPUs, or a managed host — and you control prompts, adapters, and runtime. That freedom is exactly why you need a real architecture around tool use.

The goal this month

Turn “agentic demos” into tool use you can operate in production.

The mindset shift

Open weights don’t reduce risk — they move it into your system design.

The core lever

Treat tool calls as contracts, not “model behavior.”

The deliverable

A tool runtime with sandboxes + capability boundaries + evidence logs.


Tool Use Is Not “Agents” — It’s Typed Integration

The internet loves the word agent.

Engineers should love a different phrase:

typed integration.

A tool call is not “the model deciding to do something.”

It’s the model producing a structured request that you interpret and execute.

That subtle framing matters because it forces the right architecture:

  • the model proposes
  • the system decides
  • tools execute inside constraints
  • results come back as evidence

If you let the model “execute” directly, you’ve already lost.


What “Function Calling” Really Means

Function calling is a family of techniques for making tool use reliable.

At minimum, you need:

  • a tool registry (name + description + schema)
  • a tool-call format (JSON-like)
  • a dispatcher (select tool → validate args → run)
  • a result envelope (success/error, structured output)

With open models, you usually also need:

  • strict schema validation
  • repair loops (re-ask the model to fix malformed args)
  • constrained decoding (JSON grammar / schema-guided output) where available
  • and fallbacks when the model can’t reliably produce structure
If you want tool use to scale across models, treat your tool boundary like an API boundary:
  • schemas
  • versions
  • error codes
  • backward compatibility
  • observability

The Three Boundaries That Matter

There are three boundaries in any tool-using system:

  1. Language boundary (prompt ↔ model output)
  2. Capability boundary (what the model is allowed to attempt)
  3. Execution boundary (what the system is allowed to do)

Most failures happen because these boundaries are blurred.

Here’s a useful rule:

The model is untrusted input at every boundary.

It can be wrong. It can be manipulated. It can be confidently unsafe.

So we design as if tool calls are user input — because functionally, they are.


Capability Boundaries: Make Power Explicit

A capability boundary is a policy layer that answers:

  • Which tools can be used in this context?
  • How many times?
  • With what parameter constraints?
  • Under what approval rules?
  • With what audit requirements?

This is where “model selection becomes architecture” becomes real.

Not every model gets the same capabilities.

Not every request gets the same capabilities.

A simple capability matrix

CapabilityExample toolsRiskDefault rule
Read-only retrievalsearch, fetch doc, embeddingsLowAllowed broadly
Write internal artifactscreate ticket, draft email, open PRMediumAllowed with constraints + logging
Execute computerun code, transform dataMediumAllowed in sandbox + budgets
Spend money / irreversible actionspayment, deploy, deleteHighHuman approval + idempotency keys
External commssend email, post messageHighReview, throttles, allowlists
If your tool can do an irreversible action, you must assume the model will eventually do it at the wrong time — unless you gate it with explicit approvals and strong invariants.

Sandboxes: The Execution Boundary Is Where Adults Live

A sandbox is the environment where tool execution happens under restrictions.

It’s not optional.

It’s how you turn “the model can run code” into “the model can run code safely.”

A good sandbox gives you:

  • isolation (filesystem, processes, network)
  • resource limits (CPU, memory, timeouts)
  • capability limits (no arbitrary network, allowlisted domains only)
  • deterministic IO (input/output envelopes, audit logs)
  • clean teardown (no state leaks between runs)

Two kinds of sandboxes you’ll actually use

Compute sandbox

Run code / transformations with strict limits, no secrets, no outbound network by default.

Connector sandbox

Call internal/external APIs through a gateway enforcing auth, allowlists, quotas, and logging.

In practice, the “connector sandbox” is your most important one.

Because the most dangerous tool is not “run Python.”

It’s “call production APIs with credentials.”


The Tool Gateway Pattern (My Default)

If you do one thing, do this:

Tools don’t call APIs directly. Tools call a gateway.

The gateway is where you enforce:

  • authentication and scoping
  • allowlisted endpoints
  • rate limits and quotas
  • payload size limits
  • PII redaction
  • request/response logging (evidence)
  • retry policies and idempotency

This makes tools boring — and boring is good.

When you own open weights, you can run multiple model tiers behind the same gateway. Your compliance and safety posture stays consistent even as you swap models.

The Contract Surface: Schemas, Versions, and Error Codes

Tool use becomes reliable when you stop thinking “prompt” and start thinking “protocol.”

Tool schema rules I use

  • every tool has a name and a version
  • inputs are strictly typed (JSON schema-like)
  • outputs are structured (never “just text” if you can avoid it)
  • errors are typed:
    • VALIDATION_ERROR
    • AUTH_ERROR
    • RATE_LIMITED
    • UPSTREAM_TIMEOUT
    • NOT_FOUND
    • CONFLICT
    • INTERNAL_ERROR

And every tool call is wrapped in an envelope:

{
  "tool": "crm.create_ticket",
  "version": "v1",
  "args": { "title": "...", "priority": "P2", "customerId": "..." },
  "requestId": "..." 
}

The model is allowed to propose this.

Your system is allowed to reject it.

That separation is the whole game.


Failure Modes: How Tool Use Breaks in the Real World

Tool-using systems don’t fail like normal software.

They fail like distributed systems with a stochastic planner.

Here are the patterns you should assume will happen:


Tool Use with Open Models: What Changes (Practically)

Open models are totally usable for tool use — but you need to compensate for variability:

  • function calling may be less consistent across prompts
  • outputs may be less strictly structured
  • tool selection may be noisier
  • refusal behavior may be weaker or absent
  • long-horizon planning may degrade faster

So you shift more responsibility into system constraints.

Here’s the trade:

Closed model bias

More reliability from the platform.
Less control over the full stack.

Open model bias

More control and portability.
Reliability is your job.

If you want open weights to feel “production-grade,” the trick is simple:

Upgrade the model from “decision maker” to “proposal engine.”


A Minimal Architecture That Actually Works

This is the smallest tool-use architecture I’m comfortable shipping:

Components:

  1. Context assembly
    • system policy + task instructions
    • user intent
    • relevant retrieved facts (from July)
    • tool registry summary (names + when to use)
  2. Model
    • produces either an answer or tool calls
  3. Capability gate
    • decides allowed tools for this request and model tier
    • enforces budgets
  4. Validator
    • schema checks + grounding checks
  5. Tool gateway + sandbox
    • executes safely
    • returns structured results
  6. Evidence log
    • stores tool calls + outputs + traces
  7. Response composer
    • renders an answer with citations, results, and uncertainty

This is not “overengineering.”

It’s the minimum to avoid being surprised in production.


Budgets: Tokens, Tool Calls, and Blast Radius

A tool-using system needs three budget types:

Token budget

How much context and reasoning you can afford per request.

Tool-call budget

How many tool executions you allow before you stop.

Blast-radius budget

What damage a single request is allowed to do.

Time budget

How long the whole workflow can run before fallback / escalation.

The blast radius budget is the most overlooked.

Examples:

  • max “create ticket” operations per request = 1
  • max emails sent per day per tenant = N
  • max external API calls per request = 5
  • max file writes per sandbox session = 20MB

Budgets turn “AI behavior” into something you can operate.


Implementation Guide: A Practical Checklist

Define tools like public APIs

  • schemas, versions, error codes
  • stable naming
  • structured outputs
  • backwards compatibility plan

Build a capability policy layer

  • per-user, per-tenant, per-feature gates
  • model-tier gates (not every model gets the same tools)
  • per-request budgets

Put every tool behind a gateway

  • allowlists
  • quotas
  • logging
  • PII policies
  • idempotency and retries

Sandbox anything that executes compute

  • no secrets
  • no outbound network by default
  • strict timeouts
  • full logs

Instrument the whole thing

Log:

  • model input (redacted)
  • tool proposals
  • gate decisions (allowed/denied)
  • validation failures
  • tool timings + errors
  • final response + citations

Build “repair” and “fallback” flows

  • schema repair loops (bounded)
  • “ask clarifying question” paths
  • degrade to read-only mode
  • escalate to human review for high-risk tools
Treat tool-use traces like distributed traces. If you can’t replay and explain a failure, you can’t operate the system.

The Hidden Payoff: Evidence-Ready Systems

Once you build tool use with:

  • capability boundaries
  • sandboxes
  • gateway controls
  • and evidence logs

…you get something incredibly valuable for free:

You can prove what happened.

And that’s not just good engineering.

That’s where this series is heading next.

Because the uncomfortable truth is:

When regulation shows up, teams scramble. When your system already produces evidence, regulation becomes an architecture problem — not a panic.


Resources

ReAct (Yao et al., 2022) — reasoning + tool actions

A clean mental model for “the model proposes, the system acts”: interleave reasoning with tool calls, then fold results back as evidence.

Toolformer (Schick et al., 2023) — models learning to call tools

Why tool use works at all: shows patterns for when to call tools, what args to pass, and how to incorporate results — useful context when open models are noisy.

OWASP Top 10 for LLM Applications (tool risk & injection)

A practical threat model for tool-using systems: prompt injection, insecure output handling, DoS, and supply-chain risks — exactly the failure modes capability boundaries and gateways address.

Function calling / tool calling guide (JSON-schema tools & dispatch)

A concrete reference implementation of the “tool registry + schema + validation + dispatcher” pattern. Even if you run open weights, the protocol discipline transfers.


What’s Next

August was about turning “tool use” into a real subsystem:

  • the model proposes
  • policy gates
  • sandboxes execute
  • evidence logs make it auditable

Next month we make that explicit:

Regulation becomes architecture.

Not as legal theory.

As controls, traceability, and proof.

Regulation as Architecture: turning the EU AI Act into controls and evidence

Axel Domingues - 2026