
Prompting feels like coding until it fails like statistics. This month I start treating LLMs as probabilistic components: define contracts, enforce schemas, and design failure budgets so your system survives outputs that are “plausible” but wrong.
Axel Domingues
In 2023, we learned the hard truth:
LLMs don’t “return answers.” They return samples.
In 2024, the work changes.
More system design.
Because the moment you ship an LLM feature to real users, you discover a new category of bug:
Nothing crashed.
The output was just… wrong.
That’s not a coding bug.
That’s a contract bug.
And if you treat prompting like programming, you’ll keep trying to fix a probabilistic system with deterministic habits.
So for the first article of the 2024 series (From LLM features to agents you can operate), I’m going to establish the posture that makes LLM work shippable:
Because reliability is not a model property.
Reliability is designed.
The mental shift
Prompts are not programs.
They’re inputs to a probabilistic component.
The engineering shift
Move reliability out of the prompt and into a boundary layer.
The practical output
A reusable LLM Boundary Layer you can apply to every feature.
The success metric
The feature degrades safely under failure and improves via measurement.
When you write code, you assume:
When you prompt a model, none of those assumptions are safe:
So the right comparison isn’t “prompting is like coding.”
Prompting is closer to:
Which means you need what we always need for untrusted systems:
boundaries.
The prompt is not the system.
The prompt is one part of the system.
If you’re building LLM features professionally, you want a dedicated layer that sits between your product and the model.
Not because it’s fashionable.
Because you need a place where reliability actually lives.
Here’s the boundary I recommend for almost every production integration:

At minimum, the boundary layer owns:
The point is simple:
Your product should talk to a stable interface.
The boundary layer should deal with model weirdness.
A contract is not a prompt. A contract is an agreement between your system and the model about what is acceptable.
Think of it like any other architecture boundary:
LLM features need the same adult supervision.
Before you write a single prompt, answer two questions:
Examples:
That’s how hallucinations become incidents.
A good LLM contract is explicit, bounded, and testable.
Here’s a simple contract template I use:
Contract question
What must never be wrong?
Contract question
Where is “probably right” acceptable?
Most LLM work falls into a few contract families:
Goal: extract fields, not “answer questions.”
Reliability lever: strict schema + validation + repair.
Failure mode: missing fields, wrong types, invented values.
Goal: pick from a small controlled set.
Reliability lever: constrained outputs + confidence rules + fallback to “unknown.”
Failure mode: overconfident misroutes.
Goal: help a human write faster.
Reliability lever: make it obviously a draft + include citations/notes.
Failure mode: users treat it as authoritative.
Goal: take actions safely.
Reliability lever: tool scopes, step limits, sandboxing, audit logs.
Failure mode: doing the wrong thing “helpfully.”
The key: don’t use a drafting contract for an extraction problem. That’s how you end up parsing prose like it’s data.
If contracts are “what must be true,” schemas are “what must be parseable.”
Schemas are how you turn “the model said a thing” into “the system accepted a thing.”
Without a schema, you’re doing this:
With a schema, you can do this:
That’s real engineering.
It makes the system reject nonsense early and reliably.
You don’t need a giant framework to start. You need four steps:
Here’s a simple JSON Schema example for an “intent router”:
{
"type": "object",
"required": ["intent", "confidence", "needs_human"],
"properties": {
"intent": {
"type": "string",
"enum": ["billing", "technical", "account", "sales", "other"]
},
"confidence": { "type": "number", "minimum": 0, "maximum": 1 },
"needs_human": { "type": "boolean" },
"notes": { "type": "string" }
},
"additionalProperties": false
}
And here’s the logic you want around it (pseudo-code, language-agnostic):
for attempt in 1..max_attempts:
raw = llm(prompt, context)
json = try_parse_json(raw)
if not json:
prompt = repair_prompt(raw, "Output was not valid JSON.")
continue
if validate(json, schema):
if json.confidence < 0.6: return fallback("low_confidence")
if json.needs_human: return escalate(json)
return accept(json)
prompt = repair_prompt(raw, "JSON failed schema validation.")
return fallback("schema_failure")
That loop is the difference between a demo and a system.
Unbounded retries are how “LLM reliability” becomes “infinite latency and infinite cost.”
Fix: set additionalProperties: false (or the equivalent) and reject unknown fields.
Fix: explicit “required” set + repair prompt that lists missing fields.
Fix: allow null / “unknown” for specific fields, then enforce escalation rules.
Fix: strict validation + repair prompt + (optionally) a small deterministic coercion step you control.
Fix: strip fences deterministically, then validate.
The main rule:
The model can generate.
Only your boundary layer can accept.
Here’s where most teams get stuck:
They keep rewriting prompts to reduce failure.
That helps… until it doesn’t.
Because you can’t prompt your way out of probabilistic behavior. You can only budget it.
A failure budget is an operational idea:
This is the same thinking that made SRE work: availability isn’t a wish — it’s a budget.
LLM failures are not just “500 errors.” Most are semantic.
Define failure categories that match reality:
Invalid output
Not parseable / schema-invalid / missing required fields.
Wrong-but-plausible
Semantically incorrect while sounding confident.
Unsafe output
Policy violations, privacy leaks, jailbreak behavior.
Cost/latency blow-up
Retries, long prompts, token spikes, tool loops.
Now you can put numbers on them.
Example failure budgets (illustrative):
The exact numbers depend on your product. The important part is: you choose them and you track them.
If wrong-but-plausible errors are too high, your options are not “try a better adjective.”
Your options are architectural:
This is how the system improves over time without turning into prompt folklore.
A failure budget is an operating model.
You want every LLM feature to have an explicit outcome policy. Here’s the one I use most often:
That policy gives you predictability even when the model is unpredictable.
Escalation is a product feature.
It’s the system saying:
“I am not confident enough to be autonomous.”
That is what “operable agents” will mean in 2024: bounded autonomy with explicit handoffs.
If you’re starting from scratch, don’t boil the ocean. Build the minimal boundary that makes the feature safe.
Write:
Pick one:
Track:
That’s enough to ship responsibly. Everything else is iteration.
In early LLM projects, teams often use “it seems fine” as the main metric.
That’s a trap.
Your boundary layer should expose a small dashboard of truth:
It’s the same loop you used in distributed systems: observe, constrain, stabilize.
Let’s make this concrete.
The naive prompt:
The engineered version:
Contract
Schema
summary_bullets (array, 3–7)action_items (array)tone (enum)sensitive_flags (array of enums)Failure policy
sensitive_flags contains high-risk: escalate (user confirmation)That’s the difference between “a cool demo” and “a feature I can operate.”
If the output doesn’t matter, you can keep it casual.
But the moment:
you need contracts, schemas, and failure budgets.
Start small — but start with a boundary.
It reduces a class of failures:
It does not magically make facts correct.
For factuality, you need grounding + verification, which we’ll build throughout 2024.
Stop searching for “the perfect prompt.”
Instead:
This turns LLM work from art into engineering.
Shipping unstructured prose into downstream logic.
If code depends on model output, the output must be structured, validated, and versioned.
This month was the foundation:
Next month we hit the 2024 shift that breaks a lot of teams’ intuition:
Long context isn’t memory: when to stuff, when to retrieve
Because the moment you get bigger context windows, it becomes tempting to “just paste everything.”
And that’s when cost, latency, and retrieval strategy become architecture again.
Long Context Isn’t Memory: When to Stuff, When to Retrieve
Bigger context windows tempt teams to paste everything. But long context is just a larger input buffer — not memory, not grounding, and not a plan. This month: how to budget context, decide “stuff vs retrieve,” and build a context assembler that stays fast, cheap, and safe.
Midjourney and the Product Loop: Why Some Generators Feel Magical
Diffusion models made image generation possible. Midjourney made it feel addictive. The difference wasn’t just the model — it was the product loop: fast iteration, visible search, and UI as a steering system for probability.