
LLMs aren’t “features” — they’re probabilistic runtime dependencies. This post gives the mental model, contracts, failure modes, and ship-ready checklists for building real products on top of them.
Axel Domingues
For seven years this blog has been a long exercise in one idea:
treat complex systems like systems — instrument them, define contracts, and design for failure.
Now, January 2023 is where those two lanes collide.
Because LLMs didn’t just introduce a new API.
They introduced a new kind of component:
A probabilistic component that can speak confidently while being wrong.
That changes software architecture more than most “new frameworks” ever will.
This post is not prompt tips. It’s the architectural mental model I wish every senior engineer had before shipping their first LLM feature.
The thesis
LLMs are probabilistic engines. Treat them like dependencies with uncertainty, not functions with correctness.
The new job
Your architecture must define truth boundaries (what must never be wrong) and build guardrails everywhere else.
The practice
Evals + observability + rollback become first-class. “It worked in a demo” is not evidence.
The payoff
Safer rollouts, fewer incidents, and LLM features that behave predictably under real traffic.
Classic software has a comforting property:
LLM features break that mental model in three ways:
So the architect’s unit of design changes.
You’re no longer designing for correctness of a function.
You’re designing for reliability of a behavior.
In deterministic software, you debug bugs. In probabilistic software, you manage error distributions.
Forget brand names for a moment.
A modern LLM in a product acts like this:
So the true contract is not “answer correctly.”
The true contract is:
Given a context, produce a plausible continuation that matches learned patterns.
That has consequences:
In 2022 I wrote about distributed data and invariants.
That mindset matters even more here.
Because when you introduce an LLM, you introduce a component that can produce confident nonsense. So you must decide where nonsense is acceptable and where it’s catastrophic.
Examples of truth boundaries:
If an LLM is involved in those, it must be constrained to assist, not decide.
You built a production incident generator.
A safe system separates:
This is not “AI safety philosophy.” It’s the same engineering discipline as outbox + sagas:
define invariants, enforce them with single-writer authority, and treat everything else as eventually correct.
Most teams only learn these after an incident.
You can learn them now.
Hallucination
Fabricates facts, citations, or confident details when context is missing.
Instruction drift
Loses constraints across long contexts; follows the wrong part of the prompt.
Tool misuse
Calls the wrong tool, calls it with the wrong arguments, or “pretends” it called it.
Overreach
Answers beyond the available evidence instead of asking for clarification.
Prompt injection
User-provided content manipulates the model into ignoring system instructions.
Cost blow-up
Long contexts, retries, or agent loops silently multiply spend and latency.
These aren’t edge cases. They are the normal behavior of a component optimized for plausibility.
So your architecture should assume they will happen.
Here’s the simplest control intuition I know:
The more freedom the model has, the more ways it can fail.
So reliability engineering becomes: reduce degrees of freedom.
Common guardrails (ordered from “cheap” to “strong”):
Never let the LLM be both the narrator and the source of truth.
The mistake is to bolt an LLM onto your backend like it’s just another function call.
The durable architecture treats it like a subsystem with safety rails.
Here’s the baseline stack I recommend thinking in:

Notice what’s missing:
Prompts matter, but prompts are only one layer of control. Real systems need multiple layers.
When someone says “we want to add ChatGPT to the product,” the right response is:
“Cool. What is the contract?”
Here’s the spec template I use.
### Feature name
### User value (one sentence)
### Allowed behavior
- what the assistant may do
- what it may suggest
- what it must not do
### Truth boundaries (non-negotiables)
- actions that require confirmation
- actions that require human approval
- actions the assistant cannot trigger
### Inputs (context sources)
- allowed sources (DB tables, APIs, docs)
- disallowed sources (PII, secrets, internal-only)
- freshness requirements
### Output format
- free-form / structured / strict JSON schema
- citation requirements
### Failure handling
- when to abstain (“I don’t know” UX)
- fallback behavior (search, escalation, manual flow)
### Evaluation plan
- offline test set
- regression checks
- launch guard metrics
### Rollout plan
- feature flag + cohort
- monitoring thresholds
- rollback triggers
Then it becomes the cheapest document you ever wrote.
This is the part teams skip — and then re-invent during the incident.
| Risk | Typical symptom | Detection | Mitigation |
|---|---|---|---|
| Hallucinated facts | Confident wrong answer | Spot-checks, user reports, eval set | Grounding (RAG), citations, “answer only from sources”, abstention |
| Prompt injection | Model follows malicious text | Red team prompts, tool logs | Strict tool policy, content isolation, instruction hierarchy, allowlist tools |
| Data leakage | Sensitive content appears | DLP scans, audit logs | Context filtering, PII redaction, least-privilege retrieval |
| Over-automation | Wrong action executed | Incident reports | Human approval gates, confirmations, read-only mode by default |
| Latency regressions | Slow UX, timeouts | Tracing, p95/p99 | Context budgets, caching, streaming, fallbacks |
| Cost blow-ups | Spend spikes | Token accounting | Token budgets, rate limits, caching, stop conditions |
| Behavior drift | Quality degrades after changes | Offline evals | Eval harness + canary releases + model/version pinning |
It’s to force the team to answer: “How will we know we’re failing, and what will we do when it happens?”
In classic software, unit tests catch regressions. In distributed systems, telemetry catches incidents. With LLMs, you need both.
You need a repeatable dataset of prompts and expected behavior:
You need traces that let you debug:
You have a ghost story.
Write down what the model cannot be trusted to do. Wire human confirmation for anything irreversible.
Use structured formats where possible. Validate schemas. Reject and retry (or fallback) when invalid.
Typed interfaces. Allowlist tools. Validate arguments. Use idempotency keys on side-effecting actions.
Set context limits per request. Log token usage. Implement timeouts and fallbacks.
A tiny dataset is better than none. Automate it in CI. Pin versions and re-run before shipping changes.
Store the assembled prompt/context (with redaction). Store tool calls. Make incidents reproducible.
Start with internal users. Then a small cohort. Define rollback triggers upfront.
You can do it. You just won’t be able to operate it.
By the end of 2022, the blog arrived at an uncomfortable truth:
operability is part of product correctness.
LLMs make that even more true.
Because your system’s “correctness” is no longer just:
It includes:
So the job of an architect is not to make the model smart.
It’s to make the system safe, observable, and evolvable under uncertainty.
They can be run deterministically (same context + same decoding choices), but in product reality you still get variability because:
So the engineering posture should assume variability even when sampling is “off.”
Rules help, but they don’t create truth.
Hallucination is often the model doing the “most plausible continuation” given missing evidence. The durable fix is to constrain degrees of freedom:
Start with assistive outputs:
Avoid:
Treating the model like a deterministic function and calling it directly from business logic.
The right architecture gives the model a controlled sandbox: policy → context → generation → validation → tools → audit.
This month set the architectural frame:
LLMs are probabilistic components. So we design truth boundaries, guardrails, evals, and observability.
Next month we go deeper into the history of why language was hard:
Why NLP Was Hard: RNN Pain, Vanishing Gradients, and the Limits of ‘Memory’
Because the fastest way to build good intuition about transformers and ChatGPT… is to understand what broke before them.
Why NLP Was Hard: RNN Pain, Vanishing Gradients, and the Limits of “Memory”
Before transformers, language models tried to compress entire histories into a single hidden state. This post explains why that was brittle: depth-in-time, vanishing/exploding gradients, and the engineering limits of “memory” — and why attention was inevitable.
Capstone: Build a System That Can Survive (Reference Architecture + Decision Log)
A production system isn’t “done” when it works — it’s done when it can fail, recover, evolve, and stay correct under pressure. This capstone stitches the 2021–2022 series into a reference architecture and a decision log you can defend.