Jan 29, 2023 - 18 MIN READ

Software in the Age of Probabilistic Components

LLMs aren’t “features” — they’re probabilistic runtime dependencies. This post gives the mental model, contracts, failure modes, and ship-ready checklists for building real products on top of them.

Axel Domingues

For seven years this blog has been a long exercise in one idea:

treat complex systems like systems — instrument them, define contracts, and design for failure.

2016–2020 taught me to respect learning systems: stability isn’t luck, it’s design.
2021–2022 pulled me back into production reality: correctness, operability, and boundaries beat cleverness.

Now, January 2023 is where those two lanes collide.

Because LLMs didn’t just introduce a new API.

They introduced a new kind of component:

A probabilistic component that can speak confidently while being wrong.

That changes software architecture more than most “new frameworks” ever will.

This post is not prompt tips. It’s the architectural mental model I wish every senior engineer had before shipping their first LLM feature.

The thesis

LLMs are probabilistic engines. Treat them like dependencies with uncertainty, not functions with correctness.

The new job

Your architecture must define truth boundaries (what must never be wrong) and build guardrails everywhere else.

The practice

Evals + observability + rollback become first-class. “It worked in a demo” is not evidence.

The payoff

Safer rollouts, fewer incidents, and LLM features that behave predictably under real traffic.

The shift: from deterministic code to probabilistic components

Classic software has a comforting property:

same input
same code
same output (modulo bugs)

LLM features break that mental model in three ways:

Non-determinism (or “pseudo-determinism”)
- sampling and decoding choices change outputs
Underspecified inputs
- prompts are not specs; context is incomplete by default
Plausible failure
- outputs can look correct while being wrong

So the architect’s unit of design changes.

You’re no longer designing for correctness of a function.

You’re designing for reliability of a behavior.

A useful reframe:

In deterministic software, you debug bugs. In probabilistic software, you manage error distributions.

The LLM contract: what the component really does

Forget brand names for a moment.

A modern LLM in a product acts like this:

it consumes tokens (system instructions + user intent + context)
it generates tokens (a continuation)
it does so by choosing the most likely next token repeatedly
under a decoding policy you control (temperature/top-p/etc.)

So the true contract is not “answer correctly.”

The true contract is:

Given a context, produce a plausible continuation that matches learned patterns.

That has consequences:

it can be brilliant at synthesis
it can be consistent in tone
it can generalize in surprising ways
and it can fabricate when context is missing

The first architecture question: “What must never be wrong?”

In 2022 I wrote about distributed data and invariants.

That mindset matters even more here.

Because when you introduce an LLM, you introduce a component that can produce confident nonsense. So you must decide where nonsense is acceptable and where it’s catastrophic.

Examples of truth boundaries:

money movement, billing, refunds
access control, permissions, identity
legal/compliance commitments
irreversible writes (deleting data, publishing, sending)
medical/financial advice in regulated contexts

If an LLM is involved in those, it must be constrained to assist, not decide.

If an LLM can trigger irreversible actions from free-form text, you didn’t build an LLM feature.

You built a production incident generator.

A safe system separates:

assistive outputs (drafts, summaries, suggestions)
from authoritative actions (writes, commits, purchases, deletions)

This is not “AI safety philosophy.” It’s the same engineering discipline as outbox + sagas:

define invariants, enforce them with single-writer authority, and treat everything else as eventually correct.

Failure modes: the taxonomy you need before shipping

Most teams only learn these after an incident.

You can learn them now.

Hallucination

Fabricates facts, citations, or confident details when context is missing.

Instruction drift

Loses constraints across long contexts; follows the wrong part of the prompt.

Tool misuse

Calls the wrong tool, calls it with the wrong arguments, or “pretends” it called it.

Overreach

Answers beyond the available evidence instead of asking for clarification.

Prompt injection

User-provided content manipulates the model into ignoring system instructions.

Cost blow-up

Long contexts, retries, or agent loops silently multiply spend and latency.

These aren’t edge cases. They are the normal behavior of a component optimized for plausibility.

So your architecture should assume they will happen.

A practical mental model: degrees of freedom create risk

Here’s the simplest control intuition I know:

The more freedom the model has, the more ways it can fail.

So reliability engineering becomes: reduce degrees of freedom.

Common guardrails (ordered from “cheap” to “strong”):

Constrain format
- strict JSON output, schemas, structured sections
Constrain content
- grounding snippets, citations required, “answer only from sources” rules
Constrain action
- tool calls with typed interfaces and validation
Constrain authority
- human review for irreversible actions
Constrain distribution
- retrieval, policies, and post-verification

If you remember one rule from this post:

Never let the LLM be both the narrator and the source of truth.

The system architecture that survives reality

The mistake is to bolt an LLM onto your backend like it’s just another function call.

The durable architecture treats it like a subsystem with safety rails.

Here’s the baseline stack I recommend thinking in:

Intent & context assembly
- what is the user asking, and what data is allowed?
Policy layer
- what is permitted? what must be refused? what tools are available?
Model invocation
- prompt template, decoding settings, context window budgeting
Verification & post-processing
- schema validation, safety checks, citation checks, confidence gates
Action execution
- typed tools, idempotency keys, permissions, auditing
Telemetry & evals
- traces, offline eval harness, regressions, rollout flags

Reference architecture for shipping probabilistic components safely

Notice what’s missing:

“prompt engineering” as the main strategy

Prompts matter, but prompts are only one layer of control. Real systems need multiple layers.

The LLM feature spec: a template that forces clarity

When someone says “we want to add ChatGPT to the product,” the right response is:

“Cool. What is the contract?”

Here’s the spec template I use.

### Feature name
### User value (one sentence)
### Allowed behavior
- what the assistant may do
- what it may suggest
- what it must not do

### Truth boundaries (non-negotiables)
- actions that require confirmation
- actions that require human approval
- actions the assistant cannot trigger

### Inputs (context sources)
- allowed sources (DB tables, APIs, docs)
- disallowed sources (PII, secrets, internal-only)
- freshness requirements

### Output format
- free-form / structured / strict JSON schema
- citation requirements

### Failure handling
- when to abstain (“I don’t know” UX)
- fallback behavior (search, escalation, manual flow)

### Evaluation plan
- offline test set
- regression checks
- launch guard metrics

### Rollout plan
- feature flag + cohort
- monitoring thresholds
- rollback triggers

This looks “process heavy” until you ship without it.

Then it becomes the cheapest document you ever wrote.

The probabilistic risk register (what breaks, how you notice, what you do)

This is the part teams skip — and then re-invent during the incident.

Risk	Typical symptom	Detection	Mitigation
Hallucinated facts	Confident wrong answer	Spot-checks, user reports, eval set	Grounding (RAG), citations, “answer only from sources”, abstention
Prompt injection	Model follows malicious text	Red team prompts, tool logs	Strict tool policy, content isolation, instruction hierarchy, allowlist tools
Data leakage	Sensitive content appears	DLP scans, audit logs	Context filtering, PII redaction, least-privilege retrieval
Over-automation	Wrong action executed	Incident reports	Human approval gates, confirmations, read-only mode by default
Latency regressions	Slow UX, timeouts	Tracing, p95/p99	Context budgets, caching, streaming, fallbacks
Cost blow-ups	Spend spikes	Token accounting	Token budgets, rate limits, caching, stop conditions
Behavior drift	Quality degrades after changes	Offline evals	Eval harness + canary releases + model/version pinning

The purpose of this table is not to be perfect.

It’s to force the team to answer: “How will we know we’re failing, and what will we do when it happens?”

Evals and observability: your new minimum bar

In classic software, unit tests catch regressions. In distributed systems, telemetry catches incidents. With LLMs, you need both.

The eval harness (offline)

You need a repeatable dataset of prompts and expected behavior:

correct answers (where truth exists)
refusal cases
adversarial prompts
ambiguous prompts (where the right behavior is to ask questions)
tool use cases (where output must be structured)

The telemetry (online)

You need traces that let you debug:

what context was assembled
what policy was applied
what the model generated
what tools were called (with args and results)
how the user reacted (thumbs down, follow-ups, abandonment)

If you can’t reproduce a bad output with the exact context that produced it, you don’t have an LLM feature.

You have a ghost story.

Ship-ready checklist: the “don’t embarrass yourself” edition

Define truth boundaries

Write down what the model cannot be trusted to do. Wire human confirmation for anything irreversible.

Constrain outputs

Use structured formats where possible. Validate schemas. Reject and retry (or fallback) when invalid.

Constrain tools

Typed interfaces. Allowlist tools. Validate arguments. Use idempotency keys on side-effecting actions.

Budget tokens and time

Set context limits per request. Log token usage. Implement timeouts and fallbacks.

Create an eval harness before launch

A tiny dataset is better than none. Automate it in CI. Pin versions and re-run before shipping changes.

Add tracing and a replay path

Store the assembled prompt/context (with redaction). Store tool calls. Make incidents reproducible.

Roll out with a feature flag

Start with internal users. Then a small cohort. Define rollback triggers upfront.

Shipping without evals and tracing is like shipping a distributed system without logs.

You can do it. You just won’t be able to operate it.

The deeper point: “AI features” are systems features

By the end of 2022, the blog arrived at an uncomfortable truth:

operability is part of product correctness.

LLMs make that even more true.

Because your system’s “correctness” is no longer just:

code paths
database state
API responses

It includes:

context assembly
policy enforcement
model behavior
and user trust

So the job of an architect is not to make the model smart.

It’s to make the system safe, observable, and evolvable under uncertainty.

What’s Next

This month set the architectural frame:

LLMs are probabilistic components. So we design truth boundaries, guardrails, evals, and observability.

Next month we go deeper into the history of why language was hard:

Why NLP Was Hard

Because the fastest way to build good intuition about transformers and ChatGPT… is to understand what broke before them.

Why NLP Was Hard: RNN Pain, Vanishing Gradients, and the Limits of “Memory”

Before transformers, language models tried to compress entire histories into a single hidden state. This post explains why that was brittle: depth-in-time, vanishing/exploding gradients, and the engineering limits of “memory” — and why attention was inevitable.

Capstone: Build a System That Can Survive (Reference Architecture + Decision Log)

A production system isn’t “done” when it works — it’s done when it can fail, recover, evolve, and stay correct under pressure. This capstone stitches the 2021–2022 series into a reference architecture and a decision log you can defend.