Blog
Jan 29, 2023 - 18 MIN READ
Software in the Age of Probabilistic Components

Software in the Age of Probabilistic Components

LLMs aren’t “features” — they’re probabilistic runtime dependencies. This post gives the mental model, contracts, failure modes, and ship-ready checklists for building real products on top of them.

Axel Domingues

Axel Domingues

For seven years this blog has been a long exercise in one idea:

treat complex systems like systems — instrument them, define contracts, and design for failure.

  • 2016–2020 taught me to respect learning systems: stability isn’t luck, it’s design.
  • 2021–2022 pulled me back into production reality: correctness, operability, and boundaries beat cleverness.

Now, January 2023 is where those two lanes collide.

Because LLMs didn’t just introduce a new API.

They introduced a new kind of component:

A probabilistic component that can speak confidently while being wrong.

That changes software architecture more than most “new frameworks” ever will.

This post is not prompt tips. It’s the architectural mental model I wish every senior engineer had before shipping their first LLM feature.

The thesis

LLMs are probabilistic engines. Treat them like dependencies with uncertainty, not functions with correctness.

The new job

Your architecture must define truth boundaries (what must never be wrong) and build guardrails everywhere else.

The practice

Evals + observability + rollback become first-class. “It worked in a demo” is not evidence.

The payoff

Safer rollouts, fewer incidents, and LLM features that behave predictably under real traffic.


The shift: from deterministic code to probabilistic components

Classic software has a comforting property:

  • same input
  • same code
  • same output (modulo bugs)

LLM features break that mental model in three ways:

  1. Non-determinism (or “pseudo-determinism”)
    • sampling and decoding choices change outputs
  2. Underspecified inputs
    • prompts are not specs; context is incomplete by default
  3. Plausible failure
    • outputs can look correct while being wrong

So the architect’s unit of design changes.

You’re no longer designing for correctness of a function.

You’re designing for reliability of a behavior.

A useful reframe:

In deterministic software, you debug bugs. In probabilistic software, you manage error distributions.


The LLM contract: what the component really does

Forget brand names for a moment.

A modern LLM in a product acts like this:

  • it consumes tokens (system instructions + user intent + context)
  • it generates tokens (a continuation)
  • it does so by choosing the most likely next token repeatedly
  • under a decoding policy you control (temperature/top-p/etc.)

So the true contract is not “answer correctly.”

The true contract is:

Given a context, produce a plausible continuation that matches learned patterns.

That has consequences:

  • it can be brilliant at synthesis
  • it can be consistent in tone
  • it can generalize in surprising ways
  • and it can fabricate when context is missing

The first architecture question: “What must never be wrong?”

In 2022 I wrote about distributed data and invariants.

That mindset matters even more here.

Because when you introduce an LLM, you introduce a component that can produce confident nonsense. So you must decide where nonsense is acceptable and where it’s catastrophic.

Examples of truth boundaries:

  • money movement, billing, refunds
  • access control, permissions, identity
  • legal/compliance commitments
  • irreversible writes (deleting data, publishing, sending)
  • medical/financial advice in regulated contexts

If an LLM is involved in those, it must be constrained to assist, not decide.

If an LLM can trigger irreversible actions from free-form text, you didn’t build an LLM feature.

You built a production incident generator.

A safe system separates:

  • assistive outputs (drafts, summaries, suggestions)
  • from authoritative actions (writes, commits, purchases, deletions)

This is not “AI safety philosophy.” It’s the same engineering discipline as outbox + sagas:

define invariants, enforce them with single-writer authority, and treat everything else as eventually correct.


Failure modes: the taxonomy you need before shipping

Most teams only learn these after an incident.

You can learn them now.

Hallucination

Fabricates facts, citations, or confident details when context is missing.

Instruction drift

Loses constraints across long contexts; follows the wrong part of the prompt.

Tool misuse

Calls the wrong tool, calls it with the wrong arguments, or “pretends” it called it.

Overreach

Answers beyond the available evidence instead of asking for clarification.

Prompt injection

User-provided content manipulates the model into ignoring system instructions.

Cost blow-up

Long contexts, retries, or agent loops silently multiply spend and latency.

These aren’t edge cases. They are the normal behavior of a component optimized for plausibility.

So your architecture should assume they will happen.


A practical mental model: degrees of freedom create risk

Here’s the simplest control intuition I know:

The more freedom the model has, the more ways it can fail.

So reliability engineering becomes: reduce degrees of freedom.

Common guardrails (ordered from “cheap” to “strong”):

  1. Constrain format
    • strict JSON output, schemas, structured sections
  2. Constrain content
    • grounding snippets, citations required, “answer only from sources” rules
  3. Constrain action
    • tool calls with typed interfaces and validation
  4. Constrain authority
    • human review for irreversible actions
  5. Constrain distribution
    • retrieval, policies, and post-verification
If you remember one rule from this post:

Never let the LLM be both the narrator and the source of truth.


The system architecture that survives reality

The mistake is to bolt an LLM onto your backend like it’s just another function call.

The durable architecture treats it like a subsystem with safety rails.

Here’s the baseline stack I recommend thinking in:

  1. Intent & context assembly
    • what is the user asking, and what data is allowed?
  2. Policy layer
    • what is permitted? what must be refused? what tools are available?
  3. Model invocation
    • prompt template, decoding settings, context window budgeting
  4. Verification & post-processing
    • schema validation, safety checks, citation checks, confidence gates
  5. Action execution
    • typed tools, idempotency keys, permissions, auditing
  6. Telemetry & evals
    • traces, offline eval harness, regressions, rollout flags

Notice what’s missing:

  • “prompt engineering” as the main strategy

Prompts matter, but prompts are only one layer of control. Real systems need multiple layers.


The LLM feature spec: a template that forces clarity

When someone says “we want to add ChatGPT to the product,” the right response is:

“Cool. What is the contract?”

Here’s the spec template I use.

### Feature name
### User value (one sentence)
### Allowed behavior
- what the assistant may do
- what it may suggest
- what it must not do

### Truth boundaries (non-negotiables)
- actions that require confirmation
- actions that require human approval
- actions the assistant cannot trigger

### Inputs (context sources)
- allowed sources (DB tables, APIs, docs)
- disallowed sources (PII, secrets, internal-only)
- freshness requirements

### Output format
- free-form / structured / strict JSON schema
- citation requirements

### Failure handling
- when to abstain (“I don’t know” UX)
- fallback behavior (search, escalation, manual flow)

### Evaluation plan
- offline test set
- regression checks
- launch guard metrics

### Rollout plan
- feature flag + cohort
- monitoring thresholds
- rollback triggers
This looks “process heavy” until you ship without it.

Then it becomes the cheapest document you ever wrote.


The probabilistic risk register (what breaks, how you notice, what you do)

This is the part teams skip — and then re-invent during the incident.

RiskTypical symptomDetectionMitigation
Hallucinated factsConfident wrong answerSpot-checks, user reports, eval setGrounding (RAG), citations, “answer only from sources”, abstention
Prompt injectionModel follows malicious textRed team prompts, tool logsStrict tool policy, content isolation, instruction hierarchy, allowlist tools
Data leakageSensitive content appearsDLP scans, audit logsContext filtering, PII redaction, least-privilege retrieval
Over-automationWrong action executedIncident reportsHuman approval gates, confirmations, read-only mode by default
Latency regressionsSlow UX, timeoutsTracing, p95/p99Context budgets, caching, streaming, fallbacks
Cost blow-upsSpend spikesToken accountingToken budgets, rate limits, caching, stop conditions
Behavior driftQuality degrades after changesOffline evalsEval harness + canary releases + model/version pinning
The purpose of this table is not to be perfect.

It’s to force the team to answer: “How will we know we’re failing, and what will we do when it happens?”


Evals and observability: your new minimum bar

In classic software, unit tests catch regressions. In distributed systems, telemetry catches incidents. With LLMs, you need both.

The eval harness (offline)

You need a repeatable dataset of prompts and expected behavior:

  • correct answers (where truth exists)
  • refusal cases
  • adversarial prompts
  • ambiguous prompts (where the right behavior is to ask questions)
  • tool use cases (where output must be structured)

The telemetry (online)

You need traces that let you debug:

  • what context was assembled
  • what policy was applied
  • what the model generated
  • what tools were called (with args and results)
  • how the user reacted (thumbs down, follow-ups, abandonment)
If you can’t reproduce a bad output with the exact context that produced it, you don’t have an LLM feature.

You have a ghost story.


Ship-ready checklist: the “don’t embarrass yourself” edition

Define truth boundaries

Write down what the model cannot be trusted to do. Wire human confirmation for anything irreversible.

Constrain outputs

Use structured formats where possible. Validate schemas. Reject and retry (or fallback) when invalid.

Constrain tools

Typed interfaces. Allowlist tools. Validate arguments. Use idempotency keys on side-effecting actions.

Budget tokens and time

Set context limits per request. Log token usage. Implement timeouts and fallbacks.

Create an eval harness before launch

A tiny dataset is better than none. Automate it in CI. Pin versions and re-run before shipping changes.

Add tracing and a replay path

Store the assembled prompt/context (with redaction). Store tool calls. Make incidents reproducible.

Roll out with a feature flag

Start with internal users. Then a small cohort. Define rollback triggers upfront.

Shipping without evals and tracing is like shipping a distributed system without logs.

You can do it. You just won’t be able to operate it.


The deeper point: “AI features” are systems features

By the end of 2022, the blog arrived at an uncomfortable truth:

operability is part of product correctness.

LLMs make that even more true.

Because your system’s “correctness” is no longer just:

  • code paths
  • database state
  • API responses

It includes:

  • context assembly
  • policy enforcement
  • model behavior
  • and user trust

So the job of an architect is not to make the model smart.

It’s to make the system safe, observable, and evolvable under uncertainty.


FAQ


What’s Next

This month set the architectural frame:

LLMs are probabilistic components. So we design truth boundaries, guardrails, evals, and observability.

Next month we go deeper into the history of why language was hard:

Why NLP Was Hard: RNN Pain, Vanishing Gradients, and the Limits of ‘Memory’

Because the fastest way to build good intuition about transformers and ChatGPT… is to understand what broke before them.

Axel Domingues - 2026