
ChatGPT isn’t “an LLM”. It’s a carefully designed product loop: context assembly, policy layers, tool orchestration, and observability wrapped around a probabilistic core.
Axel Domingues
If you’ve ever tried to “just add an LLM” to a real product, you’ve felt it:
The demo works.
Then reality shows up.
This is why I like to say:
ChatGPT is not a model.
It’s a product architecture wrapped around a model.
And that distinction matters, because that architecture is what makes the system feel coherent, safe-ish, and usable.
a completion engine into a product:
- context assembly
- instruction hierarchy
- tool orchestration
- policy and safety layers
- observability + evals
- rollout and regression discipline
The core idea
The model is a probabilistic CPU.
ChatGPT is the runtime around it.
The mistake teams make
They ship one API call and call it “AI”.
Then they discover they shipped a stochastic production bug.
What you’ll take away
A practical architecture blueprint you can copy:
components, boundaries, and failure modes.
The framing
Reliability isn’t a model property.
It’s a system design outcome.
Most people picture this:
user prompt → model → answer
Real systems look closer to this:
request → policy → context assembly → model → (tools) → model → response shaping → logging → feedback loop
You can think of it like a modern full-stack runtime:

The model is only one step inside it.
Chat-style systems aren’t “prompted” once. They’re programmed continuously.
In practice, ChatGPT-like products have layers of instruction, typically in this shape:
The critical point:
Not all instructions are equal.
You need an explicit hierarchy, because users will try to override your system (sometimes accidentally, sometimes not).
You have an untrusted program input with no sandbox.
Most failures I see in production are not “bad model”.
They are bad context.
The model can only reason over what you give it in the window:
So the product needs a context assembly pipeline that behaves like an engineered subsystem, not a prompt hack.
The hard constraint
The context window is finite.
You’re building a budgeted memory system.
The hidden job
Select what matters now, for this user,
under this permission model.
The failure mode
Wrong context looks like “the model is dumb”.
It’s usually a retrieval + ranking bug.
The fix
Treat context assembly like search + caching:
instrument it, test it, and version it.
Typical sources:
The model is not your auth system.
You will always have more candidate context than window space. Use:
If you can’t answer “what did the model see?”, you can’t debug anything. Store:
This is software. Treat changes like API changes:
It won’t. It can’t.
Tool use is where LLM apps stop being “chat” and start being systems.
If the model can:
…then you are building a stochastic actor in your production environment.
That’s not scary if you treat tool use like any other side-effectful architecture:
capabilities, permissions, idempotency, and observability.
Capability design
Tools should be small and explicit.
One tool = one intention.
Permission design
Use allowlists and scoped tokens.
No “god tools”.
Reliability design
Idempotency keys, retries, timeouts,
and safe fallbacks.
Explainability design
Show the user what happened:
tool calls, results, and confidence.
If it can charge money, delete data, or change permissions, you need an explicit human-in-the-loop step.
ChatGPT-like products generally rely on multiple layers, because no single layer is perfect.
At a high level, you want defense in depth:
The key mental shift is this:
Safety is an engineering discipline, not a moral posture.
It’s about bounding failure modes.
By July 2023, my strongest opinion about LLM products is simple:
If you can’t trace it, you can’t ship it.
In classic systems, we ask:
LLM systems need the same, plus LLM-specific signals:
A trace should let you answer:
- what did we show the model?
- what did the model do next?
- what tools did we run?
- what did the user actually experience?
Here’s the “boring” architecture that tends to work.
Not because it’s fancy.
Because it makes responsibilities explicit.
API Gateway
Auth, rate limits, tenant isolation, request validation.
Orchestrator
Runs the loop: assemble context → call model → call tools → shape response.
Context Service
Retrieval, ranking, compression, redaction, provenance, caching.
Tool Runner
Executes tool calls in a sandbox: timeouts, retries, idempotency, audit logs.
Safety Service
Input/output policy checks, injection defenses, sensitive data redaction.
Model Gateway
Model routing, fallbacks, quotas, streaming, caching, version pinning.
Telemetry + Tracing
Token/latency/cost metrics, structured traces, sampling, alerting.
Eval Harness
Offline tests + golden sets + regression detection for prompts and models.
If you’re building something ChatGPT-like inside your domain, these are the decisions that usually dominate outcome:
Bad: shove the whole conversation forever.
Worse: summarize without provenance.
Better: separate memory into layers:
And always log what was injected.
You need product patterns that make “I’m not sure” usable:
Treat tools like capabilities:
Prompts, retrieval logic, and model versions all behave like code changes.
So ship them like code:
You paste policies, product docs, and the user request into a single blob.
It works until it doesn’t:
Fix: split responsibilities:
You give the model broad access and trust it to behave.
It works until it doesn’t:
Fix: tools as capabilities + sandbox + audit + confirmation boundaries.
You won’t. You’ll drift slowly until you have no idea why quality changed.
Fix: minimum viable eval harness before you scale usage.
That’s not a strategy.
ChatGPT (product)
Useful as a reference point for how a chat product can feel coherent when the model is inherently probabilistic.
InstructGPT / RLHF (paper)
A readable view of the “assistant tuning” pipeline and why preference optimization changes behavior.
You can replicate the principles:
The specific internals will differ, but the design constraints are universal.
A traceable orchestrator + a tiny context assembler.
If you can’t answer “what did the model see?” you’re blind. If you can’t control what context enters, you’ll hallucinate by default.
Not always. Tools increase capability and risk.
If you start without tools, design the architecture so tool support can be added later:
Treat it like control theory applied to behavior:
Alignment is not a sticker you put on a model — it’s a control system.
Now that we’ve treated ChatGPT as a product runtime, we can talk about the most visible user-facing failure mode of that runtime:
hallucinations.
Next month:
Hallucinations: A Probabilistic Failure Mode, Not a Moral Defect
The goal is to replace outrage with engineering:
Hallucinations: A Probabilistic Failure Mode, Not a Moral Defect
Hallucinations aren’t “the model lying”. They’re what happens when a probabilistic engine is forced to answer without enough grounding. This post is about designing products that stay truthful anyway.
RLHF: Stabilizing Behavior with Preferences (Alignment as Control)
RLHF is best understood as control engineering: a learned reward signal plus a constraint that keeps the model near its pretrained competence. Here’s how it works and how it fails.