
Pretraining gives you a powerful text predictor. Instruction tuning turns it into something that behaves like a helpful tool. This post explains what instruction tuning changes, what it can’t change, and how to design products around the new failure modes.
Axel Domingues
In April, the big idea was:
Pretraining is compression.
You pay an enormous up-front cost to distill the internet into a predictive model of text.
But a pretrained model is not an assistant.
It’s not even “trying” to help you.
It’s doing one thing:
Given a prefix, predict a plausible continuation.
That’s a completion engine.
So May is about the next boundary shift:
Instruction tuning — the step where a completion engine starts behaving like something you can put behind an API without immediately embarrassing yourself.
Not because it becomes “truthful.”
Not because it becomes “safe.”
But because it gains a new skill that software actually cares about:
Following a task spec in the presence of context.
What instruction tuning is
Supervised fine-tuning on (instruction → response) data that teaches task-following as a default behavior.
What instruction tuning is not
It doesn’t make the model factual, reliable, or secure. It mostly changes how it answers, not whether it’s right.
You don’t need to know the optimizer to ship LLM features. You need to know what changed in the behavior distribution — and what that implies for architecture.
A raw pretrained model has a problem that becomes obvious the first time you build a product with it:
It will happily continue whatever you started.
Sometimes that continuation is helpful.
Sometimes it’s a debate thread.
Sometimes it’s the wrong persona.
Sometimes it “finishes” a user prompt with an answer that feels confident but is structurally nonsensical.
The model isn’t broken.
It’s doing exactly what you trained it to do.
Next-token prediction doesn’t contain your product requirements.
So instruction tuning adds a training signal that does reflect product requirements:
That’s the bridge from “predictive model” to “assistant-like policy.”
Instruction tuning is usually implemented as supervised fine-tuning (SFT):
The model learns a new default:
“When I see an instruction-like prefix, the best continuation is an answer.”
That might sound small.
It’s not.
Because “answering” is not a natural property of next-token prediction — it’s a learned conversational protocol.
You can treat the modern “ChatGPT-like” behavior as a layered evolution:
Learns language, facts, style, and general world structure by predicting tokens on massive corpora.
Teaches a task-following prior: respond to instructions in a helpful format.
Refines behavior using comparisons and “what humans prefer,” often improving refusal behavior, politeness, and conversational consistency.
May is the middle layer: SFT.
June is the third layer: RLHF.
One of the most useful things you can internalize is:
SFT doesn’t primarily add new capabilities.
It changes which capabilities are reachable by default.
The base model already contains:
But without instruction tuning, you have to “prompt it into” those modes reliably.
Instruction tuning hardcodes a conversational protocol:
That “protocol prior” is what makes the same base model feel dramatically more usable.
Instruction tuning is one of those domains where:
data is the architecture.
Because the dataset defines:
You can get a “better” model simply by changing the dataset — without touching the network.
Which leads to the most important practical implication:
Your dataset is your product’s behavioral spec.
Most real instruction-tuned models are trained on a mixture:
Why mixes win:
It will “work” in your demo prompts and collapse in production prompts.
Instruction tuning makes the model more compliant.
That’s good when the instruction is benign.
It’s bad when the instruction is:
This is why, as an architect, you should treat instruction tuning as:
A capability amplifier and a risk amplifier.
The model becomes better at trying to satisfy the user.
It does not become better at validating reality.
That tension is why “helpful assistants” hallucinate with confidence.
They are trained to answer.
Not trained to say “I don’t know” unless your data made that normal.
Here’s the key shift:
A pretrained model is like a raw runtime.
An instruction-tuned model is like a runtime with a default framework.
It “expects” a certain interaction pattern.
That affects how you design your system boundaries.
You can treat the prompt like a contract:
Instead of shoving everything into one blob prompt, you can build a pipeline:
Instruction tuning increases your need for “truth boundaries,” not decreases it.
That’s the architecture lens from 2022 showing up again.
Instruction tuning shifts probabilities, not guarantees.
These are the predictable ways it fails:
Instruction tuning increases the chance the model will produce a complete, well-structured answer even when it lacks grounding.
Design implication: treat factual claims as untrusted unless grounded (RAG, tools, citations, or post-verification).
If the user prompt is underspecified, the model fills gaps with plausible assumptions.
Design implication: build clarifying-question logic (or UI constraints) for high-stakes workflows.
The assistant is trained to follow instructions. Attackers exploit that.
Design implication: use message role separation (system/developer/user), tool allowlists, and treat retrieved text as untrusted input.
Even with “return JSON,” the model occasionally produces extra text, missing fields, or malformed output.
Design implication: use structured output mechanisms when available, or validate + retry with constrained prompts.
If you’re building real product features, stop thinking “prompting.”
Start thinking “interface design.”
Here’s the pattern I use:
System / developer message
Defines role, safety boundaries, and output contract. This is your policy layer.
User task spec
Clear instruction + constraints. No mixed context. No hidden requirements.
Context packet
Retrieved facts, DB rows, tool outputs — explicitly labeled as context, with provenance.
Output schema
A concrete format: sections, bullets, JSON schema, or markdown contract.
Notice what’s missing:
Those are human-friendly. They are not system-friendly.
If you’re picking a model (or considering fine-tuning), review this list:
This is the same mindset as 2022 distributed systems:
You don’t get reliability from hope.
You get it from design + tests + telemetry.
InstructGPT (OpenAI)
A clear description of the pipeline: pretraining → supervised instruction tuning → preference optimization.
FLAN / instruction mixtures
A practical demonstration that diverse instruction mixtures can unlock strong generalization.
Usually it makes the model more usable, not fundamentally smarter.
It changes which behaviors are likely given instruction-like prompts:
But it doesn’t guarantee factuality or eliminate hallucinations.
Default to a strong instruction-tuned base model first.
Fine-tune when you need:
And only after you have an evaluation harness — otherwise you can’t tell if you improved anything.
Because it increases the probability of producing a complete answer.
If the model lacks grounding, it may still “finish the task” with plausible text. That’s why grounding and post-validation become more important after instruction tuning.
Instruction tuning makes the model behave like an assistant.
But it doesn’t fully answer the hardest question:
Which answers do humans actually prefer — and which behaviors should be discouraged even if they are “helpful”?
Next month is the missing piece:
RLHF: Stabilizing Behavior with Preferences (Alignment as Control).
That’s where “assistant-like” becomes “product-like”:
And more importantly:
a new way to think about alignment as control engineering.
RLHF: Stabilizing Behavior with Preferences (Alignment as Control)
RLHF is best understood as control engineering: a learned reward signal plus a constraint that keeps the model near its pretrained competence. Here’s how it works and how it fails.
Pretraining Is Compression: Tokens, Datasets, and Emergent Skill
Pretraining isn’t “learning facts.” It’s learning to compress a giant slice of the internet into a predictive machine. This post gives senior engineers the mental model: tokens, data mixtures, scaling, and why capabilities seem to ‘emerge’—plus the practical implications for cost, reliability, and product design.