May 28, 2023 - 16 MIN READ

Instruction Tuning: Turning a Completion Engine into an Assistant

Pretraining gives you a powerful text predictor. Instruction tuning turns it into something that behaves like a helpful tool. This post explains what instruction tuning changes, what it can’t change, and how to design products around the new failure modes.

Axel Domingues

In April, the big idea was:

Pretraining is compression.

You pay an enormous up-front cost to distill the internet into a predictive model of text.

But a pretrained model is not an assistant.

It’s not even “trying” to help you.

It’s doing one thing:

Given a prefix, predict a plausible continuation.

That’s a completion engine.

So May is about the next boundary shift:

Instruction tuning — the step where a completion engine starts behaving like something you can put behind an API without immediately embarrassing yourself.

Not because it becomes “truthful.”

Not because it becomes “safe.”

But because it gains a new skill that software actually cares about:

Following a task spec in the presence of context.

What instruction tuning is

Supervised fine-tuning on (instruction → response) data that teaches task-following as a default behavior.

What instruction tuning is not

It doesn’t make the model factual, reliable, or secure. It mostly changes how it answers, not whether it’s right.

This post is intentionally an engineering mental model, not a training cookbook.

You don’t need to know the optimizer to ship LLM features. You need to know what changed in the behavior distribution — and what that implies for architecture.

The gap instruction tuning is trying to close

A raw pretrained model has a problem that becomes obvious the first time you build a product with it:

It will happily continue whatever you started.

Sometimes that continuation is helpful.

Sometimes it’s a debate thread.

Sometimes it’s the wrong persona.

Sometimes it “finishes” a user prompt with an answer that feels confident but is structurally nonsensical.

The model isn’t broken.

It’s doing exactly what you trained it to do.

Next-token prediction doesn’t contain your product requirements.

So instruction tuning adds a training signal that does reflect product requirements:

respond to an instruction
stay in the role
produce the kind of output humans label as “a good answer”

That’s the bridge from “predictive model” to “assistant-like policy.”

A crisp definition: instruction tuning is supervised policy shaping

Instruction tuning is usually implemented as supervised fine-tuning (SFT):

You take a pretrained model.
You train it on pairs like:
prompt / instruction → desired response
You repeat across many task types.

The model learns a new default:

“When I see an instruction-like prefix, the best continuation is an answer.”

That might sound small.

It’s not.

Because “answering” is not a natural property of next-token prediction — it’s a learned conversational protocol.

The three-layer mental model: base → instruction → preference

You can treat the modern “ChatGPT-like” behavior as a layered evolution:

Base model (pretraining)

Learns language, facts, style, and general world structure by predicting tokens on massive corpora.

Instruction tuning (SFT)

Teaches a task-following prior: respond to instructions in a helpful format.

Preference tuning (RLHF / DPO / etc.)

Refines behavior using comparisons and “what humans prefer,” often improving refusal behavior, politeness, and conversational consistency.

May is the middle layer: SFT.

June is the third layer: RLHF.

Why instruction tuning works: it teaches a protocol, not knowledge

One of the most useful things you can internalize is:

SFT doesn’t primarily add new capabilities.
It changes which capabilities are reachable by default.

The base model already contains:

latent translation ability
latent summarization ability
latent QA patterns
latent code patterns

But without instruction tuning, you have to “prompt it into” those modes reliably.

Instruction tuning hardcodes a conversational protocol:

“Instruction” means “you should do something.”
“User” means “their goal matters.”
“Answer” means “produce a useful artifact.”

That “protocol prior” is what makes the same base model feel dramatically more usable.

If pretraining is learning the world, instruction tuning is learning the interface.

Where the data matters more than the algorithm

Instruction tuning is one of those domains where:

data is the architecture.

Because the dataset defines:

what counts as “helpful”
what tone is considered “good”
what structure is expected (bullets, steps, code, caveats)
what the model should refuse
what the model should “hedge” on

You can get a “better” model simply by changing the dataset — without touching the network.

Which leads to the most important practical implication:

Please note:

If you tune on sloppy instructions, you get sloppy behavior.
If you tune on unsafe instructions, you get unsafe behavior.
If you tune on verbose answers, you get verbosity.

Your dataset is your product’s behavioral spec.

Common instruction-tuning datasets (and why mixes win)

Most real instruction-tuned models are trained on a mixture:

curated human-written instruction/response pairs
programmatically generated instructions (self-instruct style)
task datasets turned into instructions (translation, classification, QA, etc.)
multi-turn conversation data

Why mixes win:

you want breadth (many task types)
you want diversity of tone and format
you want robustness to prompt styles
you want coverage over “normal user behavior” (which is messy)

The fastest path to a brittle assistant is a narrow instruction set.

It will “work” in your demo prompts and collapse in production prompts.

The big tradeoff: instruction-following vs truth

Instruction tuning makes the model more compliant.

That’s good when the instruction is benign.

It’s bad when the instruction is:

ambiguous
incorrect
adversarial
asking for private data
asking to ignore your policy layer

This is why, as an architect, you should treat instruction tuning as:

A capability amplifier and a risk amplifier.

The model becomes better at trying to satisfy the user.

It does not become better at validating reality.

That tension is why “helpful assistants” hallucinate with confidence.

They are trained to answer.

Not trained to say “I don’t know” unless your data made that normal.

What instruction tuning changes in product architecture

Here’s the key shift:

A pretrained model is like a raw runtime.

An instruction-tuned model is like a runtime with a default framework.

It “expects” a certain interaction pattern.

That affects how you design your system boundaries.

1) You now have a stable concept of “task specification”

You can treat the prompt like a contract:

instruction (what to do)
context (what to use)
constraints (format, style, safety)
output (artifact)

2) You can separate “context assembly” from “generation”

Instead of shoving everything into one blob prompt, you can build a pipeline:

gather context (RAG, DB fetches, tool results)
assemble it into a structured prompt
generate output under constraints
post-process / validate

3) You need a truth strategy

Instruction tuning increases your need for “truth boundaries,” not decreases it.

what must never be wrong?
where is the source of truth?
how do we detect when the model is guessing?

That’s the architecture lens from 2022 showing up again.

The failure modes you should expect (and design around)

Instruction tuning shifts probabilities, not guarantees.

These are the predictable ways it fails:

A practical “instruction contract” you can ship

If you’re building real product features, stop thinking “prompting.”

Start thinking “interface design.”

Here’s the pattern I use:

System / developer message

Defines role, safety boundaries, and output contract. This is your policy layer.

User task spec

Clear instruction + constraints. No mixed context. No hidden requirements.

Context packet

Retrieved facts, DB rows, tool outputs — explicitly labeled as context, with provenance.

Output schema

A concrete format: sections, bullets, JSON schema, or markdown contract.

Notice what’s missing:

“Just be smart.”
“Figure it out.”
“You know what I mean.”

Those are human-friendly. They are not system-friendly.

The engineering checklist: “Are we actually instruction-tuning ready?”

If you’re picking a model (or considering fine-tuning), review this list:

Define the assistant’s behavioral spec

tone and refusal style
allowed domains / disallowed domains
default verbosity
citation requirements (if any)

Decide what must be grounded

which claims must be verifiable?
which outputs can be “best effort”?
which workflows need human confirmation?

Build your evaluation harness early

holdout prompts that match production
adherence checks (format, constraints)
factuality checks (where applicable)
regression tests for jailbreak-like prompts

Add observability for behavior

prompt + response logging (with privacy controls)
refusal rates
“I don’t know” rates
tool-call rates and tool errors
cost and latency percentiles

This is the same mindset as 2022 distributed systems:

You don’t get reliability from hope.
You get it from design + tests + telemetry.

Resources

InstructGPT (OpenAI)

A clear description of the pipeline: pretraining → supervised instruction tuning → preference optimization.

FLAN / instruction mixtures

A practical demonstration that diverse instruction mixtures can unlock strong generalization.

Self-Instruct

Using a model to generate instruction data — a useful idea, with predictable quality risks.

Prompt Injection (OWASP Top 10 for LLM Apps)

A threat-model lens for why instruction-following is a security boundary, not a feature.

FAQ

What’s Next

Instruction tuning makes the model behave like an assistant.

But it doesn’t fully answer the hardest question:

Which answers do humans actually prefer — and which behaviors should be discouraged even if they are “helpful”?

Next month is the missing piece:

RLHF.

That’s where “assistant-like” becomes “product-like”:

more consistent refusals
fewer obviously harmful completions
better conversational stability under messy prompts

And more importantly:

a new way to think about alignment as control engineering.

RLHF: Stabilizing Behavior with Preferences (Alignment as Control)

RLHF is best understood as control engineering: a learned reward signal plus a constraint that keeps the model near its pretrained competence. Here’s how it works and how it fails.

Pretraining Is Compression: Tokens, Datasets, and Emergent Skill

Pretraining isn’t “learning facts.” It’s learning to compress a giant slice of the internet into a predictive machine. This post gives senior engineers the mental model: tokens, data mixtures, scaling, and why capabilities seem to ‘emerge’—plus the practical implications for cost, reliability, and product design.