
RLHF is best understood as control engineering: a learned reward signal plus a constraint that keeps the model near its pretrained competence. Here’s how it works and how it fails.
Axel Domingues
Instruction tuning gives you an assistant.
RLHF is what makes that assistant feel stable.
Not perfect. Not safe by magic. Not truthful by default.
Stable.
If you’ve done any real reinforcement learning, RLHF should feel familiar:
The twist is: in RLHF, the reward is human preference — a learned proxy for “good answers.”
RLHF is the moment we stop saying “the model is smart” and start saying: the model’s behavior is controlled.
The RLHF goal
Shift responses toward what humans prefer: helpful, harmless, and policy-compliant.
The key mechanism
Optimize a preference reward with a leash (stay close to the base model).
The core risk
Preference is a proxy — and proxies can be gamed (reward hacking).
The product takeaway
Alignment is not ideology. It’s control + constraints + observability.
Most explanations of RLHF start with the algorithm and end with “and it’s aligned now.”
That’s the wrong direction.
The right direction is: what problem is RLHF solving?
Two problems, really:
RLHF is a control loop that tries to balance those forces:
If you’ve ever tuned a production system, this should ring a bell:
It’s not “maximize a metric.”
It’s “maximize it under constraints, and detect when it starts lying.”
Here’s the standard three-stage mental model:
A pretrained model has broad competence but no reliable assistant behavior.
You fine-tune on instruction → ideal response pairs (May’s post).
Humans compare multiple candidate responses and choose which is better. A reward model learns to predict those preferences. Then you optimize the assistant to score higher on that reward — while staying near its original distribution.
If you stop there, it sounds simple.
The engineering reality is in the words “reward model” and “stay near.”
Preferences are not labels like “cat vs dog.” They’re judgments like:
So the typical setup is:
That scoring model is the reward model.
And it inherits every classic risk of proxy metrics:
If “truth” isn’t well-measured, you get “confident-sounding” behavior instead of correct behavior.
If you aggressively optimize for reward, the model will drift. And drift can destroy competence fast.
So RLHF systems keep the model on a leash — typically by penalizing divergence from a reference model (often the SFT model).
In plain language:
This is the control knob.
Too loose:
Too tight:
If you’ve trained RL agents, you’ve seen these patterns:
RLHF has analogs of all of them.
The model learns strategies that score well but aren’t truly better:
Mitigations:
The model converges on a narrow style that is consistently rewarded:
Mitigations:
If refusal is heavily rewarded for avoiding risk, the assistant starts refusing safe requests.
Mitigations:
The assistant becomes better at sounding aligned than being correct.
Mitigations:
Once you see RLHF as control engineering, you stop asking:
“Is the model aligned?”
And start asking:
“What are the control surfaces and how do we operate them?”
In practice, you will have multiple layers of alignment:
RLHF is a strong default — but it’s never the whole system.
A clean mental model
RLHF is “defaults.”
Your product is “guarantees.”
Even if you never train a model yourself, you still operate the outcome.
Your system needs to detect drift in behavior along the exact dimensions RLHF tries to control.
Safety + refusal metrics
Refusal rate, policy violation rate, “false refusal” reports.
Usefulness metrics
User satisfaction, task completion, edit distance (how much humans fix outputs).
Truth metrics
Citation coverage, factuality eval scores, hallucination flags, tool verification pass rate.
Cost + latency metrics
Tokens per request, context growth, tool calls per session, tail latency.
Both are failure.You can “fix hallucinations” by refusing everything. You can “fix refusals” by answering everything.
Include:
No.
RLHF shapes defaults. Safety in production requires system design:
Because preference rewards often correlate with “polite and thorough.” If the reward model overvalues style, the policy optimizes style.
That’s reward hacking — just a socially acceptable kind.
If refusing risky content is strongly rewarded, the model learns refusal as a safe strategy. Without enough “safe completion” examples, it refuses borderline-safe requests too.
By the end of June, we have the full model-side behavior story:
Now we can finally talk about the thing teams actually ship:
ChatGPT as a product architecture — context assembly, policy layers, tools, and observability around the model.
July: “Dissecting ChatGPT: The Product Architecture Around the Model”
Dissecting ChatGPT: The Product Architecture Around the Model
ChatGPT isn’t “an LLM”. It’s a carefully designed product loop: context assembly, policy layers, tool orchestration, and observability wrapped around a probabilistic core.
Instruction Tuning: Turning a Completion Engine into an Assistant
Pretraining gives you a powerful text predictor. Instruction tuning turns it into something that behaves like a helpful tool. This post explains what instruction tuning changes, what it can’t change, and how to design products around the new failure modes.