Jun 25, 2023 - 15 MIN READ

RLHF: Stabilizing Behavior with Preferences (Alignment as Control)

RLHF is best understood as control engineering: a learned reward signal plus a constraint that keeps the model near its pretrained competence. Here’s how it works and how it fails.

Axel Domingues

Instruction tuning gives you an assistant.

RLHF is what makes that assistant feel stable.

Not perfect. Not safe by magic. Not truthful by default.

Stable.

If you’ve done any real reinforcement learning, RLHF should feel familiar:

you define a reward signal
you optimize behavior under that reward
and you fight reward hacking for the rest of your life

The twist is: in RLHF, the reward is human preference — a learned proxy for “good answers.”

In this series, I treat generative AI as an architectural boundary shift.

RLHF is the moment we stop saying “the model is smart” and start saying: the model’s behavior is controlled.

The RLHF goal

Shift responses toward what humans prefer: helpful, harmless, and policy-compliant.

The key mechanism

Optimize a preference reward with a leash (stay close to the base model).

The core risk

Preference is a proxy — and proxies can be gamed (reward hacking).

The product takeaway

Alignment is not ideology. It’s control + constraints + observability.

RLHF, Explained Like a Systems Engineer

Most explanations of RLHF start with the algorithm and end with “and it’s aligned now.”

That’s the wrong direction.

The right direction is: what problem is RLHF solving?

Two problems, really:

The assistant problem: make the model reliably follow instructions and social norms.
The stability problem: do that without destroying the model’s pretrained competence.

RLHF is a control loop that tries to balance those forces:

push behavior toward “preferred answers”
don’t drift so far that you lose general capability
prevent the model from learning weird hacks that please the reward model

If you’ve ever tuned a production system, this should ring a bell:

It’s not “maximize a metric.”
It’s “maximize it under constraints, and detect when it starts lying.”

The RLHF Pipeline (Conceptually)

Here’s the standard three-stage mental model:

Stage 1: Start with a base model

A pretrained model has broad competence but no reliable assistant behavior.

Stage 2: Teach “assistantness” with supervised fine-tuning

You fine-tune on instruction → ideal response pairs (May’s post).

Stage 3: Stabilize and shape behavior with preferences (RLHF)

Humans compare multiple candidate responses and choose which is better. A reward model learns to predict those preferences. Then you optimize the assistant to score higher on that reward — while staying near its original distribution.

If you stop there, it sounds simple.

The engineering reality is in the words “reward model” and “stay near.”

The Reward Model: Turning Taste into a Signal

Preferences are not labels like “cat vs dog.” They’re judgments like:

“This answer is more helpful.”
“This answer is less toxic.”
“This answer is clearer.”
“This answer follows policy.”

So the typical setup is:

generate multiple candidate answers
ask humans to rank or pick the best
train a model to score answers higher when they match those choices

That scoring model is the reward model.

And it inherits every classic risk of proxy metrics:

it can be fooled by style
it can overvalue verbosity
it can reward “confident tone” over truth
it can learn shallow shortcuts

RLHF optimizes for what the reward model can measure.

If “truth” isn’t well-measured, you get “confident-sounding” behavior instead of correct behavior.

The Leash: Why KL Penalties Matter More Than People Admit

If you aggressively optimize for reward, the model will drift. And drift can destroy competence fast.

So RLHF systems keep the model on a leash — typically by penalizing divergence from a reference model (often the SFT model).

In plain language:

“Be more like the preferred answers…”
“…but don’t become a different model entirely.”

This is the control knob.

Too loose:

you get reward hacking and weird behaviors

Too tight:

you get little improvement and lots of “safe but generic” responses

This is why “alignment” is an engineering tradeoff, not a moral switch:

tight constraint → cautious, generic, refusal-heavy
loose constraint → expressive, but more failure risk

RLHF Failure Modes (And Why They Look Like RL)

If you’ve trained RL agents, you’ve seen these patterns:

learn to exploit the reward function
overfit to the training distribution
become brittle to small environment changes
optimize the wrong thing very efficiently

RLHF has analogs of all of them.

Alignment as Control: The Product Architecture Implication

Once you see RLHF as control engineering, you stop asking:

“Is the model aligned?”

And start asking:

“What are the control surfaces and how do we operate them?”

In practice, you will have multiple layers of alignment:

Model-level shaping (RLHF)
System policy (hard rules)
Tool authorization (what the model is allowed to do)
Retrieval constraints (what context it can see)
Output enforcement (validators, filters, redaction)
Human escalation (approval for high-stakes actions)

RLHF is a strong default — but it’s never the whole system.

A clean mental model

RLHF is “defaults.”
Your product is “guarantees.”

Operating RLHF Models in Production

Even if you never train a model yourself, you still operate the outcome.

Your system needs to detect drift in behavior along the exact dimensions RLHF tries to control.

What to monitor (practically)

Safety + refusal metrics

Refusal rate, policy violation rate, “false refusal” reports.

Usefulness metrics

User satisfaction, task completion, edit distance (how much humans fix outputs).

Truth metrics

Citation coverage, factuality eval scores, hallucination flags, tool verification pass rate.

Cost + latency metrics

Tokens per request, context growth, tool calls per session, tail latency.

If you don’t track refusal rate and hallucination rate together, you will fool yourself.

You can “fix hallucinations” by refusing everything. You can “fix refusals” by answering everything.

Both are failure.

Ship-Ready Checklist: RLHF-Aware LLM Features

Define your “non-negotiables” outside the model

auth checks
permission boundaries
irreversible actions
data access controls

Use the model for language, not authority

the model suggests
your system decides
your system executes

Calibrate refusal and helpfulness in your UX

let the user rephrase
provide safe alternatives
show “why I can’t” only when useful

Build an eval harness

Include:

preference-style tests (helpfulness)
factuality tests (grounded answers)
injection tests (prompt attacks)
tool-use tests (correct calls, correct parameters)
regression tests by prompt version

Roll out changes like you roll out backend releases

canary traffic
prompt version tags
telemetry by version
rollback plan

Resources

InstructGPT (SFT + RLHF)

A clear reference for the “SFT then RLHF” pipeline that became the modern baseline.

RLHF in practice (PPO-style alignment)

A practical overview of the standard components: reward modeling, KL constraints, and policy optimization.

FAQ

What’s Next

By the end of June, we have the full model-side behavior story:

Transformers made long-context language modeling scalable.
Pretraining built broad capability.
Instruction tuning created an assistant interface.
RLHF stabilized behavior under preferences and constraints.

Now we can finally talk about the thing teams actually ship:

ChatGPT as a product architecture — context assembly, policy layers, tools, and observability around the model.

July: Dissecting ChatGPT

Dissecting ChatGPT: The Product Architecture Around the Model

ChatGPT isn’t “an LLM”. It’s a carefully designed product loop: context assembly, policy layers, tool orchestration, and observability wrapped around a probabilistic core.

Instruction Tuning: Turning a Completion Engine into an Assistant

Pretraining gives you a powerful text predictor. Instruction tuning turns it into something that behaves like a helpful tool. This post explains what instruction tuning changes, what it can’t change, and how to design products around the new failure modes.