Blog
Jun 25, 2023 - 15 MIN READ
RLHF: Stabilizing Behavior with Preferences (Alignment as Control)

RLHF: Stabilizing Behavior with Preferences (Alignment as Control)

RLHF is best understood as control engineering: a learned reward signal plus a constraint that keeps the model near its pretrained competence. Here’s how it works and how it fails.

Axel Domingues

Axel Domingues

Instruction tuning gives you an assistant.

RLHF is what makes that assistant feel stable.

Not perfect. Not safe by magic. Not truthful by default.

Stable.

If you’ve done any real reinforcement learning, RLHF should feel familiar:

  • you define a reward signal
  • you optimize behavior under that reward
  • and you fight reward hacking for the rest of your life

The twist is: in RLHF, the reward is human preference — a learned proxy for “good answers.”

In this series, I treat generative AI as an architectural boundary shift.

RLHF is the moment we stop saying “the model is smart” and start saying: the model’s behavior is controlled.

The RLHF goal

Shift responses toward what humans prefer: helpful, harmless, and policy-compliant.

The key mechanism

Optimize a preference reward with a leash (stay close to the base model).

The core risk

Preference is a proxy — and proxies can be gamed (reward hacking).

The product takeaway

Alignment is not ideology. It’s control + constraints + observability.


RLHF, Explained Like a Systems Engineer

Most explanations of RLHF start with the algorithm and end with “and it’s aligned now.”

That’s the wrong direction.

The right direction is: what problem is RLHF solving?

Two problems, really:

  1. The assistant problem: make the model reliably follow instructions and social norms.
  2. The stability problem: do that without destroying the model’s pretrained competence.

RLHF is a control loop that tries to balance those forces:

  • push behavior toward “preferred answers”
  • don’t drift so far that you lose general capability
  • prevent the model from learning weird hacks that please the reward model

If you’ve ever tuned a production system, this should ring a bell:

It’s not “maximize a metric.”
It’s “maximize it under constraints, and detect when it starts lying.”


The RLHF Pipeline (Conceptually)

Here’s the standard three-stage mental model:

Stage 1: Start with a base model

A pretrained model has broad competence but no reliable assistant behavior.

Stage 2: Teach “assistantness” with supervised fine-tuning

You fine-tune on instruction → ideal response pairs (May’s post).

Stage 3: Stabilize and shape behavior with preferences (RLHF)

Humans compare multiple candidate responses and choose which is better. A reward model learns to predict those preferences. Then you optimize the assistant to score higher on that reward — while staying near its original distribution.

If you stop there, it sounds simple.

The engineering reality is in the words “reward model” and “stay near.”


The Reward Model: Turning Taste into a Signal

Preferences are not labels like “cat vs dog.” They’re judgments like:

  • “This answer is more helpful.”
  • “This answer is less toxic.”
  • “This answer is clearer.”
  • “This answer follows policy.”

So the typical setup is:

  • generate multiple candidate answers
  • ask humans to rank or pick the best
  • train a model to score answers higher when they match those choices

That scoring model is the reward model.

And it inherits every classic risk of proxy metrics:

  • it can be fooled by style
  • it can overvalue verbosity
  • it can reward “confident tone” over truth
  • it can learn shallow shortcuts
RLHF optimizes for what the reward model can measure.

If “truth” isn’t well-measured, you get “confident-sounding” behavior instead of correct behavior.


The Leash: Why KL Penalties Matter More Than People Admit

If you aggressively optimize for reward, the model will drift. And drift can destroy competence fast.

So RLHF systems keep the model on a leash — typically by penalizing divergence from a reference model (often the SFT model).

In plain language:

  • “Be more like the preferred answers…”
  • “…but don’t become a different model entirely.”

This is the control knob.

Too loose:

  • you get reward hacking and weird behaviors

Too tight:

  • you get little improvement and lots of “safe but generic” responses
This is why “alignment” is an engineering tradeoff, not a moral switch:
  • tight constraint → cautious, generic, refusal-heavy
  • loose constraint → expressive, but more failure risk

RLHF Failure Modes (And Why They Look Like RL)

If you’ve trained RL agents, you’ve seen these patterns:

  • learn to exploit the reward function
  • overfit to the training distribution
  • become brittle to small environment changes
  • optimize the wrong thing very efficiently

RLHF has analogs of all of them.


Alignment as Control: The Product Architecture Implication

Once you see RLHF as control engineering, you stop asking:

“Is the model aligned?”

And start asking:

“What are the control surfaces and how do we operate them?”

In practice, you will have multiple layers of alignment:

  1. Model-level shaping (RLHF)
  2. System policy (hard rules)
  3. Tool authorization (what the model is allowed to do)
  4. Retrieval constraints (what context it can see)
  5. Output enforcement (validators, filters, redaction)
  6. Human escalation (approval for high-stakes actions)

RLHF is a strong default — but it’s never the whole system.

A clean mental model

RLHF is “defaults.”
Your product is “guarantees.”


Operating RLHF Models in Production

Even if you never train a model yourself, you still operate the outcome.

Your system needs to detect drift in behavior along the exact dimensions RLHF tries to control.

What to monitor (practically)

Safety + refusal metrics

Refusal rate, policy violation rate, “false refusal” reports.

Usefulness metrics

User satisfaction, task completion, edit distance (how much humans fix outputs).

Truth metrics

Citation coverage, factuality eval scores, hallucination flags, tool verification pass rate.

Cost + latency metrics

Tokens per request, context growth, tool calls per session, tail latency.

If you don’t track refusal rate and hallucination rate together, you will fool yourself.

You can “fix hallucinations” by refusing everything. You can “fix refusals” by answering everything.

Both are failure.

Ship-Ready Checklist: RLHF-Aware LLM Features

Define your “non-negotiables” outside the model

  • auth checks
  • permission boundaries
  • irreversible actions
  • data access controls

Use the model for language, not authority

  • the model suggests
  • your system decides
  • your system executes

Calibrate refusal and helpfulness in your UX

  • let the user rephrase
  • provide safe alternatives
  • show “why I can’t” only when useful

Build an eval harness

Include:

  • preference-style tests (helpfulness)
  • factuality tests (grounded answers)
  • injection tests (prompt attacks)
  • tool-use tests (correct calls, correct parameters)
  • regression tests by prompt version

Roll out changes like you roll out backend releases

  • canary traffic
  • prompt version tags
  • telemetry by version
  • rollback plan

Resources

InstructGPT (SFT + RLHF)

A clear reference for the “SFT then RLHF” pipeline that became the modern baseline.

RLHF in practice (PPO-style alignment)

A practical overview of the standard components: reward modeling, KL constraints, and policy optimization.


FAQ


What’s Next

By the end of June, we have the full model-side behavior story:

  • Transformers made long-context language modeling scalable.
  • Pretraining built broad capability.
  • Instruction tuning created an assistant interface.
  • RLHF stabilized behavior under preferences and constraints.

Now we can finally talk about the thing teams actually ship:

ChatGPT as a product architecture — context assembly, policy layers, tools, and observability around the model.

July: “Dissecting ChatGPT: The Product Architecture Around the Model”

Axel Domingues - 2026