Blog
Aug 26, 2018 - 15 MIN READ
Why RL Training Is Unstable (A Catalog of Breakage)

Why RL Training Is Unstable (A Catalog of Breakage)

After actor-critic finally felt “trainable,” I hit the next wall - RL doesn’t just fail—it fails in loops. This month is my map of the most common ways it breaks.

Axel Domingues

Axel Domingues

In July, actor-critic gave me my first taste of something I could call “training.”

Not just experimenting. Not just hoping.

Training: tweak a knob, observe a predictable change, repeat.

And then, inevitably, August happened.

Because the moment RL starts feeling trainable, you do the natural thing:

You try to scale it.

More steps. Harder tasks. Slightly bigger networks. Slightly different hyperparameters.

And suddenly the agent isn’t learning.
Or it is… and then it collapses.
Or it’s “learning” but only because it found a weird loophole in the environment.
Or it learns in one run and fails in the next with the same settings.

So this post is not an algorithm.

It’s a catalog.

A breakdown of the failure modes I keep hitting—so I can stop treating instability as “mystery” and start treating it like an engineering reality.

Why instability happens

You’re training a data generator while training the model.

What this post is

A catalog of breakage patterns I keep hitting — so I can diagnose, not restart.

The core lens

Policy → data → update → new policy
A feedback loop that amplifies mistakes.

The goal

Stability first, performance second.
Make reward mean something.

This is the post I wish I had in February.RL isn’t unstable because it’s “hard.”It’s unstable because the learning loop is self-generated:
  • the policy creates data
  • the data updates the policy
  • the updated policy creates different data
That feedback loop is powerful—and fragile.

RL Training Instability: What’s Actually Different From Supervised Learning?

In supervised learning, the dataset is mostly fixed.

Even if you shuffle, augment, or resample, the world doesn’t change because your model took a gradient step.

In RL, the model changes the world it learns from.

That means every update risks shifting the distribution of experience.

And once I saw it that way, the instability stopped feeling like a flaw and started feeling like a design constraint:

RL isn’t “train model on data.”
RL is “train a data generator while training the model.”


The Taxonomy: Where Breakage Comes From

To make this useful (and not just complaining), I’ve been forcing myself to classify failures into buckets.

Here are the buckets I keep coming back to:

Environment failure

The environment lies, is exploitable, or behaves unexpectedly.

Data failure

Your policy produced biased experience, so you train on a distorted world.

Signal failure

The learning signal is noisy; variance is the default.

Update failure

One step can ruin everything; updates are dangerous.

August is where I started treating each training run like a system with subsystems.

And the goal is no longer “maximize reward.”

The goal is:

Make the learning loop stable enough that reward means something.


A Catalog of Breakage

This is my “wall of shame” list—each item is something I’ve seen in practice.

Not theoretical. Not abstract.

The reason I’m writing it down is simple:

When a run fails, I don’t want to restart and hope for a different outcome.
I want to diagnose.

Exploration collapse

The agent stops looking.

Reward hacking

The agent learns the wrong lesson.

High-variance updates

Learning looks like noise.

Catastrophic updates

One bad step destroys competence.

Critic lies

Baseline becomes misinformation.

Bootstrapping errors

The future you predict is wrong.

Correlated data

You train on echoes of yourself.

Non-stationarity

The target moves because you moved it.


1) Exploration Collapse: “The Agent Stops Looking”

Symptom

  • reward curve plateaus early
  • entropy collapses quickly
  • action distribution becomes narrow and stays there

What it feels like The agent made a few early guesses, got lucky (or unlucky), and then committed.

Why it happens Exploration pressure is almost always under-tuned by default.
And many algorithms “optimize themselves into certainty” if you let them.

How I catch it

  • entropy over time
  • action histogram snapshots
  • episode length distribution (often collapses too)
If entropy collapses fast, I assume the run is already dead—even if reward is rising.

“Early success” in RL is often a trap.


2) Reward Hacking: “The Agent Learns the Wrong Lesson”

Symptom

  • reward improves but behavior looks nonsense
  • the agent repeats a weird loop
  • performance doesn’t transfer even slightly

What it feels like The system passed the metric and failed the task.

Why it happens Rewards are interfaces. Interfaces get exploited.

This is where my January line comes back:

learning is an interface problem.

How I catch it

  • record short rollouts
  • sanity-check simple invariants (“does the behavior match the goal?”)
  • watch for reward spikes that don’t correlate with episode length or stability

3) High Variance Updates: “Learning Looks Like Noise”

Symptom

  • reward curve oscillates wildly
  • policy loss spikes and drops
  • advantages have huge variance
  • one run works, next run fails with same settings

What it feels like I’m training, but the gradient is basically a random walk.

Why it happens Returns are noisy. Trajectories are correlated.
And your “dataset” is whatever your current policy happened to experience.

How I catch it

  • advantage mean/variance
  • reward per episode distribution (not just mean)
  • multiple seeds (even if it hurts)

4) Catastrophic Policy Updates: “One Bad Step Destroys Competence”

Symptom

  • reward rises… then crashes to near-zero
  • KL between policies spikes (if available)
  • entropy drops or explodes
  • value loss goes unstable after the crash

What it feels like The agent learned how to walk, then took one step and forgot its legs.

Why it happens Policy optimization is sensitive.
If a step is too big, you can jump from “pretty good policy” to “garbage policy” instantly.

And because the policy generates data, once you jump into garbage, you generate garbage experience and train on it.

Now you’re digging.

How I catch it

  • watch for sudden KL spikes or sudden reward collapse
  • watch value predictions drift after the collapse
  • keep an eye on update magnitudes
This is the failure mode that made me finally respect “trust region” ideas.

Not as math. As safety constraints.


5) Critic Misleading the Actor: “The Baseline Becomes a Liar”

Symptom

  • value loss looks “fine” but learning stalls
  • advantages become mostly same sign
  • explained variance is near zero
  • policy changes but reward doesn’t

What it feels like The actor is confidently optimizing a bad signal.

Why it happens The critic isn’t just a helper. It’s a teacher.

If it’s wrong, the actor is being trained on misinformation.

How I catch it

  • explained variance (does the critic predict returns at all?)
  • value scale vs reward scale
  • advantage distribution shape

6) Bootstrapping Errors: “The Future You Predict Is Wrong”

Symptom

  • value estimates drift upward or downward
  • learning becomes unstable even when rewards look normal
  • long-horizon tasks are especially fragile

What it feels like The model’s internal beliefs detach from reality.

Why it happens Many RL methods bootstrap:
they estimate future returns using their own value predictions.

If those predictions are biased, the bias can compound.

How I catch it

  • value prediction histograms over time
  • compare predicted value vs observed episodic returns (rough sanity check)

7) Correlated Data: “You’re Training on Echoes”

Symptom

  • training metrics improve but generalization is weak
  • the agent overfits to recent experience
  • instability when environment randomness changes slightly

What it feels like The agent becomes good at the last few minutes of its own life.

Why it happens Trajectories are correlated, especially on-policy. And even off-policy methods can become “recent experience addicts.”

How I catch it

  • measure performance over multiple evaluation episodes
  • separate training and evaluation (even if minimal)
  • watch for quick shifts in behavior

8) Non-Stationarity: “The Target Moves Because You Moved It”

This is the meta-failure that sits behind everything else.

The policy changes. The state visitation changes. The experience distribution changes. The critic target changes. The next batch is “from a different world.”

That means even if every update is “correct,” the system can still be unstable.

Because you’re not optimizing a static objective on a fixed dataset.

You’re steering a dynamic loop.


My August Debugging Upgrade: “Stability First, Performance Second”

I used to treat reward as the main metric.

Now reward is the last thing I trust.

In August, my priority became:

  1. Is the run stable?
  2. Are the internal signals sane?
  3. Does behavior improve in a way that matches the task?
  4. Then look at reward.

So I started treating these as “must be true” signals:

  • entropy doesn’t collapse instantly
  • advantage distribution looks reasonable
  • value estimates don’t drift into fantasy
  • updates don’t explode or vanish
  • evaluation isn’t a one-seed miracle
A single successful run is not evidence.

In RL, it’s often just a lucky seed.

If I can’t reproduce improvement across multiple runs, I treat the result as “unstable until proven otherwise.”


The Minimal Instrument Panel I Always Want Now

This is the instrumentation list I’m trying to standardize across experiments:

Outcome signals

Train + eval reward, episode length distribution.

Exploration signals

Entropy over time + action distribution snapshots.

Value / critic signals

Value loss + explained variance + value scale sanity.

Update safety

Advantage mean/variance + gradient norms when suspicious.

I’m basically rebuilding the deep learning mentality from 2017:

training is debugging with graphs.

RL just adds more graphs, and more ways for them to lie.


Field Notes (What I’d Tell My Past Self)

1) Instability isn’t an edge case—it’s the default

If the run is stable, that’s the surprising thing.

2) “It learned once” is not a milestone

It’s a hypothesis.

3) Most RL progress is “removing ways to fail”

Big wins are rare. Most learning comes from systematically eliminating breakage.

4) You don’t need more cleverness yet—you need safety rails

This month shifted my attention toward constrained updates and reproducibility.

Because without those, “improvement” is mostly storytelling.

August takeaway

Most RL progress isn’t cleverness.

It’s removing failure modes until the learning loop becomes stable enough to trust.


What’s Next

This month was about naming the demons.

Next month I want to start building defenses.

If catastrophic updates are real, then I need methods that explicitly care about step size in policy space—not just in parameter space.

So September is where I lean into stability mechanisms:

  • safer updates
  • constrained policy shifts
  • and the intuition behind why “trust region” ideas exist at all

If August was the catalog of breakage…

September is where I start designing guardrails.

Axel Domingues - 2026