Aug 26, 2018 - 15 MIN READ

Why RL Training Is Unstable (A Catalog of Breakage)

After actor-critic finally felt “trainable,” I hit the next wall - RL doesn’t just fail—it fails in loops. This month is my map of the most common ways it breaks.

Axel Domingues

In July, actor-critic gave me my first taste of something I could call “training.”

Not just experimenting. Not just hoping.

Training: tweak a knob, observe a predictable change, repeat.

And then, inevitably, August happened.

Because the moment RL starts feeling trainable, you do the natural thing:

You try to scale it.

More steps. Harder tasks. Slightly bigger networks. Slightly different hyperparameters.

And suddenly the agent isn’t learning.
Or it is… and then it collapses.
Or it’s “learning” but only because it found a weird loophole in the environment.
Or it learns in one run and fails in the next with the same settings.

So this post is not an algorithm.

It’s a catalog.

A breakdown of the failure modes I keep hitting—so I can stop treating instability as “mystery” and start treating it like an engineering reality.

Why instability happens

You’re training a data generator while training the model.

What this post is

A catalog of breakage patterns I keep hitting — so I can diagnose, not restart.

The core lens

Policy → data → update → new policy
A feedback loop that amplifies mistakes.

The goal

Stability first, performance second.
Make reward mean something.

This is the post I wish I had in February.RL isn’t unstable because it’s “hard.”It’s unstable because the learning loop is self-generated:

the policy creates data
the data updates the policy
the updated policy creates different data

That feedback loop is powerful—and fragile.

RL Training Instability: What’s Actually Different From Supervised Learning?

In supervised learning, the dataset is mostly fixed.

Even if you shuffle, augment, or resample, the world doesn’t change because your model took a gradient step.

In RL, the model changes the world it learns from.

That means every update risks shifting the distribution of experience.

And once I saw it that way, the instability stopped feeling like a flaw and started feeling like a design constraint:

RL isn’t “train model on data.”
RL is “train a data generator while training the model.”

The Taxonomy: Where Breakage Comes From

To make this useful (and not just complaining), I’ve been forcing myself to classify failures into buckets.

Here are the buckets I keep coming back to:

Environment failure

The environment lies, is exploitable, or behaves unexpectedly.

Data failure

Your policy produced biased experience, so you train on a distorted world.

Signal failure

The learning signal is noisy; variance is the default.

Update failure

One step can ruin everything; updates are dangerous.

August is where I started treating each training run like a system with subsystems.

And the goal is no longer “maximize reward.”

The goal is:

Make the learning loop stable enough that reward means something.

A Catalog of Breakage

This is my “wall of shame” list—each item is something I’ve seen in practice.

Not theoretical. Not abstract.

The reason I’m writing it down is simple:

When a run fails, I don’t want to restart and hope for a different outcome.
I want to diagnose.

Exploration collapse

The agent stops looking.

Reward hacking

The agent learns the wrong lesson.

High-variance updates

Learning looks like noise.

Catastrophic updates

One bad step destroys competence.

Critic lies

Baseline becomes misinformation.

Bootstrapping errors

The future you predict is wrong.

Correlated data

You train on echoes of yourself.

Non-stationarity

The target moves because you moved it.

1) Exploration Collapse: “The Agent Stops Looking”

Symptom

reward curve plateaus early
entropy collapses quickly
action distribution becomes narrow and stays there

What it feels like The agent made a few early guesses, got lucky (or unlucky), and then committed.

Why it happens Exploration pressure is almost always under-tuned by default.
And many algorithms “optimize themselves into certainty” if you let them.

How I catch it

entropy over time
action histogram snapshots
episode length distribution (often collapses too)

If entropy collapses fast, I assume the run is already dead—even if reward is rising.

“Early success” in RL is often a trap.

2) Reward Hacking: “The Agent Learns the Wrong Lesson”

Symptom

reward improves but behavior looks nonsense
the agent repeats a weird loop
performance doesn’t transfer even slightly

What it feels like The system passed the metric and failed the task.

Why it happens Rewards are interfaces. Interfaces get exploited.

This is where my January line comes back:

learning is an interface problem.

How I catch it

record short rollouts
sanity-check simple invariants (“does the behavior match the goal?”)
watch for reward spikes that don’t correlate with episode length or stability

3) High Variance Updates: “Learning Looks Like Noise”

Symptom

reward curve oscillates wildly
policy loss spikes and drops
advantages have huge variance
one run works, next run fails with same settings

What it feels like I’m training, but the gradient is basically a random walk.

Why it happens Returns are noisy. Trajectories are correlated.
And your “dataset” is whatever your current policy happened to experience.

How I catch it

advantage mean/variance
reward per episode distribution (not just mean)
multiple seeds (even if it hurts)

4) Catastrophic Policy Updates: “One Bad Step Destroys Competence”

Symptom

reward rises… then crashes to near-zero
KL between policies spikes (if available)
entropy drops or explodes
value loss goes unstable after the crash

What it feels like The agent learned how to walk, then took one step and forgot its legs.

Why it happens Policy optimization is sensitive.
If a step is too big, you can jump from “pretty good policy” to “garbage policy” instantly.

And because the policy generates data, once you jump into garbage, you generate garbage experience and train on it.

Now you’re digging.

How I catch it

watch for sudden KL spikes or sudden reward collapse
watch value predictions drift after the collapse
keep an eye on update magnitudes

This is the failure mode that made me finally respect “trust region” ideas.

Not as math. As safety constraints.

5) Critic Misleading the Actor: “The Baseline Becomes a Liar”

Symptom

value loss looks “fine” but learning stalls
advantages become mostly same sign
explained variance is near zero
policy changes but reward doesn’t

What it feels like The actor is confidently optimizing a bad signal.

Why it happens The critic isn’t just a helper. It’s a teacher.

If it’s wrong, the actor is being trained on misinformation.

How I catch it

explained variance (does the critic predict returns at all?)
value scale vs reward scale
advantage distribution shape

6) Bootstrapping Errors: “The Future You Predict Is Wrong”

Symptom

value estimates drift upward or downward
learning becomes unstable even when rewards look normal
long-horizon tasks are especially fragile

What it feels like The model’s internal beliefs detach from reality.

Why it happens Many RL methods bootstrap:
they estimate future returns using their own value predictions.

If those predictions are biased, the bias can compound.

How I catch it

value prediction histograms over time
compare predicted value vs observed episodic returns (rough sanity check)

7) Correlated Data: “You’re Training on Echoes”

Symptom

training metrics improve but generalization is weak
the agent overfits to recent experience
instability when environment randomness changes slightly

What it feels like The agent becomes good at the last few minutes of its own life.

Why it happens Trajectories are correlated, especially on-policy. And even off-policy methods can become “recent experience addicts.”

How I catch it

measure performance over multiple evaluation episodes
separate training and evaluation (even if minimal)
watch for quick shifts in behavior

8) Non-Stationarity: “The Target Moves Because You Moved It”

This is the meta-failure that sits behind everything else.

The policy changes. The state visitation changes. The experience distribution changes. The critic target changes. The next batch is “from a different world.”

That means even if every update is “correct,” the system can still be unstable.

Because you’re not optimizing a static objective on a fixed dataset.

You’re steering a dynamic loop.

My August Debugging Upgrade: “Stability First, Performance Second”

I used to treat reward as the main metric.

Now reward is the last thing I trust.

In August, my priority became:

Is the run stable?
Are the internal signals sane?
Does behavior improve in a way that matches the task?
Then look at reward.

So I started treating these as “must be true” signals:

entropy doesn’t collapse instantly
advantage distribution looks reasonable
value estimates don’t drift into fantasy
updates don’t explode or vanish
evaluation isn’t a one-seed miracle

A single successful run is not evidence.

In RL, it’s often just a lucky seed.

If I can’t reproduce improvement across multiple runs, I treat the result as “unstable until proven otherwise.”

The Minimal Instrument Panel I Always Want Now

This is the instrumentation list I’m trying to standardize across experiments:

Outcome signals

Train + eval reward, episode length distribution.

Exploration signals

Entropy over time + action distribution snapshots.

Value / critic signals

Value loss + explained variance + value scale sanity.

Update safety

Advantage mean/variance + gradient norms when suspicious.

I’m basically rebuilding the deep learning mentality from 2017:

training is debugging with graphs.

RL just adds more graphs, and more ways for them to lie.

Field Notes (What I’d Tell My Past Self)

1) Instability isn’t an edge case—it’s the default

If the run is stable, that’s the surprising thing.

2) “It learned once” is not a milestone

It’s a hypothesis.

3) Most RL progress is “removing ways to fail”

Big wins are rare. Most learning comes from systematically eliminating breakage.

4) You don’t need more cleverness yet—you need safety rails

This month shifted my attention toward constrained updates and reproducibility.

Because without those, “improvement” is mostly storytelling.

August takeaway

Most RL progress isn’t cleverness.

It’s removing failure modes until the learning loop becomes stable enough to trust.

What’s Next

This month was about naming the demons.

Next month I want to start building defenses.

If catastrophic updates are real, then I need methods that explicitly care about step size in policy space—not just in parameter space.

So September is where I lean into stability mechanisms:

safer updates
constrained policy shifts
and the intuition behind why “trust region” ideas exist at all

If August was the catalog of breakage…

September is where I start designing guardrails.

Continuous Control - DDPG and the Seduction of Off-Policy

This month I left “toy” discrete actions and stepped into continuous control. DDPG looked like the perfect deal—until I learned what off-policy really costs.

Actor-Critic - The First Time RL Feels Trainable

Policy gradients were honest but noisy. This month I added a critic—and for the first time in RL, training started to feel like something I could actually steer.