
After actor-critic finally felt “trainable,” I hit the next wall - RL doesn’t just fail—it fails in loops. This month is my map of the most common ways it breaks.
Axel Domingues
In July, actor-critic gave me my first taste of something I could call “training.”
Not just experimenting. Not just hoping.
Training: tweak a knob, observe a predictable change, repeat.
And then, inevitably, August happened.
Because the moment RL starts feeling trainable, you do the natural thing:
You try to scale it.
More steps. Harder tasks. Slightly bigger networks. Slightly different hyperparameters.
And suddenly the agent isn’t learning.
Or it is… and then it collapses.
Or it’s “learning” but only because it found a weird loophole in the environment.
Or it learns in one run and fails in the next with the same settings.
So this post is not an algorithm.
It’s a catalog.
A breakdown of the failure modes I keep hitting—so I can stop treating instability as “mystery” and start treating it like an engineering reality.
Why instability happens
You’re training a data generator while training the model.
What this post is
A catalog of breakage patterns I keep hitting — so I can diagnose, not restart.
The core lens
Policy → data → update → new policy
A feedback loop that amplifies mistakes.
The goal
Stability first, performance second.
Make reward mean something.
In supervised learning, the dataset is mostly fixed.
Even if you shuffle, augment, or resample, the world doesn’t change because your model took a gradient step.
In RL, the model changes the world it learns from.
That means every update risks shifting the distribution of experience.
And once I saw it that way, the instability stopped feeling like a flaw and started feeling like a design constraint:
RL isn’t “train model on data.”
RL is “train a data generator while training the model.”

To make this useful (and not just complaining), I’ve been forcing myself to classify failures into buckets.
Here are the buckets I keep coming back to:
Environment failure
The environment lies, is exploitable, or behaves unexpectedly.
Data failure
Your policy produced biased experience, so you train on a distorted world.
Signal failure
The learning signal is noisy; variance is the default.
Update failure
One step can ruin everything; updates are dangerous.
If evaluation is unreliable, every other bucket becomes invisible.
You can’t tell if you improved, regressed, or just got a lucky seed.
August is where I started treating each training run like a system with subsystems.
And the goal is no longer “maximize reward.”
The goal is:
Make the learning loop stable enough that reward means something.
This is my “wall of shame” list—each item is something I’ve seen in practice.
Not theoretical. Not abstract.
The reason I’m writing it down is simple:
When a run fails, I don’t want to restart and hope for a different outcome.
I want to diagnose.
Exploration collapse
The agent stops looking.
Reward hacking
The agent learns the wrong lesson.
High-variance updates
Learning looks like noise.
Catastrophic updates
One bad step destroys competence.
Critic lies
Baseline becomes misinformation.
Bootstrapping errors
The future you predict is wrong.
Correlated data
You train on echoes of yourself.
Non-stationarity
The target moves because you moved it.
Symptom
What it feels like The agent made a few early guesses, got lucky (or unlucky), and then committed.
Why it happens
Exploration pressure is almost always under-tuned by default.
And many algorithms “optimize themselves into certainty” if you let them.
How I catch it
“Early success” in RL is often a trap.
Symptom
What it feels like The system passed the metric and failed the task.
Why it happens Rewards are interfaces. Interfaces get exploited.
This is where my January line comes back:
learning is an interface problem.
How I catch it
Symptom
What it feels like I’m training, but the gradient is basically a random walk.
Why it happens
Returns are noisy. Trajectories are correlated.
And your “dataset” is whatever your current policy happened to experience.
How I catch it
Symptom
What it feels like The agent learned how to walk, then took one step and forgot its legs.
Why it happens
Policy optimization is sensitive.
If a step is too big, you can jump from “pretty good policy” to “garbage policy” instantly.
And because the policy generates data, once you jump into garbage, you generate garbage experience and train on it.
Now you’re digging.
How I catch it
Not as math. As safety constraints.
Symptom
What it feels like The actor is confidently optimizing a bad signal.
Why it happens The critic isn’t just a helper. It’s a teacher.
If it’s wrong, the actor is being trained on misinformation.
How I catch it
Symptom
What it feels like The model’s internal beliefs detach from reality.
Why it happens
Many RL methods bootstrap:
they estimate future returns using their own value predictions.
If those predictions are biased, the bias can compound.
How I catch it
Symptom
What it feels like The agent becomes good at the last few minutes of its own life.
Why it happens Trajectories are correlated, especially on-policy. And even off-policy methods can become “recent experience addicts.”
How I catch it
This is the meta-failure that sits behind everything else.
The policy changes. The state visitation changes. The experience distribution changes. The critic target changes. The next batch is “from a different world.”
That means even if every update is “correct,” the system can still be unstable.
Because you’re not optimizing a static objective on a fixed dataset.
You’re steering a dynamic loop.
I used to treat reward as the main metric.
Now reward is the last thing I trust.
In August, my priority became:
So I started treating these as “must be true” signals:
In RL, it’s often just a lucky seed.
If I can’t reproduce improvement across multiple runs, I treat the result as “unstable until proven otherwise.”
This is the instrumentation list I’m trying to standardize across experiments:
Outcome signals
Train + eval reward, episode length distribution.
Exploration signals
Entropy over time + action distribution snapshots.
Value / critic signals
Value loss + explained variance + value scale sanity.
Update safety
Advantage mean/variance + gradient norms when suspicious.
I’m basically rebuilding the deep learning mentality from 2017:
training is debugging with graphs.
RL just adds more graphs, and more ways for them to lie.
If the run is stable, that’s the surprising thing.
It’s a hypothesis.
Big wins are rare. Most learning comes from systematically eliminating breakage.
This month shifted my attention toward constrained updates and reproducibility.
Because without those, “improvement” is mostly storytelling.
August takeaway
Most RL progress isn’t cleverness.
It’s removing failure modes until the learning loop becomes stable enough to trust.
This month was about naming the demons.
Next month I want to start building defenses.
If catastrophic updates are real, then I need methods that explicitly care about step size in policy space—not just in parameter space.
So September is where I lean into stability mechanisms:
If August was the catalog of breakage…
September is where I start designing guardrails.
Continuous Control - DDPG and the Seduction of Off-Policy
This month I left “toy” discrete actions and stepped into continuous control. DDPG looked like the perfect deal—until I learned what off-policy really costs.
Actor-Critic - The First Time RL Feels Trainable
Policy gradients were honest but noisy. This month I added a critic—and for the first time in RL, training started to feel like something I could actually steer.