
After DDPG, I stopped thinking of RL instability as a surprise and started treating it like a design constraint. This month I learned why TRPO exists — and why PPO/PPO2 became the practical answer.
Axel Domingues
In September I learned something uncomfortable:
RL doesn’t “mostly work” and occasionally fail.
RL mostly fails unless you actively build guardrails.
DDPG was my first real experience of off-policy seduction: replay buffers, sample efficiency, and then… sudden collapse. It wasn’t just that it broke. It’s that it broke in ways that were hard to diagnose.
So October became a different kind of month.
Design-for-stability month.
And it introduced a concept I didn’t fully appreciate before:
In RL, the policy itself is a moving system component.
If you let it change too fast, you don’t get learning — you get chaos.
That’s the intuition behind TRPO.
And that’s why PPO / PPO2 feel like the first time RL becomes something you can run repeatedly without fear.
In deep learning, stability comes from:
In reinforcement learning, those still matter…
…but there’s an extra destabilizer:
the data distribution depends on your current policy.
When you update the policy, you change what data you collect.
When you change the data, you change what “learning” even means.
So this month was about learning to ask:
TRPO felt like someone looked at RL and said:
“Okay. We need to treat optimization like surgery.”
Not because the gradients are fragile…
…but because the policy is the data generator.
TRPO is essentially a promise:
I’m not going to pretend I “feel” all the math yet.
But I do feel the engineering motivation:
If policy updates can jump too far, training becomes a coin flip.
And then PPO shows up and feels like TRPO’s practical cousin:

In my head, the relationship became:
TRPO
Stability as a principle: protect learning by limiting policy change.
PPO / PPO2
Stability as a tool: a practical guardrail you can run daily.
Update size
Why big policy updates destroy learning (even when reward rises).
Trust region intuition
TRPO as “change management”: improve, but conservatively.
PPO clipping
What clipping prevents: one update that pushes the policy too far.
Stability checklist
Signals that tell me training is stable, not just lucky.
Here’s the core that I’m keeping:
Policy change is dangerous
If the policy moves too fast, the data distribution shifts under your feet.
On-policy is honest
Fresh data from the current policy trades efficiency for stability.
Clipping is a guardrail
PPO tries to prevent “one bad update” from wrecking everything.
Diagnostics are required
Entropy + KL + value health are stability signals, not optional extras.
This month was mostly: run, watch, adjust, run again — but with more discipline.
I kept this mostly in the “classic continuous control sandbox” zone:
Not because these are the final destination — but because they produce the kinds of failures PPO/TRPO are designed to prevent.
This month I started treating stability as something measurable.
Episode return (mean + variance)
Mean return is the headline. Variance is the truth. If the curve is volatile across seeds, stability is missing.
Policy entropy
Entropy collapsing early is “premature certainty.” Entropy staying too high is “permanent confusion.”
Approx KL (how far the policy moved)
If the policy shifts too much in one update, you get instability masked as “learning.”
Value loss + explained variance
If the value function is weak, the learning signal gets noisy. If it dominates, the policy stops improving.
I also keep an eye on:
This month I tried to name “stability failures” the same way I name bugs.
What it looks like: returns jump wildly; entropy is chaotic; KL spikes.
First check: approximate KL and update frequency (are updates too aggressive?).
What it looks like: entropy collapses early; learning plateaus.
First check: entropy curve + action distribution (premature certainty).
What it looks like: value loss “fine” but policy stops improving.
First check: explained variance + advantage stats (signal quality).
What it looks like: training feels like pushing with the handbrake on.
First check: clipping fraction / KL staying tiny (clip too tight).
What it looks like: one seed great, others fail; brittle behavior.
First check: multi-seed eval + rollout length sensitivity.
What it looks like: reward improves but behavior is nonsense or non-transferable.
First check: short rollouts and invariant checks (“is it doing the intended thing?”).
What it looks like: one seed looks “boring and good,” the next collapses.
First check: evaluate across seeds before you trust a curve.
This month wasn’t about “discovering a new trick.”
It was about accepting something deeper:
stability is not a property you stumble into.
it’s a feature you design for.
That’s why PPO feels so important to learn before I get ambitious.
If my future goal is to apply RL in messy, real settings, I need to learn:
I used to think TRPO was exotic.
Now it feels like a familiar engineering idea:
RL just forces that idea into the core algorithm.
The first time I saw a “boring” PPO curve, I realized:
This is what trainable looks like.
This month I learned to value diagnostics like entropy and KL the same way I value:
RL needs its own instrumentation culture.
October takeaway
Stability is not luck.
It’s controlled policy change + diagnostics that catch drift early.
What PPO gave me
A repeatable workflow: boring curves, fewer collapses, and progress I can trust.
This month taught me how to keep policy updates from blowing up the learning process.
But there’s another kind of failure waiting for me:
what if the agent almost never sees reward at all?
Sparse reward problems don’t fail by instability.
They fail by silence.
Next month is HER: learning from what didn’t happen.
And I’m already anticipating the weirdest feeling in RL so far:
Training on failures… and watching them become progress.
Sparse Rewards - HER and Learning From What Didn’t Happen
This month RL didn’t fail loudly. It failed quietly. Sparse rewards taught me the most brutal lesson yet - if nothing “good” happens, nothing gets learned — unless you rewrite what counts as experience.
Continuous Control - DDPG and the Seduction of Off-Policy
This month I left “toy” discrete actions and stepped into continuous control. DDPG looked like the perfect deal—until I learned what off-policy really costs.