Oct 28, 2018 - 11 MIN READ

Stability is a Feature You Have to Design

After DDPG, I stopped thinking of RL instability as a surprise and started treating it like a design constraint. This month I learned why TRPO exists — and why PPO/PPO2 became the practical answer.

Axel Domingues

In September I learned something uncomfortable:

RL doesn’t “mostly work” and occasionally fail.
RL mostly fails unless you actively build guardrails.

DDPG was my first real experience of off-policy seduction: replay buffers, sample efficiency, and then… sudden collapse. It wasn’t just that it broke. It’s that it broke in ways that were hard to diagnose.

So October became a different kind of month.

Design-for-stability month.

In October, I stopped asking “did it learn?” and started asking “did it learn repeatably?”

And it introduced a concept I didn’t fully appreciate before:

In RL, the policy itself is a moving system component.
If you let it change too fast, you don’t get learning — you get chaos.

That’s the intuition behind TRPO.
And that’s why PPO / PPO2 feel like the first time RL becomes something you can run repeatedly without fear.

The theme of the month

In deep learning, stability comes from:

good initialization
sane activations
normalization
optimization tricks

In reinforcement learning, those still matter…

…but there’s an extra destabilizer:

the data distribution depends on your current policy.

When you update the policy, you change what data you collect.
When you change the data, you change what “learning” even means.

So this month was about learning to ask:

How big should a policy update be?
How do we prevent “one bad update” from destroying weeks of progress?
What should be measured to know whether training is stable or just lucky?

Why TRPO showed up at all

TRPO felt like someone looked at RL and said:

“Okay. We need to treat optimization like surgery.”

Not because the gradients are fragile…

…but because the policy is the data generator.

TRPO is essentially a promise:

improve the policy
but do it conservatively
so you don’t break the world you’re learning inside

I’m not going to pretend I “feel” all the math yet.

But I do feel the engineering motivation:

If policy updates can jump too far, training becomes a coin flip.

Why PPO/PPO2 became the real “2018 engineer’s option”

And then PPO shows up and feels like TRPO’s practical cousin:

still conservative
still focused on controlled updates
but simpler to run
and much more common in practice

Stability in reinforcement learning systems

In my head, the relationship became:

TRPO

Stability as a principle: protect learning by limiting policy change.

PPO / PPO2

Stability as a tool: a practical guardrail you can run daily.

My framing this month: TRPO taught me what to protect. PPO taught me how to protect it without turning my workflow into a research project.

What I’m trying to learn this month

Learning goals

Update size

Why big policy updates destroy learning (even when reward rises).

Trust region intuition

TRPO as “change management”: improve, but conservatively.

PPO clipping

What clipping prevents: one update that pushes the policy too far.

Stability checklist

Signals that tell me training is stable, not just lucky.

The concepts I wrote on sticky notes (plain English)

Here’s the core that I’m keeping:

Policy change is dangerous

If the policy moves too fast, the data distribution shifts under your feet.

On-policy is honest

Fresh data from the current policy trades efficiency for stability.

Clipping is a guardrail

PPO tries to prevent “one bad update” from wrecking everything.

Diagnostics are required

Entropy + KL + value health are stability signals, not optional extras.

What I actually did

This month was mostly: run, watch, adjust, run again — but with more discipline.

Environments I focused on

I kept this mostly in the “classic continuous control sandbox” zone:

Classic control for quick sanity checks (where runs finish fast)
Continuous control tasks (where policy smoothness matters)

Not because these are the final destination — but because they produce the kinds of failures PPO/TRPO are designed to prevent.

I avoided judging any algorithm by a single run.
A run is a story. A pattern across runs is evidence.

The stability dashboard (what I monitor now)

This month I started treating stability as something measurable.

Episode return (mean + variance)

Mean return is the headline. Variance is the truth. If the curve is volatile across seeds, stability is missing.

Policy entropy

Entropy collapsing early is “premature certainty.” Entropy staying too high is “permanent confusion.”

Approx KL (how far the policy moved)

If the policy shifts too much in one update, you get instability masked as “learning.”

Value loss + explained variance

If the value function is weak, the learning signal gets noisy. If it dominates, the policy stops improving.

How I read this dashboard:

Return mean tells the story headline.
Return variance tells me if it’s real.
Entropy tells me if exploration is alive.
KL tells me if updates are becoming dangerous.
Value health tells me if my learning signal is being corrupted.

I also keep an eye on:

gradient norms when runs suddenly degrade
action distribution stats (mean, std, saturation/clipping)
rollout length vs update frequency (data freshness)

Common failure modes (and what they look like in PPO/TRPO land)

This month I tried to name “stability failures” the same way I name bugs.

A stable run that learns slowly is still progress.
An unstable run that learns fast is a trap.

The shift in mindset

This month wasn’t about “discovering a new trick.”

It was about accepting something deeper:

stability is not a property you stumble into.
it’s a feature you design for.

That’s why PPO feels so important to learn before I get ambitious.

If my future goal is to apply RL in messy, real settings, I need to learn:

how to make training repeatable
how to detect instability early
how to trust results

Field notes (what I’d tell my past self)

1) Trust-region thinking is just “change management”

I used to think TRPO was exotic.

Now it feels like a familiar engineering idea:

Don’t deploy a change that’s too big.
Make small updates.
Monitor drift.
Roll back when something goes wrong.

RL just forces that idea into the core algorithm.

2) PPO feels boring — and that’s the point

The first time I saw a “boring” PPO curve, I realized:

This is what trainable looks like.

3) Reward isn’t enough — stability needs secondary signals

This month I learned to value diagnostics like entropy and KL the same way I value:

loss curves in deep learning
gradient stats during vanishing/exploding investigations

RL needs its own instrumentation culture.

October takeaway

Stability is not luck.

It’s controlled policy change + diagnostics that catch drift early.

What PPO gave me

A repeatable workflow: boring curves, fewer collapses, and progress I can trust.

What’s next

This month taught me how to keep policy updates from blowing up the learning process.

But there’s another kind of failure waiting for me:

what if the agent almost never sees reward at all?

Sparse reward problems don’t fail by instability.
They fail by silence.

Next month is HER: learning from what didn’t happen.

And I’m already anticipating the weirdest feeling in RL so far:

Training on failures… and watching them become progress.

Sparse Rewards - HER and Learning From What Didn’t Happen

This month RL didn’t fail loudly. It failed quietly. Sparse rewards taught me the most brutal lesson yet - if nothing “good” happens, nothing gets learned — unless you rewrite what counts as experience.

Continuous Control - DDPG and the Seduction of Off-Policy

This month I left “toy” discrete actions and stepped into continuous control. DDPG looked like the perfect deal—until I learned what off-policy really costs.