
This month I left “toy” discrete actions and stepped into continuous control. DDPG looked like the perfect deal—until I learned what off-policy really costs.
Axel Domingues
In deep learning, instability usually feels like a bug you can corner: gradients explode, activations saturate, initialization is off.
In reinforcement learning, instability is often the default state of the system.
This month I crossed a threshold: continuous control.
No more “left / right.” Now the agent outputs real-valued actions—torques, forces, steering angles—things that feel closer to robotics than videogames.
And that’s where I met the seduction:
Off-policy learning looks like free efficiency.
You can reuse old experience. You can learn from a replay buffer. You can pretend data is “just data.”
DDPG promises exactly that.
In plain English: an actor-critic method for continuous actions, where the actor outputs a real-valued action directly (deterministic), and a critic learns to judge it.
And then it reminds you: in RL, your data is you.
With discrete environments, I could sometimes forgive myself for hand-waving:
With continuous actions, exploration becomes an engineering question:
This is where RL stops feeling like a chapter of ML and starts feeling like control systems wearing a neural network mask.
I’m not chasing “solves MuJoCo” this early.
I’m trying to build a working mental model for why continuous control is different, and why off-policy is both powerful and fragile.
Continuous actions
What changes when actions are real-valued (and exploration can crash you).
Deterministic actor-critic
Actor outputs actions. Critic judges them.
How the two co-evolve (and mislead each other).
Off-policy costs
Replay buffers feel like “free efficiency” until distribution mismatch shows up.
Debugging truth
Diagnostics that reveal drift: action stats, Q-scale, buffer health, silent failure.

Actor
Outputs the action directly (continuous control).
Critic
Scores how good an action is in a state (and can hallucinate).
Replay buffer
Stores old transitions so you can train many times.
Target networks
Slow-moving copies that reduce the “moving target” problem.
Here’s the mental picture that finally made DDPG click:
And the replay buffer is the “memory” that feeds training.
Which sounds stable… until you remember:
This month felt like building a machine that’s assembled out of parts that are still melting.
Off-policy has a very tempting story:
But off-policy isn’t a free lunch. It’s a loan.
Your replay buffer contains experience from older policies.
So the critic is trained on data that doesn’t match the actor’s current behavior. If the mismatch grows, the critic can become confident about situations the current actor never actually visits.
You borrow stability now… and repay it later with complexity:
In continuous control, random actions can be catastrophically bad. So exploration becomes noise design:
And the annoying part: “more exploration” can make training look worse for a long time.
The critic is supposed to estimate “how good an action is.”
But since it bootstraps its own predictions, it can start believing a fantasy world where:
Continuous actions introduce a subtle new class of error:
A bad scale doesn’t look like a crash.
It looks like “training is unstable.”
This month I started treating RL runs like production systems with a monitoring suite.
Here’s what I care about.
Reward curve (but with suspicion)
I watch average episode return, but I don’t trust it alone. Spikes can be luck; plateaus can hide improvement.
Critic loss + Q magnitude
If the critic loss explodes or Q-values drift to absurd magnitudes, the agent is learning a story, not a policy.
Action stats
Mean, std, clipping rate. If actions saturate at bounds early, the actor is stuck in “panic mode.”
Replay buffer sanity
What’s the distribution of rewards in the buffer? Is it dominated by failure? Is it too stale?
And because I’m coming from deep learning, I also keep an eye on:
This month had fewer “clean lessons” and more recognizable failure shapes.
Here are the ones that stood out:
What it looks like: Q-values drift upward until nothing is grounded.
Likely cause: critic bootstrapping + distribution mismatch.
First check: Q min/mean/max + critic loss spikes.
What it looks like: nearly constant actions, saturating at bounds.
Likely cause: critic gives misleading gradients or action scaling is off.
First check: action mean/std + clipping rate + bounds.
What it looks like: reward climbs, then collapses and never recovers.
Likely cause: one destabilizing update or critic drift.
First check: value scale shift + action saturation + buffer staleness.
What it looks like: training becomes “learning to fail consistently.”
Likely cause: early experience dominates; not enough recovery data.
First check: reward distribution in buffer + age of samples.
What it looks like: too slow → stalls; too fast → instability.
Likely cause: target updates not tuned to environment dynamics.
First check: correlate performance changes with target update settings.
What it looks like: small scaling change flips “works” ↔ “dead.”
Likely cause: gradients become too large/small; critic becomes brittle.
First check: normalize observations; verify action ranges; sanity-check reward scale.
What it looks like: reward improves but behavior isn’t real control.
Likely cause: reward design exploit.
First check: short rollouts; verify the agent is doing the intended behavior.
I expected replay buffers to make training smoother.
Instead, they made training more sensitive:
Off-policy feels like: “I can train more.”
In practice it feels like: “I can train more, and also be wrong more.”
This ties back to January’s theme: learning is an interface problem.
In continuous control, the interface is everywhere:
Small interface mistakes don’t throw exceptions.
They produce policies that “kind of move” and never improve.
DDPG’s critic is powerful, but it’s also the component most likely to drift into fiction.
When DDPG breaks, it often breaks there first.
September takeaway
Off-policy is not “more efficient.”
It’s more conditional — it works only when your data and your policy stay aligned.
The fragile heart
When DDPG breaks, it usually breaks in the critic first.
So I treat critic sanity as my earliest warning light.
Next month I want to tackle a different kind of instability.
DDPG showed me how fragile learning becomes when:
Now I want to study the opposite design philosophy:
What if we keep updates “trustworthy” by limiting how far the policy can move?
That’s where TRPO enters as the idea, and PPO/PPO2 enter as the engineering version I can actually run.
If September was “continuous control is seductive,”
October will be: “stability is a feature you have to design.”
Stability is a Feature You Have to Design
After DDPG, I stopped thinking of RL instability as a surprise and started treating it like a design constraint. This month I learned why TRPO exists — and why PPO/PPO2 became the practical answer.
Why RL Training Is Unstable (A Catalog of Breakage)
After actor-critic finally felt “trainable,” I hit the next wall - RL doesn’t just fail—it fails in loops. This month is my map of the most common ways it breaks.