
This month RL didn’t fail loudly. It failed quietly. Sparse rewards taught me the most brutal lesson yet - if nothing “good” happens, nothing gets learned — unless you rewrite what counts as experience.
Axel Domingues
October was about stability.
I learned to keep the policy from changing too fast. I learned to watch entropy and KL the way I used to watch gradients and loss curves. I learned that “boring training curves” are sometimes the best compliment you can give an RL run.
Then November arrived and stability didn’t matter.
Because nothing happened.
Not in the philosophical sense. In the literal sense:
Sparse rewards are not an algorithmic challenge first.
They’re a feedback problem.
If the environment never tells you “that was good,” the agent has no gradient-shaped hint of what to repeat.
This month I hit the most honest wall in RL so far:
You can’t optimize a signal you never receive.
And that’s why HER (Hindsight Experience Replay) felt like a breakthrough — not because it makes things easier, but because it changes what “experience” even means.
The problem
Sparse rewards = the environment stays silent.
Learning never starts.
The breakthrough
HER relabels goals so the same trajectory produces a learning signal.
What to measure
Success rate + distance-to-goal matter more than reward early on.
The engineer lesson
If the teacher won’t talk, redesign the classroom: engineer learning signal.
Pain type
Training instability: it learns… then collapses.
Pain type
Training silence: learning never begins because reward never arrives.
There are two kinds of RL pain:
Sparse reward tasks are pain type #2.
They don’t explode.
They just sit there… flatline reward curves… forever.

HER is a shockingly simple idea:
If you didn’t achieve the goal you wanted, pretend the goal was what you actually achieved — and learn from that anyway.
It’s not lying. It’s reframing.
States, actions, next states — nothing changes about what happened.
Pretend the goal was the one you actually achieved.
You teach: “If that is the goal, these actions work.”
But you change the label of the goal during replay so that the agent can learn:
And slowly, this teaches competence and control — even before the “real” goal becomes frequent enough to learn from.
The reason it works is because it tackles the core problem:
rare reward means rare learning.
HER manufactures denser learning signals from the same data.
In supervised learning, labels are sacred.
In RL with HER, labels become flexible.
It was the first time I felt RL stepping away from “learn from truth” and into something more engineering-like:
learn from whatever produces a useful learning signal.
That sentence is dangerously easy to misuse.
But in sparse reward settings, it’s survival.
Silence vs difficulty
Hard tasks still give feedback.
Silent tasks don’t.
Goal-conditioned thinking
Treat the goal as an input, not a fixed objective.
Why HER works
Relabeling makes learning signals dense enough for competence to emerge.
Debug habits
Track success + distance + replay composition, not just return.
This month I spent time in environments where the structure is:
The difficulty isn’t the dynamics.
The difficulty is that random behavior almost never produces a “win.”
So my experiments were about contrasts:
For sparse rewards, reward curves are almost useless early on.
So I started tracking:
Success rate
The real progress signal in sparse tasks.
Distance-to-goal
Shows shaping of behavior before success is common.
Replay composition
How many “successful” transitions exist after relabeling?
Learning health
Entropy + value/Q stability tell me if learning is real.
Here’s my running catalog from this month.
Symptom: success rate stays at 0 forever.
Likely cause: exploration never produces a reachable achieved goal.
First check: is the goal reachable by random behavior at all? log achieved-goal distribution.
Symptom: smoother motion / stability improves, but success doesn’t.
Likely cause: relabeling teaches “do something coherent” but doesn’t connect to the real goal.
First check: does distance-to-goal trend down even when success is rare?
Symptom: learning looks noisy and directionless.
Likely cause: goals don’t match env logic / reward can’t evaluate arbitrary goal-state pairs.
First check: verify goal representation + reward(goal, achieved) is consistent.
Symptom: competence for easy goals, no transfer to real goal.
Likely cause: replay overrepresents “easy positives.”
First check: replay goal distribution; measure transfer success, not just relabeled success.
Symptom: learning stalls as behavior drifts; old data stops helping.
Likely cause: buffer is stale; exploration collapses.
First check: entropy trend + sample age in buffer.
This month made me confront something fundamental:
In RL, the environment is your teacher.
If the teacher never speaks, you have to redesign the classroom.
HER is one of the first tools I’ve seen that treats experience itself as something you can engineer.
And this connects directly back to my 2017 deep learning lesson:
Training stability isn’t luck. It’s design.
In November, the equivalent is:
Learning signal isn’t guaranteed. It’s design too.
November takeaway
Sparse rewards don’t break training.
They prevent training from ever starting.
What HER taught me
Experience isn’t just collected.
In RL, experience can be engineered.
It’s not “a better optimizer.”
It’s an intervention at the interface between:
Which is exactly the kind of next-level engineering concept I’m here for.
In dense reward settings, you can mistake “improvement” for understanding.
In sparse reward settings, you can’t.
Either you solve it, or you don’t.
This month I learned to celebrate tiny things:
That’s what sparse reward progress looks like.
November taught me how to learn from failure.
Which sets up a weirdly natural next question:
what happens when the “experience” isn’t random at all… but expert?
December is about imitation learning — specifically GAIL — and the strangely uncomfortable feeling of watching an agent learn by copying.
Not optimizing reward.
Not exploring.
Just… absorbing behavior.
And I suspect it’s going to challenge my intuition in a whole new way.
Imitation Learning - GAIL and the Strange Feeling of Learning From Experts
I ended 2018 in a weird place - less “reinforcement,” more “copying.” GAIL taught me that sometimes the fastest path to competence is to borrow behavior first — and ask questions later.
Stability is a Feature You Have to Design
After DDPG, I stopped thinking of RL instability as a surprise and started treating it like a design constraint. This month I learned why TRPO exists — and why PPO/PPO2 became the practical answer.