Nov 25, 2018 - 12 MIN READ

Sparse Rewards - HER and Learning From What Didn’t Happen

This month RL didn’t fail loudly. It failed quietly. Sparse rewards taught me the most brutal lesson yet - if nothing “good” happens, nothing gets learned — unless you rewrite what counts as experience.

Axel Domingues

October was about stability.

I learned to keep the policy from changing too fast. I learned to watch entropy and KL the way I used to watch gradients and loss curves. I learned that “boring training curves” are sometimes the best compliment you can give an RL run.

Then November arrived and stability didn’t matter.

Because nothing happened.

Not in the philosophical sense. In the literal sense:

episodes ran
actions were taken
states changed
time passed
and reward stayed at zero

Sparse rewards are not an algorithmic challenge first.
They’re a feedback problem.

Dense reward teaches every step. Sparse reward teaches only the ending — if you ever reach it.

If the environment never tells you “that was good,” the agent has no gradient-shaped hint of what to repeat.

This month I hit the most honest wall in RL so far:

You can’t optimize a signal you never receive.

And that’s why HER (Hindsight Experience Replay) felt like a breakthrough — not because it makes things easier, but because it changes what “experience” even means.

The problem

Sparse rewards = the environment stays silent.
Learning never starts.

The breakthrough

HER relabels goals so the same trajectory produces a learning signal.

What to measure

Success rate + distance-to-goal matter more than reward early on.

The engineer lesson

If the teacher won’t talk, redesign the classroom: engineer learning signal.

The theme of the month

Pain type

Training instability: it learns… then collapses.

Pain type

Training silence: learning never begins because reward never arrives.

There are two kinds of RL pain:

Training instability — learning happens, but it collapses.
Training silence — learning never starts.

Sparse reward tasks are pain type #2.

They don’t explode.

They just sit there… flatline reward curves… forever.

What HER is (in plain English)

HER is a shockingly simple idea:

If you didn’t achieve the goal you wanted, pretend the goal was what you actually achieved — and learn from that anyway.

It’s not lying. It’s reframing.

Store the trajectory as usual

States, actions, next states — nothing changes about what happened.

Swap the goal label during replay

Pretend the goal was the one you actually achieved.

Learn the conditional skill anyway

You teach: “If that is the goal, these actions work.”

But you change the label of the goal during replay so that the agent can learn:

“If I aim for that goal, these actions work.”

And slowly, this teaches competence and control — even before the “real” goal becomes frequent enough to learn from.

The reason it works is because it tackles the core problem:

rare reward means rare learning.

HER manufactures denser learning signals from the same data.

The part that felt strange (and important)

In supervised learning, labels are sacred.

In RL with HER, labels become flexible.

It was the first time I felt RL stepping away from “learn from truth” and into something more engineering-like:

learn from whatever produces a useful learning signal.

That sentence is dangerously easy to misuse.

But in sparse reward settings, it’s survival.

My learning goals this month

Silence vs difficulty

Hard tasks still give feedback.
Silent tasks don’t.

Goal-conditioned thinking

Treat the goal as an input, not a fixed objective.

Why HER works

Relabeling makes learning signals dense enough for competence to emerge.

Debug habits

Track success + distance + replay composition, not just return.

Key concepts I keep repeating to myself

Sparse reward debugging rule: If the agent never succeeds, stop asking “why isn’t it improving?” and start asking “how would it ever get a single success?”

What I tried (conceptually)

This month I spent time in environments where the structure is:

state includes position/velocity/etc.
there is a goal
the agent gets reward mainly for reaching that goal

The difficulty isn’t the dynamics.

The difficulty is that random behavior almost never produces a “win.”

So my experiments were about contrasts:

without HER: learning signal is too rare
with HER: learning signal becomes constant enough for competence to emerge

The metrics that actually mattered

For sparse rewards, reward curves are almost useless early on.

So I started tracking:

Success rate

The real progress signal in sparse tasks.

Distance-to-goal

Shows shaping of behavior before success is common.

Replay composition

How many “successful” transitions exist after relabeling?

Learning health

Entropy + value/Q stability tell me if learning is real.

Sparse reward training can look “stable” because nothing changes.
Flat curves aren’t stability — they’re absence of learning.

The failure modes that taught me the most

Here’s my running catalog from this month.

The mindset change (the real lesson)

This month made me confront something fundamental:

In RL, the environment is your teacher.

If the teacher never speaks, you have to redesign the classroom.

HER is one of the first tools I’ve seen that treats experience itself as something you can engineer.

And this connects directly back to my 2017 deep learning lesson:

Training stability isn’t luck. It’s design.

In November, the equivalent is:

Learning signal isn’t guaranteed. It’s design too.

November takeaway

Sparse rewards don’t break training.

They prevent training from ever starting.

What HER taught me

Experience isn’t just collected.

In RL, experience can be engineered.

Field notes (what surprised me)

1) HER feels like debugging the reward interface

It’s not “a better optimizer.”

It’s an intervention at the interface between:

experience
goals
learning signal

Which is exactly the kind of next-level engineering concept I’m here for.

2) Sparse reward problems expose fake competence fast

In dense reward settings, you can mistake “improvement” for understanding.

In sparse reward settings, you can’t.

Either you solve it, or you don’t.

3) Progress is measured in first successes

This month I learned to celebrate tiny things:

one success per thousand episodes
then one per hundred
then one per ten

That’s what sparse reward progress looks like.

What’s next

November taught me how to learn from failure.

Which sets up a weirdly natural next question:

what happens when the “experience” isn’t random at all… but expert?

December is about imitation learning — specifically GAIL — and the strangely uncomfortable feeling of watching an agent learn by copying.

Not optimizing reward.
Not exploring.

Just… absorbing behavior.

And I suspect it’s going to challenge my intuition in a whole new way.

Imitation Learning - GAIL and the Strange Feeling of Learning From Experts

I ended 2018 in a weird place - less “reinforcement,” more “copying.” GAIL taught me that sometimes the fastest path to competence is to borrow behavior first — and ask questions later.

Stability is a Feature You Have to Design

After DDPG, I stopped thinking of RL instability as a surprise and started treating it like a design constraint. This month I learned why TRPO exists — and why PPO/PPO2 became the practical answer.