Blog
Apr 29, 2018 - 11 MIN READ
Function Approximation - The Day RL Stopped Being Stable

Function Approximation - The Day RL Stopped Being Stable

Tabular RL felt clean because you could see the truth in a table. The moment I replaced the table with a model, RL stopped being a neat algorithm and became a fragile system.

Axel Domingues

Axel Domingues

March was the last month where RL felt… polite.

Values were tables.
Policies were arrows on a grid.
Bugs were visible.

It was the last time I could point at a value estimate and say:

“Right. That’s wrong. I know why.”

April is the month that ends that comfort.

Because April is where I replace the table with a model.

And the day I did that, RL stopped being stable.

Not “a bit noisy.”
Not “takes longer to tune.”

Unstable.

Like: one small change can flip learning from progress to collapse.

If 2017 taught me that deep learning stability is engineering…

April taught me RL stability is engineering on hard mode.

The new contract

Replacing a table with a model turns each update into a global bet, not a local fact.

The triangle of pain

Bootstrapping + off-policy data + approximation
→ the recipe behind most instability.

What you lose (and must rebuild)

Tabular RL was inspectable.
Approximation kills debuggability unless you add instrumentation.

Practical takeaways

Failure modes to recognize early + a minimal checklist to make progress without self-deception.

Function approximation is the bridge between “toy RL” and “real RL.”It’s also the moment you inherit the full mess of RL:
  • bootstrapping
  • non-stationary data
  • feedback loops
  • and now, approximation error
This is where the reputation comes from.

The Big Shift: From “Exact” to “Approximate”

Tabular RL works because the world is small enough to store beliefs exactly:

  • one state → one cell in a table
  • one action → one column
  • learning updates a literal entry you can inspect

Function approximation changes the contract:

  • a state is no longer an address in a table
  • a state becomes an input
  • a model produces an estimate
  • updating one state can change estimates for many other states

This is the “generalization” benefit.

It’s also the instability source.

Because you stop learning facts and start learning a guess function.

The new contract (the part I had to internalize)
  • Tabular: “update here” means only this entry changes
  • Approximation: “update here” means my entire guess function changes a little
So every update is now a global bet, not a local fact.

Why It Became Unstable So Fast

I kept bumping into the same triangle of pain:

  • Bootstrapping: learning from your own current estimates
  • Off-policy data: learning from behavior that isn’t the policy you’re evaluating
  • Function approximation: using a model to represent value instead of a table

Each of these can be fine alone.

But together?

They feel like a feedback loop that amplifies your mistakes.

I didn’t need to understand every theorem to feel it in my bones:

The system can start believing its own errors and then train itself harder on them.

That’s not a normal “bug.”
That’s a system failure mode.

This month I learned a painful RL truth:

Sometimes your agent isn’t failing because it’s “not smart enough.”

Sometimes it’s failing because your learning dynamics are literally unstable.


The Comfort I Lost: Values Were Debuggable

The hardest emotional shift wasn’t the complexity.

It was losing transparency.

In tabular RL:

  • if the agent avoids a state, I can check its value
  • I can see if it believes that state is dangerous
  • I can see how its belief changes over time

With a function approximator:

  • the “value table” is implicit
  • I can’t easily inspect all states
  • and when something goes wrong, the model can silently poison everything

This felt exactly like the shift from simple linear models to deep nets:

Once you stop being able to “see” the learned representation, you need instrumentation.

So I doubled down on logs and sanity checks again.


The Mental Picture That Helped

Tabular feels stable because updates are isolated

One state, one cell. If it’s wrong, it’s wrong there.

Approximation feels unstable because updates generalize

One update changes many predictions. If it’s wrong, it can be wrong everywhere.

Here’s the picture I kept using:

Tabular RL

Each state is a bucket.
You pour experience into the bucket.
Only that bucket changes.

Function approximation

All buckets are connected by rubber bands.
You pull one bucket up, and some other bucket shifts too.

That’s generalization.

It’s also why small updates can have unintended global effects.


The Engineering Lesson: RL Has More Places to Lie

In supervised learning, if training loss goes down, it usually means something.

In RL, even before function approximation, reward can lie.

With function approximation, your diagnostics can lie too if you don’t control evaluation carefully.

Because performance can change due to:

  • a better policy
  • a worse policy that got lucky
  • a value function that became overconfident
  • exploration turning off (giving the illusion of improvement)
  • or the model collapsing into a degenerate behavior that still harvests some reward

So I started treating my plots as suspects instead of evidence.

A plot can improve for the wrong reason.Example illusions I now assume by default:
  • reward went up because exploration decreased
  • one seed got lucky and I stared at the hero curve
  • values became overconfident and pushed risky actions that sometimes pay

What I Focused On (Without Going Full Deep RL Yet)

This month wasn’t “deep RL month” yet.
It was “the day I learned why deep RL needs so many tricks.”

So I focused on three practical questions:

1) What does it mean to approximate value?

Not just “fit a model,” but:

  • what inputs represent state?
  • what does the model output represent?
  • what does an update actually do to the space of states?

2) When does approximation help?

It helps when:

  • similar states genuinely share structure
  • the environment has smoothness you can exploit
  • you can’t visit every state often enough to fill a table

3) When does approximation hurt?

It hurts when:

  • rare states matter a lot (catastrophic failures)
  • the model generalizes in the wrong direction
  • training signal is noisy and bootstrapped
  • your behavior distribution shifts during learning

This month didn’t feel like learning one algorithm.

It felt like learning the fragility budget of RL.


Common Failure Modes I Hit (Or Could Reproduce Mentally)

Here are the failure patterns that kept showing up:

  • Chasing moving targets: the value you’re trying to learn changes as the policy changes
  • Catastrophic forgetting: learning a new region breaks estimates elsewhere
  • Overconfident values: the model predicts strong value for states it hasn’t earned the right to believe in
  • Evaluation confusion: performance looks better because exploration changed, not because policy improved
  • Divergence: estimates blow up or oscillate instead of settling
  • Silent collapse: policy becomes repetitive or “stuck,” reward plateaus, and nothing obvious explains why
The scary part:

Many of these failures don’t look like “errors.”

They look like “training.”


The Debugging Discipline I Started Building

I couldn’t inspect tables anymore, so I leaned harder into system-level instrumentation.

This is the checklist I began writing like a ritual:

Separate train vs eval

If evaluation uses exploration, I’m not evaluating—I'm sampling noise.

Track value statistics

Min/mean/max of predicted values. Sudden growth is a smell.

Watch action entropy

If the policy suddenly becomes deterministic early, it might be collapsing, not converging.

Multi-run sanity

If only one seed works, I assume the system is unstable, not “solved.”

And the big cultural rule I’m adopting:

If RL “works” once, it doesn’t count yet. It counts when it works reliably.

Field Notes (What I’d Tell My Past Self)

  1. Function approximation isn’t an “upgrade.” It’s a new regime.
    Tabular methods are not “the same thing but smaller.” They’re fundamentally more stable because they isolate updates.
  2. Generalization is a double-edged sword.
    It can speed up learning dramatically… or spread the wrong belief everywhere.
  3. Instability isn’t failure. It’s information.
    If a method diverges, it’s telling you something about your learning dynamics.
  4. This is where RL stops being “an algorithm” and becomes “a system.”
    Which means my 2017 mindset applies: instrument, isolate, debug.

What’s Next

April was the bridge.

And now I’m standing at the entrance of deep RL.

Next month I’m doing the first deep RL algorithm that feels like it has a clear story:

Deep Q-Learning.

Because it’s the cleanest continuation from tabular Q-learning:

  • value-based learning
  • but now the Q-table is a neural net

The continuity into DQN

DQN is the same story as tabular Q-learning:

  • learn action values
  • pick actions that look best

But now the Q-function is a model. So every instability from April becomes relevant immediately.

If March was “RL at human scale,”
and April was “RL gets fragile,”

May is where I finally put my 2017 deep learning skills back on the table.


FAQ

Axel Domingues - 2026