Apr 29, 2018 - 11 MIN READ

Function Approximation - The Day RL Stopped Being Stable

Tabular RL felt clean because you could see the truth in a table. The moment I replaced the table with a model, RL stopped being a neat algorithm and became a fragile system.

Axel Domingues

March was the last month where RL felt… polite.

Values were tables.
Policies were arrows on a grid.
Bugs were visible.

It was the last time I could point at a value estimate and say:

“Right. That’s wrong. I know why.”

April is the month that ends that comfort.

Because April is where I replace the table with a model.

And the day I did that, RL stopped being stable.

Not “a bit noisy.”
Not “takes longer to tune.”

Unstable.

Like: one small change can flip learning from progress to collapse.

If 2017 taught me that deep learning stability is engineering…

April taught me RL stability is engineering on hard mode.

The new contract

Replacing a table with a model turns each update into a global bet, not a local fact.

The triangle of pain

Bootstrapping + off-policy data + approximation
→ the recipe behind most instability.

What you lose (and must rebuild)

Tabular RL was inspectable.
Approximation kills debuggability unless you add instrumentation.

Practical takeaways

Failure modes to recognize early + a minimal checklist to make progress without self-deception.

Function approximation is the bridge between “toy RL” and “real RL.”It’s also the moment you inherit the full mess of RL:

bootstrapping
non-stationary data
feedback loops
and now, approximation error

This is where the reputation comes from.

The Big Shift: From “Exact” to “Approximate”

Tabular RL works because the world is small enough to store beliefs exactly:

one state → one cell in a table
one action → one column
learning updates a literal entry you can inspect

Function approximation changes the contract:

a state is no longer an address in a table
a state becomes an input
a model produces an estimate
updating one state can change estimates for many other states

This is the “generalization” benefit.

It’s also the instability source.

Because you stop learning facts and start learning a guess function.

The new contract (the part I had to internalize)

Tabular: “update here” means only this entry changes
Approximation: “update here” means my entire guess function changes a little

So every update is now a global bet, not a local fact.

Why It Became Unstable So Fast

I kept bumping into the same triangle of pain:

Bootstrapping: learning from your own current estimates
Off-policy data: learning from behavior that isn’t the policy you’re evaluating
Function approximation: using a model to represent value instead of a table

Each of these can be fine alone.

But together?

They feel like a feedback loop that amplifies your mistakes.

I didn’t need to understand every theorem to feel it in my bones:

The system can start believing its own errors and then train itself harder on them.

That’s not a normal “bug.”
That’s a system failure mode.

This month I learned a painful RL truth:

Sometimes your agent isn’t failing because it’s “not smart enough.”

Sometimes it’s failing because your learning dynamics are literally unstable.

The Comfort I Lost: Values Were Debuggable

The hardest emotional shift wasn’t the complexity.

It was losing transparency.

In tabular RL:

if the agent avoids a state, I can check its value
I can see if it believes that state is dangerous
I can see how its belief changes over time

With a function approximator:

the “value table” is implicit
I can’t easily inspect all states
and when something goes wrong, the model can silently poison everything

This felt exactly like the shift from simple linear models to deep nets:

Once you stop being able to “see” the learned representation, you need instrumentation.

So I doubled down on logs and sanity checks again.

The Mental Picture That Helped

Tabular feels stable because updates are isolated

One state, one cell. If it’s wrong, it’s wrong there.

Approximation feels unstable because updates generalize

One update changes many predictions. If it’s wrong, it can be wrong everywhere.

Here’s the picture I kept using:

Tabular RL

Each state is a bucket.
You pour experience into the bucket.
Only that bucket changes.

Function approximation

All buckets are connected by rubber bands.
You pull one bucket up, and some other bucket shifts too.

That’s generalization.

It’s also why small updates can have unintended global effects.

The Engineering Lesson: RL Has More Places to Lie

In supervised learning, if training loss goes down, it usually means something.

In RL, even before function approximation, reward can lie.

With function approximation, your diagnostics can lie too if you don’t control evaluation carefully.

Because performance can change due to:

a better policy
a worse policy that got lucky
a value function that became overconfident
exploration turning off (giving the illusion of improvement)
or the model collapsing into a degenerate behavior that still harvests some reward

So I started treating my plots as suspects instead of evidence.

A plot can improve for the wrong reason.Example illusions I now assume by default:

reward went up because exploration decreased
one seed got lucky and I stared at the hero curve
values became overconfident and pushed risky actions that sometimes pay

What I Focused On (Without Going Full Deep RL Yet)

This month wasn’t “deep RL month” yet.
It was “the day I learned why deep RL needs so many tricks.”

So I focused on three practical questions:

1) What does it mean to approximate value?

Not just “fit a model,” but:

what inputs represent state?
what does the model output represent?
what does an update actually do to the space of states?

2) When does approximation help?

It helps when:

similar states genuinely share structure
the environment has smoothness you can exploit
you can’t visit every state often enough to fill a table

3) When does approximation hurt?

It hurts when:

rare states matter a lot (catastrophic failures)
the model generalizes in the wrong direction
training signal is noisy and bootstrapped
your behavior distribution shifts during learning

This month didn’t feel like learning one algorithm.

It felt like learning the fragility budget of RL.

Common Failure Modes I Hit (Or Could Reproduce Mentally)

Here are the failure patterns that kept showing up:

Chasing moving targets: the value you’re trying to learn changes as the policy changes
Catastrophic forgetting: learning a new region breaks estimates elsewhere
Overconfident values: the model predicts strong value for states it hasn’t earned the right to believe in
Evaluation confusion: performance looks better because exploration changed, not because policy improved
Divergence: estimates blow up or oscillate instead of settling
Silent collapse: policy becomes repetitive or “stuck,” reward plateaus, and nothing obvious explains why

The scary part:

Many of these failures don’t look like “errors.”

They look like “training.”

The Debugging Discipline I Started Building

I couldn’t inspect tables anymore, so I leaned harder into system-level instrumentation.

This is the checklist I began writing like a ritual:

Separate train vs eval

If evaluation uses exploration, I’m not evaluating—I'm sampling noise.

Track value statistics

Min/mean/max of predicted values. Sudden growth is a smell.

Watch action entropy

If the policy suddenly becomes deterministic early, it might be collapsing, not converging.

Multi-run sanity

If only one seed works, I assume the system is unstable, not “solved.”

And the big cultural rule I’m adopting:

If RL “works” once, it doesn’t count yet. It counts when it works reliably.

Field Notes (What I’d Tell My Past Self)

Function approximation isn’t an “upgrade.” It’s a new regime.
Tabular methods are not “the same thing but smaller.” They’re fundamentally more stable because they isolate updates.
Generalization is a double-edged sword.
It can speed up learning dramatically… or spread the wrong belief everywhere.
Instability isn’t failure. It’s information.
If a method diverges, it’s telling you something about your learning dynamics.
This is where RL stops being “an algorithm” and becomes “a system.”
Which means my 2017 mindset applies: instrument, isolate, debug.

What’s Next

April was the bridge.

And now I’m standing at the entrance of deep RL.

Next month I’m doing the first deep RL algorithm that feels like it has a clear story:

Deep Q-Learning.

Because it’s the cleanest continuation from tabular Q-learning:

value-based learning
but now the Q-table is a neural net

The continuity into DQN

DQN is the same story as tabular Q-learning:

learn action values
pick actions that look best

But now the Q-function is a model. So every instability from April becomes relevant immediately.

If March was “RL at human scale,”
and April was “RL gets fragile,”

May is where I finally put my 2017 deep learning skills back on the table.

FAQ

Deep Q-Learning - My First Real Baselines Month

This is the month I stopped reading about deep RL and started running it. DQN is simple enough to explain, hard enough to break, and perfect for learning Baselines like an engineer.

Tabular RL - When Value Iteration Feels Like Cheating

Tabular RL is the last time reinforcement learning feels clean. Values become literal tables, planning becomes explicit, and the “aha” moments arrive fast.