
Tabular RL felt clean because you could see the truth in a table. The moment I replaced the table with a model, RL stopped being a neat algorithm and became a fragile system.
Axel Domingues
March was the last month where RL felt… polite.
Values were tables.
Policies were arrows on a grid.
Bugs were visible.
It was the last time I could point at a value estimate and say:
“Right. That’s wrong. I know why.”
April is the month that ends that comfort.
Because April is where I replace the table with a model.
And the day I did that, RL stopped being stable.
Not “a bit noisy.”
Not “takes longer to tune.”
Unstable.
Like: one small change can flip learning from progress to collapse.
If 2017 taught me that deep learning stability is engineering…
April taught me RL stability is engineering on hard mode.
The new contract
Replacing a table with a model turns each update into a global bet, not a local fact.
The triangle of pain
Bootstrapping + off-policy data + approximation
→ the recipe behind most instability.
What you lose (and must rebuild)
Tabular RL was inspectable.
Approximation kills debuggability unless you add instrumentation.
Practical takeaways
Failure modes to recognize early + a minimal checklist to make progress without self-deception.

Tabular RL works because the world is small enough to store beliefs exactly:
Function approximation changes the contract:
This is the “generalization” benefit.
It’s also the instability source.
Because you stop learning facts and start learning a guess function.
I kept bumping into the same triangle of pain:
You learn from your current estimates. If those estimates drift, your targets drift too. This can create a loop where errors become self-reinforcing.
Your data comes from one behavior, but your updates assume another. That mismatch can be survivable in tabular settings — and explosive with approximation.
A single update changes predictions for many states. If the update direction is wrong, it doesn’t stay local — it contaminates regions you didn’t even visit.
Each of these can be fine alone.
But together?
They feel like a feedback loop that amplifies your mistakes.
I didn’t need to understand every theorem to feel it in my bones:
The system can start believing its own errors and then train itself harder on them.
That’s not a normal “bug.”
That’s a system failure mode.
Sometimes your agent isn’t failing because it’s “not smart enough.”
Sometimes it’s failing because your learning dynamics are literally unstable.
The hardest emotional shift wasn’t the complexity.
It was losing transparency.
In tabular RL:
With a function approximator:
This felt exactly like the shift from simple linear models to deep nets:
Once you stop being able to “see” the learned representation, you need instrumentation.
So I doubled down on logs and sanity checks again.
Tabular feels stable because updates are isolated
One state, one cell. If it’s wrong, it’s wrong there.
Approximation feels unstable because updates generalize
One update changes many predictions. If it’s wrong, it can be wrong everywhere.
Here’s the picture I kept using:
Each state is a bucket.
You pour experience into the bucket.
Only that bucket changes.
All buckets are connected by rubber bands.
You pull one bucket up, and some other bucket shifts too.
That’s generalization.
It’s also why small updates can have unintended global effects.
In supervised learning, if training loss goes down, it usually means something.
In RL, even before function approximation, reward can lie.
With function approximation, your diagnostics can lie too if you don’t control evaluation carefully.
Because performance can change due to:
So I started treating my plots as suspects instead of evidence.
This month wasn’t “deep RL month” yet.
It was “the day I learned why deep RL needs so many tricks.”
So I focused on three practical questions:
Not just “fit a model,” but:
It helps when:
It hurts when:
This month didn’t feel like learning one algorithm.
It felt like learning the fragility budget of RL.
Here are the failure patterns that kept showing up:
Many of these failures don’t look like “errors.”
They look like “training.”
I couldn’t inspect tables anymore, so I leaned harder into system-level instrumentation.
This is the checklist I began writing like a ritual:
Separate train vs eval
If evaluation uses exploration, I’m not evaluating—I'm sampling noise.
Track value statistics
Min/mean/max of predicted values. Sudden growth is a smell.
Watch action entropy
If the policy suddenly becomes deterministic early, it might be collapsing, not converging.
Multi-run sanity
If only one seed works, I assume the system is unstable, not “solved.”
And the big cultural rule I’m adopting:
April was the bridge.
And now I’m standing at the entrance of deep RL.
Next month I’m doing the first deep RL algorithm that feels like it has a clear story:
Deep Q-Learning.
Because it’s the cleanest continuation from tabular Q-learning:
The continuity into DQN
DQN is the same story as tabular Q-learning:
But now the Q-function is a model. So every instability from April becomes relevant immediately.
If March was “RL at human scale,”
and April was “RL gets fragile,”
May is where I finally put my 2017 deep learning skills back on the table.
Because updating one state no longer affects only that state.
A model generalizes: a single update can change predictions in many places. That can spread mistakes, especially when the learning target is bootstrapped from the model’s own estimates.
It rhymes, but it’s worse.
Deep learning instability often comes from optimization dynamics on a fixed dataset. RL adds a moving target: the policy changes the data distribution while you train, and your targets can be based on your own current predictions.
Tabular RL is stable because it isolates updates.
The moment you approximate, you trade stability for generalization — and you need instrumentation to keep yourself honest.
Deep Q-Learning - My First Real Baselines Month
This is the month I stopped reading about deep RL and started running it. DQN is simple enough to explain, hard enough to break, and perfect for learning Baselines like an engineer.
Tabular RL - When Value Iteration Feels Like Cheating
Tabular RL is the last time reinforcement learning feels clean. Values become literal tables, planning becomes explicit, and the “aha” moments arrive fast.