
Tabular RL is the last time reinforcement learning feels clean. Values become literal tables, planning becomes explicit, and the “aha” moments arrive fast.
Axel Domingues
February was brutally honest.
Bandits forced me to face the simplest RL truth:
you can’t learn without paying an exploration tax.
But bandits also have an escape hatch:
There’s no state. No dynamics. No long-term consequences.
So March is where RL starts to feel like reinforcement learning in the way people mean it:
This is the month where value functions stop being abstract, and become something I can literally print out and inspect.
And honestly?
The first time I ran value iteration it felt like cheating.
The problem shape
What an MDP is, in plain English — and why structure makes planning possible.
Why it feels like cheating
Value iteration isn’t trial-and-error.
It’s solving a world when the rules are known.
What “value” really means
State value, action value, and the intuition behind advantage-like thinking.
The first real forks
Monte Carlo vs TD.
SARSA vs Q-learning.
Where learning philosophy starts to matter.
In January, the RL loop felt like an interface.
In February, it felt like uncertainty.
In March, it finally felt like a structured problem.
The structure is simple to describe:
This structure is what makes “planning” possible.
And it’s also what makes the next year of Deep RL make sense:
Deep RL is mostly “tabular RL, but the table is too big to store.”
Values are tables. Policies are tables. Transitions are tables.
It’s RL at human scale.
I expected learning to feel like training:
But value iteration feels like this instead:
It feels less like learning and more like compiling.
And that’s exactly the point:
Value iteration is not “learning from experience.”
It’s solving the environment when you already know its dynamics.
That helped me build a clean mental separation:
Planning
You know the environment’s rules.
You compute the best behavior directly.
Learning
You don’t know the rules.
You must discover good behavior through interaction.
Most of the time in real RL, we’re in the second bucket.
But starting with the first bucket made everything else less mysterious.
The word “value” is overloaded.
March is where I finally stopped treating it as a vague concept.
Here’s how it showed up clearly:
Once I saw values as simple scores attached to states and actions, RL became much easier to reason about.
This was the mental loop that clicked again and again:
A state isn’t “good” or “bad” by itself. It’s good or bad because of what it tends to lead to.
Value is your compressed belief about downstream consequences.
Given your current beliefs (values), choose the action that leads to the best future on average.
If values are wrong, the policy will be wrong — even if the agent is “trying.”
That’s it.
Everything else this month felt like different ways of making that loop work under constraints.

FrozenLake is deceptively simple:
But it also adds the first real “welcome to RL” detail:
slipperiness.
You try to move up, and the world says, “maybe.”
That one detail was enough to teach me two things:
FrozenLake is small enough to reason about…
…but not so clean that it becomes fake.
This was the big conceptual fork:
“You learn by seeing full outcomes.”
This feels intuitive… and slow.
“You learn while things are still unfolding.”
This feels less intuitive… but it feels like the beginning of modern RL.
This month I started to understand TD as a powerful engineering tradeoff:
And the “mess” is exactly what later becomes instability in Deep RL.
MC feels “honest” because it waits for reality to finish speaking.
TD is an engineering trade: earlier updates, but now your estimates can chase their own errors.
I’d heard “on-policy” and “off-policy” in January.
But it was still abstract.
March made it concrete in the simplest way:
That distinction matters because it changes what you consider “truth” during learning.
I’m not going to pretend I mastered this in March.
But I did feel the difference:
And suddenly a lot of later Deep RL design choices make more sense.
Even in tabular RL, I still hit real breakage.
Just… in simpler forms.
Here are the most common failure modes I ran into mentally:
You think you’re evaluating the learned policy, but you’re still injecting randomness through exploration.
This month reinforced a theme from January:
If you don’t control the interface, you don’t control the story.
This is the part that made March feel like a gift.
In deep learning, if training is failing, you often stare at curves and guess.
In tabular RL, you can inspect the world:
It’s the last time RL feels this transparent.
And I’m grateful for it.
This month gave me a map I want to carry forward:
The “table in my head” mental model
Deep RL is still trying to learn values and policies.
The only reason it looks different is because the “table” doesn’t fit in memory — so we approximate it with a model.
If I can’t explain what a deep agent is doing in tabular terms, I probably don’t understand it yet.
March made RL feel solvable.
But it also revealed the cliff I’m walking toward:
Tabular RL works because the state space is small enough to store beliefs as a table.
Real problems don’t fit in tables.
So next month I’m stepping into the uncomfortable middle ground:
function approximation.
Not deep networks yet — but the moment we replace a table with a model.
And if the pattern from 2017 holds…
that’s the moment stability starts to wobble.
Because it’s closer to “solving” than “learning.”
If you know the environment dynamics, value iteration computes what good behavior looks like everywhere, without needing trial-and-error experience.
Monte Carlo updates from full outcomes after an episode finishes.
Temporal Difference updates during the episode by bootstrapping from current estimates. TD feels less clean, but it’s the foundation for most modern RL methods.
That values and policies are not mysterious.
Deep RL is still learning the same objects — it’s just forced to approximate them because the “table” is too big.
Function Approximation - The Day RL Stopped Being Stable
Tabular RL felt clean because you could see the truth in a table. The moment I replaced the table with a model, RL stopped being a neat algorithm and became a fragile system.
Bandits - The First Honest RL Problem
Bandits strip RL down to one tension—explore vs exploit—so I can stop confusing luck with learning and start building real intuition.