Mar 25, 2018 - 10 MIN READ

Tabular RL - When Value Iteration Feels Like Cheating

Tabular RL is the last time reinforcement learning feels clean. Values become literal tables, planning becomes explicit, and the “aha” moments arrive fast.

Axel Domingues

February was brutally honest.

Bandits forced me to face the simplest RL truth:

you can’t learn without paying an exploration tax.

But bandits also have an escape hatch:

There’s no state. No dynamics. No long-term consequences.

So March is where RL starts to feel like reinforcement learning in the way people mean it:

actions change the state you end up in
rewards can be delayed
good behavior requires planning across multiple steps

This is the month where value functions stop being abstract, and become something I can literally print out and inspect.

And honestly?

The first time I ran value iteration it felt like cheating.

The problem shape

What an MDP is, in plain English — and why structure makes planning possible.

Why it feels like cheating

Value iteration isn’t trial-and-error.
It’s solving a world when the rules are known.

What “value” really means

State value, action value, and the intuition behind advantage-like thinking.

The first real forks

Monte Carlo vs TD.
SARSA vs Q-learning.
Where learning philosophy starts to matter.

The “MDP” Moment: When the Problem Finally Has Shape

In January, the RL loop felt like an interface.
In February, it felt like uncertainty.

In March, it finally felt like a structured problem.

The structure is simple to describe:

there are states you can be in
there are actions you can take
actions move you to new states (often with randomness)
rewards show up along the way
the goal is to learn behavior that produces good long-term outcomes

This structure is what makes “planning” possible.

And it’s also what makes the next year of Deep RL make sense:

Deep RL is mostly “tabular RL, but the table is too big to store.”

This month is important because it’s the last time I can hold the whole learning system in my head.

Values are tables. Policies are tables. Transitions are tables.

It’s RL at human scale.

Why Value Iteration Felt Like Cheating

I expected learning to feel like training:

trial and error
gradual improvement
noisy curves

But value iteration feels like this instead:

define the rules of the world
compute what “good” means everywhere
instantly extract the right behavior

It feels less like learning and more like compiling.

And that’s exactly the point:

Value iteration is not “learning from experience.”

It’s solving the environment when you already know its dynamics.

That helped me build a clean mental separation:

Planning

You know the environment’s rules.
You compute the best behavior directly.

Learning

You don’t know the rules.
You must discover good behavior through interaction.

Most of the time in real RL, we’re in the second bucket.

But starting with the first bucket made everything else less mysterious.

The Three Ways “Value” Showed Up This Month

The word “value” is overloaded.
March is where I finally stopped treating it as a vague concept.

Here’s how it showed up clearly:

State value: “How good is it to be here?”
Action value: “How good is it to do this here?”
Advantage-like thinking (even if I didn’t call it that yet): “Is this action better than my baseline expectation?”

Once I saw values as simple scores attached to states and actions, RL became much easier to reason about.

The Core Loop I Kept Repeating Mentally

This was the mental loop that clicked again and again:

A state is only meaningful through its futures

A state isn’t “good” or “bad” by itself. It’s good or bad because of what it tends to lead to.

Value is how you bring the future into the present

Value is your compressed belief about downstream consequences.

A policy is just a choice rule over actions

Given your current beliefs (values), choose the action that leads to the best future on average.

Learning is improving beliefs so the choice rule improves

If values are wrong, the policy will be wrong — even if the agent is “trying.”

That’s it.

Everything else this month felt like different ways of making that loop work under constraints.

The Environment That Made It Real: FrozenLake

FrozenLake is deceptively simple:

it looks like a tiny grid world
there’s a goal state
there are holes you fall into
you want to reach the goal

But it also adds the first real “welcome to RL” detail:

slipperiness.

You try to move up, and the world says, “maybe.”

That one detail was enough to teach me two things:

the best action is not always the one that looks best locally
learning needs to handle uncertainty in transitions, not just reward noise

FrozenLake is small enough to reason about…

…but not so clean that it becomes fake.

What I inspect in FrozenLake (tabular superpower)

a table of state values: do “near-hole” states look worse?
a policy-as-arrows view: do arrows point away from holes?
where randomness matters: do “safe-looking” moves still have risk?
termination handling: do hole states and goal state end episodes correctly?

Monte Carlo vs Temporal Difference: The First Real Fork in the Road

This was the big conceptual fork:

Monte Carlo (MC) thinking

“You learn by seeing full outcomes.”

you run an episode
you observe what happened
you update your estimates based on the total outcome

This feels intuitive… and slow.

Temporal Difference (TD) thinking

“You learn while things are still unfolding.”

you update based on partial progress
you bootstrap from your current value estimates

This feels less intuitive… but it feels like the beginning of modern RL.

This month I started to understand TD as a powerful engineering tradeoff:

Monte Carlo is clean but late.
Temporal Difference is messy but fast.

And the “mess” is exactly what later becomes instability in Deep RL.

SARSA vs Q-learning: Where On-Policy Became Concrete

I’d heard “on-policy” and “off-policy” in January.

But it was still abstract.

March made it concrete in the simplest way:

SARSA learns about the behavior you actually follow
Q-learning learns about a best-case future behavior, even if you’re not behaving that way yet

That distinction matters because it changes what you consider “truth” during learning.

What each method treats as “truth”

SARSA: “I want to evaluate the world under the behavior I’m actually executing.”
Q-learning: “I want to evaluate the world under the best behavior I wish I were executing.”

Same environment. Different definition of what you’re estimating.

I’m not going to pretend I mastered this in March.

But I did feel the difference:

SARSA feels cautious and honest
Q-learning feels ambitious and sometimes overconfident

And suddenly a lot of later Deep RL design choices make more sense.

What Actually Broke (My Debugging Notes)

Even in tabular RL, I still hit real breakage.
Just… in simpler forms.

Here are the most common failure modes I ran into mentally:

confusing “more updates” with “better learning”
forgetting that exploration changes what data you see
mixing up episode termination handling (again!)
assuming deterministic transitions when the env is stochastic
evaluating with exploration still on (and blaming the algorithm)

The biggest silent bug pattern in tabular RL:

You think you’re evaluating the learned policy, but you’re still injecting randomness through exploration.

This month reinforced a theme from January:

If you don’t control the interface, you don’t control the story.

The Most Important “Aha”: Values Are Debuggable

This is the part that made March feel like a gift.

In deep learning, if training is failing, you often stare at curves and guess.

In tabular RL, you can inspect the world:

you can print a value table and see if it makes sense
you can visualize a policy as arrows on a grid
you can see where the agent thinks it’s safe vs dangerous
you can see whether the agent’s beliefs match the environment

It’s the last time RL feels this transparent.

And I’m grateful for it.

The Mental Model I’m Taking Into Deep RL

This month gave me a map I want to carry forward:

The “table in my head” mental model

Deep RL is still trying to learn values and policies.

The only reason it looks different is because the “table” doesn’t fit in memory — so we approximate it with a model.

If I can’t explain what a deep agent is doing in tabular terms, I probably don’t understand it yet.

What’s Next

March made RL feel solvable.

But it also revealed the cliff I’m walking toward:

Tabular RL works because the state space is small enough to store beliefs as a table.

Real problems don’t fit in tables.

So next month I’m stepping into the uncomfortable middle ground:

function approximation.

Not deep networks yet — but the moment we replace a table with a model.

And if the pattern from 2017 holds…

that’s the moment stability starts to wobble.

FAQ

Function Approximation - The Day RL Stopped Being Stable

Tabular RL felt clean because you could see the truth in a table. The moment I replaced the table with a model, RL stopped being a neat algorithm and became a fragile system.

Bandits - The First Honest RL Problem

Bandits strip RL down to one tension—explore vs exploit—so I can stop confusing luck with learning and start building real intuition.