
Bandits strip RL down to one tension—explore vs exploit—so I can stop confusing luck with learning and start building real intuition.
Axel Domingues
January taught me something uncomfortable:
In reinforcement learning, the “dataset” is alive.
The policy changes what you see, which changes what you learn.
That’s powerful… and also a perfect recipe for self-deception.
So for February I wanted an RL problem that’s brutally honest.
Not elegant. Honest.
Something where I can’t hide behind “it’s complicated” when results look random.
That’s what bandits are.
What bandits remove
What still remains
Which means you’re forced to face the central RL tension directly:
exploration vs exploitation.
In supervised learning, if my model performs badly, I can usually reason about:
With RL I’m still learning what “bad performance” even means.
Bandits give me a clean controlled setting where:
It’s the first RL problem where I feel I can develop trustworthy intuition.
But this month I stopped seeing them as “toy” and started seeing them as the purest interface between decision-making and uncertainty.

A bandit is basically a slot machine row:
The catch is subtle but huge:
To learn which lever is best, you must spend pulls on levers that are probably worse.
That’s the “price” of learning.
And that price shows up as a practical engineering question:
How much uncertainty am I willing to tolerate while learning?
If you can recognize these patterns early, you can debug RL systems faster later.These are not “mistakes in code.” They’re predictable behaviors that show up when a strategy handles uncertainty badly.
This month I kept coming back to two failure patterns:
You get a few lucky rewards early on from a mediocre arm, and you commit too soon.
This looks like “learning” at first.
Then later you realize:
You keep sampling everything, never settling.
This looks like “being thorough.”
But the cost is real:
The bandit problem is essentially the art of finding a good tradeoff between these.
If you don’t explicitly control exploration, you don’t control what you’re learning from.
I approached this like an engineering experiment: a handful of strategies, same conditions, compare behavior across runs.
Strategy 1
Pure random A baseline to remind me what “no learning” looks like.
Strategy 2
Greedy Always pick the current best-looking arm (and fail hilariously when early luck lies).
Strategy 3
Epsilon-greedy Mostly exploit, sometimes explore — simple, but surprisingly effective.
Strategy 4
Optimism / hopeful start Start by assuming arms are good until proven otherwise, forcing early exploration.
I’m intentionally not going deep into algorithms yet.
The purpose of this month was to feel the behavior of strategies, not memorize names.
In supervised learning, “average loss over a batch” is usually a stable thing to monitor.
In bandits, early averages are dangerously noisy.
A single lucky reward can dominate early estimates.
Don’t over-trust early reward
Early reward curves can look amazing for the wrong reason: luck.
Watch behavior change instead
This is what forced me to change how I think about progress.
Instead of asking:
I started asking:
Even though bandits are small, they taught me to log like a grown-up.
Here’s what I found useful:
One run can look like a breakthrough. Five runs can reveal it was just luck.
I knew it would struggle, but it was dramatic.
It taught me that:
A learning system without exploration is not a learning system.
Epsilon-style exploration feels almost too simple to matter.
But it immediately improved behavior.
It made me less obsessed with fancy algorithms and more obsessed with:
The problem is earning the right to believe you’ve found the best arm.
That’s an epistemology problem, not just a math problem.
Bandits were my first RL problem where I could build a “discipline loop.”
This is the checklist I want to reuse as RL gets messier:
So I’m making “anti-self-deception” part of the learning process.
March is where “real RL” begins.
Bandits have no state, no dynamics, no long-term consequences.
MDPs will add the next layer: my action changes what I see next, so now exploration affects the future dataset, not just today’s reward.
So next month I’m moving into tabular environments where:
In other words:
MDPs, value functions, and the first time RL starts to feel like learning over time.
Yes — they’re just the simplest form.
They still involve choosing actions under uncertainty to maximize reward. What they remove is state and long-term planning, which makes the core exploration problem impossible to ignore.
Because deep RL adds extra failure modes (function approximation, non-stationary data, instability). Bandits let me develop the discipline of evaluation and exploration first, before the complexity explodes.
That early reward averages are incredibly misleading.
I started trusting action-selection behavior and multi-run consistency more than a single reward curve.
For tiny problems, I treat 5 runs as the minimum to detect “it was luck.”
If results are still unstable:
Tabular RL - When Value Iteration Feels Like Cheating
Tabular RL is the last time reinforcement learning feels clean. Values become literal tables, planning becomes explicit, and the “aha” moments arrive fast.
Rewards, Returns, and Why “Learning” Is an Interface Problem
I’m starting 2018 by shifting from deep learning to reinforcement learning. The first lesson isn’t an algorithm — it’s that the data pipeline is the policy itself.