Feb 25, 2018 - 8 MIN READ

Bandits - The First Honest RL Problem

Bandits strip RL down to one tension—explore vs exploit—so I can stop confusing luck with learning and start building real intuition.

Axel Domingues

January taught me something uncomfortable:

In reinforcement learning, the “dataset” is alive.
The policy changes what you see, which changes what you learn.

That’s powerful… and also a perfect recipe for self-deception.

So for February I wanted an RL problem that’s brutally honest.

Not elegant. Honest.

Something where I can’t hide behind “it’s complicated” when results look random.

That’s what bandits are.

What you’ll learn in this post

what a bandit problem is (in one mental picture)
what “exploration vs exploitation” means in practice
the 4 strategies I used and what each one is for
what to log so you don’t confuse luck with learning
the checklist I’m carrying into “real RL”

What bandits remove

no state (nothing to “navigate”)
no dynamics (your action doesn’t change the world)
no delayed reward (no long-term credit assignment)

What still remains

decisions under uncertainty
feedback via rewards
the need to explore to learn
the risk of locking onto a lucky illusion

Which means you’re forced to face the central RL tension directly:

exploration vs exploitation.

Why Bandits Felt Like the Right Second Step

In supervised learning, if my model performs badly, I can usually reason about:

data quality
model capacity
loss function
optimization settings

With RL I’m still learning what “bad performance” even means.

Bandits give me a clean controlled setting where:

“learning” means you choose the good option more often over time
“failure” means you either never discover the good option, or you discover it too late
randomness is present, but the problem is small enough to study carefully

It’s the first RL problem where I feel I can develop trustworthy intuition.

Bandits are often introduced as a toy problem.

But this month I stopped seeing them as “toy” and started seeing them as the purest interface between decision-making and uncertainty.

The Mental Model That Finally Clicked

Explore VS exploit in reinforcement learning

A bandit is basically a slot machine row:

you have multiple levers (arms)
each lever has its own reward behavior
you don’t know which one is best
you only learn by pulling levers and seeing outcomes

The catch is subtle but huge:

To learn which lever is best, you must spend pulls on levers that are probably worse.

That’s the “price” of learning.

And that price shows up as a practical engineering question:

How much uncertainty am I willing to tolerate while learning?

Every exploratory pull is a pull you suspect is suboptimal.You pay this tax up front so that later you can exploit with confidence.

Exploration vs Exploitation (The Core Tension)

How to interpret the failure patterns

These are not “mistakes in code.” They’re predictable behaviors that show up when a strategy handles uncertainty badly.

If you can recognize these patterns early, you can debug RL systems faster later.

This month I kept coming back to two failure patterns:

Failure Pattern #1: Exploit too early

You get a few lucky rewards early on from a mediocre arm, and you commit too soon.

This looks like “learning” at first.

Then later you realize:

you never sampled the best arm enough
you locked in a local illusion

Failure Pattern #2: Explore forever

You keep sampling everything, never settling.

This looks like “being thorough.”

But the cost is real:

you keep paying the exploration tax
performance never stabilizes

The bandit problem is essentially the art of finding a good tradeoff between these.

Bandits taught me a rule I want to carry into all RL:

If you don’t explicitly control exploration, you don’t control what you’re learning from.

What I Tried (Conceptually)

I approached this like an engineering experiment: a handful of strategies, same conditions, compare behavior across runs.

Strategy 1

Pure random A baseline to remind me what “no learning” looks like.

Strategy 2

Greedy Always pick the current best-looking arm (and fail hilariously when early luck lies).

Strategy 3

Epsilon-greedy Mostly exploit, sometimes explore — simple, but surprisingly effective.

Strategy 4

Optimism / hopeful start Start by assuming arms are good until proven otherwise, forcing early exploration.

I’m intentionally not going deep into algorithms yet.

The purpose of this month was to feel the behavior of strategies, not memorize names.

The Most Important Lesson: Averages Lie Early

In supervised learning, “average loss over a batch” is usually a stable thing to monitor.

In bandits, early averages are dangerously noisy.

A single lucky reward can dominate early estimates.

Don’t over-trust early reward

Early reward curves can look amazing for the wrong reason: luck.

Watch behavior change instead

do action counts shift toward the best arm?
do bad arms get sampled less over time?
does this pattern repeat across multiple runs?

This is what forced me to change how I think about progress.

Instead of asking:

“did the reward go up?”

I started asking:

“did the action selection behavior change in a reasonable way?”
“how fast did it stop wasting pulls on bad arms?”
“does the result persist across multiple runs?”

What I Started Logging (Even in Simple Problems)

Even though bandits are small, they taught me to log like a grown-up.

Here’s what I found useful:

reward over time (but with skepticism)
action counts per arm
estimated value per arm (and how quickly it stabilizes)
run-to-run variation (different randomness, different outcomes)

Bandits are where I first got burned by “a good-looking curve.”

One run can look like a breakthrough. Five runs can reveal it was just luck.

Field Notes (What Surprised Me)

1) “Greedy” is worse than I expected

I knew it would struggle, but it was dramatic.

It taught me that:

A learning system without exploration is not a learning system.

2) Simple exploration beats cleverness early

Epsilon-style exploration feels almost too simple to matter.

But it immediately improved behavior.

It made me less obsessed with fancy algorithms and more obsessed with:

instrumentation
repeatability
understanding why something fails

3) The problem isn’t picking the best arm

The problem is earning the right to believe you’ve found the best arm.

That’s an epistemology problem, not just a math problem.

The Engineer’s Checklist I’m Carrying Forward

Bandits were my first RL problem where I could build a “discipline loop.”

This is the checklist I want to reuse as RL gets messier:

baseline first (random policy)
log behavior, not just reward
run multiple seeds
separate “looks good” from “is reliable”
treat exploration as a controllable system knob

Bandits convinced me that RL progress is easy to fake.

So I’m making “anti-self-deception” part of the learning process.

What’s Next

March is where “real RL” begins.

Bandits have no state, no dynamics, no long-term consequences.

Bridge to next month Bandits taught me how to evaluate learning under uncertainty.

MDPs will add the next layer: my action changes what I see next, so now exploration affects the future dataset, not just today’s reward.

So next month I’m moving into tabular environments where:

actions change what state you see next
rewards can be delayed
“good behavior” means planning across steps, not just picking the best lever

In other words:

MDPs, value functions, and the first time RL starts to feel like learning over time.

FAQ

Tabular RL - When Value Iteration Feels Like Cheating

Tabular RL is the last time reinforcement learning feels clean. Values become literal tables, planning becomes explicit, and the “aha” moments arrive fast.

Rewards, Returns, and Why “Learning” Is an Interface Problem

I’m starting 2018 by shifting from deep learning to reinforcement learning. The first lesson isn’t an algorithm — it’s that the data pipeline is the policy itself.