Dec 30, 2018 - 13 MIN READ

Imitation Learning - GAIL and the Strange Feeling of Learning From Experts

I ended 2018 in a weird place - less “reinforcement,” more “copying.” GAIL taught me that sometimes the fastest path to competence is to borrow behavior first — and ask questions later.

Axel Domingues

2018 started with rewards and returns.

It ends with something that feels… almost like cheating:

learning without rewards.

After a year of fighting exploration, instability, sparse feedback, and “why is nothing happening,” I arrived at a technique that basically says:

What if we skip the reward design and learn from someone who already knows what they’re doing?

That thought is both comforting and uncomfortable.

Comforting because it promises competence faster.

Uncomfortable because it changes what RL is.

I didn’t get into reinforcement learning to become a copy machine.
I got into it because I wanted to understand how behavior emerges from optimization.

This is the first month where “competence” and “understanding” diverged for me.

GAIL can give you behavior that works, without giving you the same feeling of why it works.

And yet… after meeting GAIL, I can’t deny the feeling:

Sometimes the cleanest “reward function” is a dataset of expert trajectories.

The shift

From “learn from reward” → to “learn by matching expert behavior.”

The mechanism

Train a judge (discriminator), then train the policy to fool it.

The new risks

Bad demos, weak coverage, or a too-strong judge can kill learning.

The takeaway

You don’t remove cost — you move it from reward design to data quality + evaluation.

The theme of the month

If November was:

“learn from what didn’t happen”

Then December is:

“learn from what someone else already made happen”

It’s still optimization.
It’s still policies and trajectories.

But the teacher changes.

Instead of reward, the signal becomes:

do my trajectories look like the expert’s?

That tiny shift changes the emotional texture of training.

It goes from:

“agent learns by failing forward”

to:

“agent learns by being compared”

What GAIL is (plain English)

GAIL stands for Generative Adversarial Imitation Learning.

But the name is way scarier than the idea.

Start with expert trajectories

A dataset of episodes: states, actions, and what the expert did over time.

Train a discriminator (the judge)

It learns to classify: “expert” vs “agent” trajectory snippets.

Use the judge as a learned reward

The policy updates to produce trajectories the judge can’t distinguish from expert ones.

The story version is:

You have an expert (a set of trajectories).
You have an agent (your policy).
You train a judge (a discriminator) that tries to tell:
- “this trajectory came from the expert”
- vs
- “this trajectory came from the agent”

Then you train the agent to fool the judge.

If the judge can’t tell them apart, the agent is behaving like the expert.

So the agent “learns” without an explicit reward signal from the environment — it learns a reward-like signal from the discriminator’s feedback.

That’s the part that felt deeply weird:

the reward becomes something you learn.

Why this felt like a milestone

For most of 2018, I treated RL as:

reward is truth
the environment is the source of meaning
optimization tries to climb toward “good behavior”

GAIL flips that.

It says:

behavior is truth
the expert is the source of meaning
optimization tries to match a distribution of trajectories

It’s not replacing RL — it’s showing a different path through the same space.

And it connects to something I’ve been circling all year:

RL is an interface problem.

In GAIL, the interface isn’t reward design.
It’s data quality.

My learning goals this month

Category clarity

Imitation learning as its own thing — not “just RL with a shortcut.”

BC vs GAIL

Supervised copying vs adversarial distribution matching.

The adversarial loop

Internalize the two-player dynamic: judge vs policy.

What breaks first

Coverage, covariate shift, and discriminator overpowering the policy.

Key concepts I’m taking with me

Imitation learning rule: If your expert demonstrations are messy, your agent will faithfully learn the mess.

The part that felt morally confusing

With reward-based RL, there’s a clean story:

the agent discovers behavior through experience

With imitation learning, the story becomes:

the agent inherits behavior from someone else’s decisions

That raises questions I didn’t expect to face this early:

Is the agent “learning,” or just approximating?
Does this teach understanding, or just mimicry?
What happens when the expert is wrong?
What happens when the environment shifts?

And the most important one for my engineering mindset:

How do I debug learning when the reward is not a number from the environment, but a moving target from another network?

What I watched instead of “reward”

In GAIL, the traditional RL scoreboard (episode return) is not the main event.

Judge strength

Discriminator accuracy (too high = sparse gradient again).

Policy collapse risk

Entropy and action diversity (avoid narrow imitation).

Behavior truth

Qualitative rollouts (this month, eyes beat plots).

Failure signature

Termination reasons + episode length patterns.

So I started paying attention to different signals:

discriminator accuracy (is the judge too strong or too weak?)
policy behavior drift (does the agent become more expert-like or just noisier?)
entropy (is the policy collapsing into a narrow imitation too early?)
episode length / termination reasons (is it failing in the same way?)
qualitative rollouts (this month I trusted my eyes more than my plots)

A discriminator that’s “too good” can kill learning.
If the judge always wins, the policy gets no useful gradient — it’s sparse rewards all over again, just wearing a different mask.

Failure modes I expect (and why they matter)

The most useful mental model I found

I started thinking of GAIL as:

reward learning + policy learning running in a loop.

the discriminator is effectively learning “what good looks like”
the policy is learning “how to do what good looks like”

This is powerful, but it also means you have:

two moving targets
two failure modes
two instability sources

And suddenly my 2017 deep learning instincts came back:

watch for collapse
watch for saturation
watch for one network overpowering the other

Field notes (what surprised me)

1) It felt like training a mirror

Instead of “maximize reward,” the agent is trying to “be indistinguishable.”

It’s optimization as identity.

That’s… psychologically different.

2) It made reward design feel optional — but not free

GAIL can reduce reward engineering.

But it replaces it with:

data collection
expert quality
distribution coverage
evaluation complexity

You don’t remove the cost — you move it.

Closing: what 2018 gave me

I started the year thinking RL was about:

rewards
policies
and clever algorithms

I’m ending it with a more engineering-shaped belief:

signals
interfaces
stability
evaluation
and the discipline to treat learning like a system

Bandits taught me honesty.
Tabular RL taught me clarity.
Deep RL taught me humility.
Sparse rewards taught me patience.
And GAIL taught me something new:

Sometimes the shortest path to competence is to stand on someone else’s shoulders — but you have to measure what you’re inheriting.

December takeaway

Imitation isn’t a shortcut.

It’s a different interface: behavior data becomes the teacher.

The new discipline

When reward is learned, evaluation must get stricter —
or you’ll confuse “looks right” with “is robust.”

What’s next (2019)

2018 was exploration.

I learned the primitives:

value-based methods
policy gradients
actor-critic
TRPO/PPO-style constraints
off-policy continuous control
sparse reward strategies like HER
and imitation learning through GAIL

In 2019, I want to do something harder than learning algorithms:

apply them.

Specifically: take the Deep RL toolbox and point it at a real system where:

feedback is noisy
rewards are easy to mis-specify
data has sharp edges
and “success” is not a benchmark score

The direction I’m most excited about is building toward a fully automated trading system — learning to trade autonomously on BitMEX.

Not because it’s easy.

Because it’s the kind of environment where RL stops being a demo and starts being engineering.

And I want to find out what breaks when I leave Gym.

Order Books Are the Battlefield - Matching Engines in Plain English

In 2018 I learned RL inside clean Gym worlds. In 2019 I’m pointing that mindset at BitMEX — where the “environment” is a matching engine and the rewards come with slippage, fees, queue priority, and outages.

Sparse Rewards - HER and Learning From What Didn’t Happen

This month RL didn’t fail loudly. It failed quietly. Sparse rewards taught me the most brutal lesson yet - if nothing “good” happens, nothing gets learned — unless you rewrite what counts as experience.