
I ended 2018 in a weird place - less “reinforcement,” more “copying.” GAIL taught me that sometimes the fastest path to competence is to borrow behavior first — and ask questions later.
Axel Domingues
2018 started with rewards and returns.
It ends with something that feels… almost like cheating:
learning without rewards.
After a year of fighting exploration, instability, sparse feedback, and “why is nothing happening,” I arrived at a technique that basically says:
What if we skip the reward design and learn from someone who already knows what they’re doing?
That thought is both comforting and uncomfortable.
Comforting because it promises competence faster.
Uncomfortable because it changes what RL is.
I didn’t get into reinforcement learning to become a copy machine.
I got into it because I wanted to understand how behavior emerges from optimization.
GAIL can give you behavior that works, without giving you the same feeling of why it works.
And yet… after meeting GAIL, I can’t deny the feeling:
Sometimes the cleanest “reward function” is a dataset of expert trajectories.
The shift
From “learn from reward” → to “learn by matching expert behavior.”
The mechanism
Train a judge (discriminator), then train the policy to fool it.
The new risks
Bad demos, weak coverage, or a too-strong judge can kill learning.
The takeaway
You don’t remove cost — you move it from reward design to data quality + evaluation.
If November was:
Then December is:
It’s still optimization.
It’s still policies and trajectories.
But the teacher changes.
Instead of reward, the signal becomes:
That tiny shift changes the emotional texture of training.
It goes from:
to:

GAIL stands for Generative Adversarial Imitation Learning.
But the name is way scarier than the idea.
A dataset of episodes: states, actions, and what the expert did over time.
It learns to classify: “expert” vs “agent” trajectory snippets.
The policy updates to produce trajectories the judge can’t distinguish from expert ones.
The story version is:
Then you train the agent to fool the judge.
If the judge can’t tell them apart, the agent is behaving like the expert.
So the agent “learns” without an explicit reward signal from the environment — it learns a reward-like signal from the discriminator’s feedback.
That’s the part that felt deeply weird:
the reward becomes something you learn.
For most of 2018, I treated RL as:
GAIL flips that.
It says:
It’s not replacing RL — it’s showing a different path through the same space.
And it connects to something I’ve been circling all year:
RL is an interface problem.
In GAIL, the interface isn’t reward design.
It’s data quality.
Category clarity
Imitation learning as its own thing — not “just RL with a shortcut.”
BC vs GAIL
Supervised copying vs adversarial distribution matching.
The adversarial loop
Internalize the two-player dynamic: judge vs policy.
What breaks first
Coverage, covariate shift, and discriminator overpowering the policy.
With reward-based RL, there’s a clean story:
With imitation learning, the story becomes:
That raises questions I didn’t expect to face this early:
And the most important one for my engineering mindset:
How do I debug learning when the reward is not a number from the environment, but a moving target from another network?
In GAIL, the traditional RL scoreboard (episode return) is not the main event.
Judge strength
Discriminator accuracy (too high = sparse gradient again).
Policy collapse risk
Entropy and action diversity (avoid narrow imitation).
Behavior truth
Qualitative rollouts (this month, eyes beat plots).
Failure signature
Termination reasons + episode length patterns.
So I started paying attention to different signals:
Symptom: looks good on expert-like states, then spirals when it drifts.
Likely cause: covariate shift (small errors compound).
First check: evaluate on perturbed starts; watch recovery behavior.
Symptom: judge always wins; the policy gets no usable signal.
Likely cause: discriminator too strong / policy too weak / imbalance.
First check: discriminator accuracy; if it saturates near 100%, learning is starved.
Symptom: policy “fools” the judge without real competence.
Likely cause: judge underfits or collapses.
First check: judge accuracy near chance too early + behavior not improving.
Symptom: one narrow behavior that looks smooth but fails under variation.
Likely cause: easiest-to-fool slice of the expert distribution.
First check: diversity tests (different starts, disturbances, minor env changes).
Symptom: works in familiar situations, fails elsewhere.
Likely cause: expert coverage is narrow; the policy never learns outside it.
First check: state coverage metrics; test scenarios outside demonstration manifold.
I started thinking of GAIL as:
reward learning + policy learning running in a loop.
This is powerful, but it also means you have:
And suddenly my 2017 deep learning instincts came back:
Instead of “maximize reward,” the agent is trying to “be indistinguishable.”
It’s optimization as identity.
That’s… psychologically different.
GAIL can reduce reward engineering.
But it replaces it with:
You don’t remove the cost — you move it.
I started the year thinking RL was about:
I’m ending it with a more engineering-shaped belief:
Bandits taught me honesty.
Tabular RL taught me clarity.
Deep RL taught me humility.
Sparse rewards taught me patience.
And GAIL taught me something new:
Sometimes the shortest path to competence is to stand on someone else’s shoulders — but you have to measure what you’re inheriting.
December takeaway
Imitation isn’t a shortcut.
It’s a different interface: behavior data becomes the teacher.
The new discipline
When reward is learned, evaluation must get stricter —
or you’ll confuse “looks right” with “is robust.”
2018 was exploration.
I learned the primitives:
In 2019, I want to do something harder than learning algorithms:
apply them.
Specifically: take the Deep RL toolbox and point it at a real system where:
The direction I’m most excited about is building toward a fully automated trading system — learning to trade autonomously on BitMEX.
Not because it’s easy.
Because it’s the kind of environment where RL stops being a demo and starts being engineering.
And I want to find out what breaks when I leave Gym.
Order Books Are the Battlefield - Matching Engines in Plain English
In 2018 I learned RL inside clean Gym worlds. In 2019 I’m pointing that mindset at BitMEX — where the “environment” is a matching engine and the rewards come with slippage, fees, queue priority, and outages.
Sparse Rewards - HER and Learning From What Didn’t Happen
This month RL didn’t fail loudly. It failed quietly. Sparse rewards taught me the most brutal lesson yet - if nothing “good” happens, nothing gets learned — unless you rewrite what counts as experience.