
In March 2020, I stopped treating reward like a number and started treating it like a contract—pay real costs, punish real risk, and don’t teach the agent to win a video game.
Axel Domingues
In 2018 I learned RL as algorithms: value functions, policies, advantage, exploration.
In 2019 I learned trading as plumbing: order books, features, labels, baselines, and the first humbling curves.
In January and February 2020, I finally put the two together: a Gym environment plus evaluation discipline.
March is where I hit the wall:
If the reward is wrong, everything else is theatre.
Early on, I thought reward shaping was a kind of “extra credit.”
Like: first get it working, then add a few penalties.
Reality: reward shaping is the behavior spec.
If you pay for the wrong proxy, the agent will happily:
Not because it’s “smart”.
Because that’s literally the job.
My first version was basically:
That sounds reasonable until you remember what a real exchange charges you for existing:
So the agent did what you’d expect from a badly-specified game:
Training wasn’t “unstable because RL is hard.”
Training was unstable because I was paying for incoherent behavior.
My rule became:
So I focused on 3 categories that are real and unavoidable:
All of this landed in bitmex_env.py and, importantly, it stayed simple enough to audit.
If your environment doesn’t price fees, you teach the agent that “more actions” is always an option.
So I made fees explicit:
In the environment, this shows up as adding fee terms into the reward path.
Not as a lecture.
As a bill.
Effect:
Markets don’t care about your neural network.
Neural networks care a lot about scale.
So I introduced a simple multiplier to keep reward magnitudes inside a learnable range.
This is not finance.
This is interface design.
If reward spikes, training becomes noise.
If reward is too tiny, the agent never updates meaningfully.
So: scale it, then keep it consistent.
A big failure mode was letting unrealised losses be soft.
If the agent can keep a position open forever with no increasing pain, it will.
Because “maybe it comes back” is free.
So I made two asymmetric rules:
This is me admitting a truth about the system:
In trading, the worst thing is not being wrong — it’s staying wrong with leverage and time.
So the reward started nudging the agent toward:
Not because “it’s morally good.”
Because my environment needs a policy that can survive the messy reality of regimes.
Then I added the simplest form of risk constraint:
These are deliberately crude.
They’re not a full risk model.
They’re a behavioral bias:
This created a tempo.
Not a strategy.
But tempo matters, because it defines what kind of policy can exist.
In the baseline Gym, opening a position commits the whole portfolio.
So I didn’t need a “position size” state variable yet.
But I did need the agent to know if it was currently exposed.
So the observation isn’t just microstructure features.
It’s features + a tiny state header:
This is the minimum amount of “self-awareness” required for the reward constraints to make sense.
a reward constraint without the corresponding state is just a random shock.
This is where I had to get honest about what I was building.
I wasn’t trying to build a pure academic RL benchmark.
I was trying to build a policy that trades in a system that fails.
So constraints are not cheating.
They’re the only way to teach “behavior under risk” without waiting 6 months for the agent to discover it by accident.
The only real rule is:
If you do that, shaping is not a lie.
It’s a curriculum.
Reward shaping sounds philosophical until you instrument it.
Then it becomes engineering.
Here’s the checklist I kept near the code:
Fees, unrealised shaping, time penalties, base movement reward. If you can’t log it, you can’t trust it.
If one term dominates, you’re training on that term. Not on “trading”.
A single flipped sign turns “avoid fees” into “farm fees”.
One candle up, one candle down. One open/close sequence. If the reward surprises you there, it will destroy you at scale.
Once these fixes went in, training stopped feeling like superstition.
The agent became:
And that mattered because it let me focus on the next problem:
how to wire this into a real process and keep it alive.
That’s April.
Next post: Chappie Wiring.
This is the month the project stops being “training” and becomes a loop that can fail.
Chappie Wiring From Trained Policy to Running Process
The moment RL stops being a notebook artifact: load a PPO policy, rebuild the live observation stream, and turn BitMEX into a runtime you can monitor and control.
Evaluation Discipline - Walk-Forward Backtesting Inside the Gym
Training reward was lying to me. So I turned evaluation into a first-class system - chronological splits, deterministic runs, and walk-forward backtests that survive the next dataset.