
I stopped pretending “a good predictor” was the same thing as “a tradable strategy” and designed a Gym-style environment contract that makes cheating obvious and failure modes measurable.
Axel Domingues
In 2018 I fell in love with Reinforcement Learning.
Not in a “publish a paper” way — in a “this finally feels like the missing piece” way.
Policy gradients taught me how to optimize behavior directly. Actor-critic made it feel trainable. Deep Q learning made it feel practical.
And then BitMEX showed me the part I hadn’t earned yet:
In trading, the algorithm is never the hard part. The contract is.
Because a predictor can be brilliant and still lose money.
A policy can be “optimal” and still be impossible to execute.
A backtest can be “profitable” and still be lying.
So December 2019 is where I draw a line in the sand:
If I can’t define the environment contract precisely, I don’t actually have an RL trading problem — I have a storytelling problem.
This post is about that contract: what the agent sees, what it’s allowed to do, how it gets rewarded, where episodes start/end, and where “cheating” begins.
By this point in the series:
So yes — I can predict “up” vs “down” ahead.
But the market doesn’t pay for predictions. It pays for decisions.
And a decision has hidden structure:
That’s where the environment comes in.
I’m using “Gym” in the practical sense — the classic interface:
reset() returns an observation (initial state)step(action) returns (observation, reward, done, info)It’s boring. It’s standardized. And that’s the point.
The contract forces you to be explicit about:
In the repo, this “contract-first” approach shows up as two tracks:
bitmex-gym/gym_bitmex/envs/bitmex_env.pydummy-gym/gym_dummy/envs/dummy_env.pyThe dummy env exists for one reason:
I don’t trust my environment until I can run the whole RL pipeline end-to-end on something trivial.
When you say “I’m building an RL environment for trading”, you’re really saying:
Let’s go through them the way I implemented them.
In my world, the observation starts as the same thing I trained supervised models on:
You can see this pattern directly in the early BitMEX environment code:
__get_features() normalizes (x - mean) / sigma and cleans NaNs/infs__load_data() pulls HDF5 chunks with pd.read_hdf(...) and stacks features over filesobservation_space is defined as a bounded Box(...) so agents can’t “break the world” with weird numeric assumptionsIn the repo, the environment’s observation space is explicitly bounded (the idea is “normalized-ish data lives in a small numeric range”):
low = -high, high = 3 * ones(...)spaces.Box(low, high, dtype=np.float32)That is not “financial truth.” It’s a training discipline: keep the numeric interface stable.
len(FEATURES_COLS) + 5.Those 5 vars are:By July 2019 I already learned how easy it is to leak with labels.
With an RL environment it’s even easier, because you can leak by accident:
info and then quietly feeding them into the networkThe contract is my guardrail: observations must be computable strictly from the current snapshot and the past.
The first version of an action space should not be “everything a trader could do”.
It should be “the smallest set of actions that can be executed reliably”.
Because every extra action multiplies:
So I went for a deliberately boring action space:
NOOPIn bitmex_env.py that shows up as a simple discrete action space:
self.action_space = spaces.Discrete(NUM_ACTIONS)And the comments make the intention explicit:
“Noop, Open/Close maker Long, Open/Close maker Short”
This is a design decision that will echo through 2020:
So even in December 2019, the environment contract is already steering the project:
The agent can only do what the system can do.
Reward is where most trading environments become fantasy novels.
Because it’s tempting to write:
reward = next_price - current_price
And then celebrate when the curve goes up.
But that reward is not trading.
It ignores:
And worst of all:
It rewards agents for predicting, not executing.
So the reward design principle I adopted is:
Reward must be computable from what would have happened if a real order was placed under the contract assumptions.
That means reward is closer to a “trade loop” than a label.
In early versions of bitmex_env.py, you can see the reward logic is already oriented around:
Even if the exact reward function evolves later, the shape is set:
A good reward is one that makes bad strategies feel bad for the same reasons they would fail live.This is reward hacking, but in trading it looks like:
- infinite turnover because fees are missing
- perfect timing because fills are assumed
- unrealistic position flipping because inventory risk is ignored
Episodes are the most under-discussed form of cheating.
If you let an agent reset whenever it wants, or if reset always starts at a “nice” place in the data, the agent will learn:
In bitmex_env.py, you can see the start of a very specific defense:
min_valid_idx and max_valid_idx where data is validmax_valid_idx backwards by a fixed amount:“We move the max 2 hours before the last ss to discourage reset exploit”
That’s a tiny line with a big philosophy:
Your environment must prevent the agent from learning “reset strategy” as alpha.
In other words: the environment contract includes how episodes are sampled.
Gym’s info dictionary is both a gift and a loaded gun.
It’s a gift because trading is impossible to debug without diagnostics:
It’s a loaded gun because any of that can become leakage if you feed it back into the model.
So my rule is:
obs is what the agent can learn frominfo is what I can inspect as an engineerIf a metric is useful for learning, it must be explicitly part of the observation.
If it is useful only for debugging, it lives in info and stays out of the network.
Before letting an agent touch market data, I built a toy environment in:
dummy-gym/gym_dummy/envs/dummy_env.pyIt defines:
NUM_FEATURES = 76 (matching the feature vector shape I was working with at the time)spaces.Discrete(NUM_ACTIONS))step() and reset() with predictable shapesThat environment is not “finance”.
It’s a basic empty env for smoke testing.
This is the mental checklist I used before writing more code:
info is for debugging, not learningThat last bullet is the bridge from November’s “503 lesson” into 2020:
In a clean simulator, outages don’t exist.
In BitMEX, outages are a teacher.
This whole series started as “learn ML”.
But by late 2019, the most valuable skill I was building wasn’t “writing a model” — it was:
That’s the kind of work that survives outside of trading too.
In other words: this is RL as engineering, not RL as demos.
Repo — bitmex-deeprl-research
The full research log and codebase this series documents (collector → features → models → environments).
OpenAI Gym Interface
The classic reset() / step() contract that forces you to make assumptions explicit.
Because the contract has to match what the system can execute.
With BitMEX, outages and delayed responses make “instant taker scalping” a simulation fantasy unless you model those failure modes explicitly. The first stable step is maker-style actions with conservative assumptions — then we expand.
Supervised learning gives you signals. Trading needs decisions.
A direction predictor doesn’t tell you:
The environment is where you turn “signal” into “policy”.
It means the agent benefits from an assumption that wouldn’t hold live.
Common examples:
A good environment contract is one where you can point to every assumption and say: “yes, the live system can actually do that.”
In January 2020, I stop describing the contract and start making it real:
bitmex-gym: The Baseline Trading Environment
That’s where I’ll show:
bitmex-gym - The Baseline Trading Environment (Where Cheating Starts)
In January 2020 I stop “predicting” and build a Gym environment that turns BitMEX microstructure features into actions, fills, and rewards — and makes every hidden assumption painfully visible.
The 503 Lesson - Outages as a Signal, Not Just a Bug
My first live alpha monitor was “working”… until BitMEX started replying 503 right when the model got excited. That’s when I learned availability is part of market microstructure.