
In January 2020 I stop “predicting” and build a Gym environment that turns BitMEX microstructure features into actions, fills, and rewards — and makes every hidden assumption painfully visible.
Axel Domingues
2018 was my “RL is real” year.
I spent months reading papers, re-implementing algorithms, and getting that first addictive feeling: an agent can learn behavior.
Then 2019 happened.
BitMEX forced me to admit something uncomfortable: in trading, the algorithm is rarely the bottleneck. The bottleneck is the system.
So December 2019 was about the contract: observation → action → reward → done.
January 2020 is where that contract becomes code.
This is the month I wrote my first “real” trading environment: bitmex-gym.
Not because it was perfect.
But because it was the first time I could point at a file and say:
“Here. This is where the lies start.”
The environment lives here:
bitmex-gym/gym_bitmex/envs/bitmex_env.pyIt consumes the 2019 pipeline outputs (features + prices in HDF5), and exposes a Gym-compatible API: reset(), step(action), observation_space, action_space.
In this environment I had to commit to five things:
Everything else is implementation detail.
And every implementation detail is an opportunity to cheat.
The core input is a feature vector produced earlier in 2019.
In code, the observation dimension is:
self.dimentions = len(bitmexEnv.FEATURES_COLS) + 5
The “+5” is the part that matters.
The observation isn’t just “order book features.” It’s features plus position context.
At every step, the environment concatenates:
long_open (0/1)short_open (0/1)current_unrealised_pct (scaled and clipped)hours_open_position (scaled and clipped)hours_closed_position (scaled and clipped)Those are built right in the step() loop:
open_position_state = np.array([
long_state,
short_state,
current_unrealised_pct,
hours_open_position,
hours_closed_position
], dtype=np.float64)
outcome_state_features = np.concatenate((features, open_position_state), axis=0)
Early on, I wanted the agent’s inputs to have a bounded-ish range so training didn’t explode the moment a feature distribution shifted.
So there’s no “size” to track — just whether you’re long, short, or flat. That changes later in 2020, when I move to agents that build positions in increments (2%, 5%, …) and “size” becomes part of both state and action space.This baseline environment is basically all-in / all-out: when you open, you deploy the whole portfolio; when you close, you go flat.
I started with the simplest thing that could possibly work:
NUM_ACTIONS = 3
# 0: noop
# 1: try to open/close via a maker-style long
# 2: try to open/close via a maker-style short
The environment keeps internal flags:
self.open_longself.open_shortActions are “intent” signals, not direct fills.
You place a limit order idea (maker), and the environment simulates whether that order would have been filled as the market moves.
action == 3 and action == 4 (taker actions).I was already experimenting with taker behavior, but the baseline action space stayed at 3 for training simplicity — meaning those branches are effectively unreachable unless you manually force actions outside the declared space.That’s a scar from iteration.
Here’s the baseline maker approximation I used:
For example, the environment sets the current maker prices like this:
current_long_price = avgPrice - MAKE_POS_ORDER_WAIT_MINUTES * avgDiffPrice
current_short_price = avgPrice + MAKE_POS_ORDER_WAIT_MINUTES * avgDiffPrice
Then it checks fills against best bid/ask:
bestAsk <= current_long_pricebestBid >= current_short_priceThis is not how a matching engine works.
But it was enough to test the first question I cared about:
“Can an agent learn anything useful from microstructure features if we give it a simplified execution layer?”
The reward signal is a blend of:
This environment tracks the last filled long price and the last filled short price.
When it has both, it considers the trade “closed” and computes:
trade_reward = ((short_price / long_price) - 1.0) * 100
That formulation works for both:
because the environment always treats the short-side fill as the “sell price” and the long-side fill as the “buy price.”
A classic RL failure mode in trading environments is:
So I added a crude but effective guardrail:
It’s not elegant.
But it forces the agent to learn a behavior that survives reality:
“Trades end.”
I ran training as short episodes, not one infinite time series.
Two reasons:
So the environment uses:
random_init_action)use_episode_based_time) — sampled around MEAN_STEPS_PER_EPISODE (≈1 hour on average) and clamped to MIN_STEPS_PER_EPISODE (≈20 min)step_skip) that controls how many raw ticks you fast-forward per agent decisionThe result is a training distribution that is less deterministic than “always start at the beginning of the file.”
In this project the goal was coverage and robustness:
- The agent sees many different microstructure regimes instead of memorizing one start date.
- Episodes are long enough for an action to realize PnL (otherwise training collapses into noise).
- The environment can intentionally spawn the agent with a position already open (good or bad), forcing it to learn position management, not just entry timing.
Writing a Gym environment is where you find out how easy it is to lie to yourself.
Here are the big cheats baked into this baseline:
And yet… I still built it.
Because the only way to fix these lies is to name them, and the only way to name them is to ship an environment and watch it break.
This is how I validated the environment wasn’t completely broken before training:
Confirm observation length matches len(FEATURES_COLS) + 5.
Just verify rewards are finite and the env doesn’t crash on invalid rows.
Make sure open_long / open_short transitions make sense.
If rewards are always zero or always extreme, something is wrong.
If your environment lies, your agent learns to exploit the lie.
Because I was debugging world mechanics, not strategy.
A big action space makes training harder and makes bugs harder to see. In this baseline I wanted to prove that the environment can:
Everything else can be layered later.
No — and that’s the point.
This is the “controlled lab rig.” It’s where I identify what I’m missing: queue priority, partial fills, outages, regime shifts, and all the ways markets punish naive assumptions.
Fill certainty.
If you let the agent assume that limit orders fill cleanly whenever price crosses, you give it a superpower. In later 2020 work, outages and microstructure friction forced me to pay that debt.
Now that the environment exists, the next problem is evaluation discipline.
A trading agent that “learns” in-sample is easy.
A trading agent that survives walk-forward evaluation is rare.
Evaluation Discipline - Walk-Forward Backtesting Inside the Gym
Training reward was lying to me. So I turned evaluation into a first-class system - chronological splits, deterministic runs, and walk-forward backtests that survive the next dataset.
From Prediction to Decision - Designing the Trading Environment Contract
I stopped pretending “a good predictor” was the same thing as “a tradable strategy” and designed a Gym-style environment contract that makes cheating obvious and failure modes measurable.