Jan 26, 2020 - 12 MIN READ

bitmex-gym - The Baseline Trading Environment (Where Cheating Starts)

In January 2020 I stop “predicting” and build a Gym environment that turns BitMEX microstructure features into actions, fills, and rewards — and makes every hidden assumption painfully visible.

Axel Domingues

2018 was my “RL is real” year.

I spent months reading papers, re-implementing algorithms, and getting that first addictive feeling: an agent can learn behavior.

Then 2019 happened.

BitMEX forced me to admit something uncomfortable: in trading, the algorithm is rarely the bottleneck. The bottleneck is the system.

data that lies (missing snapshots, desynced timestamps)
features that smuggle future information
labels that “work” only because they cheat
models that look good until you run them live

So December 2019 was about the contract: observation → action → reward → done.

January 2020 is where that contract becomes code.

This is the month I wrote my first “real” trading environment: bitmex-gym.

Not because it was perfect.

But because it was the first time I could point at a file and say:

“Here. This is where the lies start.”

Repo anchor

The environment lives here:

bitmex-gym/gym_bitmex/envs/bitmex_env.py

It consumes the 2019 pipeline outputs (features + prices in HDF5), and exposes a Gym-compatible API: reset(), step(action), observation_space, action_space.

If you’re following along in the repo: this environment is intentionally minimal. It’s a baseline where most complexities are still ignored on purpose — so later failures have a clear origin.

The Gym contract, concretely

In this environment I had to commit to five things:

What the agent sees (observation)
What it can do (actions)
How it gets rewarded (reward function)
When an episode ends (termination)
How we reset without bias (initial state)

Everything else is implementation detail.

And every implementation detail is an opportunity to cheat.

Observation: microstructure features + 5 state vars

The core input is a feature vector produced earlier in 2019.

In code, the observation dimension is:

self.dimentions = len(bitmexEnv.FEATURES_COLS) + 5

The “+5” is the part that matters.

The observation isn’t just “order book features.” It’s features plus position context.

The 5 state variables

At every step, the environment concatenates:

long_open (0/1)
short_open (0/1)
current_unrealised_pct (scaled and clipped)
hours_open_position (scaled and clipped)
hours_closed_position (scaled and clipped)

Those are built right in the step() loop:

open_position_state = np.array([
  long_state,
  short_state,
  current_unrealised_pct,
  hours_open_position,
  hours_closed_position
], dtype=np.float64)

outcome_state_features = np.concatenate((features, open_position_state), axis=0)

I clipped features and the “unrealised” state on purpose.

Early on, I wanted the agent’s inputs to have a bounded-ish range so training didn’t explode the moment a feature distribution shifted.

What I did not include here is position sizing.

This baseline environment is basically all-in / all-out: when you open, you deploy the whole portfolio; when you close, you go flat.

So there’s no “size” to track — just whether you’re long, short, or flat. That changes later in 2020, when I move to agents that build positions in increments (2%, 5%, …) and “size” becomes part of both state and action space.

Action space: small on purpose (and a little messy)

I started with the simplest thing that could possibly work:

NUM_ACTIONS = 3
# 0: noop
# 1: try to open/close via a maker-style long
# 2: try to open/close via a maker-style short

The environment keeps internal flags:

self.open_long
self.open_short

Actions are “intent” signals, not direct fills.

You place a limit order idea (maker), and the environment simulates whether that order would have been filled as the market moves.

If you read the code closely, you’ll notice references to action == 3 and action == 4 (taker actions).

That’s a scar from iteration.

I was already experimenting with taker behavior, but the baseline action space stayed at 3 for training simplicity — meaning those branches are effectively unreachable unless you manually force actions outside the declared space.

Execution model: “maker-ish” fills with a moving price

Here’s the baseline maker approximation I used:

compute a “target” limit price using an average price and an average diff
keep that order “alive” while the market evolves
consider it filled if the best bid/ask crosses it

For example, the environment sets the current maker prices like this:

current_long_price  = avgPrice - MAKE_POS_ORDER_WAIT_MINUTES * avgDiffPrice
current_short_price = avgPrice + MAKE_POS_ORDER_WAIT_MINUTES * avgDiffPrice

Then it checks fills against best bid/ask:

long fills if bestAsk <= current_long_price
short fills if bestBid >= current_short_price

This is not how a matching engine works.

But it was enough to test the first question I cared about:

“Can an agent learn anything useful from microstructure features if we give it a simplified execution layer?”

Reward: profit, maker fees, and punishment for bad behavior

The reward signal is a blend of:

round-trip PnL (percent profit/loss)
maker fee rebate when an order “rolls over” and gets filled
time-based punishment for holding positions too long

Closing a trade

This environment tracks the last filled long price and the last filled short price.

When it has both, it considers the trade “closed” and computes:

trade_reward = ((short_price / long_price) - 1.0) * 100

That formulation works for both:

long then short (buy then sell)
short then long (sell then buy)

because the environment always treats the short-side fill as the “sell price” and the long-side fill as the “buy price.”

Lengthy position punishment

A classic RL failure mode in trading environments is:

agent opens a position
agent never closes
reward never realizes
training becomes nonsense

So I added a crude but effective guardrail:

if a position is open longer than a threshold, increase a negative reward each minute

It’s not elegant.

But it forces the agent to learn a behavior that survives reality:

“Trades end.”

Episode boundaries: trading in slices

I ran training as short episodes, not one infinite time series.

Two reasons:

Debuggability — when something breaks, you want short, repeatable runs.
Anti-luck bias — starting at the same timestamp every time creates accidental overfitting.

So the environment uses:

a random starting index (random_init_action)
an episode limit expressed in steps, with optional variance (use_episode_based_time) — sampled around MEAN_STEPS_PER_EPISODE (≈1 hour on average) and clamped to MIN_STEPS_PER_EPISODE (≈20 min)
a configurable time-skip hyperparameter (step_skip) that controls how many raw ticks you fast-forward per agent decision

The result is a training distribution that is less deterministic than “always start at the beginning of the file.”

Random episode starts are not "teleportation" in the way that matters for learning — as long as continuity inside the episode is preserved.

In this project the goal was coverage and robustness:
The agent sees many different microstructure regimes instead of memorizing one start date.
Episodes are long enough for an action to realize PnL (otherwise training collapses into noise).
The environment can intentionally spawn the agent with a position already open (good or bad), forcing it to learn position management, not just entry timing.

Where cheating starts (a practical list)

Writing a Gym environment is where you find out how easy it is to lie to yourself.

Here are the big cheats baked into this baseline:

Fill certainty: orders fill the moment best bid/ask crosses your price.
No queue priority: maker orders don’t lose their place in line.
No partial fills: fills are binary.
No slippage: execution price is clean, not impacted.
Fixed fees: fee model doesn’t vary by instrument, regime, or tier.
Episode randomization (training tool, not a cheat): random resets, time-skips, and occasional forced initial positions increase coverage and reduce "start-date luck". The "cheat" only happens if you evaluate with the same randomness and call it out-of-sample.
Feature stationarity assumptions: normalized features pretend distributions won’t drift.

And yet… I still built it.

Because the only way to fix these lies is to name them, and the only way to name them is to ship an environment and watch it break.

A quick sanity workflow

This is how I validated the environment wasn’t completely broken before training:

Reset and check shapes

Confirm observation length matches len(FEATURES_COLS) + 5.

Run random actions for a few episodes

Just verify rewards are finite and the env doesn’t crash on invalid rows.

Log “fills” and position flags

Make sure open_long / open_short transitions make sense.

Plot reward distribution

If rewards are always zero or always extreme, something is wrong.

I didn’t trust training curves until I trusted the environment.

If your environment lies, your agent learns to exploit the lie.

What I learned this month

The environment is the model. In RL, you don’t just “train an agent.” You encode the world.
A baseline is allowed to be wrong. The point is to be wrong in known ways.
Making assumptions explicit is progress. Even a naive fill model is useful if it’s written down and testable.

Resources

bitmex-gym environment source

The exact environment implementation discussed in this article.

BitMEX DeepRL research repo

The full project: collectors, features, labels, models, bots, and gyms.

FAQ

What’s next

Now that the environment exists, the next problem is evaluation discipline.

A trading agent that “learns” in-sample is easy.

A trading agent that survives walk-forward evaluation is rare.

Evaluation Discipline - Walk-Forward Backtesting Inside the Gym

Training reward was lying to me. So I turned evaluation into a first-class system - chronological splits, deterministic runs, and walk-forward backtests that survive the next dataset.

From Prediction to Decision - Designing the Trading Environment Contract

I stopped pretending “a good predictor” was the same thing as “a tradable strategy” and designed a Gym-style environment contract that makes cheating obvious and failure modes measurable.