Dec 29, 2019 - 14 MIN READ

From Prediction to Decision - Designing the Trading Environment Contract

I stopped pretending “a good predictor” was the same thing as “a tradable strategy” and designed a Gym-style environment contract that makes cheating obvious and failure modes measurable.

Axel Domingues

In 2018 I fell in love with Reinforcement Learning.

Not in a “publish a paper” way — in a “this finally feels like the missing piece” way.

Policy gradients taught me how to optimize behavior directly. Actor-critic made it feel trainable. Deep Q learning made it feel practical.

And then BitMEX showed me the part I hadn’t earned yet:

In trading, the algorithm is never the hard part. The contract is.

Because a predictor can be brilliant and still lose money.

A policy can be “optimal” and still be impossible to execute.

A backtest can be “profitable” and still be lying.

So December 2019 is where I draw a line in the sand:

If I can’t define the environment contract precisely, I don’t actually have an RL trading problem — I have a storytelling problem.

This post is about that contract: what the agent sees, what it’s allowed to do, how it gets rewarded, where episodes start/end, and where “cheating” begins.

The moment prediction stopped being enough

By this point in the series:

I’ve built a collector and a dataset pipeline (websockets → snapshots → HDF5).
I’ve engineered microstructure features and labels.
I’ve trained supervised baselines and deep silo models.
I’ve run live monitoring and watched the market “talk back.”
I’ve learned the 503 lesson: outages aren’t rare; they are a regime.

So yes — I can predict “up” vs “down” ahead.

But the market doesn’t pay for predictions. It pays for decisions.

And a decision has hidden structure:

When do I trade?
How do I enter (maker vs taker)?
What do I do when I can’t get filled?
What does “do nothing” mean?
How do I avoid learning to exploit my simulator?

That’s where the environment comes in.

A Gym contract is a lie detector

I’m using “Gym” in the practical sense — the classic interface:

reset() returns an observation (initial state)
step(action) returns (observation, reward, done, info)

It’s boring. It’s standardized. And that’s the point.

The contract forces you to be explicit about:

what information is available now
what consequences happen later
what the agent is actually optimizing

In the repo, this “contract-first” approach shows up as two tracks:

A real environment scaffold: bitmex-gym/gym_bitmex/envs/bitmex_env.py
A minimal test harness: dummy-gym/gym_dummy/envs/dummy_env.py

The dummy env exists for one reason:

I don’t trust my environment until I can run the whole RL pipeline end-to-end on something trivial.

The dummy environment is where you debug interfaces (shapes, action encoding, logging).
The BitMEX environment is where you debug reality (fills, fees, outages, slippage, regime shifts).

The five questions every trading environment must answer

When you say “I’m building an RL environment for trading”, you’re really saying:

What is state? (what the agent sees)
What is action? (what the agent can do)
What is reward? (what the agent optimizes)
What ends an episode? (where learning resets)
What is info? (what I log without leaking)

Let’s go through them the way I implemented them.

1) Observation: what the agent sees (and what it doesn’t)

In my world, the observation starts as the same thing I trained supervised models on:

a feature vector computed from order book snapshots
already normalized using mean/sigma derived offline
with NaNs/infs guarded and zeroed

You can see this pattern directly in the early BitMEX environment code:

__get_features() normalizes (x - mean) / sigma and cleans NaNs/infs
__load_data() pulls HDF5 chunks with pd.read_hdf(...) and stacks features over files
observation_space is defined as a bounded Box(...) so agents can’t “break the world” with weird numeric assumptions

In the repo, the environment’s observation space is explicitly bounded (the idea is “normalized-ish data lives in a small numeric range”):

low = -high, high = 3 * ones(...)
spaces.Box(low, high, dtype=np.float32)

That is not “financial truth.” It’s a training discipline: keep the numeric interface stable.

In the code, this observation does include minimal position state — it’s literally len(FEATURES_COLS) + 5.Those 5 vars are:

long_open (binary)
short_open (binary)
current_estimated_reward (a running mark-to-market-ish estimate)
open_position_time_span
close_position_time_span

So the agent is not “blind” to being in a trade — it knows whether it’s in one, and it knows how long it’s been open/closed.What I intentionally kept out in this baseline is position sizing (and richer inventory details), because an “open” here is basically all-in and a “close” is flat. There’s no 2%, 5%, 10% ladder to reason about yet — so a position-size feature would just be noise.This changes in 2020 when I move from “entry/exit decisions” to risk-aware agents that can scale in/out incrementally and need explicit sizing/inventory state.

The first hard rule: no future data, ever

By July 2019 I already learned how easy it is to leak with labels.

With an RL environment it’s even easier, because you can leak by accident:

using “future mid-price” inside the current state
smoothing over missing timestamps with look-ahead
letting the agent know “how much data is left”
returning diagnostic values in info and then quietly feeding them into the network

The contract is my guardrail: observations must be computable strictly from the current snapshot and the past.

2) Action: start small, make it executable

The first version of an action space should not be “everything a trader could do”.

It should be “the smallest set of actions that can be executed reliably”.

Because every extra action multiplies:

the number of states your agent must explore
the number of ways your simulator can lie

So I went for a deliberately boring action space:

NOOP
open/close a maker long
open/close a maker short

In bitmex_env.py that shows up as a simple discrete action space:

self.action_space = spaces.Discrete(NUM_ACTIONS)

And the comments make the intention explicit:

“Noop, Open/Close maker Long, Open/Close maker Short”

This is a design decision that will echo through 2020:

taker/HFT strategies break under outages
maker-style behavior survives longer because it’s more tolerant to delays and missing steps

So even in December 2019, the environment contract is already steering the project:

The agent can only do what the system can do.

3) Reward: “PnL” is not enough (and can be a trap)

Reward is where most trading environments become fantasy novels.

Because it’s tempting to write:

reward = next_price - current_price

And then celebrate when the curve goes up.

But that reward is not trading.

It ignores:

fees (maker/taker)
spread and queue priority
partial fills
inventory risk
the cost of being wrong repeatedly

And worst of all:

It rewards agents for predicting, not executing.

So the reward design principle I adopted is:

Reward must be computable from what would have happened if a real order was placed under the contract assumptions.

That means reward is closer to a “trade loop” than a label.

In early versions of bitmex_env.py, you can see the reward logic is already oriented around:

open trade timestamps
open/close state flags (long/short)
arrays that track open prices
the difference between realized and unrealized outcomes

Even if the exact reward function evolves later, the shape is set:

reward is a consequence of actions across time
not a “free label” granted by the dataset

If the reward is too “clean”, the agent will learn to exploit the cleanliness.

This is reward hacking, but in trading it looks like:
infinite turnover because fees are missing
perfect timing because fills are assumed
unrealistic position flipping because inventory risk is ignored

A good reward is one that makes bad strategies feel bad for the same reasons they would fail live.

4) Episodes: where cheating starts

Episodes are the most under-discussed form of cheating.

If you let an agent reset whenever it wants, or if reset always starts at a “nice” place in the data, the agent will learn:

not trading
“waiting for the good parts”
farming the easiest regime
exploiting knowledge of dataset boundaries

In bitmex_env.py, you can see the start of a very specific defense:

the environment scans for valid indices (data sanity mask)
it finds min_valid_idx and max_valid_idx where data is valid
and then it moves the max_valid_idx backwards by a fixed amount:

“We move the max 2 hours before the last ss to discourage reset exploit”

That’s a tiny line with a big philosophy:

Your environment must prevent the agent from learning “reset strategy” as alpha.

In other words: the environment contract includes how episodes are sampled.

5) Info: log everything, leak nothing

Gym’s info dictionary is both a gift and a loaded gun.

It’s a gift because trading is impossible to debug without diagnostics:

prices
fills
position status
fees
slippage estimates
episode timestamps
drawdown, etc.

It’s a loaded gun because any of that can become leakage if you feed it back into the model.

So my rule is:

obs is what the agent can learn from
info is what I can inspect as an engineer

If a metric is useful for learning, it must be explicitly part of the observation.
If it is useful only for debugging, it lives in info and stays out of the network.

A quick look at the “dummy env” scaffold

Before letting an agent touch market data, I built a toy environment in:

dummy-gym/gym_dummy/envs/dummy_env.py

It defines:

NUM_FEATURES = 76 (matching the feature vector shape I was working with at the time)
a discrete action space (spaces.Discrete(NUM_ACTIONS))
a Box observation space with bounded values
step() and reset() with predictable shapes

That environment is not “finance”.

It’s a basic empty env for smoke testing.

The engineering checklist I kept while designing the contract

This is the mental checklist I used before writing more code:

No look-ahead: observations derived from current snapshot + past only
Executable actions only: if the live loop can’t do it, the env can’t allow it
Reward matches execution: no free labels pretending to be PnL
Episode sampling defended: no “reset exploit” farming good regimes
Diagnostics separated: info is for debugging, not learning
Missing data is a state: if the market goes silent or the API breaks, the environment must represent it (not hide it)

That last bullet is the bridge from November’s “503 lesson” into 2020:

In a clean simulator, outages don’t exist.
In BitMEX, outages are a teacher.

Why this is a portfolio artifact

This whole series started as “learn ML”.

But by late 2019, the most valuable skill I was building wasn’t “writing a model” — it was:

defining contracts
designing interfaces
preventing leakage
building systems that fail safely
making results reproducible

That’s the kind of work that survives outside of trading too.

In other words: this is RL as engineering, not RL as demos.

Resources

Repo — bitmex-deeprl-research

The full research log and codebase this series documents (collector → features → models → environments).

OpenAI Gym Interface

The classic reset() / step() contract that forces you to make assumptions explicit.

Spinning Up in Deep RL

A practical reference for RL concepts (and a reminder that environments are half the battle).

Stable-Baselines3

Useful tooling for training loops, but only after your environment contract is solid.

FAQ

What’s next

In January 2020, I stop describing the contract and start making it real:

bitmex-gym: The Baseline Trading Environment

That’s where I’ll show:

the first complete BitMEX environment loop
the exact assumptions I made (and later regretted)
how “cheating” shows up in practice, even when you think you’re being careful

bitmex-gym - The Baseline Trading Environment (Where Cheating Starts)

In January 2020 I stop “predicting” and build a Gym environment that turns BitMEX microstructure features into actions, fills, and rewards — and makes every hidden assumption painfully visible.

The 503 Lesson - Outages as a Signal, Not Just a Bug

My first live alpha monitor was “working”… until BitMEX started replying 503 right when the model got excited. That’s when I learned availability is part of market microstructure.