
Training reward was lying to me. So I turned evaluation into a first-class system - chronological splits, deterministic runs, and walk-forward backtests that survive the next dataset.
Axel Domingues
In 2018 I spent the year inside reinforcement learning as algorithms: policy gradients, actor-critic, the whole zoo.
2019 was the year I finally accepted the uncomfortable truth: trading is not an algorithm. It’s a system. And the system fails in ways models don’t warn you about.
So in January 2020 I built bitmex-gym: a minimal trading environment that could train end-to-end on BitMEX data.
This February post is about the next step that saved me from weeks of self-deception:
Evaluation discipline.
Not “look at the training curve and feel good.” Not “run one backtest and screenshot the best.”
Walk-forward evaluation — inside the same Gym contract — with explicit rules for what counts as evidence.
RL gives you an extremely convincing illusion:
And then you run the same policy on a different day of data and… nothing. Or worse: it behaves confidently in exactly the wrong regime.
By early 2020 I had enough moving parts that any of them could create fake wins:
If I didn’t lock evaluation down, I wasn’t doing research. I was doing curve worship.
I’m using BitMEX historical files (HDF5) as the underlying time series. The key rule is simple:
Then you repeat this process by rolling forward.
That’s it.
No random shuffling. No mixing days. No “test set” that you touch 30 times while tuning hyperparameters.
In an RL pipeline, peeking can happen without you noticing:
- normalization computed over the full dataset
- episode slices sampled from the entire timeline
- “validation” episodes drawn from the same pool as training
Everything I needed was already present in the environment — I just had to treat it as two modes: training and evaluation.
Inside bitmex-gym/gym_bitmex/envs/bitmex_env.py, the environment loads a fixed list of HDF5 files:
# bitmex-gym/gym_bitmex/envs/bitmex_env.py
FILE_NAMES = ['XBTUSD-data-25-12-2018.h5', 'XBTUSD-data-26-12-2018.h5', ...]
# Real validation files:
# FILE_NAMES = ['XBTUSD-data-18-06-2019.h5', 'XBTUSD-data-19-06-2019.h5', ...]
At this stage of the project, my “split mechanism” was blunt but effective:
FILE_NAMES at earlier filesFILE_NAMES at later filesNot elegant, but it forced the right habit: evaluation is a separate run on separate data.
Training needs randomness. Evaluation needs repeatability.
In the same file there’s a switch:
# bitmex-gym/gym_bitmex/envs/bitmex_env.py
TAKE_RANDOM_INIT_ACTION = True
And in reset() the environment does a lot of randomized setup:
0, 1, 2 → flat / long / short)step_skip hyperparameter (how many timesteps to jump per action)That random spawn isn’t a bug — it’s one of the best ideas in this project, because it broadens coverage and reduces “luck at the starting timestamp.”
But for evaluation I needed to turn randomness into a controlled variable. So my evaluation rule became:
That lets me compare runs and know if improvements are real or just different dice rolls.
The environment was already built around short, measurable episodes (with a bit of variance so training doesn’t lock onto one fixed length):
# Episode length is expressed in *steps* (dataset-step units), not minutes.
MEAN_STEPS_PER_EPISODE = 216000 # ~1 hour on average
MIN_STEPS_PER_EPISODE = 72000 # ~20 minutes minimum
if self.use_episode_based_time:
self.current_limit_steps_per_episode = int(
np.random.normal(MEAN_STEPS_PER_EPISODE, STD_STEPS_PER_EPISODE)
)
self.current_limit_steps_per_episode = max(
self.current_limit_steps_per_episode, MIN_STEPS_PER_EPISODE
)
STEP_SKIP = 100 # hyperparameter: how many raw ticks to jump per action
Short episodes were not “unrealistic.” In this setup they’re roughly ~1 hour on average, sometimes shorter (down to ~20 minutes), and that variability is part of the training curriculum. They were the practical way to make learning possible:
The evaluation trick was: keep the same episode structure, but run it walk-forward on held-out files.
So instead of “one long heroic backtest,” I got:
That yields an evaluation distribution that you can actually reason about.
This is what I treated as the evaluation contract:
Example: train on a block of early files (e.g., late 2018 / early 2019).
Same number of timesteps, same hyperparameters. No “train until it looks good.”
Switch FILE_NAMES to later files and run deterministic evaluation episodes.
Move the window forward and repeat the process. If the policy only works on one window, it’s not a policy.
Reward can be a proxy. It is not a metric.
So I started tracking a small set of sanity metrics during evaluation:
And I always compared against two baselines:
If the agent didn’t beat those, I didn’t care how pretty the training curve looked.
It’s tempting to bolt evaluation on as a separate script.
But I learned quickly that if evaluation isn’t part of the same environment contract, you end up comparing apples to a different simulator.
Keeping evaluation inside Gym forced consistency:
The only thing allowed to change is the data window and the randomness policy.
That’s how you catch cheating early: by refusing to change the rules when it’s time to judge the agent.
This month hurt.
Most models that looked promising in training didn’t survive walk-forward evaluation.
And that wasn’t a tragedy. It was information.
Because once evaluation became strict, improvements became meaningful:
If you want to follow the exact environment behavior discussed here:
bitmex-gym/gym_bitmex/envs/bitmex_env.py — reset() is the heart of episode randomization and the switches that separate training from evaluation.And if you’re browsing the larger repo:
Because training reward is the result of your training distribution, your exploration noise, your reset policy, and your reward design.
It’s useful for debugging, but it’s not evidence that the policy generalizes. Walk-forward evaluation is the first time the policy has to face the future.
Not by itself.
Random spawning is a training curriculum: it forces the agent to see many different market states and reduces “luck” tied to a single starting timestamp.
The realism problems usually come from other places (execution assumptions, slippage, fees, inventory/risk simplifications).
The key discipline is: keep training randomization, but evaluate with controlled randomness and chronological splits.
A win is boring:
The goal isn’t to find the best equity curve. It’s to build an evaluation loop that can tell you when you’re fooling yourself.
Now that I could trust evaluation, the next obvious question was:
How do I shape reward and add constraints without teaching the agent to lie?
That’s March 2020:
Reward Shaping
Reward Shaping Without Lying - Penalties, Constraints, and the First Real Fixes
In March 2020, I stopped treating reward like a number and started treating it like a contract—pay real costs, punish real risk, and don’t teach the agent to win a video game.
bitmex-gym - The Baseline Trading Environment (Where Cheating Starts)
In January 2020 I stop “predicting” and build a Gym environment that turns BitMEX microstructure features into actions, fills, and rewards — and makes every hidden assumption painfully visible.