
In late 2020 I stopped trusting hero backtests. I built a batch runner + a walk-forward evaluation harness, added eval gates, and discovered an uncomfortable truth: shorter training often wins.
Axel Domingues
In 2018 I treated RL like a box of tricks: pick an algorithm, hit train, stare at curves.
By late 2020 BitMEX had trained a different muscle: system discipline. If the pipeline isn’t reproducible, the result isn’t real.
October’s milestone was capability: bitmex-management-gym gave the agent new tools (position sizing, management-style decisions) instead of only “directional bets”.
November’s milestone was credibility: I rebuilt training and evaluation as a batch process so “good results” would survive:
When you only run one experiment, you can always find a story:
And because my environments contain intentional variance (random init behavior and variable episode lengths), single-run evaluation is basically inviting self-deception.
Instead of “train a model”, the unit of work became:
Train a family of configs across seeds and steps, then evaluate them across a fixed set of slices.
That’s the mindset shift. The implementation was… gloriously Windows.
The key file in the repo is a Windows batch script that turns “training” into a repeatable experiment:
OpenAI/baselines/train-test-models.batHere’s the vibe (trimmed):
set MODEL_NAME=auto_bitmex_108len_099gamma_2000step_100stack_00sup_norm_reward_1_5M_selu_truncated
for /l %%x in (1, 5, 100) do (
set /a RANDOM_SEED=%%x
set TIMESTEPS=1500000
python ppo2_mgt_train.py --model_name %MODEL_NAME%_%%x --random_seed %%x --timesteps %TIMESTEPS%
python ppo2_mgt_back_test.py --model_name %MODEL_NAME%_%%x
python ppo2_mgt_back_test.py --model_name %MODEL_NAME%_%%x test
)
Three details mattered more than I expected:
1.5M, 1M, 2M, etc.).Training is where you can accidentally overfit.
Evaluation is where you catch yourself.
The harness I leaned on this month lives here:
OpenAI/baselines/ppo2_mgt_back_test.pyInstead of “evaluate on one big test file”, the script evaluates across multiple curated slices. In the code, it’s literally a list of windows:
exec_files_lists = [
["2018-11-10-2018-12-04_10k_30m", "2018-12-05-2019-01-09_10k_30m"],
["2018-12-22-2019-02-06_10k_30m"],
["2019-01-01-2019-02-15_10k_30m", "2019-02-16-2019-03-31_10k_30m"],
]
if test_name:
exec_files_lists = [[
"2019-01-01-2019-02-15_10k_30m",
"2019-03-01-2019-04-15_10k_30m",
"2019-05-01-2019-06-15_10k_30m",
"2019-06-16-2019-07-30_10k_30m",
]]
That structure let me do two crucial things:
And the output is intentionally blunt. The script prints and logs the stuff that actually matters:
You can see the emphasis in the summary block it writes out (again: trimmed):
Total trades: ...
Max drawdown: ...
Total summed profit precentage: ...
Total fees precentage: ...
Total fees & profit precentage: ...
Once I had:
…I stopped asking “does it work?” and started asking:
That shift is the difference between research vibes and engineering progress.
By the end of the month, I had a simple rule I could actually follow:
It’s not a perfect rule.
But it’s the first rule that reliably prevented me from spending weeks chasing a backtest hallucination.
This month’s surprising result wasn’t a new architecture.
It was a training habit change.
After running batch sweeps, it became obvious that the “train forever” mindset from earlier months wasn’t just wasteful — it was actively harmful. The best models consistently showed up in the early-stop range:
Meanwhile, the old habit (training for tens of millions of steps) produced models that looked amazing on biased validation slices and then collapsed the moment I asked them to generalize.
Here’s the report skeleton I started using so every run produces comparable artifacts.
Model:
name:
seed:
timesteps:
env:
reward spec:
Train slices:
slice A:
trades:
win/loss:
max drawdown:
pnl (no fees):
fees:
total (pnl+fees):
slice B:
...
Test slices:
slice X:
...
Gates:
pass_majority_slices (Y/N):
drawdown_under_cap (Y/N):
no_single_slice_dependency (Y/N):
seed_stability (Y/N):
Decision:
promote / reject / retest
Notes:
This looks boring.
That’s the point.
bitmex-deeprl-research (repo)
The full research rig: environments, baselines, evaluation scripts, and the “Chappie” execution code.
Batch runner (train-test-models.bat)
The script that turned training into a repeatable experiment across seeds and training lengths.
Because manual runs invite cherry-picking without you noticing. The batch runner forces the same protocol every time: same seeds strategy, same training lengths, same evaluation harness.
In my setup, longer training often meant deeper memorization of the training distribution (especially under a biased regime mix). The harness made it obvious when the “skill” didn’t transfer to other slices.
For early research, yes — if the name encodes the knobs you change (reward spec, step cadence, gamma, stack size, timesteps, seed). Later, it becomes worth adding explicit metadata files, but the naming discipline still pays off.
More than one seed, more than one slice, and at least one slice you expect to fail. If the model only looks good on the slices you like, you don’t have a model — you have a story.
The next post is where this mindset meets the real world again:
From Research Rig to System
Because once you can batch-train and batch-evaluate, you can finally tell the difference between:
From Research Rig to System: 2020 Postmortem and the Real Amazing Result
2020 is when I stopped training agents and started building a trading system: environments, evaluation discipline, safety, and a live loop that survives outages. This is the postmortem — and the first result that actually held up in reality.
bitmex-management-gym: Position Sizing and the First Risk-Aware Agent
After months of "all-in" agents with bull personalities, I rebuilt the environment to teach risk: stackable positions, time-awareness, and penalties that prevent reward-hacking.