Nov 29, 2020 - 18 MIN READ

Batch Training & Evaluation Again: Promising Results That Survive Scrutiny

In late 2020 I stopped trusting hero backtests. I built a batch runner + a walk-forward evaluation harness, added eval gates, and discovered an uncomfortable truth: shorter training often wins.

Axel Domingues

In 2018 I treated RL like a box of tricks: pick an algorithm, hit train, stare at curves.

By late 2020 BitMEX had trained a different muscle: system discipline. If the pipeline isn’t reproducible, the result isn’t real.

October’s milestone was capability: bitmex-management-gym gave the agent new tools (position sizing, management-style decisions) instead of only “directional bets”.

November’s milestone was credibility: I rebuilt training and evaluation as a batch process so “good results” would survive:

different random seeds
different walk-forward windows
different regimes (bull-heavy, mixed, declining)

Since June 2020 I became painfully aware of regime bias. My training data was bull-heavy, my validation data was also bull-heavy, and the agents developed a very real bull personality. The fix wasn’t another clever network — it was a better evaluation contract.

The problem: one backtest can make anything look smart

When you only run one experiment, you can always find a story:

“it’s profitable after fees” (but only because it churns trades)
“it holds through drawdowns” (but only in a bull-tilted window)
“it exploits microstructure” (but only in clean periods without 503s)

And because my environments contain intentional variance (random init behavior and variable episode lengths), single-run evaluation is basically inviting self-deception.

Variance in the environment isn’t a bug — it’s a feature. But once you accept that, you also have to accept that single-seed results are not evidence.

The fix: batch experiments become the unit of progress

Instead of “train a model”, the unit of work became:

Train a family of configs across seeds and steps, then evaluate them across a fixed set of slices.

That’s the mindset shift. The implementation was… gloriously Windows.

The key file in the repo is a Windows batch script that turns “training” into a repeatable experiment:

OpenAI/baselines/train-test-models.bat

Here’s the vibe (trimmed):

set MODEL_NAME=auto_bitmex_108len_099gamma_2000step_100stack_00sup_norm_reward_1_5M_selu_truncated

for /l %%x in (1, 5, 100) do (
  set /a RANDOM_SEED=%%x

  set TIMESTEPS=1500000
  python ppo2_mgt_train.py --model_name %MODEL_NAME%_%%x --random_seed %%x --timesteps %TIMESTEPS%

  python ppo2_mgt_back_test.py --model_name %MODEL_NAME%_%%x
  python ppo2_mgt_back_test.py --model_name %MODEL_NAME%_%%x test
)

Three details mattered more than I expected:

Multiple seeds are not “nice to have” The environment injects variance by design (random spawn, episode length variance), so I needed to treat any single seed as anecdotal.
Training length is a hyperparameter The batch file doesn’t just sweep seeds — it also sweeps how long I train (1.5M, 1M, 2M, etc.).
The model name is the registry At this stage the “model registry” was brutally simple: encode the important knobs in the filename so I can diff runs without lying.

When you’re iterating fast, a filename-based registry is surprisingly effective — as long as you’re disciplined about what goes into the name.

The evaluation harness: walk-forward slices as a default

Training is where you can accidentally overfit.

Evaluation is where you catch yourself.

The harness I leaned on this month lives here:

OpenAI/baselines/ppo2_mgt_back_test.py

Instead of “evaluate on one big test file”, the script evaluates across multiple curated slices. In the code, it’s literally a list of windows:

exec_files_lists = [
  ["2018-11-10-2018-12-04_10k_30m", "2018-12-05-2019-01-09_10k_30m"],
  ["2018-12-22-2019-02-06_10k_30m"],
  ["2019-01-01-2019-02-15_10k_30m", "2019-02-16-2019-03-31_10k_30m"],
]

if test_name:
  exec_files_lists = [[
    "2019-01-01-2019-02-15_10k_30m",
    "2019-03-01-2019-04-15_10k_30m",
    "2019-05-01-2019-06-15_10k_30m",
    "2019-06-16-2019-07-30_10k_30m",
  ]]

That structure let me do two crucial things:

Walk forward without pretending markets are stationary.
Grade models by regime slices, not by one flattering average.

And the output is intentionally blunt. The script prints and logs the stuff that actually matters:

trade count
win/loss counts
max drawdown
PnL vs fees (and the combined total)

You can see the emphasis in the summary block it writes out (again: trimmed):

Total trades: ...
Max drawdown: ...
Total summed profit precentage: ...
Total fees precentage: ...
Total fees & profit precentage: ...

This was a direct response to the “bull personality” problem: I needed to force myself to see how models behave when the market stops rewarding my favorite patterns.

Batch experiments became my unit of progress

Once I had:

a batch runner (many runs, same protocol)
a harness (many slices, same report)

…I stopped asking “does it work?” and started asking:

Does it work across seeds?
Does it work across regimes?
Does it keep working after fees?
Does it survive the slices that usually break me?

That shift is the difference between research vibes and engineering progress.

The selection rule: ship only if it survives X slices

By the end of the month, I had a simple rule I could actually follow:

Ship (promote) a model only if it survives the evaluation gates.A “pass” means:

positive total after fees on the majority of slices
max drawdown below a hard ceiling
not dependent on a single slice carrying the whole result
not a one-seed miracle

It’s not a perfect rule.

But it’s the first rule that reliably prevented me from spending weeks chasing a backtest hallucination.

The uncomfortable discovery: shorter training was better

This month’s surprising result wasn’t a new architecture.

It was a training habit change.

After running batch sweeps, it became obvious that the “train forever” mindset from earlier months wasn’t just wasteful — it was actively harmful. The best models consistently showed up in the early-stop range:

roughly 1.5M to 2M steps for my data windows

Meanwhile, the old habit (training for tens of millions of steps) produced models that looked amazing on biased validation slices and then collapsed the moment I asked them to generalize.

Long training runs can create a very convincing failure mode: the model becomes an expert at the quirks of your training distribution and you mistake that for “skill”.Batch evaluation is what makes this visible.

Deliverable: a report template I can compare month-to-month

Here’s the report skeleton I started using so every run produces comparable artifacts.

Model:
  name:
  seed:
  timesteps:
  env:
  reward spec:

Train slices:
  slice A:
    trades:
    win/loss:
    max drawdown:
    pnl (no fees):
    fees:
    total (pnl+fees):
  slice B:
    ...

Test slices:
  slice X:
    ...

Gates:
  pass_majority_slices (Y/N):
  drawdown_under_cap (Y/N):
  no_single_slice_dependency (Y/N):
  seed_stability (Y/N):

Decision:
  promote / reject / retest
Notes:

This looks boring.

That’s the point.

Resources

bitmex-deeprl-research (repo)

The full research rig: environments, baselines, evaluation scripts, and the “Chappie” execution code.

Batch runner (train-test-models.bat)

The script that turned training into a repeatable experiment across seeds and training lengths.

Walk-forward harness (ppo2_mgt_back_test.py)

The evaluation harness: regime slices, fee-aware summaries, and comparable reports.

Management environment (bitmex_management_env.py)

The risk-aware environment that made “management-style agents” possible.

FAQ

What’s next

The next post is where this mindset meets the real world again:

From Research Rig to System

Because once you can batch-train and batch-evaluate, you can finally tell the difference between:

a cool chart
and a result that survives reality.

From Research Rig to System: 2020 Postmortem and the Real Amazing Result

2020 is when I stopped training agents and started building a trading system: environments, evaluation discipline, safety, and a live loop that survives outages. This is the postmortem — and the first result that actually held up in reality.

bitmex-management-gym: Position Sizing and the First Risk-Aware Agent

After months of "all-in" agents with bull personalities, I rebuilt the environment to teach risk: stackable positions, time-awareness, and penalties that prevent reward-hacking.