Blog
Nov 29, 2020 - 18 MIN READ
Batch Training & Evaluation Again: Promising Results That Survive Scrutiny

Batch Training & Evaluation Again: Promising Results That Survive Scrutiny

In late 2020 I stopped trusting hero backtests. I built a batch runner + a walk-forward evaluation harness, added eval gates, and discovered an uncomfortable truth: shorter training often wins.

Axel Domingues

Axel Domingues

In 2018 I treated RL like a box of tricks: pick an algorithm, hit train, stare at curves.

By late 2020 BitMEX had trained a different muscle: system discipline. If the pipeline isn’t reproducible, the result isn’t real.

October’s milestone was capability: bitmex-management-gym gave the agent new tools (position sizing, management-style decisions) instead of only “directional bets”.

November’s milestone was credibility: I rebuilt training and evaluation as a batch process so “good results” would survive:

  • different random seeds
  • different walk-forward windows
  • different regimes (bull-heavy, mixed, declining)
Since June 2020 I became painfully aware of regime bias. My training data was bull-heavy, my validation data was also bull-heavy, and the agents developed a very real bull personality. The fix wasn’t another clever network — it was a better evaluation contract.

The problem: one backtest can make anything look smart

When you only run one experiment, you can always find a story:

  • “it’s profitable after fees” (but only because it churns trades)
  • “it holds through drawdowns” (but only in a bull-tilted window)
  • “it exploits microstructure” (but only in clean periods without 503s)

And because my environments contain intentional variance (random init behavior and variable episode lengths), single-run evaluation is basically inviting self-deception.

Variance in the environment isn’t a bug — it’s a feature. But once you accept that, you also have to accept that single-seed results are not evidence.

The fix: batch experiments become the unit of progress

Instead of “train a model”, the unit of work became:

Train a family of configs across seeds and steps, then evaluate them across a fixed set of slices.

That’s the mindset shift. The implementation was… gloriously Windows.

The key file in the repo is a Windows batch script that turns “training” into a repeatable experiment:

  • OpenAI/baselines/train-test-models.bat

Here’s the vibe (trimmed):

set MODEL_NAME=auto_bitmex_108len_099gamma_2000step_100stack_00sup_norm_reward_1_5M_selu_truncated

for /l %%x in (1, 5, 100) do (
  set /a RANDOM_SEED=%%x

  set TIMESTEPS=1500000
  python ppo2_mgt_train.py --model_name %MODEL_NAME%_%%x --random_seed %%x --timesteps %TIMESTEPS%

  python ppo2_mgt_back_test.py --model_name %MODEL_NAME%_%%x
  python ppo2_mgt_back_test.py --model_name %MODEL_NAME%_%%x test
)

Three details mattered more than I expected:

  1. Multiple seeds are not “nice to have” The environment injects variance by design (random spawn, episode length variance), so I needed to treat any single seed as anecdotal.
  2. Training length is a hyperparameter The batch file doesn’t just sweep seeds — it also sweeps how long I train (1.5M, 1M, 2M, etc.).
  3. The model name is the registry At this stage the “model registry” was brutally simple: encode the important knobs in the filename so I can diff runs without lying.
When you’re iterating fast, a filename-based registry is surprisingly effective — as long as you’re disciplined about what goes into the name.

The evaluation harness: walk-forward slices as a default

Training is where you can accidentally overfit.

Evaluation is where you catch yourself.

The harness I leaned on this month lives here:

  • OpenAI/baselines/ppo2_mgt_back_test.py

Instead of “evaluate on one big test file”, the script evaluates across multiple curated slices. In the code, it’s literally a list of windows:

exec_files_lists = [
  ["2018-11-10-2018-12-04_10k_30m", "2018-12-05-2019-01-09_10k_30m"],
  ["2018-12-22-2019-02-06_10k_30m"],
  ["2019-01-01-2019-02-15_10k_30m", "2019-02-16-2019-03-31_10k_30m"],
]

if test_name:
  exec_files_lists = [[
    "2019-01-01-2019-02-15_10k_30m",
    "2019-03-01-2019-04-15_10k_30m",
    "2019-05-01-2019-06-15_10k_30m",
    "2019-06-16-2019-07-30_10k_30m",
  ]]

That structure let me do two crucial things:

  • Walk forward without pretending markets are stationary.
  • Grade models by regime slices, not by one flattering average.

And the output is intentionally blunt. The script prints and logs the stuff that actually matters:

  • trade count
  • win/loss counts
  • max drawdown
  • PnL vs fees (and the combined total)

You can see the emphasis in the summary block it writes out (again: trimmed):

Total trades: ...
Max drawdown: ...
Total summed profit precentage: ...
Total fees precentage: ...
Total fees & profit precentage: ...
This was a direct response to the “bull personality” problem: I needed to force myself to see how models behave when the market stops rewarding my favorite patterns.

Batch experiments became my unit of progress

Once I had:

  • a batch runner (many runs, same protocol)
  • a harness (many slices, same report)

…I stopped asking “does it work?” and started asking:

  • Does it work across seeds?
  • Does it work across regimes?
  • Does it keep working after fees?
  • Does it survive the slices that usually break me?

That shift is the difference between research vibes and engineering progress.


The selection rule: ship only if it survives X slices

By the end of the month, I had a simple rule I could actually follow:

Ship (promote) a model only if it survives the evaluation gates.A “pass” means:
  • positive total after fees on the majority of slices
  • max drawdown below a hard ceiling
  • not dependent on a single slice carrying the whole result
  • not a one-seed miracle

It’s not a perfect rule.

But it’s the first rule that reliably prevented me from spending weeks chasing a backtest hallucination.


The uncomfortable discovery: shorter training was better

This month’s surprising result wasn’t a new architecture.

It was a training habit change.

After running batch sweeps, it became obvious that the “train forever” mindset from earlier months wasn’t just wasteful — it was actively harmful. The best models consistently showed up in the early-stop range:

  • roughly 1.5M to 2M steps for my data windows

Meanwhile, the old habit (training for tens of millions of steps) produced models that looked amazing on biased validation slices and then collapsed the moment I asked them to generalize.

Long training runs can create a very convincing failure mode: the model becomes an expert at the quirks of your training distribution and you mistake that for “skill”.Batch evaluation is what makes this visible.

Deliverable: a report template I can compare month-to-month

Here’s the report skeleton I started using so every run produces comparable artifacts.

Model:
  name:
  seed:
  timesteps:
  env:
  reward spec:

Train slices:
  slice A:
    trades:
    win/loss:
    max drawdown:
    pnl (no fees):
    fees:
    total (pnl+fees):
  slice B:
    ...

Test slices:
  slice X:
    ...

Gates:
  pass_majority_slices (Y/N):
  drawdown_under_cap (Y/N):
  no_single_slice_dependency (Y/N):
  seed_stability (Y/N):

Decision:
  promote / reject / retest
Notes:

This looks boring.

That’s the point.


Resources

bitmex-deeprl-research (repo)

The full research rig: environments, baselines, evaluation scripts, and the “Chappie” execution code.

Batch runner (train-test-models.bat)

The script that turned training into a repeatable experiment across seeds and training lengths.

Walk-forward harness (ppo2_mgt_back_test.py)

The evaluation harness: regime slices, fee-aware summaries, and comparable reports.

Management environment (bitmex_management_env.py)

The risk-aware environment that made “management-style agents” possible.


FAQ


What’s next

The next post is where this mindset meets the real world again:

From Research Rig to System

Because once you can batch-train and batch-evaluate, you can finally tell the difference between:

  • a cool chart
  • and a result that survives reality.
Axel Domingues - 2026