Feb 23, 2020 - 11 MIN READ

Evaluation Discipline - Walk-Forward Backtesting Inside the Gym

Training reward was lying to me. So I turned evaluation into a first-class system - chronological splits, deterministic runs, and walk-forward backtests that survive the next dataset.

Axel Domingues

In 2018 I spent the year inside reinforcement learning as algorithms: policy gradients, actor-critic, the whole zoo.

2019 was the year I finally accepted the uncomfortable truth: trading is not an algorithm. It’s a system. And the system fails in ways models don’t warn you about.

So in January 2020 I built bitmex-gym: a minimal trading environment that could train end-to-end on BitMEX data.

This February post is about the next step that saved me from weeks of self-deception:

Evaluation discipline.

Not “look at the training curve and feel good.” Not “run one backtest and screenshot the best.”

Walk-forward evaluation — inside the same Gym contract — with explicit rules for what counts as evidence.

The problem: training reward is not evidence

RL gives you an extremely convincing illusion:

you see reward climb
your agent starts taking actions that look purposeful
your logs feel alive

And then you run the same policy on a different day of data and… nothing. Or worse: it behaves confidently in exactly the wrong regime.

By early 2020 I had enough moving parts that any of them could create fake wins:

data slicing
feature normalization
episode resets
trade execution assumptions
reward shaping
random initialization

If I didn’t lock evaluation down, I wasn’t doing research. I was doing curve worship.

What “walk-forward” means in my setup

I’m using BitMEX historical files (HDF5) as the underlying time series. The key rule is simple:

training only sees earlier data
evaluation runs on later data

Then you repeat this process by rolling forward.

That’s it.

No random shuffling. No mixing days. No “test set” that you touch 30 times while tuning hyperparameters.

The easiest way to accidentally cheat is to evaluate on data you already “peeked” during training.

In an RL pipeline, peeking can happen without you noticing:
normalization computed over the full dataset
episode slices sampled from the entire timeline
“validation” episodes drawn from the same pool as training

The three knobs that made evaluation real

Everything I needed was already present in the environment — I just had to treat it as two modes: training and evaluation.

1) Explicit file splits (chronology first)

Inside bitmex-gym/gym_bitmex/envs/bitmex_env.py, the environment loads a fixed list of HDF5 files:

# bitmex-gym/gym_bitmex/envs/bitmex_env.py
FILE_NAMES = ['XBTUSD-data-25-12-2018.h5', 'XBTUSD-data-26-12-2018.h5', ...]
# Real validation files:
# FILE_NAMES = ['XBTUSD-data-18-06-2019.h5', 'XBTUSD-data-19-06-2019.h5', ...]

At this stage of the project, my “split mechanism” was blunt but effective:

for training: point FILE_NAMES at earlier files
for evaluation: point FILE_NAMES at later files

Not elegant, but it forced the right habit: evaluation is a separate run on separate data.

2) Freeze randomness for evaluation

Training needs randomness. Evaluation needs repeatability.

In the same file there’s a switch:

# bitmex-gym/gym_bitmex/envs/bitmex_env.py
TAKE_RANDOM_INIT_ACTION = True

And in reset() the environment does a lot of randomized setup:

random initial action (0, 1, 2 → flat / long / short)
random start index (spawn inside the dataset)
tunable step_skip hyperparameter (how many timesteps to jump per action)
random “trade details” shift (hour and price offset)

That random spawn isn’t a bug — it’s one of the best ideas in this project, because it broadens coverage and reduces “luck at the starting timestamp.”

But for evaluation I needed to turn randomness into a controlled variable. So my evaluation rule became:

disable random init action
run from a known start state
keep the episode schedule deterministic

That lets me compare runs and know if improvements are real or just different dice rolls.

3) Make the episode structure support measurement

The environment was already built around short, measurable episodes (with a bit of variance so training doesn’t lock onto one fixed length):

# Episode length is expressed in *steps* (dataset-step units), not minutes.
MEAN_STEPS_PER_EPISODE = 216000  # ~1 hour on average
MIN_STEPS_PER_EPISODE  = 72000   # ~20 minutes minimum

if self.use_episode_based_time:
    self.current_limit_steps_per_episode = int(
        np.random.normal(MEAN_STEPS_PER_EPISODE, STD_STEPS_PER_EPISODE)
    )
    self.current_limit_steps_per_episode = max(
        self.current_limit_steps_per_episode, MIN_STEPS_PER_EPISODE
    )

STEP_SKIP = 100  # hyperparameter: how many raw ticks to jump per action

Short episodes were not “unrealistic.” In this setup they’re roughly ~1 hour on average, sometimes shorter (down to ~20 minutes), and that variability is part of the training curriculum. They were the practical way to make learning possible:

reward signal flows faster
debugging is easier
the agent sees more market situations per hour of training

The evaluation trick was: keep the same episode structure, but run it walk-forward on held-out files.

So instead of “one long heroic backtest,” I got:

many episodes
across multiple future days
with fixed rules

That yields an evaluation distribution that you can actually reason about.

The walk-forward loop I actually ran

This is what I treated as the evaluation contract:

Pick a training window

Example: train on a block of early files (e.g., late 2018 / early 2019).

Train for a fixed budget

Same number of timesteps, same hyperparameters. No “train until it looks good.”

Swap to the next window for evaluation

Switch FILE_NAMES to later files and run deterministic evaluation episodes.

Roll forward and repeat

Move the window forward and repeat the process. If the policy only works on one window, it’s not a policy.

The mental shift was huge: I stopped asking “does it work?” and started asking “does it keep working when the date changes?”

What I measured (before I trusted anything)

Reward can be a proxy. It is not a metric.

So I started tracking a small set of sanity metrics during evaluation:

PnL curve (even if the reward function isn’t PnL)
time in position (am I always-in, never-in, or oscillating?)
trade count (is it overtrading?)
win/loss balance (not as a goal, as a symptom)
max drawdown (when it fails, how badly?)

And I always compared against two baselines:

Do-nothing (stay flat)
Naive heuristic (e.g., always follow sign of a simple microstructure signal)

If the agent didn’t beat those, I didn’t care how pretty the training curve looked.

Why this belongs “inside the Gym”

It’s tempting to bolt evaluation on as a separate script.

But I learned quickly that if evaluation isn’t part of the same environment contract, you end up comparing apples to a different simulator.

Keeping evaluation inside Gym forced consistency:

same observation space
same action space
same execution model
same reward definition

The only thing allowed to change is the data window and the randomness policy.

That’s how you catch cheating early: by refusing to change the rules when it’s time to judge the agent.

The punchline: most “good agents” died here

This month hurt.

Most models that looked promising in training didn’t survive walk-forward evaluation.

And that wasn’t a tragedy. It was information.

Because once evaluation became strict, improvements became meaningful:

changing reward shaping actually showed up on future days
feature changes either generalized… or immediately died
execution assumptions stopped being “details” and became first-order effects

If you can’t reproduce your evaluation result tomorrow (same data, same seed, same code), it’s not a result. It’s a mood.

Repo pointers

If you want to follow the exact environment behavior discussed here:

bitmex-gym/gym_bitmex/envs/bitmex_env.py — reset() is the heart of episode randomization and the switches that separate training from evaluation.

And if you’re browsing the larger repo:

https://github.com/axeldomingues/bitmex-deeprl-research

bitmex-deeprl-research (repo)

All code + experiments for this series (BitMEX data, alpha detection, and the early Gym environments).

bitmex_env.py (baseline environment)

The baseline environment contract that made evaluation rules possible in the first place.

FAQ

What’s next

Now that I could trust evaluation, the next obvious question was:

How do I shape reward and add constraints without teaching the agent to lie?

That’s March 2020:

Reward Shaping

Reward Shaping Without Lying - Penalties, Constraints, and the First Real Fixes

In March 2020, I stopped treating reward like a number and started treating it like a contract—pay real costs, punish real risk, and don’t teach the agent to win a video game.

bitmex-gym - The Baseline Trading Environment (Where Cheating Starts)

In January 2020 I stop “predicting” and build a Gym environment that turns BitMEX microstructure features into actions, fills, and rewards — and makes every hidden assumption painfully visible.