Jun 30, 2019 - 12 MIN READ

Normalization Is a Deployment Problem - Mean/Sigma and Index Diff

In June 2019 I stop treating feature scaling as “preprocessing” and start treating it as part of the production contract - same transforms, same stats, same order — or the live system lies.

Axel Domingues

In 2018, Reinforcement Learning (RL) trained me to distrust “it seems to work.”
In 2019, trading is doing the same thing — but with money-shaped consequences.

This month is where I learn a painful lesson: normalization isn’t a data science detail.
It’s a deployment problem.

If the model sees one distribution in training and a different distribution in production — even because I recomputed mean and sigma (standard deviation) the “easy way” — I’m not shipping intelligence. I’m shipping drift.

The problem I thought I had (and the one I actually had)

I thought the problem was:

“My features have different scales — I should normalize.”

The real problem was:

“My training and live pipelines must apply the same transform, with the same parameters, in the same feature order, forever.”

Because in live trading I can’t “just fit a scaler.”

The market shifts.
My collector misses packets.
One feature goes constant for hours.
Another feature gets NaNs for a while.
The BTC price level itself drifts a lot over months.

In that world, a scaler is a contract, not a convenience.

If you compute normalization stats using any data that includes “future” relative to the period you’re evaluating, you’ve built a leakage machine.

It won’t explode immediately — it will just look brilliant in backtests.

What the model sees in my pipeline (June reality check)

By now (after March + April), my collector is producing snapshots and writing them through the output pipeline. The model-facing table includes things like:

top-of-book sizes (ask/bid size at levels 1–5)
depth summaries (like depth at L10/L25)
created liquidity (what gets added/removed near the best prices)
volume windows (1.5s, 5.5s, … up to 3600.5s)
simple distribution stats over price changes (CDF-like signals, std windows)

If you stare at those raw columns for 5 minutes you notice:

some are tiny ratios near zero
some are huge counts (volume + sizes)
some are rare spikes (liquidity bursts)
and some are “almost constant until they suddenly aren’t”

So I decide to treat normalization as first-class infrastructure: generated offline, versioned with the model, loaded in the live bot.

Mean/Sigma as an artifact (not a side effect)

The core idea

Build a training dataset snapshot (as frozen as possible)
Compute mean and sigma (standard deviation) per feature
Save them as artifacts
Ship them alongside the model
In production: apply the same transform, and handle NaNs/inf deterministically

That’s what Util/produceMeanSigma.py does.

A few details that matter a lot:

It reads multiple HDF5 files, stacks feature matrices, and computes stats globally.
It replaces missing values in a consistent way before computing stats.
It outputs mean.npy and sigma.npy — and those become part of the model version.

Here’s the “shape” of the pipeline in plain English:

Input: one or more HDF5 files with a data table
Select: the agreed list of x_features_cols
Transform: consistent missing-value handling
Compute: mean + sigma per column
Output: Numpy files you can commit, copy, and load later

The first thing I do after generating mean.npy and sigma.npy:

verify the arrays have the same length as the feature list
verify sigma has no zeros
verify I can normalize one sample and get “reasonable numbers” (not all zeros, not all huge)

“Index diff”: making price features survive time

A second issue shows up quickly: even if you normalize features, the price level itself moves over months.
Absolute price-based signals are fragile, because “10 dollars” means different things at different BTC levels.

So I introduce a relative price feature: mid price divided by the BitMEX index (the “indicative settle” reference).

That’s what Util/produceBitmexIndexDiff.py produces:

compute a mid price from last / mark prices
compute a ratio against the reference index
store it as idx_diff
mark the row as valid = 1 (the schema starts to care about validity explicitly)

In other words:

Don’t let the model learn “BTC at 3k behaves like this and BTC at 10k behaves like that”
when what I really want is “price relative to fair reference is stretched or compressed.”

This also makes later walk-forward evaluation less brittle, because the feature space is closer to “stationary” (still not stationary — just less ridiculous).

This isn’t “financial alpha.” It’s just a way to remove a moving baseline from the model’s view.

It’s the same engineering instinct as normalizing sensor readings in robotics:
you want the model to learn relationships, not raw units.

Live inference: normalize the same way or don’t pretend

The most important place normalization appears is not training.
It’s the bot.

In BitmexPythonChappie/OrderBookMovePredictor.py, I load the saved arrays and apply the transform exactly once per prediction:

load mean-truncated.npy and sigma-truncated.npy
extract FEATURES_COLS from the current snapshot
apply (x - mean) / sigma
replace inf, -inf, and NaN with 0

That last part is not “nice to have.” It’s how you stop live inference from crashing — or worse, outputting garbage while looking healthy.

The contract becomes:

same features
same order
same normalization stats
same missing-value policy

Anything else is a silent fork.

There’s a subtle failure mode here: if you change the feature list but forget to regenerate mean/sigma, you still get numbers — just the wrong ones.

That’s why I treat FEATURES_COLS, mean, and sigma as one versioned unit.

My “don’t lie to yourself” rules for normalization

This is the checklist I keep in the repo now — because I don’t trust my future self:

1) Version normalization with the model

saved_networks/<model>/
- mean-*.npy
- sigma-*.npy
- (and later: metadata about feature order)

2) Treat sigma=0 as a bug

If a feature becomes constant in the dataset:

it’s either dead
or your collector/feature builder broke
or the market regime changed and your feature is useless

First fix: drop the feature or clamp sigma with a small floor — but only intentionally.

3) Normalize with training stats, not “current” stats

training stats must be computed from a fixed historical span
evaluation must use stats computed only from the past
live trading must use the stats packaged with the deployed model

4) NaNs and inf must become deterministic

If I allow “whatever NumPy does” to leak into decisions, I deserve the bug.

5) Validate distributions after normalization

At minimum:

mean near 0 (roughly) on a held-out set
typical values around a human-scale range
no feature dominates just because it’s unscaled

Implementation steps (what I actually run)

Generate `idx_diff` in the dataset

Run Util/produceBitmexIndexDiff.py on the HDF5 file(s) you intend to treat as “truth” for training.

What I verify:

the new idx_diff column exists
it stays near 1 in normal periods (no insane jumps)
it doesn’t become NaN during missing-data stretches

Compute `mean.npy` and `sigma.npy`

Run Util/produceMeanSigma.py over the training dataset directory.

What I verify:

output arrays match the feature list length
sigma has no zeros
a sample row normalizes without NaNs/inf

Package stats with the model

Copy the produced arrays into the model directory (later under saved_networks/).

What I verify:

the bot loads the same stats the trainer used
feature order is identical

Confirm live normalization behavior

In BitmexPythonChappie/OrderBookMovePredictor.py, validate:

normalization is applied exactly once per step
NaNs/inf are zeroed deterministically
the model still produces actions under missing-data conditions

Instrumentation: the normalization dashboard I wish I had earlier

If normalization is a deployment problem, I need deployment-grade telemetry.

This month I start logging/plotting:

histogram per feature (raw vs normalized)
per-feature mean, sigma summary table
count of NaNs per feature per hour
count of inf values per feature per hour
min/max per feature per day (raw + normalized)
“sigma too small” alert list
“feature went constant” alert list
percent of rows marked invalid
correlations that suddenly flip sign (feature drift hint)
model input vector norm over time (detect blow-ups)
fraction of zeros in the normalized vector (detect NaN-clamping overload)
time alignment sanity: snapshot timestamp vs local clock
regime markers: volume regime (quiet vs bursty)

What changed in my thinking

In 2016 I learned “feature scaling improves optimization.”
In 2017 I learned “training stability is engineering.”
In 2018 I learned “RL breaks if your signal is inconsistent.”

In 2019 I add a sharper rule:

If preprocessing can’t be reproduced byte-for-byte in production, it isn’t preprocessing — it’s a bug factory.

Resources and repo pointers

My research repo

The full research repo where these normalization artifacts are generated and then loaded by the live bot.

Mean/Sigma generator

Offline stats generation: compute and save mean.npy + sigma.npy as versioned artifacts.

Index diff helper

Adds a relative price feature (idx_diff) so price signals don’t die across regimes and price levels.

Live inference normalization

Where the contract becomes real: load stats, normalize, clamp NaNs/inf, then predict.

FAQ

What’s next

Next month I stop hiding behind “features” and define alpha — the label that claims “this is the move that matters.”

And I’m already nervous, because the easiest way to cheat in trading is not in the model.
It’s in the target.

Defining Alpha Without Cheating - Look-Ahead Labels and Leakage Traps

Before I train anything, I need a label that doesn’t smuggle the future into my dataset.

Feature Engineering, But Make It Microstructure: Liquidity Created/Removed

If the order book is the battlefield, features are the sensors. This month I stop hand-waving and teach my pipeline to measure liquidity being added and removed - in a way I can deploy live.