Blog
Jun 30, 2019 - 12 MIN READ
Normalization Is a Deployment Problem - Mean/Sigma and Index Diff

Normalization Is a Deployment Problem - Mean/Sigma and Index Diff

In June 2019 I stop treating feature scaling as “preprocessing” and start treating it as part of the production contract - same transforms, same stats, same order — or the live system lies.

Axel Domingues

Axel Domingues

In 2018, Reinforcement Learning (RL) trained me to distrust “it seems to work.”
In 2019, trading is doing the same thing — but with money-shaped consequences.

This month is where I learn a painful lesson: normalization isn’t a data science detail.
It’s a deployment problem.

If the model sees one distribution in training and a different distribution in production — even because I recomputed mean and sigma (standard deviation) the “easy way” — I’m not shipping intelligence. I’m shipping drift.


The problem I thought I had (and the one I actually had)

I thought the problem was:

“My features have different scales — I should normalize.”

The real problem was:

“My training and live pipelines must apply the same transform, with the same parameters, in the same feature order, forever.”

Because in live trading I can’t “just fit a scaler.”

  • The market shifts.
  • My collector misses packets.
  • One feature goes constant for hours.
  • Another feature gets NaNs for a while.
  • The BTC price level itself drifts a lot over months.

In that world, a scaler is a contract, not a convenience.

If you compute normalization stats using any data that includes “future” relative to the period you’re evaluating, you’ve built a leakage machine.

It won’t explode immediately — it will just look brilliant in backtests.


What the model sees in my pipeline (June reality check)

By now (after March + April), my collector is producing snapshots and writing them through the output pipeline. The model-facing table includes things like:

  • top-of-book sizes (ask/bid size at levels 1–5)
  • depth summaries (like depth at L10/L25)
  • created liquidity (what gets added/removed near the best prices)
  • volume windows (1.5s, 5.5s, … up to 3600.5s)
  • simple distribution stats over price changes (CDF-like signals, std windows)

If you stare at those raw columns for 5 minutes you notice:

  • some are tiny ratios near zero
  • some are huge counts (volume + sizes)
  • some are rare spikes (liquidity bursts)
  • and some are “almost constant until they suddenly aren’t”

So I decide to treat normalization as first-class infrastructure: generated offline, versioned with the model, loaded in the live bot.


Mean/Sigma as an artifact (not a side effect)

The core idea

  1. Build a training dataset snapshot (as frozen as possible)
  2. Compute mean and sigma (standard deviation) per feature
  3. Save them as artifacts
  4. Ship them alongside the model
  5. In production: apply the same transform, and handle NaNs/inf deterministically

That’s what Util/produceMeanSigma.py does.

A few details that matter a lot:

  • It reads multiple HDF5 files, stacks feature matrices, and computes stats globally.
  • It replaces missing values in a consistent way before computing stats.
  • It outputs mean.npy and sigma.npy — and those become part of the model version.

Here’s the “shape” of the pipeline in plain English:

  • Input: one or more HDF5 files with a data table
  • Select: the agreed list of x_features_cols
  • Transform: consistent missing-value handling
  • Compute: mean + sigma per column
  • Output: Numpy files you can commit, copy, and load later
The first thing I do after generating mean.npy and sigma.npy:
  • verify the arrays have the same length as the feature list
  • verify sigma has no zeros
  • verify I can normalize one sample and get “reasonable numbers” (not all zeros, not all huge)

“Index diff”: making price features survive time

A second issue shows up quickly: even if you normalize features, the price level itself moves over months.
Absolute price-based signals are fragile, because “10 dollars” means different things at different BTC levels.

So I introduce a relative price feature: mid price divided by the BitMEX index (the “indicative settle” reference).

That’s what Util/produceBitmexIndexDiff.py produces:

  • compute a mid price from last / mark prices
  • compute a ratio against the reference index
  • store it as idx_diff
  • mark the row as valid = 1 (the schema starts to care about validity explicitly)

In other words:

Don’t let the model learn “BTC at 3k behaves like this and BTC at 10k behaves like that”
when what I really want is “price relative to fair reference is stretched or compressed.”

This also makes later walk-forward evaluation less brittle, because the feature space is closer to “stationary” (still not stationary — just less ridiculous).

This isn’t “financial alpha.” It’s just a way to remove a moving baseline from the model’s view.

It’s the same engineering instinct as normalizing sensor readings in robotics:
you want the model to learn relationships, not raw units.


Live inference: normalize the same way or don’t pretend

The most important place normalization appears is not training.
It’s the bot.

In BitmexPythonChappie/OrderBookMovePredictor.py, I load the saved arrays and apply the transform exactly once per prediction:

  • load mean-truncated.npy and sigma-truncated.npy
  • extract FEATURES_COLS from the current snapshot
  • apply (x - mean) / sigma
  • replace inf, -inf, and NaN with 0

That last part is not “nice to have.” It’s how you stop live inference from crashing — or worse, outputting garbage while looking healthy.

The contract becomes:

  • same features
  • same order
  • same normalization stats
  • same missing-value policy

Anything else is a silent fork.

There’s a subtle failure mode here: if you change the feature list but forget to regenerate mean/sigma, you still get numbers — just the wrong ones.

That’s why I treat FEATURES_COLS, mean, and sigma as one versioned unit.


My “don’t lie to yourself” rules for normalization

This is the checklist I keep in the repo now — because I don’t trust my future self:

1) Version normalization with the model

  • saved_networks/<model>/
    • mean-*.npy
    • sigma-*.npy
    • (and later: metadata about feature order)

2) Treat sigma=0 as a bug

If a feature becomes constant in the dataset:

  • it’s either dead
  • or your collector/feature builder broke
  • or the market regime changed and your feature is useless

First fix: drop the feature or clamp sigma with a small floor — but only intentionally.

3) Normalize with training stats, not “current” stats

  • training stats must be computed from a fixed historical span
  • evaluation must use stats computed only from the past
  • live trading must use the stats packaged with the deployed model

4) NaNs and inf must become deterministic

If I allow “whatever NumPy does” to leak into decisions, I deserve the bug.

5) Validate distributions after normalization

At minimum:

  • mean near 0 (roughly) on a held-out set
  • typical values around a human-scale range
  • no feature dominates just because it’s unscaled

Implementation steps (what I actually run)

Generate idx_diff in the dataset

Run Util/produceBitmexIndexDiff.py on the HDF5 file(s) you intend to treat as “truth” for training.

What I verify:

  • the new idx_diff column exists
  • it stays near 1 in normal periods (no insane jumps)
  • it doesn’t become NaN during missing-data stretches

Compute mean.npy and sigma.npy

Run Util/produceMeanSigma.py over the training dataset directory.

What I verify:

  • output arrays match the feature list length
  • sigma has no zeros
  • a sample row normalizes without NaNs/inf

Package stats with the model

Copy the produced arrays into the model directory (later under saved_networks/).

What I verify:

  • the bot loads the same stats the trainer used
  • feature order is identical

Confirm live normalization behavior

In BitmexPythonChappie/OrderBookMovePredictor.py, validate:

  • normalization is applied exactly once per step
  • NaNs/inf are zeroed deterministically
  • the model still produces actions under missing-data conditions

Instrumentation: the normalization dashboard I wish I had earlier

If normalization is a deployment problem, I need deployment-grade telemetry.

This month I start logging/plotting:

  • histogram per feature (raw vs normalized)
  • per-feature mean, sigma summary table
  • count of NaNs per feature per hour
  • count of inf values per feature per hour
  • min/max per feature per day (raw + normalized)
  • “sigma too small” alert list
  • “feature went constant” alert list
  • percent of rows marked invalid
  • correlations that suddenly flip sign (feature drift hint)
  • model input vector norm over time (detect blow-ups)
  • fraction of zeros in the normalized vector (detect NaN-clamping overload)
  • time alignment sanity: snapshot timestamp vs local clock
  • regime markers: volume regime (quiet vs bursty)

What changed in my thinking

In 2016 I learned “feature scaling improves optimization.”
In 2017 I learned “training stability is engineering.”
In 2018 I learned “RL breaks if your signal is inconsistent.”

In 2019 I add a sharper rule:

If preprocessing can’t be reproduced byte-for-byte in production, it isn’t preprocessing — it’s a bug factory.


Resources and repo pointers

My research repo

The full research repo where these normalization artifacts are generated and then loaded by the live bot.

Mean/Sigma generator

Offline stats generation: compute and save mean.npy + sigma.npy as versioned artifacts.

Index diff helper

Adds a relative price feature (idx_diff) so price signals don’t die across regimes and price levels.

Live inference normalization

Where the contract becomes real: load stats, normalize, clamp NaNs/inf, then predict.


FAQ


What’s next

Next month I stop hiding behind “features” and define alpha — the label that claims “this is the move that matters.”

And I’m already nervous, because the easiest way to cheat in trading is not in the model.
It’s in the target.

Axel Domingues - 2026