Blog
Aug 25, 2019 - 14 MIN READ
Supervised Baselines - First Alpha Models, First Humbling Curves

Supervised Baselines - First Alpha Models, First Humbling Curves

I train my first alpha predictor on BitMEX order-book features, and learn why ‘it trains’ is not the same as ‘it works’.

Axel Domingues

Axel Domingues

In 2018 I spent a full year learning Reinforcement Learning (RL) and Deep RL with an instrumentation-first mindset. By December, I had a mental model I trusted:

  • always start with a dumb baseline
  • log everything that can lie
  • don’t celebrate learning curves until you can reproduce them

So when I started supervised alpha detection in mid-2019, I thought it would feel… easier. No credit assignment over long horizons. No reward hacking. No policy collapse.

What I got instead was a different kind of pain:

My model trained fine, my loss went down, and my results were still useless.

This post is about that first month where I trained “real” alpha models on BitMEX order-book features — and got humbled by the gap between:

  • a model that optimizes loss
  • and a model that is actionable under costs, noise, and regime shifts
This is research engineering, not financial advice. I’m documenting the process of building and evaluating models on historical data, and the ways those evaluations can be misleading.

Where we are in the 2019 pipeline

By August, the system has a shape. It’s not “production,” but it’s a real pipeline with real failure modes.

  • March: the collector (websockets, clock drift, first clean snapshots)
  • April: dataset reality (HDF5 schema, missing data, rules for not lying)
  • May: feature engineering in microstructure terms (liquidity created/removed)
  • June: normalization is a deployment problem (mean/sigma + index diff)
  • July: defining alpha without cheating (look-ahead labels + leakage traps)

Now the question is brutally simple:

Given my features at time t, can a model predict a useful “alpha outcome ahead”?

In the repo, this month centers on:

  • alpha_detection/train_alpha_model.py (training + evaluation loop)

…and it depends on the label and feature pipelines created in earlier months.


The baseline mindset (why “dumb” is sacred)

In Deep RL, baselines protect you from storytelling. In trading, they protect you from a worse sin:

building complexity to compensate for a target that isn’t learnable (or isn’t worth learning).

So I keep a strict rule:

Before I tune an architecture, I need at least two baselines:
  • a dumb predictor that sets a floor
  • a simple learnable model that proves the pipeline is coherent

In August, my baselines were:

  1. Constant predictor: always predict the mean of the training labels
  2. Persistence-ish predictor: “next move looks like the recent move” (usually collapses)
  3. Shallow linear model: sanity-check that some signal exists
  4. Small neural net (MLP): the first “real” model

And then… I accidentally built something that looked like a baseline but wasn’t:

  • an early version of Deep Silos (more on that next month)

That tension (baseline vs architecture) becomes important later.


The target: what we’re predicting (and what we are not)

I’m using a look-ahead label (from July) that tries to answer:

  • “Given the current book state, how much price movement is about to happen over a short horizon?”

In the repo this is produced by:

  • alpha_detection/produce_alpha_outcome_ahead.py

And the big constraints I keep repeating to myself:

  • no future features (obvious)
  • no future normalization (less obvious)
  • no random shuffling across time (almost everyone does this by accident)

This month’s uncomfortable discovery is that even with those rules…

The label distribution itself can quietly kill you.

Because the market is mostly “nothing happens,” with occasional violent bursts.

So the first real lesson wasn’t about neural nets. It was about rare events.


What I actually trained (repo-grounded)

The training loop lives in alpha_detection/train_alpha_model.py. The structure is intentionally boring:

  • read HDF5 datasets from disk
  • define a TensorFlow model
  • train for a few epochs
  • write results to disk
  • print a small “threshold sanity check” summary

The reason to keep it boring is simple:

In trading ML, every extra feature is a new place to accidentally cheat.

The dataset contract

The script expects data files that already include:

  • engineered features (from the collector/feature pipeline)
  • the alpha label column (from the look-ahead label script)

And it follows an explicit train/valid/test directory split.

If you randomly shuffle snapshots and split train/valid/test by row, you are almost certainly leaking regime information across time.

Even if you don’t leak features, you leak distribution.

Normalization in training

The script recomputes mean and sigma from the training set only and then reuses it:

  • apply the same mean/sigma to validation and test
  • treat inf and NaN as zero after normalization

This matches June’s theme:

Normalization is not “preprocessing.” It’s part of your deployment contract.

If you normalize differently between training and inference, you don’t have a model. You have two unrelated functions.


The first humbling curves

Here’s what my first training runs looked like:

  • training loss goes down
  • validation loss sometimes goes down a little
  • test metrics are unstable
  • predictions collapse toward a narrow band

The most painful part:

It looked like progress.

I could have written a victory post. Instead I did what 2018 taught me:

  • plot distributions
  • look at edge cases
  • inspect the top predictions manually

…and the story changed.

The “predict the mean” trap

When the label distribution is heavy-tailed, the safest move (for MSE-style training) is often:

  • predict something close to the global mean

That gives you a nice stable loss. And it gives you nothing actionable.

In train_alpha_model.py, I even had a small “threshold check” that prints counts for cases like:

  • predicted alpha > 0.25
  • predicted alpha > 0.50

The humbling part:

  • those events were rare
  • and when they happened, they were often wrong

So the model was not learning “signal.” It was learning “how not to be embarrassed by the average.”


My instrumentation checklist (the supervised version)

In Deep RL, I log rewards, values, entropy, KL, advantage stats.

In supervised alpha, I log different lies.

Here’s what I consider the minimum dashboard for this month:

  • label histogram (train vs valid vs test)
  • feature histograms (a few key features, split by time)
  • mean and sigma snapshots (saved to disk and versioned)
  • training loss curve
  • validation loss curve
  • prediction histogram (is it collapsing?)
  • scatter plot: predicted vs actual label
  • correlation: predictions vs label (overall and per-day)
  • top-N predicted events: inspect timestamps + market context
  • threshold table: precision/recall at a few cutoffs
  • stability under re-train: does the story survive a different seed?
  • stability under time shift: does it survive the next month?
A model that only looks good “on average” is usually a model that is learning the market’s boredom. To learn spikes, you need to look at spikes.

A practical runbook (how to reproduce this month)

Produce labels (look-ahead alpha) without leakage

Run the label builder on top of your already-generated feature dataset.

The goal is: one HDF5 file that contains both features and a column for the look-ahead outcome.

Split by time, not by row

Create explicit train/valid/test ranges. My default is:

  • train: earlier contiguous chunk
  • valid: later contiguous chunk
  • test: final contiguous chunk

No shuffling across time boundaries.

Train the baseline model

Run the training script:

  • alpha_detection/train_alpha_model.py

Keep the config minimal:

  • small batch size
  • few epochs
  • conservative learning rate

The point is not performance. The point is: does anything learnable exist?

Inspect results like an engineer, not like a gambler

Don’t just read the final loss. Open the output artifacts:

  • training + validation curves
  • prediction distribution
  • threshold summary

If the model collapses to predicting the mean, treat it as a diagnosis, not a failure.


What broke (symptoms → likely cause → first check)

  • Loss improves but predictions are useless → label distribution dominates → plot label histogram + prediction histogram
  • Validation improves, then degrades immediately → overfitting or regime mismatch → split by later months and repeat
  • Great results when shuffled, terrible when time-split → leakage or regime contamination → enforce strict time splits
  • Model predicts extreme values, then explodes → normalization mismatch or bad NaNs → log mean/sigma + NaN counts
  • Metrics vary wildly run to run → low signal-to-noise → fix seed, reduce model capacity, increase evaluation windows
  • Top predictions cluster around a single day → one regime event dominates → evaluate on a different month
If your evaluation is not time-aware, you are not evaluating a trading model. You are evaluating a compression model on a shuffled dataset.

The most important realization

This month is where I stop thinking of supervised learning as “easier than RL.”

In RL, the challenge is credit assignment.

In supervised trading, the challenge is:

  • is the target learnable?
  • is the target stable?
  • is the target worth acting on after costs and slippage?

A model can be “correct” and still be untradeable.

And worse:

A model can be “correct” in backtests because the backtest is lying.

So August becomes a month of humility and discipline. Not because I failed to train. Because I learned what training success is allowed to hide.


Repo artifacts for this month

Alpha training loop

The training + evaluation script: reads HDF5, trains a model, and writes results.

Alpha label generation

Builds the look-ahead outcome label. This is where leakage traps start if you’re careless.

Early Deep Silos module (preview)

An architectural idea that groups features by “family” so the model can’t cheat by mixing everything too early.

Dataset pipeline foundation

The HDF5 schema and output conventions established in April become the backbone for every experiment after this.


Field notes (what I’d tell my past self)

  • “It trains” is not a result. It’s a precondition.
  • If your model predicts the mean, it might be telling you the truth: your target is mostly noise.
  • Never evaluate trading ML without a time split.
  • Always inspect top predictions manually. The model’s mistakes teach faster than the loss curve.
  • Treat normalization as a versioned artifact, not a preprocessing step.

FAQ


What’s next

This month I proved the pipeline can train models, but I also learned that “baseline MLP” is not enough.

The next step is architectural.

I want a model that:

  • doesn’t mix unrelated feature families too early
  • generalizes better across regimes
  • fails in ways I can diagnose

Because in trading, debuggability is not optional. It’s survival.

Axel Domingues - 2026