Oct 27, 2019 - 14 MIN READ

Live Alpha Monitoring - When the Market Talks Back

I stop treating my alpha model like a notebook artifact and make it sit in the real BitMEX stream. The goal is not trading yet. It is seeing whether my features, normalization, and inference loop survive reality without quietly cheating.

Axel Domingues

In September I finally got a Deep Silos model to behave. Not "win the leaderboard" behave - behave like an engineer would trust it: stable training curves, controlled capacity, clear feature families, and fewer ways to accidentally leak.

So naturally, I did the most dangerous thing.

I took the model out of the notebook and let it stare at a real BitMEX (Bitcoin Mercantile Exchange) order book.

Not to trade.

Just to watch.

Because the first time your model sees the live stream, you discover what your dataset politely hid.

What I mean by "live alpha monitoring"

There is a version of this project where I jump straight from "good validation curve" to "place orders".

I am not doing that.

This month is about building a live inference loop that:

pulls the same market data my collector pulls,
runs the same feature engineering pipeline,
applies the same normalization (mean/sigma + index diff),
produces predictions continuously,
and logs everything hard enough that I can reproduce and audit it later.

If the loop cannot do that, trading would just be gambling with extra steps.

This is not financial advice. This is research engineering: collecting data, building models, and learning how reality breaks assumptions.

The "Chappie" idea (v0): a bot that only listens

I am calling the running process Chappie.

In October 2019, Chappie's job is simple:

Subscribe to BitMEX market data
Build periodic snapshots
Run the alpha model on each snapshot
Report predictions (and the health of the pipeline)
Save snapshots to disk so I can replay the exact moments later

The key constraint: no secret feature code path.

If training and live inference do not share the same feature logic, then any success is probably a bug.

Repo focus (where the live monitor lives)

This month is mostly about turning three files into a cohesive process:

BitmexPythonChappie/main.py
BitmexPythonChappie/SnapshotManager.py
BitmexPythonChappie/OrderBookMovePredictor.py

The wiring: `BitmexPythonChappie/main.py`

main.py does the boring but critical orchestration:

connects to BitMEX (REST + websocket),
estimates a clock difference between my machine and the server,
starts a Twisted reactor loop for periodic tasks,
watches for connection faults and triggers reconnect logic,
passes a configured client into the snapshot manager.

Boring is good. Boring means reproducible.

If you see API keys in any local scripts: do not commit them. Treat this as a personal reminder as much as a warning to the reader.

The heart: `BitmexPythonChappie/SnapshotManager.py`

This is where monitoring becomes real.

A loop (LoopingCall) runs __process_snapshot() on a fixed schedule. Inside that loop:

it asks the websocket for current order book state,
builds a feature dictionary from bids/asks + timestamps,
stages snapshots until the label can be determined (the "moved_up" outcome),
writes daily HDF5 files as a durable audit trail,
calls the model for a fresh prediction on every snapshot.

In code, it looks roughly like this:

prediction = self.predictor.predict(pd.DataFrame([features_dict]))
self.BitMEX_bot_client.inform_new_prediction(prediction)

That line is basically the whole month: "can I produce live predictions without lying?"

The model wrapper: `BitmexPythonChappie/OrderBookMovePredictor.py`

This file is the bridge between research artifacts and a running process.

It loads the latest checkpoint and applies the same normalization artifacts I produced earlier:

mean-*.npy
sigma-*.npy
bitmex_index_diff-*.npy (if used)

Then it turns "feature dict" into "model input", produces a prediction, and returns something the bot client can act on (right now: log/alert, not trade).

Clock drift: the first silent killer

The live stream quickly forces one issue to the surface:

timestamps are not optional.

If your snapshot time is wrong by even a little, every look-ahead label and every alignment step becomes suspect.

My basic approach this month is:

fetch an approximate server time reference,
compute clock_diff = local_time - server_time,
treat server time as now - clock_diff inside the snapshot manager.

It is not perfect, but it is consistent - and consistency is what lets me debug.

Rule: if I cannot explain how a timestamp was produced, I do not trust the downstream label or evaluation that depends on it.

The model sees what the market says moment

A good offline validation curve makes you think the model is smart.

Then live monitoring happens and you realize something sharper:

the market is adversarial input.

A few patterns I see immediately in the live prediction logs:

predictions become confident during spread expansions (not when price moves),
confidence spikes correlate with liquidity reshuffling (created liquidity, size jumps),
the model panics around sudden bursts of small trades (microstructure noise),
predictions have time-of-day texture (activity regimes, not just price levels).

And at least once per session:

the whole stream behaves normally,
then the prediction distribution shifts,
and the first instinct is model drift,
but the actual cause is: pipeline drift (missing book levels, stale websocket state, clock alignment).

That is why this month is monitoring, not trading.

Instrumentation: my live alpha dashboard checklist

I am not building a fancy UI yet. I am building logs and plots that let me answer: "is this real?"

Here is what I log/plot/check in October:

Snapshot cadence: actual interval between snapshots vs expected interval
Processing latency: time to compute features + time to run inference
Queue depth: length of any internal buffers (if I fall behind, I want to know)
Websocket health: disconnects, reconnect attempts, stale state detection
Clock diff over time: does it drift during a long session?
Best bid/ask: current values and changes per snapshot
Spread: instantaneous spread and rolling statistics
Top-of-book imbalance: bid vs ask size ratio (and its change)
Liquidity created: change in size at fixed depth levels (bid/ask)
Level coverage: how many book levels were present vs expected
Missing feature rate: how often any feature is NaN/inf/out-of-range
Normalization sanity: mean/sigma application, clipping counts, min/max per feature
Prediction distribution: counts per class (or mean/std if continuous)
Confidence proxy: how confident the model is (even if approximated)
Alert triggers: thresholds for "high confidence + high microstructure activity"
Replay hooks: file name + row index so I can reproduce a moment offline
Daily output size: number of rows saved to HDF5 per day

If any of these go bad, I treat the predictions as untrusted. The model is not wrong - my pipeline is wrong - until proven otherwise.

The do-not-lie-to-yourself rules (live edition)

April was about dataset honesty. October is about live honesty.

I keep a few hard rules:

Train/serve skew is a bug until proven otherwise If the live feature vector is not built the same way as training, stop.
Every prediction must be replayable I need to be able to load the saved snapshot row and produce the same prediction offline.
No silent fallbacks If the websocket state is stale or missing depth, I would rather log an error than fill with zeros.
Distribution shifts must be explained If the prediction distribution changes, I investigate the pipeline first, not the model.
No trading until monitoring is boring "Boring" means: stable cadence, stable inference time, stable input ranges, predictable failure modes.

Common failure modes I hit this month

These show up fast in a live loop. I am writing them down in "symptom -> likely cause -> first check" format:

Predictions become all one class -> normalization mismatch -> verify mean/sigma files used in inference
Sudden spike in NaNs -> missing book levels or bad division -> log feature-wise NaN counts
Inconsistent snapshot spacing -> reactor loop blocked -> measure feature computation time and GC pauses
High latency during bursts -> too much per-tick work -> reduce feature set or batch inference
Prediction changes when replayed -> non-determinism or version mismatch -> confirm model checkpoint + preprocessing version
Spread looks wrong (negative/zero) -> bid/ask swapped or stale state -> log best bid/ask source and timestamps
Clock diff drifts -> server time estimation too naive -> recompute diff periodically and compare
HDF5 output gaps -> file rotation logic or exception path -> log every write and exception context

Field notes (things I did not expect)

The live stream is the best unit test I have ever had. It is brutal and it does not care about my feelings.
Most "model problems" are pipeline problems first. It is embarrassing how often this is true.
Confidence is not correctness. Confidence often tracks volatility and spread, not my desired "alpha".
Monitoring is a product. If I cannot observe it, I cannot improve it.

Deliverable for October 2019

This month's artifact is not a new model.

It is a running process:

a Chappie entry point (BitmexPythonChappie/main.py) that can run for hours,
a snapshot loop (SnapshotManager.py) that produces both HDF5 output and predictions,
a model wrapper (OrderBookMovePredictor.py) that is strict about preprocessing,
and log output that supports replay and debugging.

Expected result: I can run Chappie in "listen-only mode" and collect a daily HDF5 file plus a prediction log that I can replay offline.

That sounds small - but it is the step that makes everything else honest.

Resources (what I kept open while building this)

bitmex-deeprl-research (repo)

The codebase this series is grounded in. October lives in BitmexPythonChappie/.

Twisted LoopingCall

The periodic snapshot loop pattern I use to turn "stream" into "ticks".

What is next

This month taught me that the market does not just provide data.

It pushes back.

Next month: The 503 Lesson - the moment I realize that some strategies fail not because the model is wrong, but because the exchange is unreachable exactly when the model becomes most confident.

The 503 Lesson - Outages as a Signal, Not Just a Bug

My first live alpha monitor was “working”… until BitMEX started replying 503 right when the model got excited. That’s when I learned availability is part of market microstructure.

Deep Silos - Representation Learning That Respects Feature Families

My first serious attempt at making the model “see” microstructure features the way I designed them — grouped, compressed, and only then fused.