
If the order book is the battlefield, features are the sensors. This month I stop hand-waving and teach my pipeline to measure liquidity being added and removed - in a way I can deploy live.
Axel Domingues
In January I convinced myself that order books are the battlefield.
In February I started translating microstructure into the question that actually matters for ML:
What will the model see?
In March I built a collector that could survive websockets, reconnects, and clock drift.
In April I learned the uncomfortable truth: datasets are not neutral. The schema, missingness rules, and integrity checks decide what your model is allowed to learn.
Now it is May, and it is time to do the thing everyone hand-waves:
feature engineering - but with microstructure discipline.
Not "RSI" and "MACD". Not vibes.
I want signals that match the mechanics: liquidity being created and removed, because that is what every participant is doing, every second.
The core feature this month
Liquidity created/removed: net depth change at best bid/ask + near levels, aggregated over time windows.
The key mindset shift
Features are sensors, not indicators. If I cannot explain what market action produces a feature, I do not trust it.
The anti-leakage rule
Every feature must be computable from past + present messages only. No future bars, no "oops I used the close".
Repo focus
Feature computation lives in SnapshotManager (collector + live bot):
Collector/SnapshotManager.pyChappie/SnapshotManager.pyAt the order-book level, signal is not mystical.
It is mostly three things:
A lot of ML trading features ignore this and operate on price series as if price is the primary object.
But on BitMEX, price is an output of the matching engine.
The book is the input.
So I want features that track how the book changes.
The simplest story:
So instead of asking "did price go up?", I can ask:
"Is the market currently adding liquidity near the top... or taking it away?"
That question is closer to the mechanics of the matching engine.
It also gives me a handle on regime:
And importantly, it is a feature I can compute purely from websocket order book updates, which means it can exist in both:
That symmetry becomes a theme later.
I am not trying to perfectly classify each book update as cancel vs add vs fill.
I am doing something simpler and more robust:
If the net change is positive, liquidity was created (more size got posted than removed). If it is negative, liquidity was removed (more size got consumed/cancelled than added).
This matters because it is a feature that survives a messy world:
The feature logic lives in two parallel implementations:
BitmexPythonDataCollector/SnapshotManager.py (offline dataset pipeline)BitmexPythonChappie/SnapshotManager.py (live bot feature stream)They share the same idea: build a snapshot, then compute deltas.
In both SnapshotManagers, I intentionally limit the depth I care about.
In the current repo snapshot, the constant is:
RELEVANT_NUM_QUOTES = 5Meaning: compute near-top-of-book signals using the best 5 levels per side.
That is an engineering tradeoff:
This is not a forever decision - it is a starting contract.
Inside SnapshotManager.__determine_liquidity_creation(...), I compare the new snapshot sizes to the previous snapshot sizes and compute:
Conceptually:
delta_best_bid = new_best_bid_size - prev_best_bid_size
delta_rest_bids = sum(new_bid_sizes[1:K]) - sum(prev_bid_sizes[1:K])
delta_best_ask = new_best_ask_size - prev_best_ask_size
delta_rest_asks = sum(new_ask_sizes[1:K]) - sum(prev_ask_sizes[1:K])
Then those deltas get appended to rolling arrays alongside their timestamps:
best_bid_liquidity_array, best_bid_liquidity_ts_arraybids_liquidity_array, bids_liquidity_ts_arraybest_ask_liquidity_array, best_ask_liquidity_ts_arrayasks_liquidity_array, asks_liquidity_ts_arrayAnd the arrays are trimmed to a max horizon:
MAX_LIQUIDITY_TIME = 10 minutesThat trimming is not nice to have. It prevents memory blowups during long runs.
A single delta is noisy. A bursty market produces jitter.
So I do not feed last delta to a model. I aggregate deltas over windows.
In the repo snapshot, the window list is explicit:
RELEVANT_CREATED_LIQUIDITY_WINDOW = [1.5, 3, 6, 12, 24, 48, 96, 240, 480, 960] secondsThat gives me multi-scale vision:
In SnapshotManager.__create_current_snapshot(...), each window produces four features:
best_bid_created_liq_{window}bids_created_liq_{window}best_ask_created_liq_{window}asks_created_liq_{window}Each one is simply:
sum of deltas since
current_ts - window
Which is a clean engineering contract.
Created liquidity
Positive net delta: more size got posted than removed in that window.
Removed liquidity
Negative net delta: more size got consumed/cancelled than posted in that window.
Liquidity deltas are about book depth.
Executed volume is a different object.
That is why SnapshotManager also tracks traded volume arrays and exposes features like:
buy_volume_{window}sell_volume_{window}(Those are derived from trade messages, not book deltas.)
If I mix these concepts, I will misread the market.
A market can:
So I keep them separate on purpose.
April taught me that datasets lie unless you force them to tell the truth.
May's version is:
features lie unless you force them to stay causal, stable, and consistent.
Here is the checklist I use while building and validating these liquidity features.
If timestamp goes backward, the time-window sums become meaningless.
A reconnect can cause sudden jumps. I want those flagged, not silently fed to the model.
When price spikes up, I expect ask liquidity to be removed more often than created (negative ask deltas).
If the 1.5s window has the same distribution as the 960s window, something is wrong in the indexing logic.
I am not looking for predictive proof yet - just basic alignment with mechanics.
If the live bot computes a different feature set than the collector, I am training on fiction.
Liquidity is what's available to trade at various prices (depth in the book). Volume is what actually traded (executions). They interact, but they are not the same thing - and conflating them leads to nonsense conclusions.
Because markets are multi-timescale. A burst that matters at 3 seconds might be noise at 15 minutes, and a slow drift matters even when the last 1.5 seconds look calm. Multiple windows let the model see both without me hardcoding a single horizon.
No. Net depth can drop because of:
That is why I track both depth deltas and executed volume as separate feature families.
In classic ML, features felt like inputs.
In microstructure, features feel like instrumentation.
I am not extracting alpha yet - I am building a sensor suite that measures the actual forces inside the matching engine:
It is the same mindset I learned in RL in 2018:
if you cannot instrument it, you cannot debug it - and if you cannot debug it, you cannot trust it.
My research repo
This series is tied to code. Collector → datasets → features → alpha models → Gym envs → live loop (“Chappie”).
Once I have a feature pipeline, a new problem shows up immediately:
Normalization is not a training detail
If I compute means/standard deviations on one dataset and trade on another regime, the feature distribution shifts and the model can behave like it lost its senses.
Next month is about:
Normalization Is a Deployment Problem - Mean/Sigma and Index Diff
In June 2019 I stop treating feature scaling as “preprocessing” and start treating it as part of the production contract - same transforms, same stats, same order — or the live system lies.
Dataset Reality — HDF5 Schema, Missing Data, and “Don’t Lie to Yourself” Rules
In April 2019 I learned that the hardest part of trading ML isn’t the model — it’s the dataset contract. This month is about HDF5, integrity checks, and building rules that stop “good backtests” from lying.