
In March 2019 I stop “talking about microstructure” and start collecting it. Websockets drop messages, clocks drift, and the only thing that matters is producing a snapshot I can trust.
Axel Domingues
January taught me what an order book is.
February taught me what the model should see.
March is where the fantasy dies:
if the collector lies, every backtest is theatre.
This is the month I built the thing that sits underneath everything else in this repo: a real-time BitMEX data collector that turns a firehose (websocket streams) into something I can actually train on.
And yes — the hardest part wasn’t “websockets”.
It was time.
The collector contract
What “a snapshot” means, and what has to be true before I write it to disk.
Websocket reality
Dropped messages, stale books, partial updates, and why “connected” doesn’t mean “correct”.
Clock drift is a data bug
How I estimate machine↔exchange time difference using trade timestamps (and why it matters later).
The first clean snapshot
The moment the stream becomes stable enough to trust — and the rules I added to keep it that way.
I need a process that can:
That’s it.
If I can’t do that, “feature engineering” is just writing fanfic.
This month is basically these three modules:
BitmexPythonDataCollector/main.py — orchestration loop + reconnection strategyBitmexPythonDataCollector/BitMEXWebsocketClient.py — websocket client + liveness checksBitmexPythonDataCollector/SnapshotManager.py — periodic snapshot creation + rolling windows + persistenceI like that this mirrors how I think:
transport → truth checks → stateful snapshot → storage
At a high level, the collector is “one websocket client + one snapshot manager per symbol”.
In the repo, main.py loops through a list of root symbols and builds a pair for each.
The part that made it feel “real” is that the design assumes failure:
So the main loop is basically a supervisor:
Start with a single contract (I used XBTUSD first), and only add more once the loop stays alive.
For my “first trustworthy snapshot”, I needed:
orderBookL2_25 (depth)quote (top of book)trade (last executions)instrument (metadata like mark price / funding when needed later)The websocket client tracks the last received message timestamp. If no messages arrive for a few seconds, I treat the client as faulted and recycle it.
I estimate machine↔server time difference from the trade timestamps. This becomes the “alignment glue” for windows and labels later.
If book and quote are out of sync, or fields are missing, I skip and log. Silence is worse than missingness.
In BitMEXWebsocketClient.py, I added a periodic status check that treats “no messages recently” as failure.
The pattern is simple:
lastMessageReceivedTSnow - lastMessageReceivedTS > threshold → set faulted stateIn my code, the threshold is only a few seconds — because in active markets, silence is suspicious.
BitMEX streams are incremental. You get partial updates (and sometimes bursts). So the only safe model is:
They don’t.
In fact, my own comment in the websocket client says it bluntly:
This becomes a theme of the whole project:
the market is asynchronous; your dataset must not pretend it isn’t.
This is the part that surprised me.
I expected websocket handling to be hard.
But the thing that created the most silent corruption was:
Even if you think your system clock is “fine”, drift and scheduling jitter show up as:
So in main.py, I compute an approximate drift by waiting for a fresh trade, then comparing:
lastTradeTS (from server payload)currentMachineTS (my local UTC timestamp)In the repo it’s literally:
clockDiff = currentMachineTS - lastTradeTSThat “diff” becomes a crude correction term I pass into SnapshotManager.
This is why April is a full post about dataset integrity rules: missing data + timestamp alignment is where research quietly dies.
By March, I forced myself to define “snapshot” as a contract, not a vibe.
A snapshot is only valid if:
In SnapshotManager.py, a periodic task runs on a fixed cadence (the repo uses a 1-second looping call) and attempts to:
A detail I didn’t appreciate yet (but it’s already present in the code): some snapshots get staged until an event happens (like a “move up” condition).
That design ends up shaping how labeling works later.
At this stage, the goal was simpler:
produce a stream that doesn’t contradict itself.
Here’s what I logged obsessively this month:
If this sounds like overkill, it’s because I learned this in RL already:
if you can’t inspect it, you can’t trust it.
Likely cause: websocket connection is alive but stream is stalled.
First check:
Likely cause: local state is inconsistent (partial updates applied out of order or mixed symbol state).
First check:
Likely cause: machine clock drift / inconsistent timestamp alignment.
First check:
Likely cause: message bursts + Python processing + IO (writes) create lag.
First check:
main.py — Orchestration + reconnect loop
The supervisor loop: multi-symbol clients, reactor thread, fault detection, recreation.
BitMEXWebsocketClient.py — Liveness + stream state
Local tables for streams + “no messages means fault” logic.
In RL, I learned that “the algorithm” is not the system.
In trading research, I learned something sharper:
the collector is the system at the beginning.
If the data is unstable, everything downstream becomes a sophisticated way to overfit noise and timestamp bugs.
Now that I can produce a stream, I have to face the uncomfortable truth:
a running collector does not automatically produce a usable dataset.
Next month is about turning these snapshots into something trainable without lying.
Dataset Reality — HDF5 Schema, Missing Data, and “Don’t Lie to Yourself” Rules
In April 2019 I learned that the hardest part of trading ML isn’t the model — it’s the dataset contract. This month is about HDF5, integrity checks, and building rules that stop “good backtests” from lying.
From Microstructure to Features - What the Model Will See
If RL taught me “the state is the contract,” then trading is where that contract becomes painful. This month I map order book microstructure into concrete feature families my models can actually learn from.