Mar 31, 2019 - 14 MIN READ

The Collector - Websockets, Clock Drift, and the First Clean Snapshot

In March 2019 I stop “talking about microstructure” and start collecting it. Websockets drop messages, clocks drift, and the only thing that matters is producing a snapshot I can trust.

Axel Domingues

January taught me what an order book is.

February taught me what the model should see.

March is where the fantasy dies:

if the collector lies, every backtest is theatre.

This is the month I built the thing that sits underneath everything else in this repo: a real-time BitMEX data collector that turns a firehose (websocket streams) into something I can actually train on.

And yes — the hardest part wasn’t “websockets”.

It was time.

The collector contract

What “a snapshot” means, and what has to be true before I write it to disk.

Websocket reality

Dropped messages, stale books, partial updates, and why “connected” doesn’t mean “correct”.

Clock drift is a data bug

How I estimate machine↔exchange time difference using trade timestamps (and why it matters later).

The first clean snapshot

The moment the stream becomes stable enough to trust — and the rules I added to keep it that way.

The Problem Statement (in one sentence)

I need a process that can:

stay connected for hours
recover from failure without manual babysitting
and produce time-aligned snapshots that are internally consistent enough to become a dataset

That’s it.

If I can’t do that, “feature engineering” is just writing fanfic.

Repo Focus: the three files that became my world

This month is basically these three modules:

BitmexPythonDataCollector/main.py — orchestration loop + reconnection strategy
BitmexPythonDataCollector/BitMEXWebsocketClient.py — websocket client + liveness checks
BitmexPythonDataCollector/SnapshotManager.py — periodic snapshot creation + rolling windows + persistence

I like that this mirrors how I think:

transport → truth checks → stateful snapshot → storage

The Collector Architecture (what runs, and why)

At a high level, the collector is “one websocket client + one snapshot manager per symbol”.

In the repo, main.py loops through a list of root symbols and builds a pair for each.

The part that made it feel “real” is that the design assumes failure:

websocket will disconnect
connection will hang without cleanly closing
messages will stop arriving while the TCP connection still looks alive

So the main loop is basically a supervisor:

if a client is faulted → kill it → recreate it → re-attach the snapshot manager → continue

Step-by-step: how I got to the first stable run

Connect one symbol first (don’t scale while blind)

Start with a single contract (I used XBTUSD first), and only add more once the loop stays alive.

For my “first trustworthy snapshot”, I needed:

orderBookL2_25 (depth)
quote (top of book)
trade (last executions)
instrument (metadata like mark price / funding when needed later)

Add a liveness rule (connected is not the same as healthy)

The websocket client tracks the last received message timestamp. If no messages arrive for a few seconds, I treat the client as faulted and recycle it.

Compute clock drift once (then use it everywhere)

I estimate machine↔server time difference from the trade timestamps. This becomes the “alignment glue” for windows and labels later.

Only write snapshots that pass basic consistency checks

If book and quote are out of sync, or fields are missing, I skip and log. Silence is worse than missingness.

Websockets: the three illusions I had to kill

Illusion #1: “If I’m connected, I’m receiving data”

In BitMEXWebsocketClient.py, I added a periodic status check that treats “no messages recently” as failure.

The pattern is simple:

track lastMessageReceivedTS
if now - lastMessageReceivedTS > threshold → set faulted state

In my code, the threshold is only a few seconds — because in active markets, silence is suspicious.

This is the collector equivalent of RL instrumentation:If you don’t measure liveness explicitly, you’ll believe your own uptime charts.

Illusion #2: “A message is a snapshot”

BitMEX streams are incremental. You get partial updates (and sometimes bursts). So the only safe model is:

keep a local state table per stream
apply updates
build snapshots from the assembled state, not from individual events

Illusion #3: “The book and the quote will arrive together”

They don’t.

In fact, my own comment in the websocket client says it bluntly:

order book updates can be faster than quotes
so I added sync checks before letting snapshot logic proceed

This becomes a theme of the whole project:

the market is asynchronous; your dataset must not pretend it isn’t.

Clock Drift: the bug that doesn’t look like a bug

This is the part that surprised me.

I expected websocket handling to be hard.

But the thing that created the most silent corruption was:

server timestamps vs machine time

Even if you think your system clock is “fine”, drift and scheduling jitter show up as:

misaligned feature windows
look-ahead leakage you don’t realize you introduced
label boundaries that slide by seconds (which matters when your horizon is short)

So in main.py, I compute an approximate drift by waiting for a fresh trade, then comparing:

lastTradeTS (from server payload)
currentMachineTS (my local UTC timestamp)

In the repo it’s literally:

clockDiff = currentMachineTS - lastTradeTS

That “diff” becomes a crude correction term I pass into SnapshotManager.

This is not perfect time synchronization.It’s a pragmatic engineering trick: “good enough alignment to not lie to myself while building v1”.

Later (when labels matter), “good enough” gets stricter.

This is why April is a full post about dataset integrity rules: missing data + timestamp alignment is where research quietly dies.

What a “Snapshot” Means (my first real contract)

By March, I forced myself to define “snapshot” as a contract, not a vibe.

A snapshot is only valid if:

I have a coherent L2 book state (at least the top levels)
I can compute best bid / best ask reliably
I have a recent trade reference (for drift sanity + later features)
the snapshot timestamp is stable enough to be placed into rolling windows

In SnapshotManager.py, a periodic task runs on a fixed cadence (the repo uses a 1-second looping call) and attempts to:

build the current snapshot
update time windows (rolling feature buffers)
persist output (in March: “write something stable”; in April: “make it a dataset”)

A detail I didn’t appreciate yet (but it’s already present in the code): some snapshots get staged until an event happens (like a “move up” condition).

That design ends up shaping how labeling works later.

At this stage, the goal was simpler:

produce a stream that doesn’t contradict itself.

My “Don’t Lie To Yourself” Debug Dashboard (collector edition)

Here’s what I logged obsessively this month:

websocket connected/disconnected events
reconnect counts (per symbol)
time since last message received
last trade timestamp (server time)
current machine timestamp (local time)
estimated clock diff over the session (and whether it drifts)
best bid / best ask values
spread (and whether it ever goes negative — that’s a bug alarm)
book depth sanity (are there 25 levels or did state reset?)
message burst rate (helps detect stalls vs floods)
snapshot cadence (am I actually producing 1/sec?)
skipped snapshot counts (and reason)
write-to-disk latency
CPU load spikes (because bursts + Python can lag)
per-symbol “faulted” state flips

If this sounds like overkill, it’s because I learned this in RL already:

if you can’t inspect it, you can’t trust it.

Common failure modes (what broke first)

Resources (the exact repo touchpoints)

main.py — Orchestration + reconnect loop

The supervisor loop: multi-symbol clients, reactor thread, fault detection, recreation.

BitMEXWebsocketClient.py — Liveness + stream state

Local tables for streams + “no messages means fault” logic.

SnapshotManager.py — Periodic snapshots

The snapshot contract: rolling windows, validity checks, and persistence hooks.

Repo — bitmex-deeprl-research

The full pipeline as it evolves: microstructure → dataset → alpha → environments → live loop.

What changed in my thinking (March takeaway)

In RL, I learned that “the algorithm” is not the system.

In trading research, I learned something sharper:

the collector is the system at the beginning.

If the data is unstable, everything downstream becomes a sophisticated way to overfit noise and timestamp bugs.

What’s Next

Now that I can produce a stream, I have to face the uncomfortable truth:

a running collector does not automatically produce a usable dataset.

Next month is about turning these snapshots into something trainable without lying.

Dataset Reality — HDF5 Schema, Missing Data, and “Don’t Lie to Yourself” Rules

In April 2019 I learned that the hardest part of trading ML isn’t the model — it’s the dataset contract. This month is about HDF5, integrity checks, and building rules that stop “good backtests” from lying.

From Microstructure to Features - What the Model Will See

If RL taught me “the state is the contract,” then trading is where that contract becomes painful. This month I map order book microstructure into concrete feature families my models can actually learn from.