
My first live alpha monitor was “working”… until BitMEX started replying 503 right when the model got excited. That’s when I learned availability is part of market microstructure.
Axel Domingues
By November 2019, I’ve got something that looks like “a system”: a collector, a dataset, a supervised model, and a small live monitor that watches the BitMEX (Bitcoin Mercantile Exchange) order book and emits a prediction.
And then the exchange does what real systems always do: it fails… at exactly the worst time.
Not “fails” like my code throws a stack trace. Fails like: the model starts screaming “move!”, I try to place or manage orders, and BitMEX replies with HTTP 503 (Service Unavailable). Not once. Not occasionally. Repeatedly — clustered around the moments that mattered.
That was the day “outages” stopped being an infrastructure annoyance and became a market feature I had to model.
In textbooks, you treat the exchange like an oracle:
In practice, BitMEX is an API running under load, and during bursts of activity you see symptoms like:
That last one is the killer. You can’t even reason about risk if you don’t know your own state.
October’s post was about listening to the market. November is about what happens when you try to act.
At this point in the repo, the live pieces are basically:
BitmexPythonChappie/BitMEXWebsocketClient.py — keeps a local view of trades/order book and exposes helpers like market_depth() and recent_trades()BitmexPythonChappie/SnapshotManager.py — produces the feature snapshot used by the predictor (same “snapshot contract” I used for training)BitmexPythonChappie/OrderBookMovePredictor.py — loads the trained model + applies mean/sigma + index-diff normalization (from June)BitmexPythonChappie/BitMEXBotClient.py — REST client for placing/cancelling orders and reconciling what “should” exist with what actually existsBitmexPythonChappie/main.py — wires it together, restarts websocket client when faulted, keeps the process aliveThe key detail is that this is not a full trading strategy yet. It’s a monitor + “poke the exchange” loop that was meant to validate one thing:
When my model predicts large moves, do those moments line up with real market pressure?
The answer was: yes — and the exchange itself confirmed it by falling over.
I expected outages to be random, like intermittent Wi‑Fi. Instead, they were structured:
This changes the framing completely:
If 503 events correlate with “everyone is doing something right now”, then 503 is part of the market microstructure story.
It’s a signal about crowding and capacity — and it hits taker/HFT-style behaviors the hardest because they rely on tight timing.
Even before I named the problem, the code started growing “scar tissue” around it.
In BitmexPythonChappie/BitMEXWebsocketClient.py, the websocket error handler marks the client faulted and exits:
def _on_error(self, ws, error):
self.logger.error("ERROR: %s" % error)
self.isFaulted = True
self._on_close(ws)
Then in BitmexPythonChappie/main.py, the process watches for isFaulted and rebuilds the websocket client:
if wsBitMEXClient.exited or wsBitMEXClient.isFaulted:
wsBitMEXClient.exit()
wsBitMEXClient = setup_ws_BitMEXClient(BitMEXBotClientInstance)
snapshotManagerInstance.set_ws_bitMEX_client(wsBitMEXClient)
That’s not “elegant architecture”, but it’s real. Fault detection + auto-reconnect is the minimum for a long-running live process.
In BitmexPythonChappie/BitMEXBotClient.py, I ended up writing a reconciliation helper:
You can see this mindset in check_valid_response() — it’s literally coded as: if the order isn’t there, wait and check again.
That is not a theoretical risk-control system. It’s the first time I’m admitting the exchange can lie to my bot through omission, delay, or partial failure.
After a week of watching 503 clusters, I wrote a rule on paper and taped it above my monitor:
If the exchange is unavailable, the market is not “paused” — the market is screaming.
Treat outages like a volatility regime, not like downtime.
That rule led to four concrete changes in how I log and how I design the system.
This month’s goal wasn’t profitability. It was observability: can I explain what happened after the fact?
Here’s the checklist I started tracking (some in logs, some as quick scripts/plots):
Here’s what I saw repeatedly, and how I learned to debug it fast.
SnapshotManager timings + missing fieldsI didn’t “fix BitMEX”. I changed my system so it doesn’t pretend the exchange is deterministic.
Concretely, I focused on three things:
At the end of 2018 I was deep into RL algorithms: DQN (Deep Q‑Network), PPO (Proximal Policy Optimization), actor‑critic, stability tricks.
This month forced a more uncomfortable question:
What does an “optimal policy” mean if the environment sometimes refuses actions?
If I don’t model outages and delayed acknowledgements, an RL agent will learn the wrong thing — it will learn in a fantasy world where action always happens instantly.
That’s why December’s post is about the environment contract. Before I write “reward” or “step()”, I need to define what reality can do to me:
The 503 lesson is the first big constraint in that contract.
In December I’m going to do the most unsexy thing in the world: define an interface.
Not a library interface — a truth interface:
From Prediction to Decision - Designing the Trading Environment Contract
I stopped pretending “a good predictor” was the same thing as “a tradable strategy” and designed a Gym-style environment contract that makes cheating obvious and failure modes measurable.
Live Alpha Monitoring - When the Market Talks Back
I stop treating my alpha model like a notebook artifact and make it sit in the real BitMEX stream. The goal is not trading yet. It is seeing whether my features, normalization, and inference loop survive reality without quietly cheating.