
In May 2020, I stop hoping the bot is “fine” and start giving it explicit failure states — stale websockets, missing fills, rate-limits, and the kill switches that keep a live loop honest.
Axel Domingues
In April, I finally wired Chappie: a trained policy -> a running process -> real BitMEX sockets -> real orders.
In May, I learned the part nobody puts in the “cool RL trading bot” demos:
A live trading loop is mostly failure handling.
Not “exceptions” either.
The dangerous failures are the quiet ones:
This post is my safety engineering month: kill switches, reconciliation, and recovery.
I didn’t want “make money.” I wanted something more basic first:
That turned into three layers:
All of this lives in the repo:
BitmexPythonChappie/BitMEXWebsocketClient.pyBitmexPythonChappie/BitMEXBotClient.pyI’m not interested in being clever when the feed is stale.
I want the bot to notice and fail loudly.
In BitMEXWebsocketClient.py, I added a periodic status check that watches for “last updated” drift and progressively escalates:
A simplified slice of the logic:
# BitmexPythonChappie/BitMEXWebsocketClient.py
diff_from_last_updated = datetime.utcnow() - self.last_updated
if diff_from_last_updated.total_seconds() > 5:
self.ws.stop() # soft stop
if diff_from_last_updated.total_seconds() > 20:
self.__fault_main_thread() # hard kill
That second threshold (faulting the main thread) is important.
A “soft stop” can leave you in limbo: the process still runs, but market data is gone.
A hard fault forces recovery to happen outside the trading logic.
A websocket can be alive but still be wrong.
So I tracked two separate “sync” flags:
quotes_syncedorderBook_syncedAnd I let timeouts turn them into kill conditions.
In the same status check, I made “not synced for too long” a fault:
# BitmexPythonChappie/BitMEXWebsocketClient.py
if not self.quotes_synced and diff_from_last_updated.total_seconds() > 150:
self.__fault_main_thread()
if not self.orderBook_synced and diff_from_last_updated.total_seconds() > 150:
self.__fault_main_thread()
That is the “market talks back” idea made operational:
If the market feed is incomplete, it’s telling you to stop.
The simplest safety feature I kept reusing was also the most boring:
a config toggle.
In BitMEXBotClient.py, I read allow_trade from config.ini and treated it as the highest authority.
If it’s off, the bot can still run, log, and monitor… but it won’t send orders.
This became my “dry-run in prod” mode.
; BitmexPythonChappie/config.ini
[Bot]
allow_trade = false
And in the bot:
# BitmexPythonChappie/BitMEXBotClient.py
trade_enabled = self.config.getboolean('Bot', 'allow_trade')
if not trade_enabled:
return # observe only
This sounds trivial, but it mattered because it changed my workflow:
allow_trade = falseWhen BitMEX starts rate-limiting you, the next failure mode is subtle:
So I added a lock when the remaining request budget is too low.
In BitMEXBotClient.py, if the remaining budget is below a threshold, I set self.locked = True.
That lock propagates into behavior decisions: no new trades while locked.
# BitmexPythonChappie/BitMEXBotClient.py
remaining = self.bitmex.getLockRemainingRequests()
if remaining < self.LOCK_REQ_THRESHOLD:
self.locked = True
This is less about “saving requests” and more about refusing to operate in degraded IO.
Even with a healthy websocket, you can miss a message.
This is what drift looks like:
So I built reconciliation in the state machine.
In manage_opening_state(), if the websocket doesn’t have the open order, I fallback to a REST query.
# BitmexPythonChappie/BitMEXBotClient.py
if len(open_order_ws) == 0:
open_order_rest = self.bitmex.getOpenOrders() # REST fallback
if len(open_order_rest) > 0:
self.open_order = open_order_rest[0]
This is the entire point of reconciliation:
Websocket is your fast path, REST is your truth oracle.
The reverse drift is just as dangerous.
In manage_closed_state(), if there are no open orders, I clear internal state.
That is a safety reset, not a “strategy decision.”
The early version of me wanted to handle everything inside the trading loop:
In May 2020 I changed my mind.
For certain fault conditions (stale websocket, out-of-sync feed), I wanted a crash.
Not because crashing is fun — because restarting is cleaner than half-recovering.
So the websocket client can set a “faulted” state, and the bot loop checks it:
# BitmexPythonChappie/BitMEXBotClient.py
if not self.ws.isActive() or self.ws.isFaulted:
return # or exit in the supervisor-driven version
And the websocket client can fault the main thread when health checks fail.
That’s not graceful.
That’s intentional.
Inside the bot: be conservative and fail-fast. Outside the bot: supervisor restarts, alerting, and “did we leave orders behind?” checks.
Before BitMEX, I thought “outages” were bad luck.
After BitMEX, I treated outages as a normal market regime.
Safety engineering is just admitting:
So you build a bot that can say:
“I don’t know. I’m stopping.”
That sentence is where real trading automation starts.
Because the bot needs to react to its own observed reality, not a webpage.
The only thing that matters is whether your websocket stream is timely and internally consistent.
Because infinite reconnect loops hide the problem.
A crash is loud. A crash gets your attention. And when paired with a supervisor restart + reconciliation checks, it’s often the cleanest way to return to a known state.
Websockets are fast, not sacred.
Reconciliation is what makes the system robust to dropped messages, brief stalls, and partial outages — all the stuff that happens in the real world.
In the next post, I finally do what I’d been postponing:
First live runs — small size, big lessons.
That’s June 2020.
First Live Runs - Small Size, Big Lessons
Backtests looked amazing. Live PnL didn't. In June 2020 I ran the first real BitMEX live loop at tiny size and learned the most important lesson in trading ML: regime is the boss.
Chappie Wiring From Trained Policy to Running Process
The moment RL stops being a notebook artifact: load a PPO policy, rebuild the live observation stream, and turn BitMEX into a runtime you can monitor and control.