May 31, 2020 - 15 MIN READ

Safety Engineering - Kill Switches, Reconciliation, and Failure Recovery

In May 2020, I stop hoping the bot is “fine” and start giving it explicit failure states — stale websockets, missing fills, rate-limits, and the kill switches that keep a live loop honest.

Axel Domingues

In April, I finally wired Chappie: a trained policy -> a running process -> real BitMEX sockets -> real orders.

In May, I learned the part nobody puts in the “cool RL trading bot” demos:

A live trading loop is mostly failure handling.

Not “exceptions” either.

The dangerous failures are the quiet ones:

websocket stalls but your code keeps trading
you miss an order update and your internal state drifts
rate limits creep up and you start dropping requests
the exchange is “up”… except it isn’t (and you’re about to be the last one to notice)

This post is my safety engineering month: kill switches, reconciliation, and recovery.

The safety stack (what I wanted the bot to guarantee)

I didn’t want “make money.” I wanted something more basic first:

Don’t trade on broken market data.
Don’t keep trading if our view of reality is stale.
If we’re unsure, stop.
If we stop, stop in a way that leaves the account in a knowable state.

That turned into three layers:

Liveness: detect websocket stalls, missing subscriptions, and stale quotes.
Reconciliation: periodically confirm orders/positions via REST when websocket messages are missing.
Fail-fast recovery: treat certain states as terminal and let a supervisor restart cleanly.

All of this lives in the repo:

BitmexPythonChappie/BitMEXWebsocketClient.py
BitmexPythonChappie/BitMEXBotClient.py

Kill switch #1: “If the websocket is stale, we’re done”

I’m not interested in being clever when the feed is stale.

I want the bot to notice and fail loudly.

In BitMEXWebsocketClient.py, I added a periodic status check that watches for “last updated” drift and progressively escalates:

if the socket hasn’t updated in a few seconds: stop the websocket loop
if it keeps drifting: fault the main thread (hard kill)

A simplified slice of the logic:

# BitmexPythonChappie/BitMEXWebsocketClient.py

diff_from_last_updated = datetime.utcnow() - self.last_updated

if diff_from_last_updated.total_seconds() > 5:
    self.ws.stop()                 # soft stop

if diff_from_last_updated.total_seconds() > 20:
    self.__fault_main_thread()     # hard kill

That second threshold (faulting the main thread) is important.

A “soft stop” can leave you in limbo: the process still runs, but market data is gone.

A hard fault forces recovery to happen outside the trading logic.

A trading bot that can’t stop itself is not a trading bot — it’s a runaway process.

Kill switch #2: “If quotes aren’t synced, don’t pretend they are”

A websocket can be alive but still be wrong.

So I tracked two separate “sync” flags:

quotes_synced
orderBook_synced

And I let timeouts turn them into kill conditions.

In the same status check, I made “not synced for too long” a fault:

# BitmexPythonChappie/BitMEXWebsocketClient.py

if not self.quotes_synced and diff_from_last_updated.total_seconds() > 150:
    self.__fault_main_thread()

if not self.orderBook_synced and diff_from_last_updated.total_seconds() > 150:
    self.__fault_main_thread()

That is the “market talks back” idea made operational:

If the market feed is incomplete, it’s telling you to stop.

Kill switch #3: “Trading enabled is a config flag, not a code comment”

The simplest safety feature I kept reusing was also the most boring:

a config toggle.

In BitMEXBotClient.py, I read allow_trade from config.ini and treated it as the highest authority.

If it’s off, the bot can still run, log, and monitor… but it won’t send orders.

This became my “dry-run in prod” mode.

; BitmexPythonChappie/config.ini
[Bot]
allow_trade = false

And in the bot:

# BitmexPythonChappie/BitMEXBotClient.py

trade_enabled = self.config.getboolean('Bot', 'allow_trade')
if not trade_enabled:
    return  # observe only

This sounds trivial, but it mattered because it changed my workflow:

deploy with allow_trade = false
verify data, subscriptions, time drift, and logs
flip the flag only when everything looks sane

Rate-limit safety: “If the request budget is low, slow down”

When BitMEX starts rate-limiting you, the next failure mode is subtle:

some requests fail
you miss an update
you drift
you trade on a hallucinated state

So I added a lock when the remaining request budget is too low.

In BitMEXBotClient.py, if the remaining budget is below a threshold, I set self.locked = True.

That lock propagates into behavior decisions: no new trades while locked.

# BitmexPythonChappie/BitMEXBotClient.py

remaining = self.bitmex.getLockRemainingRequests()
if remaining < self.LOCK_REQ_THRESHOLD:
    self.locked = True

This is less about “saving requests” and more about refusing to operate in degraded IO.

Reconciliation: websockets lie by omission

Even with a healthy websocket, you can miss a message.

This is what drift looks like:

exchange has an open order
your websocket missed the update
your bot thinks you’re flat
your policy makes a “safe” decision
you accidentally stack exposure

So I built reconciliation in the state machine.

Case: we opened a position, but the websocket doesn’t show the order

In manage_opening_state(), if the websocket doesn’t have the open order, I fallback to a REST query.

# BitmexPythonChappie/BitMEXBotClient.py

if len(open_order_ws) == 0:
    open_order_rest = self.bitmex.getOpenOrders()  # REST fallback
    if len(open_order_rest) > 0:
        self.open_order = open_order_rest[0]

This is the entire point of reconciliation:

Websocket is your fast path, REST is your truth oracle.

Case: we think we’re in a position, but there’s no order on the exchange

The reverse drift is just as dangerous.

In manage_closed_state(), if there are no open orders, I clear internal state.

That is a safety reset, not a “strategy decision.”

Failure recovery: stop trying to be resilient inside the loop

The early version of me wanted to handle everything inside the trading loop:

catch exceptions
keep going
log harder

In May 2020 I changed my mind.

For certain fault conditions (stale websocket, out-of-sync feed), I wanted a crash.

Not because crashing is fun — because restarting is cleaner than half-recovering.

So the websocket client can set a “faulted” state, and the bot loop checks it:

# BitmexPythonChappie/BitMEXBotClient.py

if not self.ws.isActive() or self.ws.isFaulted:
    return  # or exit in the supervisor-driven version

And the websocket client can fault the main thread when health checks fail.

That’s not graceful.

That’s intentional.

Resilience belongs at the process boundary.

Inside the bot: be conservative and fail-fast. Outside the bot: supervisor restarts, alerting, and “did we leave orders behind?” checks.

The mindset shift: safety isn’t “extra”, it’s the product

Before BitMEX, I thought “outages” were bad luck.

After BitMEX, I treated outages as a normal market regime.

Safety engineering is just admitting:

you don’t control the venue
you don’t control the network
you don’t control your own bugs

So you build a bot that can say:

“I don’t know. I’m stopping.”

That sentence is where real trading automation starts.

Repo pointers

bitmex-deeprl-research (repo)

All code and experiments referenced in this 2019–2020 series.

BitMEXWebsocketClient.py

Heartbeat checks, staleness detection, and faulting behavior.

BitMEXBotClient.py

State machine + reconciliation logic + “observe-only” mode.

config.ini

The simplest kill switch: allow_trade.

FAQ

What’s next

In the next post, I finally do what I’d been postponing:

First live runs

That’s June 2020.

First Live Runs - Small Size, Big Lessons

Backtests looked amazing. Live PnL didn't. In June 2020 I ran the first real BitMEX live loop at tiny size and learned the most important lesson in trading ML: regime is the boss.

Chappie Wiring From Trained Policy to Running Process

The moment RL stops being a notebook artifact: load a PPO policy, rebuild the live observation stream, and turn BitMEX into a runtime you can monitor and control.