Blog
May 31, 2020 - 15 MIN READ
Safety Engineering - Kill Switches, Reconciliation, and Failure Recovery

Safety Engineering - Kill Switches, Reconciliation, and Failure Recovery

In May 2020, I stop hoping the bot is “fine” and start giving it explicit failure states — stale websockets, missing fills, rate-limits, and the kill switches that keep a live loop honest.

Axel Domingues

Axel Domingues

In April, I finally wired Chappie: a trained policy -> a running process -> real BitMEX sockets -> real orders.

In May, I learned the part nobody puts in the “cool RL trading bot” demos:

A live trading loop is mostly failure handling.

Not “exceptions” either.

The dangerous failures are the quiet ones:

  • websocket stalls but your code keeps trading
  • you miss an order update and your internal state drifts
  • rate limits creep up and you start dropping requests
  • the exchange is “up”… except it isn’t (and you’re about to be the last one to notice)

This post is my safety engineering month: kill switches, reconciliation, and recovery.


The safety stack (what I wanted the bot to guarantee)

I didn’t want “make money.” I wanted something more basic first:

  1. Don’t trade on broken market data.
  2. Don’t keep trading if our view of reality is stale.
  3. If we’re unsure, stop.
  4. If we stop, stop in a way that leaves the account in a knowable state.

That turned into three layers:

  • Liveness: detect websocket stalls, missing subscriptions, and stale quotes.
  • Reconciliation: periodically confirm orders/positions via REST when websocket messages are missing.
  • Fail-fast recovery: treat certain states as terminal and let a supervisor restart cleanly.

All of this lives in the repo:

  • BitmexPythonChappie/BitMEXWebsocketClient.py
  • BitmexPythonChappie/BitMEXBotClient.py

Kill switch #1: “If the websocket is stale, we’re done”

I’m not interested in being clever when the feed is stale.

I want the bot to notice and fail loudly.

In BitMEXWebsocketClient.py, I added a periodic status check that watches for “last updated” drift and progressively escalates:

  • if the socket hasn’t updated in a few seconds: stop the websocket loop
  • if it keeps drifting: fault the main thread (hard kill)

A simplified slice of the logic:

# BitmexPythonChappie/BitMEXWebsocketClient.py

diff_from_last_updated = datetime.utcnow() - self.last_updated

if diff_from_last_updated.total_seconds() > 5:
    self.ws.stop()                 # soft stop

if diff_from_last_updated.total_seconds() > 20:
    self.__fault_main_thread()     # hard kill

That second threshold (faulting the main thread) is important.

A “soft stop” can leave you in limbo: the process still runs, but market data is gone.

A hard fault forces recovery to happen outside the trading logic.

A trading bot that can’t stop itself is not a trading bot — it’s a runaway process.

Kill switch #2: “If quotes aren’t synced, don’t pretend they are”

A websocket can be alive but still be wrong.

So I tracked two separate “sync” flags:

  • quotes_synced
  • orderBook_synced

And I let timeouts turn them into kill conditions.

In the same status check, I made “not synced for too long” a fault:

# BitmexPythonChappie/BitMEXWebsocketClient.py

if not self.quotes_synced and diff_from_last_updated.total_seconds() > 150:
    self.__fault_main_thread()

if not self.orderBook_synced and diff_from_last_updated.total_seconds() > 150:
    self.__fault_main_thread()

That is the “market talks back” idea made operational:

If the market feed is incomplete, it’s telling you to stop.


Kill switch #3: “Trading enabled is a config flag, not a code comment”

The simplest safety feature I kept reusing was also the most boring:

a config toggle.

In BitMEXBotClient.py, I read allow_trade from config.ini and treated it as the highest authority.

If it’s off, the bot can still run, log, and monitor… but it won’t send orders.

This became my “dry-run in prod” mode.

; BitmexPythonChappie/config.ini
[Bot]
allow_trade = false

And in the bot:

# BitmexPythonChappie/BitMEXBotClient.py

trade_enabled = self.config.getboolean('Bot', 'allow_trade')
if not trade_enabled:
    return  # observe only

This sounds trivial, but it mattered because it changed my workflow:

  • deploy with allow_trade = false
  • verify data, subscriptions, time drift, and logs
  • flip the flag only when everything looks sane

Rate-limit safety: “If the request budget is low, slow down”

When BitMEX starts rate-limiting you, the next failure mode is subtle:

  • some requests fail
  • you miss an update
  • you drift
  • you trade on a hallucinated state

So I added a lock when the remaining request budget is too low.

In BitMEXBotClient.py, if the remaining budget is below a threshold, I set self.locked = True.

That lock propagates into behavior decisions: no new trades while locked.

# BitmexPythonChappie/BitMEXBotClient.py

remaining = self.bitmex.getLockRemainingRequests()
if remaining < self.LOCK_REQ_THRESHOLD:
    self.locked = True

This is less about “saving requests” and more about refusing to operate in degraded IO.


Reconciliation: websockets lie by omission

Even with a healthy websocket, you can miss a message.

This is what drift looks like:

  • exchange has an open order
  • your websocket missed the update
  • your bot thinks you’re flat
  • your policy makes a “safe” decision
  • you accidentally stack exposure

So I built reconciliation in the state machine.

Case: we opened a position, but the websocket doesn’t show the order

In manage_opening_state(), if the websocket doesn’t have the open order, I fallback to a REST query.

# BitmexPythonChappie/BitMEXBotClient.py

if len(open_order_ws) == 0:
    open_order_rest = self.bitmex.getOpenOrders()  # REST fallback
    if len(open_order_rest) > 0:
        self.open_order = open_order_rest[0]

This is the entire point of reconciliation:

Websocket is your fast path, REST is your truth oracle.

Case: we think we’re in a position, but there’s no order on the exchange

The reverse drift is just as dangerous.

In manage_closed_state(), if there are no open orders, I clear internal state.

That is a safety reset, not a “strategy decision.”


Failure recovery: stop trying to be resilient inside the loop

The early version of me wanted to handle everything inside the trading loop:

  • catch exceptions
  • keep going
  • log harder

In May 2020 I changed my mind.

For certain fault conditions (stale websocket, out-of-sync feed), I wanted a crash.

Not because crashing is fun — because restarting is cleaner than half-recovering.

So the websocket client can set a “faulted” state, and the bot loop checks it:

# BitmexPythonChappie/BitMEXBotClient.py

if not self.ws.isActive() or self.ws.isFaulted:
    return  # or exit in the supervisor-driven version

And the websocket client can fault the main thread when health checks fail.

That’s not graceful.

That’s intentional.

Resilience belongs at the process boundary.

Inside the bot: be conservative and fail-fast. Outside the bot: supervisor restarts, alerting, and “did we leave orders behind?” checks.


The mindset shift: safety isn’t “extra”, it’s the product

Before BitMEX, I thought “outages” were bad luck.

After BitMEX, I treated outages as a normal market regime.

Safety engineering is just admitting:

  • you don’t control the venue
  • you don’t control the network
  • you don’t control your own bugs

So you build a bot that can say:

“I don’t know. I’m stopping.”

That sentence is where real trading automation starts.


Repo pointers

bitmex-deeprl-research (repo)

All code and experiments referenced in this 2019–2020 series.

BitMEXWebsocketClient.py

Heartbeat checks, staleness detection, and faulting behavior.

BitMEXBotClient.py

State machine + reconciliation logic + “observe-only” mode.

config.ini

The simplest kill switch: allow_trade.


FAQ


What’s next

In the next post, I finally do what I’d been postponing:

First live runs — small size, big lessons.

That’s June 2020.

Axel Domingues - 2026