Blog
Nov 24, 2019 - 12 MIN READ
The 503 Lesson - Outages as a Signal, Not Just a Bug

The 503 Lesson - Outages as a Signal, Not Just a Bug

My first live alpha monitor was “working”… until BitMEX started replying 503 right when the model got excited. That’s when I learned availability is part of market microstructure.

Axel Domingues

Axel Domingues

By November 2019, I’ve got something that looks like “a system”: a collector, a dataset, a supervised model, and a small live monitor that watches the BitMEX (Bitcoin Mercantile Exchange) order book and emits a prediction.

And then the exchange does what real systems always do: it fails… at exactly the worst time.

Not “fails” like my code throws a stack trace. Fails like: the model starts screaming “move!”, I try to place or manage orders, and BitMEX replies with HTTP 503 (Service Unavailable). Not once. Not occasionally. Repeatedly — clustered around the moments that mattered.

That was the day “outages” stopped being an infrastructure annoyance and became a market feature I had to model.

This series is research engineering, not trading advice. I’m documenting what I built, what broke, and what I learned about evaluation and system design.

What “503” really means in practice

In textbooks, you treat the exchange like an oracle:

  • you see the latest book
  • you send an order
  • you get an acknowledgement
  • you get fills
  • done

In practice, BitMEX is an API running under load, and during bursts of activity you see symptoms like:

  • REST calls returning 503
  • order placement timing out or retrying
  • websocket streams lagging or reconnecting
  • state divergence: “my bot thinks I have X open orders, but the exchange disagrees”

That last one is the killer. You can’t even reason about risk if you don’t know your own state.


What I was running (the smallest possible live loop)

October’s post was about listening to the market. November is about what happens when you try to act.

At this point in the repo, the live pieces are basically:

  • BitmexPythonChappie/BitMEXWebsocketClient.py — keeps a local view of trades/order book and exposes helpers like market_depth() and recent_trades()
  • BitmexPythonChappie/SnapshotManager.py — produces the feature snapshot used by the predictor (same “snapshot contract” I used for training)
  • BitmexPythonChappie/OrderBookMovePredictor.py — loads the trained model + applies mean/sigma + index-diff normalization (from June)
  • BitmexPythonChappie/BitMEXBotClient.py — REST client for placing/cancelling orders and reconciling what “should” exist with what actually exists
  • BitmexPythonChappie/main.py — wires it together, restarts websocket client when faulted, keeps the process alive

The key detail is that this is not a full trading strategy yet. It’s a monitor + “poke the exchange” loop that was meant to validate one thing:

When my model predicts large moves, do those moments line up with real market pressure?

The answer was: yes — and the exchange itself confirmed it by falling over.


The moment it clicked: outages cluster around market bursts

I expected outages to be random, like intermittent Wi‑Fi. Instead, they were structured:

  • quiet periods: few errors, smooth order updates
  • burst periods: prediction spikes, volume spike, then… 503s
  • aftershock: delayed acknowledgements, “phantom orders”, and messy reconciliation

This changes the framing completely:

If 503 events correlate with “everyone is doing something right now”, then 503 is part of the market microstructure story.

It’s a signal about crowding and capacity — and it hits taker/HFT-style behaviors the hardest because they rely on tight timing.


Where the repo already hints at the problem

Even before I named the problem, the code started growing “scar tissue” around it.

Websocket faults must be treated as first-class

In BitmexPythonChappie/BitMEXWebsocketClient.py, the websocket error handler marks the client faulted and exits:

def _on_error(self, ws, error):
    self.logger.error("ERROR: %s" % error)
    self.isFaulted = True
    self._on_close(ws)

Then in BitmexPythonChappie/main.py, the process watches for isFaulted and rebuilds the websocket client:

if wsBitMEXClient.exited or wsBitMEXClient.isFaulted:
    wsBitMEXClient.exit()
    wsBitMEXClient = setup_ws_BitMEXClient(BitMEXBotClientInstance)
    snapshotManagerInstance.set_ws_bitMEX_client(wsBitMEXClient)

That’s not “elegant architecture”, but it’s real. Fault detection + auto-reconnect is the minimum for a long-running live process.

REST responses can’t be trusted to be complete

In BitmexPythonChappie/BitMEXBotClient.py, I ended up writing a reconciliation helper:

  • it fetches the exchange’s open orders
  • it filters out cancelled orders
  • and it treats “missing” as a temporary state rather than an invariant violation

You can see this mindset in check_valid_response() — it’s literally coded as: if the order isn’t there, wait and check again.

That is not a theoretical risk-control system. It’s the first time I’m admitting the exchange can lie to my bot through omission, delay, or partial failure.

If you’re building anything live: “reconciliation” is not a nice-to-have. It’s the difference between a bot you can stop safely and a bot that can’t even tell what it holds.

The engineering rule I extracted

After a week of watching 503 clusters, I wrote a rule on paper and taped it above my monitor:

If the exchange is unavailable, the market is not “paused” — the market is screaming.
Treat outages like a volatility regime, not like downtime.

That rule led to four concrete changes in how I log and how I design the system.


The outage dashboard (what I started logging)

This month’s goal wasn’t profitability. It was observability: can I explain what happened after the fact?

Here’s the checklist I started tracking (some in logs, some as quick scripts/plots):

API health / reliability

  • REST request latency distribution (p50/p95/p99)
  • count of HTTP 503 responses per minute
  • count of other error codes (429 rate limit, 401 auth, 5xx)
  • retry counts and backoff duration
  • websocket disconnect count and reconnect time
  • websocket lag estimate (server timestamp vs machine timestamp)
  • “time since last trade” seen on websocket
  • proportion of missing snapshots (feature vector not produced)

Order-state integrity

  • order placement attempts vs acknowledgements
  • cancel attempts vs cancel confirmations
  • open orders count: websocket view vs REST view
  • “phantom order” detection (local state thinks open; REST says none, or vice versa)
  • average time-to-fill when filled
  • partial fill frequency (filledQty < orderQty)
  • stale order duration (order lives too long without meaningful progress)

Microstructure context

  • spread during outage windows vs normal windows
  • top-of-book churn rate (how often best bid/ask changes)
  • imbalance spike rate (my computed features peaking)
  • trade arrival rate (trades/sec)
  • volume per second
  • “gap” events (book levels vanish, then reappear)

Risk context (even though this is mostly a monitor)

  • current position size
  • inventory exposure duration (time in position)
  • max adverse excursion (how bad it went before it improved)
  • max favorable excursion
  • realized vs unrealized PnL (profit and loss)
  • estimated fees if the bot had crossed the spread

Failure modes: symptom → likely cause → first check

Here’s what I saw repeatedly, and how I learned to debug it fast.

  • Orders “place” but never show up → REST call accepted but delayed / exchange overloaded → check REST latency + open order list after 1–2 seconds
  • Cancel succeeds locally but order still fills → cancel confirmation delayed / race with matching engine → check order events in websocket + final REST state
  • Websocket looks frozen → connection dropped silently / server lag spike → check time since last trade + reconnect counter
  • Model spikes but snapshots stop updating → snapshot builder blocked (waiting for book) → check SnapshotManager timings + missing fields
  • Sudden jump in spread coincides with 503s → liquidity pulled; load spike → plot spread + 503 count aligned to timestamps
  • Position changes without my bot “doing anything” → stale state + late fills → reconcile position via REST + cross-check execution history
  • Panic cascade of retries → naive retry loop amplifies load / hits rate limits → check retry policy + backoff
  • Backtest looked amazing; live is chaos → environment assumed perfect fills/availability → write down the missing assumptions explicitly (next month)

What I changed in code this month

I didn’t “fix BitMEX”. I changed my system so it doesn’t pretend the exchange is deterministic.

Concretely, I focused on three things:

  1. Make outage detection explicit
    • log every failed REST call with status code and endpoint
    • count and bucket errors per minute (not just raw logs)
  2. Upgrade state reconciliation
    • treat order existence as eventually consistent
    • validate open orders repeatedly before assuming the bot is wrong or the exchange is wrong
  3. Add safe modes
    • when 503 frequency crosses a threshold, stop trying to be clever
    • cancel pending orders (if possible), reduce activity, and prioritize “know my state”
A subtle trap: “safe mode” can be worse than doing nothing if it spams the API with cancels and status checks. The safest behavior during outages is often less activity, not more.

Why this matters for RL (Reinforcement Learning) later

At the end of 2018 I was deep into RL algorithms: DQN (Deep Q‑Network), PPO (Proximal Policy Optimization), actor‑critic, stability tricks.

This month forced a more uncomfortable question:

What does an “optimal policy” mean if the environment sometimes refuses actions?

If I don’t model outages and delayed acknowledgements, an RL agent will learn the wrong thing — it will learn in a fantasy world where action always happens instantly.

That’s why December’s post is about the environment contract. Before I write “reward” or “step()”, I need to define what reality can do to me:

  • rejected orders
  • delayed orders
  • missing data
  • reconnects
  • partially observable state
  • variable transaction costs

The 503 lesson is the first big constraint in that contract.


Resources (what I kept open while debugging)

bitmex-deeprl-research (repo)

The live monitor lives under BitmexPythonChappie/ and this month’s focus is on websocket faults + bot reconciliation logic.

HTTP status codes (503)

A boring page that became surprisingly relevant once “the exchange is an API” stopped being a metaphor.


Field notes (what surprised me)

  • I expected my model to be the unstable piece. It wasn’t. The market plumbing was.
  • The exchange failing was evidence that my signal was aligning with real intensity.
  • “Make it robust” starts with logging, not architecture diagrams.
  • I underestimated how quickly state diverges when acknowledgements are delayed.
  • If you can’t reconcile, you can’t stop safely.

What’s next

In December I’m going to do the most unsexy thing in the world: define an interface.

Not a library interface — a truth interface:

  • What counts as state?
  • What counts as an action?
  • What counts as a fill?
  • What can fail, and how do we represent that failure in data?
  • What does evaluation mean if the environment is adversarial?
Axel Domingues - 2026