
The moment RL stops being a notebook artifact: load a PPO policy, rebuild the live observation stream, and turn BitMEX into a runtime you can monitor and control.
Axel Domingues
In 2018 I treated Reinforcement Learning (RL) like a set of algorithms I could run. By early 2020, BitMEX had forced me to treat it like a system. Training a policy was the easy part. Getting it to run in real time, against real order books, with real failure modes, was the actual work.
This is the month I wired the first "Chappie": a Python process that connects to BitMEX, builds the same observation vector my Gym agent saw during training, calls the PPO policy for an action, and (optionally) sends orders.
It was not pretty. It was not robust. But it was the first time the model stopped being a file and became a behavior.
When I say "wire the policy", I do not mean "call model.predict()".
I mean: build a small, boring program where every part has a single job, and every job can fail without taking the whole idea down.
In this repo that program lives under BitmexPythonChappie/ and it is basically four pieces:
If you squint, it looks like this:

The important part is the arrows: data only flows forward.
No part of this knows how the other part is implemented. They just agree on contracts.
The first version of Chappie had one goal: start up the same way every time.
Everything dangerous is behind a config flag.
No model inference until the book is populated and timestamps look sane.
The SnapshotManager decides when a new sample is "ready".
If the model fails to load, the process should still run in observe-only mode.
That ordering sounds obvious, but early on I did it wrong (load the model first, then scramble to feed it). The result was inference on garbage.
In training, I could change feature ordering, normalization, or window shapes and "fix it later".
In live, that is not a "bug".
That is a silent model swap.
So I started treating the observation vector as a versioned API:
In this repo, the thing that enforces that is BitmexPythonChappie/OrderBookMovePredictor.py.
The predictor keeps a rolling window of market features and then appends a tiny amount of state about the current position.
A detail that matters: even in this early baseline, the observation is not "just features". It is:
FEATURES_COLSThat is explicit in the dimension calculation:
# windowed market features + state vars
self.dimentions = len(bitmexEnv.FEATURES_COLS) + 5
Those extra vars were intentionally minimal because the early live bot opened a position in a coarse way (close to "all-in" behavior), not in neat 2% / 5% portfolio increments. So it did not need a rich "position sizing" state the way later environments would.
What it did need was awareness of inventory mode:
That sounds tiny, but it changes behavior dramatically: the policy stops being a pure entry signal and starts being a "manage the situation I'm already in" controller.
The predictor loads the same mean/sigma computed during training and applies it in live inference:
mean.npysigma.npyThis feels like a small detail. It is not.
If mean/sigma do not match training, the live bot is effectively running a different model than the one you evaluated.
In the Gym environment, step_skip exists to make credit assignment possible: if you evaluate every micro-step, the reward signal can be too delayed and too noisy.
In Chappie, STEP_SKIP shows up as a hyperparameter-like constant in the predictor. It is not random.
Think of it as: "how often do I let the policy speak?"
This is one of the first places where I started to feel the difference between:
They are not the same thing.
The predictor returns an action ID from the PPO policy. The bot client then interprets that as one of a small set of intents:
I kept it intentionally coarse.
Not because it is optimal, but because it is testable.
Everything that can lose money should be disabled by default.
In this repo, that is exactly what BitmexPythonChappie/config.ini does. A simplified, sanitized sketch looks like this:
[Config]
allow_trade = false
symbol = XBTUSD
testnet = true
log_level = INFO
# model + preprocessing artifacts
model_path = ./models/ppo_policy.zip
mean_path = ./models/mean.npy
sigma_path = ./models/sigma.npy
allow_trade = false is the key. It turns the bot into "observe + predict + log".
Only after I was confident that:
...did I consider flipping it.
config.ini as an operator file (ignored by git) and/or injected secrets via environment variables.The orchestration in BitmexPythonChappie/main.py is deliberately simple:
allow_trade, send the action to the bot clientPseudo-code (not the exact repo code):
while True:
if snapshot_manager.ready():
obs = predictor.build_observation(snapshot_manager)
action = predictor.predict(obs)
logger.info({"action": action, "ts": now(), "pos": bot.position()})
if config.allow_trade:
bot.apply(action)
The point is not elegance.
The point is that you can read it at 2am.
Before the famous 503 lesson, the market already had ways of telling me "your assumptions are cute":
Most of these were not "ML" problems.
They were plumbing problems.
And they were exactly the problems that decide whether an ML system survives.
When I ran Chappie, I kept a little discipline that saved me from a lot of self-deception:
Chappie entrypoint (main loop)
Orchestration: connect, warm up, snapshot, predict, and (optionally) trade.
OrderBookMovePredictor (policy + preprocessing)
Loads the PPO model, rebuilds the observation vector, applies mean/sigma, outputs action IDs.
Because Gym is a contract, not a market feed. The live system has to deal with missing data, reconnects, time drift, and asynchronous events. Chappie is the bridge between the contract and the messy world.
No. This is the first wiring pass. The goal is to make the policy executable and observable, not profitable. The next steps are safety engineering, reconciliation, and failure recovery.
Underestimating how many ways the observation contract can drift. Most early "live" failures are not about neural nets. They are about preprocessing, feature parity, and the assumptions baked into sampling.
The next post is Safety Engineering.
Now that the policy can run as a process, the real question becomes: how do you keep it from harming you when reality behaves like reality?
Safety Engineering - Kill Switches, Reconciliation, and Failure Recovery
In May 2020, I stop hoping the bot is “fine” and start giving it explicit failure states — stale websockets, missing fills, rate-limits, and the kill switches that keep a live loop honest.
Reward Shaping Without Lying - Penalties, Constraints, and the First Real Fixes
In March 2020, I stopped treating reward like a number and started treating it like a contract—pay real costs, punish real risk, and don’t teach the agent to win a video game.