Apr 26, 2020 - 12 MIN READ

Chappie Wiring From Trained Policy to Running Process

The moment RL stops being a notebook artifact: load a PPO policy, rebuild the live observation stream, and turn BitMEX into a runtime you can monitor and control.

Axel Domingues

In 2018 I treated Reinforcement Learning (RL) like a set of algorithms I could run. By early 2020, BitMEX had forced me to treat it like a system. Training a policy was the easy part. Getting it to run in real time, against real order books, with real failure modes, was the actual work.

This is the month I wired the first "Chappie": a Python process that connects to BitMEX, builds the same observation vector my Gym agent saw during training, calls the PPO policy for an action, and (optionally) sends orders.

It was not pretty. It was not robust. But it was the first time the model stopped being a file and became a behavior.

What you'll learn in this post

how I decomposed "live" into 4 tiny components: websocket, snapshot, predictor, executor
why the observation contract is the true API (and why mean/sigma matters more than the network)
how I used config-driven gating to keep the bot in "observe-only" mode by default
what broke immediately when the market talked back (even before the big 503 lesson)

The wiring diagram

When I say "wire the policy", I do not mean "call model.predict()".

I mean: build a small, boring program where every part has a single job, and every job can fail without taking the whole idea down.

In this repo that program lives under BitmexPythonChappie/ and it is basically four pieces:

Market data -- connect to BitMEX websockets and keep an in-memory view of the current book.
Snapshot manager -- decide when we sample, and turn a stream into regular "ticks".
Predictor -- take the latest tick + feature window, rebuild the observation vector, and ask the PPO policy for an action.
Bot client -- (optionally) translate the action into orders and push them to BitMEX.

If you squint, it looks like this:

The important part is the arrows: data only flows forward.

No part of this knows how the other part is implemented. They just agree on contracts.

If you ever wonder "why am I spending days on plumbing?", this is the reason: the real work is defining contracts you can keep.

Boot sequence: make the bot boring

The first version of Chappie had one goal: start up the same way every time.

Load configuration

Everything dangerous is behind a config flag.

Connect websockets and warm the cache

No model inference until the book is populated and timestamps look sane.

Start the snapshot loop

The SnapshotManager decides when a new sample is "ready".

Only then load the model and begin inference

If the model fails to load, the process should still run in observe-only mode.

That ordering sounds obvious, but early on I did it wrong (load the model first, then scramble to feed it). The result was inference on garbage.

Why the observation contract is the real API

In training, I could change feature ordering, normalization, or window shapes and "fix it later".

In live, that is not a "bug".

That is a silent model swap.

So I started treating the observation vector as a versioned API:

same feature names
same order
same normalization (mean/sigma)
same windowing
same extra state fields

In this repo, the thing that enforces that is BitmexPythonChappie/OrderBookMovePredictor.py.

Live observations: windowed features + small position state

The predictor keeps a rolling window of market features and then appends a tiny amount of state about the current position.

A detail that matters: even in this early baseline, the observation is not "just features". It is:

a windowed vector built from FEATURES_COLS
plus a handful of state variables (open long? open short? some position bookkeeping)

That is explicit in the dimension calculation:

# windowed market features + state vars
self.dimentions = len(bitmexEnv.FEATURES_COLS) + 5

Those extra vars were intentionally minimal because the early live bot opened a position in a coarse way (close to "all-in" behavior), not in neat 2% / 5% portfolio increments. So it did not need a rich "position sizing" state the way later environments would.

What it did need was awareness of inventory mode:

am I currently long?
am I currently short?
how long have I been holding something?
did I just close something?

That sounds tiny, but it changes behavior dramatically: the policy stops being a pure entry signal and starts being a "manage the situation I'm already in" controller.

This is one of the recurring themes of the project: the agent does not become smarter because the network is bigger. It becomes smarter because the state is honest about what is already true.

Normalization: deployment is where you pay the bill

The predictor loads the same mean/sigma computed during training and applies it in live inference:

mean.npy
sigma.npy

This feels like a small detail. It is not.

If mean/sigma do not match training, the live bot is effectively running a different model than the one you evaluated.

If you only remember one deployment lesson from this month: a model without its preprocessing is not a model.

step_skip in live: a tuning knob, not randomness

In the Gym environment, step_skip exists to make credit assignment possible: if you evaluate every micro-step, the reward signal can be too delayed and too noisy.

In Chappie, STEP_SKIP shows up as a hyperparameter-like constant in the predictor. It is not random.

Think of it as: "how often do I let the policy speak?"

smaller values = more reactive, more noise, more overtrading risk
larger values = slower reactions, clearer outcomes, but easier to miss fast transitions

This is one of the first places where I started to feel the difference between:

"the environment step"
and "the market tick"

They are not the same thing.

Action contract: from PPO output to a trading intent

The predictor returns an action ID from the PPO policy. The bot client then interprets that as one of a small set of intents:

do nothing
open long
open short
close position

I kept it intentionally coarse.

Not because it is optimal, but because it is testable.

Early live systems should prefer few actions with clear semantics over "many actions that feel expressive".

The control plane: config.ini as a kill switch

Everything that can lose money should be disabled by default.

In this repo, that is exactly what BitmexPythonChappie/config.ini does. A simplified, sanitized sketch looks like this:

[Config]
allow_trade = false
symbol = XBTUSD
testnet = true
log_level = INFO

# model + preprocessing artifacts
model_path = ./models/ppo_policy.zip
mean_path = ./models/mean.npy
sigma_path = ./models/sigma.npy

allow_trade = false is the key. It turns the bot into "observe + predict + log".

Only after I was confident that:

the observation vector matched training
timestamps were consistent
actions were stable
and the bot was not flapping

...did I consider flipping it.

Do not store real API secrets in a repo.In my local setup, I treated config.ini as an operator file (ignored by git) and/or injected secrets via environment variables.

The main loop: a boring heartbeat

The orchestration in BitmexPythonChappie/main.py is deliberately simple:

wait until the SnapshotManager says "new snapshot ready"
build observation and ask the predictor for an action
log everything
if allow_trade, send the action to the bot client

Pseudo-code (not the exact repo code):

while True:
    if snapshot_manager.ready():
        obs = predictor.build_observation(snapshot_manager)
        action = predictor.predict(obs)
        logger.info({"action": action, "ts": now(), "pos": bot.position()})

        if config.allow_trade:
            bot.apply(action)

The point is not elegance.

The point is that you can read it at 2am.

What broke immediately (even before outages)

Before the famous 503 lesson, the market already had ways of telling me "your assumptions are cute":

timestamp drift between websocket events and the local clock
missing fields when the book snapshot was incomplete
NaNs in features because a derived metric divided by zero on a quiet moment
model artifacts mismatch (wrong mean/sigma file loaded)
action jitter (policy flips long/short too quickly when the state window is too short)

Most of these were not "ML" problems.

They were plumbing problems.

And they were exactly the problems that decide whether an ML system survives.

The small, practical discipline I adopted

When I ran Chappie, I kept a little discipline that saved me from a lot of self-deception:

Run observe-only first (days, not minutes)
Log the observation hash (feature order + normalization version)
Record action histograms (are we stuck? are we flapping?)
Recompute the same features offline on captured snapshots (parity check)

If you cannot reproduce live observations offline, you do not have a trading system. You have a screen saver.

Resources and repo pointers

Chappie entrypoint (main loop)

Orchestration: connect, warm up, snapshot, predict, and (optionally) trade.

OrderBookMovePredictor (policy + preprocessing)

Loads the PPO model, rebuilds the observation vector, applies mean/sigma, outputs action IDs.

SnapshotManager (turn streams into ticks)

Controls sampling cadence and keeps the rolling window used by the predictor.

config.ini (operator control plane)

Gating flags, symbol selection, and pointers to model artifacts.

FAQ

What's next

The next post is Safety Engineering.

Now that the policy can run as a process, the real question becomes: how do you keep it from harming you when reality behaves like reality?

Safety Engineering - Kill Switches, Reconciliation, and Failure Recovery

In May 2020, I stop hoping the bot is “fine” and start giving it explicit failure states — stale websockets, missing fills, rate-limits, and the kill switches that keep a live loop honest.

Reward Shaping Without Lying - Penalties, Constraints, and the First Real Fixes

In March 2020, I stopped treating reward like a number and started treating it like a contract—pay real costs, punish real risk, and don’t teach the agent to win a video game.