Blog
Oct 25, 2020 - 14 MIN READ
bitmex-management-gym: Position Sizing and the First Risk-Aware Agent

bitmex-management-gym: Position Sizing and the First Risk-Aware Agent

After months of "all-in" agents with bull personalities, I rebuilt the environment to teach risk: stackable positions, time-awareness, and penalties that prevent reward-hacking.

Axel Domingues

Axel Domingues

By October 2020 I’d already learned the painful lesson: an RL trading agent without position management is basically a coin flip with strong opinions.

The early Gym baselines could open a position, ride it, close it — but they couldn’t size, they couldn’t de-risk, and they couldn’t recover from being wrong without paying the full price.

And after June 2020, I was hyper-aware of a second trap: regime bias.

My training data (and even my “validation” slices) were still overweight bull conditions, so the agents I trained had a bull personality. They looked great… right up until the market stopped behaving like the training distribution.

So I built the next environment as a deliberate upgrade path:

  • keep the microstructure feature pipeline
  • keep walk-forward evaluation discipline
  • add a management layer: position sizing, inventory awareness, and time-in-position
  • shape rewards so the agent learns behavior that survives stress (not just backtests)

This post documents the first version: bitmex-management-gym.


The idea: “entry decisions” aren’t enough

In earlier months, most of the agent’s intelligence was forced into a single question:

should I flip long, flip short, or do nothing?

But live execution doesn’t work like that — especially on BitMEX:

  • you can be partially right and still want smaller exposure
  • you can be right early and still need to reduce risk
  • you can be wrong and need a controlled exit

So the environment’s job changed.

Not “predict the next move”.

Manage risk while the future is uncertain.

This was also a psychological shift.In 2018 I treated RL as “learning an optimal policy”. In 2020 I started treating RL as “teaching behavior under constraints”.

The new environment: bitmex_management_env.py

The core artifact for this month lives here:

  • bitmex-management-gym/gym_bitmex_management/envs/bitmex_management_env.py

It keeps the same foundation (precomputed microstructure features + sliced episodes), but adds two missing capabilities:

  1. stacked position sizing (multiple partial entries)
  2. risk-aware observation + reward shaping (time, exposure, and penalties)

What changed technically

1) Position becomes a stack, not a boolean

Instead of “one position open or not”, the environment tracks arrays of trades:

  • open_longs_price_array + timestamps + stake percentages
  • open_shorts_price_array + timestamps + stake percentages

That immediately unlocks management behaviors:

  • build exposure gradually (instead of all-in)
  • reduce exposure gradually
  • represent inventory as something continuous (not binary)

In the first iteration, the code defaults to a fixed stake size per fill (10% in the step logic), but the design clearly points to a tunable “stack size” approach.

You can even see the evaluation harness experimenting with the “management mode” idea (e.g. smaller stacks + higher open limits) in OpenAI/baselines/ppo2_mgt_back_test.py.

In hindsight, this was the exact bridge I needed between research and Chappie:
  • in research, you want actions you can learn
  • in production, you want actions you can execute safely
Stacking gives you both.

2) The observation includes inventory + time

The observation vector becomes:

  • microstructure features (the existing feature pipeline)
  • plus auxiliary management state appended at the end

In the env, that auxiliary state is explicitly modeled via keys like:

  • LONG_STATE_AUX_KEY / SHORT_STATE_AUX_KEY
  • POSITION_SIZE_AUX_KEY
  • UNREALISED_PCT_AUX_KEY
  • AVG_OPEN_POSITION_HOURS_AUX_KEY
  • LAST_OPEN_POSITION_HOURS_AUX_KEY
  • HOURS_CLOSED_POSITION_AUX_KEY

This is the difference between:

  • “react to a signal”
  • and “react to a signal while carrying risk

That one change reduces a huge class of reward hacks.

If the agent can’t see its inventory state, it often learns stupid things:

  • churn trades to farm noise
  • hold losers forever because the reward is delayed
  • open positions without ever learning what it means to be exposed

3) The action space becomes “manage”, not “bet”

The main loop uses a small discrete action set:

  • 0 — hold
  • 1 — toggle/open/close a long maker order
  • 2 — toggle/open/close a short maker order

And the key constraint is executability:

  • a long opens by placing at bid (maker-style)
  • a short opens by placing at ask (maker-style)

So the agent is not learning “hit market now”. It’s learning “place good orders and survive.”


Reward shaping: penalties, constraints, and the first honest fixes

October is where reward shaping stopped being a hack and started being engineering.

The reward is not “profit, done.” It’s a bundle of terms designed to teach a specific behavior:

Base reward: unrealized PnL scaled by exposure

At each step, the env computes current unrealized percentage and scales it by the current position size.

That is the first “risk-aware” decision:

  • small position, small reward
  • big position, big reward

So the agent can’t pretend size doesn’t exist.

Fees: maker/taker costs become first-class

This month continues the lesson from the maker-strategy work:

  • opening a position adds fee impact
  • the reward includes maker/taker fee percentages

So the agent can learn a surprising truth:

A strategy can be negative PnL and still be profitable once fees/rebates are included.

This wasn’t theoretical — the fee breakdown logs made it impossible to ignore.

Penalties: time-based discipline

The environment introduces time-based punishments:

  • a penalty for holding positions longer than a threshold (OPEN_POSITION_HOURS)
  • a penalty for being flat too long (CLOSE_POSITION_HOURS)
  • a terminal penalty when the environment hits a final state

The goal isn’t “punish the agent”.

The goal is to prevent classic failure modes:

  • infinite hold (never close losers)
  • do nothing (avoid risk to avoid loss)
  • stalling (stop taking actions because the reward is sparse)

The subtle trick: punishment suppression

A really practical detail is that penalties are not applied at full strength immediately.

The code introduces:

  • PUNISHMENT_SUPPRESSION_RATE
  • PUNISHMENT_SUPPRESSION_CYCLES
  • PUNISHMENT_SCALING

Translation:

  • early training: let the policy explore without drowning in negatives
  • later training: gradually turn on “realistic discipline”

That’s not cheating.

That’s curriculum learning for risk constraints.


Episode design: teaching across regimes

The environment keeps the “spawn in the middle of history” idea:

  • random starting point
  • optional random initialization action

And it upgrades episode control:

  • episodes can be step-count based, with variance
  • if use_episode_based_time is enabled, the env draws a step budget from a normal distribution around MEAN_STEPS_PER_EPISODE

The practical motivation is simple:

  • short episodes make debugging possible
  • varied episodes reduce overfitting to a single horizon

But the real reason is regime exposure:

If you only ever train on one contiguous run, you often train on one regime.

Randomized slices aren’t a hack here — they’re a way to force the agent to see more worlds.

One of the best design choices in this whole project was the ability to start the agent already holding risk.

If the env forces a random initial position, the agent can’t learn “perfect entry only”. It must learn “manage what you already have”.


Walk-forward discipline, now with risk-aware validation

The biggest practical change wasn’t a new neural network.

It was evaluation discipline:

  • train on walk-forward slices
  • validate on declining regime slices
  • inspect behavior, not just totals

In code, the “reporting harness” style lives in scripts like:

  • OpenAI/baselines/ppo2_mgt_back_test.py

That file is a mess in the best way: it shows how the system was actually used.

It builds an execution-like loop:

  • reconstructs position stacks
  • rolls maker orders forward as the book moves
  • logs trade rollovers and summaries

This is where “my agent made money” turns into:

  • how many trades?
  • how much drawdown?
  • how much fee impact?
  • what did it do under a regime change?

Architecture note: why this was the first “risk-aware agent”

I’m calling it “risk-aware” for a very specific reason:

  • the agent can see exposure
  • the agent can control exposure
  • the reward makes exposure matter

Without all three, you’re still training a gambler.

With all three, you can start training something closer to a trader.


Resources

bitmex-deeprl-research (GitHub)

The repo that contains the environments, training scripts, and the “Chappie” live loop.

bitmex-management-gym environment file

Main artifact for this post: bitmex_management_env.py.


FAQ


What’s next

The next post is where the evaluation loop becomes stricter again:

Batch Training & Evaluation Again

Because once you can manage risk, you can finally ask the adult question:

do these results survive scrutiny, or are they just another backtest lie?

Axel Domingues - 2026