
After months of "all-in" agents with bull personalities, I rebuilt the environment to teach risk: stackable positions, time-awareness, and penalties that prevent reward-hacking.
Axel Domingues
By October 2020 I’d already learned the painful lesson: an RL trading agent without position management is basically a coin flip with strong opinions.
The early Gym baselines could open a position, ride it, close it — but they couldn’t size, they couldn’t de-risk, and they couldn’t recover from being wrong without paying the full price.
And after June 2020, I was hyper-aware of a second trap: regime bias.
My training data (and even my “validation” slices) were still overweight bull conditions, so the agents I trained had a bull personality. They looked great… right up until the market stopped behaving like the training distribution.
So I built the next environment as a deliberate upgrade path:
This post documents the first version: bitmex-management-gym.
In earlier months, most of the agent’s intelligence was forced into a single question:
should I flip long, flip short, or do nothing?
But live execution doesn’t work like that — especially on BitMEX:
So the environment’s job changed.
Not “predict the next move”.
Manage risk while the future is uncertain.
bitmex_management_env.pyThe core artifact for this month lives here:
bitmex-management-gym/gym_bitmex_management/envs/bitmex_management_env.pyIt keeps the same foundation (precomputed microstructure features + sliced episodes), but adds two missing capabilities:
Instead of “one position open or not”, the environment tracks arrays of trades:
open_longs_price_array + timestamps + stake percentagesopen_shorts_price_array + timestamps + stake percentagesThat immediately unlocks management behaviors:
In the first iteration, the code defaults to a fixed stake size per fill (10% in the step logic), but the design clearly points to a tunable “stack size” approach.
You can even see the evaluation harness experimenting with the “management mode” idea (e.g. smaller stacks + higher open limits) in OpenAI/baselines/ppo2_mgt_back_test.py.
The observation vector becomes:
In the env, that auxiliary state is explicitly modeled via keys like:
LONG_STATE_AUX_KEY / SHORT_STATE_AUX_KEYPOSITION_SIZE_AUX_KEYUNREALISED_PCT_AUX_KEYAVG_OPEN_POSITION_HOURS_AUX_KEYLAST_OPEN_POSITION_HOURS_AUX_KEYHOURS_CLOSED_POSITION_AUX_KEYThis is the difference between:
That one change reduces a huge class of reward hacks.
If the agent can’t see its inventory state, it often learns stupid things:
The main loop uses a small discrete action set:
0 — hold1 — toggle/open/close a long maker order2 — toggle/open/close a short maker orderAnd the key constraint is executability:
So the agent is not learning “hit market now”. It’s learning “place good orders and survive.”
October is where reward shaping stopped being a hack and started being engineering.
The reward is not “profit, done.” It’s a bundle of terms designed to teach a specific behavior:
At each step, the env computes current unrealized percentage and scales it by the current position size.
That is the first “risk-aware” decision:
So the agent can’t pretend size doesn’t exist.
This month continues the lesson from the maker-strategy work:
So the agent can learn a surprising truth:
A strategy can be negative PnL and still be profitable once fees/rebates are included.
This wasn’t theoretical — the fee breakdown logs made it impossible to ignore.
The environment introduces time-based punishments:
OPEN_POSITION_HOURS)CLOSE_POSITION_HOURS)The goal isn’t “punish the agent”.
The goal is to prevent classic failure modes:
A really practical detail is that penalties are not applied at full strength immediately.
The code introduces:
PUNISHMENT_SUPPRESSION_RATEPUNISHMENT_SUPPRESSION_CYCLESPUNISHMENT_SCALINGTranslation:
That’s not cheating.
That’s curriculum learning for risk constraints.
The environment keeps the “spawn in the middle of history” idea:
And it upgrades episode control:
use_episode_based_time is enabled, the env draws a step budget from a normal distribution around MEAN_STEPS_PER_EPISODEThe practical motivation is simple:
But the real reason is regime exposure:
If you only ever train on one contiguous run, you often train on one regime.
Randomized slices aren’t a hack here — they’re a way to force the agent to see more worlds.
If the env forces a random initial position, the agent can’t learn “perfect entry only”. It must learn “manage what you already have”.
The biggest practical change wasn’t a new neural network.
It was evaluation discipline:
In code, the “reporting harness” style lives in scripts like:
OpenAI/baselines/ppo2_mgt_back_test.pyThat file is a mess in the best way: it shows how the system was actually used.
It builds an execution-like loop:
This is where “my agent made money” turns into:
I’m calling it “risk-aware” for a very specific reason:
Without all three, you’re still training a gambler.
With all three, you can start training something closer to a trader.
Not by itself. It gave the agent the tools to behave better under regime shifts, but the real fix required discipline: balanced walk-forward slices and validation that deliberately included downtrends.
Reward shaping is only “honest” if it teaches a behavior you can explain and defend. If you can’t describe what the penalty is preventing, it’s probably just a tuning hack.
The next post is where the evaluation loop becomes stricter again:
Batch Training & Evaluation Again
Because once you can manage risk, you can finally ask the adult question:
do these results survive scrutiny, or are they just another backtest lie?
Batch Training & Evaluation Again: Promising Results That Survive Scrutiny
In late 2020 I stopped trusting hero backtests. I built a batch runner + a walk-forward evaluation harness, added eval gates, and discovered an uncomfortable truth: shorter training often wins.
Maker Trades as a Strategy: When Fees Become a Reward Signal
In September 2020, I stop trying to be fast and start trying to be executable. The surprising result: in maker-style trading, fees aren’t a footnote — they can be the whole edge.