
After my first disappointing live runs, I stopped asking my agent to be clever and started forcing it to be safe: risk caps, timeouts, and “market-health” gates that kept the loop alive when the regime wasn’t.
Axel Domingues
June 2020 hurt (in a good way).
Backtests were amazing. Validation looked clean.
Then I switched on the live loop with small size… and watched my “confident” agent behave like it had a personality: bullish, eager to be long, and confused when the market stopped rewarding that attitude.
So July became the month I stopped treating risk controls as “production hardening” and started treating them as part of the learning problem.
If the environment and the live loop don’t enforce reality, the agent won’t learn it.
In the 2018 RL posts, constraints felt like something you bolt on after the algorithm works.
Trading flipped that on me.
In trading, constraints are how you define the task.
So this month I started writing constraints in two places:
The theme: constraints that teach, not constraints that merely block.
One big reason the live results felt bad is that the baseline setup was too “binary”:
That creates a brittle agent. When regime shifts, it doesn’t degrade gracefully — it just keeps expressing the same bias.
So I started moving toward management-style behavior:
In the live client, you can see this as “stacked” increments:
# BitmexPythonChappie/BitMEXBotClient.py
stack_size = 0.02
self.stack_sizes = [stack_size, stack_size, stack_size, stack_size, stack_size, stack_size, stack_size]
self.trade_amt_multiplier = 0.5 # size control
And then, during execution:
# BitmexPythonChappie/BitMEXBotClient.py
if self.current_position_size < 1.0:
stack_size = self.stack_sizes[len(self.trades) - 1]
amt = amt * stack_size
amt = amt * self.trade_amt_multiplier
That logic does two important things:
current_position_size < 1.0)The simplest form of regime fragility is this:
the agent enters a position that used to work, then gets stuck holding it while the market slowly bleeds.
So I started encoding “time is not free” directly into the Gym.
In bitmex_env.py, the environment defines explicit time-based punishments and triggers:
# bitmex-gym/gym_bitmex/envs/bitmex_env.py
LENGHTY_POSITION_HOURS = 0.3 # punish holding too long
TRIGGER_CLOSE_POSITION_HOURS = 2 # force close eventually
IGNORE_UNREALISED_POSITIVE_REWARD_HOURS = 1.5
USE_UNREALISED_REWARD_MULTIPLIER_HOURS = 0.75
This is what I mean by constraints that teach:
The end goal wasn’t to “punish risk”.
It was to teach the agent that time-in-position is an input, not an accident.
So I treated these thresholds as tunable, not sacred.If you force-close too aggressively, you train “panic exits” and destroy any chance of trend-following.
I kept episodes short on purpose — not because it was a cheat, but because it makes the agent see more situations.
Two key pieces here:
In the environment:
# bitmex-gym/gym_bitmex/envs/bitmex_env.py
STEP_SKIP = 2000
MEAN_STEPS_PER_EPISODE = 216000 / STEP_SKIP
STD_STEPS_PER_EPISODE = 57600 / STEP_SKIP
MIN_STEPS_PER_EPISODE = 72000 / STEP_SKIP
if self.use_episode_based_time:
self.current_limit_steps_per_episode = int(np.random.normal(
bitmexEnv.MEAN_STEPS_PER_EPISODE,
bitmexEnv.STD_STEPS_PER_EPISODE
))
Conceptually:
STEP_SKIP / step_skip is simply a knob.
It controls how often the agent can act, and how quickly reward can propagate.
I treated it like any other tuning variable: stability vs realism vs reactivity.
At reset, the env can open a random long/short (or none), so the agent must learn to manage already being in trouble:
# bitmex-gym/gym_bitmex/envs/bitmex_env.py
init_action = random.randint(0, 2) # none, long, short
new_obs_state, _, _, _ = self.step(init_action)
That idea aged well.
It forced diversity in the state distribution and prevented the agent from only learning “clean entries”.
Random spawn isn’t about exploiting resets — it’s about forcing the agent to practice recovering from imperfect starts.
A major lesson from BitMEX wasn’t “fees” or “slippage”.
It was: sometimes your market feed is lying or dead.
So the live loop had to learn a new rule:
if the market data is unhealthy, the correct action is do nothing.
In the Websocket client, I had explicit logic for “quote sanity” and “connection sanity”:
# BitmexPythonChappie/BitMEXWebsocketClient.py
PONG_TIMEOUT = 5
MAX_TIME_NO_QUOTES = 20
def _on_message(...):
# update quotes, update last time
def _on_pong(...):
self.last_pong = datetime.utcnow()
# later: periodic check
# if (now - last_quote) > MAX_TIME_NO_QUOTES: fault
# if (now - last_pong) > PONG_TIMEOUT: fault
And once you have “faulted”, you can enforce the hard constraint:
This is how you survive bad regimes and bad infrastructure.
By July 2020, my default assumption changed:
It’s because I trained a personality on an unbalanced world.
In my case:
Constraints were my way of saying:
And that set up the next design move: architecture as stability.
Don’t trust reward curves. Watch what the agent actually does under the new rule.
If it must always hold in production (risk cap), enforce it in the live loop too.
If you can’t explain why a timeout is 2 hours and not 20 minutes, it’s probably wrong.
If data is stale, the best policy is often: flat + cancel + wait.
bitmex_env.py (Gym environment)
The place where timeouts, reward multipliers, and episode mechanics become “the contract”.
It can be.
My rule became: shape with constraints, not with fantasies.
Constraints are shaping, but they are shaping toward reality.
Yes — and that becomes a big theme later.
But in July, the urgent problem was: even with better data, a live trading system still needs caps, timeouts, and fault handling. Those aren’t optional.
They didn’t magically create alpha.
They did something more important first: they made the system survivable.
Once survivable, I could iterate without blowing up, and I could start designing architectures that don’t collapse under instability.
Next month is where I stop thinking “algorithm tuning” and start thinking representation + stability:
Deep Silos in RL - Architecture as Stability
Because by this point, I’d learned that if your agent collapses under regime shift, you don’t just need better rewards — you need a better shape of model.
Deep Silos in RL: Architecture as Stability (and the First LSTM Variant)
August 2020 - After the first live pain and the bull-personality problem, I stopped tuning "algorithms" and started tuning the network contract. Deep Silos beat flat MLPs, and the LSTM variant overfit fast.
First Live Runs - Small Size, Big Lessons
Backtests looked amazing. Live PnL didn't. In June 2020 I ran the first real BitMEX live loop at tiny size and learned the most important lesson in trading ML: regime is the boss.