Jul 26, 2020 - 12 MIN READ

Constraints That Teach: Risk Caps, Timeouts, and Surviving Bad Regimes

After my first disappointing live runs, I stopped asking my agent to be clever and started forcing it to be safe: risk caps, timeouts, and “market-health” gates that kept the loop alive when the regime wasn’t.

Axel Domingues

June 2020 hurt (in a good way).

Backtests were amazing. Validation looked clean.

Then I switched on the live loop with small size… and watched my “confident” agent behave like it had a personality: bullish, eager to be long, and confused when the market stopped rewarding that attitude.

So July became the month I stopped treating risk controls as “production hardening” and started treating them as part of the learning problem.

If the environment and the live loop don’t enforce reality, the agent won’t learn it.

The mental shift: constraints aren’t a band-aid

In the 2018 RL posts, constraints felt like something you bolt on after the algorithm works.

Trading flipped that on me.

In trading, constraints are how you define the task.

“Make money” is not a task.
“Make money without blowing up” is closer.
“Make money while respecting exposure, time, outages, and regime shifts” is the actual job.

So this month I started writing constraints in two places:

Inside the Gym (what gets rewarded/punished, and when an episode ends)
Inside the live loop (what is allowed to happen in production, even if the policy tries)

The theme: constraints that teach, not constraints that merely block.

Constraint #1: risk caps (make “position size” a controlled ramp)

One big reason the live results felt bad is that the baseline setup was too “binary”:

enter
all-in
hope

That creates a brittle agent. When regime shifts, it doesn’t degrade gracefully — it just keeps expressing the same bias.

So I started moving toward management-style behavior:

position size becomes a ramp
exposure grows in increments
and there is always a cap

In the live client, you can see this as “stacked” increments:

# BitmexPythonChappie/BitMEXBotClient.py
stack_size = 0.02
self.stack_sizes = [stack_size, stack_size, stack_size, stack_size, stack_size, stack_size, stack_size]
self.trade_amt_multiplier = 0.5  # size control

And then, during execution:

# BitmexPythonChappie/BitMEXBotClient.py
if self.current_position_size < 1.0:
    stack_size = self.stack_sizes[len(self.trades) - 1]
    amt = amt * stack_size
    amt = amt * self.trade_amt_multiplier

That logic does two important things:

Caps exposure (current_position_size < 1.0)
Makes exposure incremental (2% chunks + multiplier), which is how you survive uncertainty

Risk caps are not only about safety. They change the learning dynamics because they make “being wrong” less terminal.

Constraint #2: timeouts (teach the agent that time is a cost)

The simplest form of regime fragility is this:

the agent enters a position that used to work, then gets stuck holding it while the market slowly bleeds.

So I started encoding “time is not free” directly into the Gym.

In bitmex_env.py, the environment defines explicit time-based punishments and triggers:

# bitmex-gym/gym_bitmex/envs/bitmex_env.py
LENGHTY_POSITION_HOURS = 0.3                 # punish holding too long
TRIGGER_CLOSE_POSITION_HOURS = 2             # force close eventually
IGNORE_UNREALISED_POSITIVE_REWARD_HOURS = 1.5
USE_UNREALISED_REWARD_MULTIPLIER_HOURS = 0.75

This is what I mean by constraints that teach:

you can hold, but it becomes less attractive over time
you can ride a winner, but unrealised profit stops being “free dopamine”
if you refuse to close, the environment stops negotiating

The end goal wasn’t to “punish risk”.

It was to teach the agent that time-in-position is an input, not an accident.

Timeouts are dangerous if they’re arbitrary.

If you force-close too aggressively, you train “panic exits” and destroy any chance of trend-following.

So I treated these thresholds as tunable, not sacred.

Constraint #3: episode design (randomize the situation, not the rules)

I kept episodes short on purpose — not because it was a cheat, but because it makes the agent see more situations.

Two key pieces here:

1) Episode length is step-based (with variance)

In the environment:

# bitmex-gym/gym_bitmex/envs/bitmex_env.py
STEP_SKIP = 2000
MEAN_STEPS_PER_EPISODE = 216000 / STEP_SKIP
STD_STEPS_PER_EPISODE = 57600 / STEP_SKIP
MIN_STEPS_PER_EPISODE = 72000 / STEP_SKIP

if self.use_episode_based_time:
    self.current_limit_steps_per_episode = int(np.random.normal(
        bitmexEnv.MEAN_STEPS_PER_EPISODE,
        bitmexEnv.STD_STEPS_PER_EPISODE
    ))

Conceptually:

mean episode length ≈ 1 hour of market time
minimum episode length ≈ 20 minutes
but every episode is slightly different

2) Step skipping is a hyperparameter (not randomness)

STEP_SKIP / step_skip is simply a knob.

It controls how often the agent can act, and how quickly reward can propagate.

I treated it like any other tuning variable: stability vs realism vs reactivity.

3) The environment can spawn you into a position

At reset, the env can open a random long/short (or none), so the agent must learn to manage already being in trouble:

# bitmex-gym/gym_bitmex/envs/bitmex_env.py
init_action = random.randint(0, 2)  # none, long, short
new_obs_state, _, _, _ = self.step(init_action)

That idea aged well.

It forced diversity in the state distribution and prevented the agent from only learning “clean entries”.

This is one of the best ideas in the whole project.

Random spawn isn’t about exploiting resets — it’s about forcing the agent to practice recovering from imperfect starts.

Constraint #4: market-health gates (don’t trade blind)

A major lesson from BitMEX wasn’t “fees” or “slippage”.

It was: sometimes your market feed is lying or dead.

So the live loop had to learn a new rule:

if the market data is unhealthy, the correct action is do nothing.

In the Websocket client, I had explicit logic for “quote sanity” and “connection sanity”:

if quotes haven’t updated for too long → treat as fault
if pings don’t get pongs → treat as fault

# BitmexPythonChappie/BitMEXWebsocketClient.py
PONG_TIMEOUT = 5
MAX_TIME_NO_QUOTES = 20

def _on_message(...):
    # update quotes, update last time

def _on_pong(...):
    self.last_pong = datetime.utcnow()

# later: periodic check
# if (now - last_quote) > MAX_TIME_NO_QUOTES: fault
# if (now - last_pong) > PONG_TIMEOUT: fault

And once you have “faulted”, you can enforce the hard constraint:

cancel open orders
stop sending new ones
reconcile state before resuming

This is how you survive bad regimes and bad infrastructure.

The rule I wrote on my wall

By July 2020, my default assumption changed:

If the agent looks amazing in backtest and dumb in live, it’s usually not “because RL is hard”.

It’s because I trained a personality on an unbalanced world.

In my case:

training + validation were heavy on the bull regime (Dec 2018 → Jun 2019)
the market shifted into a slow decline (Jun 2019 → Apr 2020)
my agent kept expressing “bull instincts”

Constraints were my way of saying:

you don’t get to be all-in by default
you don’t get to hold forever
you don’t get to trade when the feed is unhealthy

And that set up the next design move: architecture as stability.

What I changed in my debugging checklist

Add a constraint, then run a “behavior audit”

Don’t trust reward curves. Watch what the agent actually does under the new rule.

Separate “learning constraints” from “execution constraints”

If it must always hold in production (risk cap), enforce it in the live loop too.

Treat timeouts as tunable

If you can’t explain why a timeout is 2 hours and not 20 minutes, it’s probably wrong.

Build a “market-health” state

If data is stale, the best policy is often: flat + cancel + wait.

Resources

bitmex_env.py (Gym environment)

The place where timeouts, reward multipliers, and episode mechanics become “the contract”.

BitMEXWebsocketClient.py (feed health)

Where stale quotes and missing pongs become first-class failure signals.

BitMEXBotClient.py (risk caps)

Where “stack sizes” and exposure ramps turn into a real risk model.

Repository

The full research rig: data → models → gym → live loop.

FAQ

What’s next

Next month is where I stop thinking “algorithm tuning” and start thinking representation + stability:

Deep Silos in RL - Architecture as Stability

Because by this point, I’d learned that if your agent collapses under regime shift, you don’t just need better rewards — you need a better shape of model.

Deep Silos in RL: Architecture as Stability (and the First LSTM Variant)

August 2020 - After the first live pain and the bull-personality problem, I stopped tuning "algorithms" and started tuning the network contract. Deep Silos beat flat MLPs, and the LSTM variant overfit fast.

First Live Runs - Small Size, Big Lessons

Backtests looked amazing. Live PnL didn't. In June 2020 I ran the first real BitMEX live loop at tiny size and learned the most important lesson in trading ML: regime is the boss.