
August 2020 - After the first live pain and the bull-personality problem, I stopped tuning "algorithms" and started tuning the network contract. Deep Silos beat flat MLPs, and the LSTM variant overfit fast.
Axel Domingues
In 2018 I was still in "RL exploration mode": run an algorithm, tune a few hyperparameters, celebrate when the curve goes up.
In 2019 BitMEX forced me to grow up: microstructure, fees, queue priority, partial fills, outages - the plumbing.
By August 2020 I finally had the right kind of problem:
The root cause was not mysterious.
My data (and my evaluation) was bull-biased.
So the agents did what agents do: they developed a personality that fit the regime they saw most.
This post is about the lever that helped more than I expected:
Architecture as stability.
Not because architecture is magic, but because wiring is a contract: it controls what shortcuts the model can learn.
baselines/common/models.py (Deep Silos + Deep Silos LSTM networks)BitmexPythonChappie/OrderBookMovePredictor.py (LSTM inference args)bitmex-gym/gym_bitmex/envs/bitmex_env.py (baseline env mechanics)bitmex-hft-gym/.../bitmex_hft_env.py (HFT detour + failure story)The annoying part about regime overfitting is that it can look very scientific.
But if the validation distribution shares the same regime bias, you are effectively grading the agent on the same mood it trained on.
My big realization around this time:
By June/July 2020 I was already hyper aware that my dataset contained more bull runs than anything else. So August became the month where I tried to make generalization harder to avoid.
If you feed a flat MLP a giant feature vector, you are implicitly telling it:
Any feature can talk to any other feature immediately.
In trading, that can be dangerous.
Feature families often have very different semantics:
A flat MLP can invent "cross-feature hacks" early - shortcuts that exist only because your training dataset has a repeating structure.
So I brought back an old supervised trick from 2019:
Deep Silos.
The Deep Silos idea is simple:
In the repo this is implemented as a Baselines network in:
baselines/common/models.py under @register("deep_silos")Here is the core pattern (trimmed, but faithful):
# baselines/common/models.py (trimmed)
@register("deep_silos")
def deep_silos_net(**net_kwargs):
def network_fn(X, nenv=1):
silos_outputs = []
silo_start_idx = 0
for silo_list in silos_list:
silo_input = tf.slice(X, [0, silo_start_idx], [-1, len(silo_list)])
h = fc(silo_input, f"mlp_silo_fc{silo_number}", nh=32)
h = activation(h)
h = fc(h, f"mlp_silo_output{silo_number}", nh=2)
h = activation(h)
silos_outputs.append(h)
silo_start_idx += len(silo_list)
remaining_input = tf.slice(X, [0, silo_start_idx], [-1, input_total_features - silo_start_idx])
silos_outputs.append(remaining_input)
h = fc(tf.concat(silos_outputs, 1), "mlp_fc1", nh=64)
h = activation(h)
h = fc(h, "mlp_fc2", nh=64)
return activation(h)
return network_fn
The structural effect is the point:
It is regularization you do not have to tune.
After the June/July pain, an LSTM felt like the obvious upgrade:
So I tried an LSTM after the silos embedding.
In the repo:
baselines/common/models.py under @register("deep_silos_lstm")The core idea (trimmed):
# baselines/common/models.py (trimmed)
@register("deep_silos_lstm")
def deep_silos_lstm(nlstm=32, **net_kwargs):
def network_fn(X, nenv=1):
h = deep_silos_net(X, tf.nn.selu)
xs = batch_to_seq(h, nenv, nsteps)
ms = batch_to_seq(M, nenv, nsteps)
h5, snew = utils.lstm(xs, ms, S, scope="lstm", nh=nlstm)
return seq_to_batch(h5), {"state": snew, "initial_state": initial_state}
return network_fn
I also left breadcrumbs in the live inference layer so I could test it in Chappie:
BitmexPythonChappie/OrderBookMovePredictor.py defines ARGS_LSTM (including -network deep_silos_lstm and -nlstm 32).So what happened?
The LSTM did not learn "market memory". It mostly learned dataset memory.
I compared the architectures under the same discipline:
Only the network wiring changed.
What kept repeating:
The most useful rule I wrote down at the time:
Capacity makes overfitting more powerful. Structure makes overfitting more expensive.Add structure.
This is important because it is easy to label things as "cheats" when they are actually good engineering.
The baseline environment (bitmex-gym/gym_bitmex/envs/bitmex_env.py) starts episodes at random places in the dataset.
It can also force a random initial action (so the agent sometimes begins already holding a position).
That second part turned out to be one of the best ideas in the whole project.
In code (trimmed):
# bitmex_env.py (trimmed)
if bitmexEnv.TAKE_RANDOM_INIT_ACTION:
init_action = random.randint(0, 2) # hold / open long / open short
init_steps = random.randint(0, bitmexEnv.RANDOM_INIT_TIME_IN_SECONDS * 4)
self.current_step_skip = init_steps
ob, reward, done, info = self.step(init_action)
self.current_step_skip = bitmexEnv.STEP_SKIP
It forces the policy to learn "manage" and not only "enter".
A correction that matters:
STEP_SKIP is not random.
It is a hyperparameter that controls decision tempo and credit assignment stability.
The reset logic can use a random initial skip to land in a different micro-moment, but the episode runs with a fixed STEP_SKIP.
The environment can define episode length in steps, sampled with variance:
# bitmex_env.py (trimmed)
if self.use_episode_based_time:
self.current_limit_steps_per_episode = int(
np.random.normal(bitmexEnv.MEAN_STEPS_PER_EPISODE, bitmexEnv.STD_STEPS_PER_EPISODE)
)
In the repo, the constants are anchored to an underlying notion of time (about 1 hour on average, minimum around 20 minutes), scaled by STEP_SKIP.
This choice was about practicality:
The HFT environment (bitmex-hft-gym/.../bitmex_hft_env.py) was a learning artifact.
It tried to do more:
STEP_SKIP = 25 in that file)current_position_size style signal)But two forces crushed it:
The main takeaway is not "HFT is impossible".
It is:
If the exchange can stall, your agent must be trained inside that reality, or it will learn a policy that only works in a simulator.
This failure story pushed me toward maker-style thinking later in 2020.
This is what I used before blaming "RL instability":
STEP_SKIP like you would tune latency in a real systemIt behaves like regularization, but it is stronger: it changes what shortcuts are available.
Instead of asking the optimizer to be nice, it wires the model so memorization is harder.
The LSTM had more capacity to memorize sequence-level quirks in the training windows.
With regime imbalance, that advantage turns into an overfitting accelerator.
Not in this project.
Random starts increased state coverage and reduced the chance of learning a single timeline. Inside an episode the environment still steps forward in order.
Silos were a practical answer to bull personality.
They forced the model to learn reusable structure inside each feature family, instead of learning cross-feature hacks that only exist in one regime.
Next month is where this turns into a strategy decision:
Maker Trades as a Strategy
Because once you accept outages as reality, always-taker stops being a sensible policy.
Maker Trades as a Strategy: When Fees Become a Reward Signal
In September 2020, I stop trying to be fast and start trying to be executable. The surprising result: in maker-style trading, fees aren’t a footnote — they can be the whole edge.
Constraints That Teach: Risk Caps, Timeouts, and Surviving Bad Regimes
After my first disappointing live runs, I stopped asking my agent to be clever and started forcing it to be safe: risk caps, timeouts, and “market-health” gates that kept the loop alive when the regime wasn’t.