Aug 30, 2020 - 18 MIN READ

Deep Silos in RL: Architecture as Stability (and the First LSTM Variant)

August 2020 - After the first live pain and the bull-personality problem, I stopped tuning "algorithms" and started tuning the network contract. Deep Silos beat flat MLPs, and the LSTM variant overfit fast.

Axel Domingues

In 2018 I was still in "RL exploration mode": run an algorithm, tune a few hyperparameters, celebrate when the curve goes up.

In 2019 BitMEX forced me to grow up: microstructure, fees, queue priority, partial fills, outages - the plumbing.

By August 2020 I finally had the right kind of problem:

my backtests looked great
my walk-forward slices looked "fine"
and the policies still felt fragile

The root cause was not mysterious.

My data (and my evaluation) was bull-biased.

So the agents did what agents do: they developed a personality that fit the regime they saw most.

This post is about the lever that helped more than I expected:

Architecture as stability.

Not because architecture is magic, but because wiring is a contract: it controls what shortcuts the model can learn.

Repo anchors used in this post

baselines/common/models.py (Deep Silos + Deep Silos LSTM networks)
BitmexPythonChappie/OrderBookMovePredictor.py (LSTM inference args)
bitmex-gym/gym_bitmex/envs/bitmex_env.py (baseline env mechanics)
bitmex-hft-gym/.../bitmex_hft_env.py (HFT detour + failure story)

The problem: "bull personality" is still overfitting

The annoying part about regime overfitting is that it can look very scientific.

Training reward improves.
Validation reward improves.
Walk-forward curves look stable.

But if the validation distribution shares the same regime bias, you are effectively grading the agent on the same mood it trained on.

My big realization around this time:

A good policy is not just "profitable" in a slice.
A good policy is consistent across slices that are meaningfully different.

By June/July 2020 I was already hyper aware that my dataset contained more bull runs than anything else. So August became the month where I tried to make generalization harder to avoid.

The hypothesis: stop mixing feature families too early

If you feed a flat MLP a giant feature vector, you are implicitly telling it:

Any feature can talk to any other feature immediately.

In trading, that can be dangerous.

Feature families often have very different semantics:

datetime/context features
microstructure features
derived imbalance/pressure features
longer-horizon context

A flat MLP can invent "cross-feature hacks" early - shortcuts that exist only because your training dataset has a repeating structure.

So I brought back an old supervised trick from 2019:

Deep Silos.

Deep Silos: wiring as a regularizer

The Deep Silos idea is simple:

slice the input into feature families
give each family a small MLP (a silo)
concatenate the silo embeddings
only then allow mixing

In the repo this is implemented as a Baselines network in:

baselines/common/models.py under @register("deep_silos")

Here is the core pattern (trimmed, but faithful):

# baselines/common/models.py (trimmed)
@register("deep_silos")
def deep_silos_net(**net_kwargs):
    def network_fn(X, nenv=1):
        silos_outputs = []
        silo_start_idx = 0

        for silo_list in silos_list:
            silo_input = tf.slice(X, [0, silo_start_idx], [-1, len(silo_list)])
            h = fc(silo_input, f"mlp_silo_fc{silo_number}", nh=32)
            h = activation(h)
            h = fc(h, f"mlp_silo_output{silo_number}", nh=2)
            h = activation(h)
            silos_outputs.append(h)
            silo_start_idx += len(silo_list)

        remaining_input = tf.slice(X, [0, silo_start_idx], [-1, input_total_features - silo_start_idx])
        silos_outputs.append(remaining_input)

        h = fc(tf.concat(silos_outputs, 1), "mlp_fc1", nh=64)
        h = activation(h)
        h = fc(h, "mlp_fc2", nh=64)
        return activation(h)

    return network_fn

The structural effect is the point:

if the model wants to cheat, it must cheat inside a silo first
cross-family shortcuts become more expensive

It is regularization you do not have to tune.

The first LSTM variant: it worked fast, then overfit fast

After the June/July pain, an LSTM felt like the obvious upgrade:

markets are sequences
microstructure is temporal
an MLP is memoryless

So I tried an LSTM after the silos embedding.

In the repo:

baselines/common/models.py under @register("deep_silos_lstm")

The core idea (trimmed):

# baselines/common/models.py (trimmed)
@register("deep_silos_lstm")
def deep_silos_lstm(nlstm=32, **net_kwargs):
    def network_fn(X, nenv=1):
        h = deep_silos_net(X, tf.nn.selu)
        xs = batch_to_seq(h, nenv, nsteps)
        ms = batch_to_seq(M, nenv, nsteps)
        h5, snew = utils.lstm(xs, ms, S, scope="lstm", nh=nlstm)
        return seq_to_batch(h5), {"state": snew, "initial_state": initial_state}
    return network_fn

I also left breadcrumbs in the live inference layer so I could test it in Chappie:

BitmexPythonChappie/OrderBookMovePredictor.py defines ARGS_LSTM (including -network deep_silos_lstm and -nlstm 32).

So what happened?

training improved quickly
the policy looked smarter in-sample
but walk-forward performance degraded faster than the silos-only model

The LSTM did not learn "market memory". It mostly learned dataset memory.

The LSTM was not "bad".It was simply higher capacity in an already biased regime. Under those conditions it tends to memorize:

the micro-sequence quirks of the training windows
slice artifacts
regime-specific patterns

Deep Silos without LSTM was the clear winner because the structure forced generalization.

Flat vs Silos vs Silos+LSTM: the summary

I compared the architectures under the same discipline:

same walk-forward slicing logic
same episode mechanics
same reward baseline

Only the network wiring changed.

What kept repeating:

Flat MLP: learns fast, looks great in-sample, degrades out-of-sample
Deep Silos: learns slower, but holds up better in walk-forward slices
Deep Silos + LSTM: learns fast again, and overfits fast again

The most useful rule I wrote down at the time:

If evaluation is fragile, do not add capacity.

Add structure.

Capacity makes overfitting more powerful. Structure makes overfitting more expensive.

Environment mechanics that mattered (and stayed good later)

This is important because it is easy to label things as "cheats" when they are actually good engineering.

Random spawn is not a cheat - it is state coverage

The baseline environment (bitmex-gym/gym_bitmex/envs/bitmex_env.py) starts episodes at random places in the dataset.

It can also force a random initial action (so the agent sometimes begins already holding a position).

That second part turned out to be one of the best ideas in the whole project.

In code (trimmed):

# bitmex_env.py (trimmed)
if bitmexEnv.TAKE_RANDOM_INIT_ACTION:
    init_action = random.randint(0, 2)  # hold / open long / open short
    init_steps = random.randint(0, bitmexEnv.RANDOM_INIT_TIME_IN_SECONDS * 4)
    self.current_step_skip = init_steps
    ob, reward, done, info = self.step(init_action)
    self.current_step_skip = bitmexEnv.STEP_SKIP

It forces the policy to learn "manage" and not only "enter".

STEP_SKIP is a tunable hyperparameter

A correction that matters:

STEP_SKIP is not random.

It is a hyperparameter that controls decision tempo and credit assignment stability.

The reset logic can use a random initial skip to land in a different micro-moment, but the episode runs with a fixed STEP_SKIP.

Episodes were time-like via steps (with variance)

The environment can define episode length in steps, sampled with variance:

# bitmex_env.py (trimmed)
if self.use_episode_based_time:
    self.current_limit_steps_per_episode = int(
        np.random.normal(bitmexEnv.MEAN_STEPS_PER_EPISODE, bitmexEnv.STD_STEPS_PER_EPISODE)
    )

In the repo, the constants are anchored to an underlying notion of time (about 1 hour on average, minimum around 20 minutes), scaled by STEP_SKIP.

This choice was about practicality:

long episodes made learning unstable
shorter episodes made debugging and credit assignment possible

Why bitmex-hft-gym failed (and why outages made it worse)

The HFT environment (bitmex-hft-gym/.../bitmex_hft_env.py) was a learning artifact.

It tried to do more:

higher decision tempo (STEP_SKIP = 25 in that file)
more complex mechanics
extra state variables (it even tracks a current_position_size style signal)

But two forces crushed it:

Complexity explosion
- more actions means harder testing
- harder evaluation discipline
- and more ways to accidentally assume "perfect fills"
Outage reality
- when the agent wants to react to a sudden movement, BitMEX can return the infamous 503
- taker/HFT logic is fragile under availability failures

The main takeaway is not "HFT is impossible".

It is:

Your environment contract must match the exchange availability contract.

If the exchange can stall, your agent must be trained inside that reality, or it will learn a policy that only works in a simulator.

This failure story pushed me toward maker-style thinking later in 2020.

The August 2020 checklist

This is what I used before blaming "RL instability":

Regime balance
- if training and validation share the same market mood, you are not validating
Walk-forward slices
- evaluation must respect time, not random split
State coverage
- random starts and forced initial positions are coverage, not noise
Decision tempo
- tune STEP_SKIP like you would tune latency in a real system
Architecture as guardrail
- if feature families are real, silo them
LSTM skepticism
- if LSTM looks too good too fast, it is usually memorizing

Resources and repo anchors

Repo - bitmex-deeprl-research

The full research log and code for this series.

Deep Silos networks (Baselines)

See baselines/common/models.py for deep_silos and deep_silos_lstm.

LSTM inference breadcrumbs (Chappie)

See OrderBookMovePredictor.py for the ARGS_LSTM config.

Environments - baseline vs HFT

Compare bitmex-gym/.../bitmex_env.py with bitmex-hft-gym/.../bitmex_hft_env.py.

FAQ

Whats next

Next month is where this turns into a strategy decision:

Maker Trades as a Strategy

Because once you accept outages as reality, always-taker stops being a sensible policy.

Maker Trades as a Strategy: When Fees Become a Reward Signal

In September 2020, I stop trying to be fast and start trying to be executable. The surprising result: in maker-style trading, fees aren’t a footnote — they can be the whole edge.

Constraints That Teach: Risk Caps, Timeouts, and Surviving Bad Regimes

After my first disappointing live runs, I stopped asking my agent to be clever and started forcing it to be safe: risk caps, timeouts, and “market-health” gates that kept the loop alive when the regime wasn’t.