Sep 29, 2019 - 12 MIN READ

Deep Silos - Representation Learning That Respects Feature Families

My first serious attempt at making the model “see” microstructure features the way I designed them — grouped, compressed, and only then fused.

Axel Domingues

In August, I got my first real supervised “alpha” curves. They were… humbling.

Not because the model was useless — but because it was too easy to get a model that looked smart on the training set and weirdly fragile on validation.

This month I stop treating “add more layers” as the answer and start treating architecture as a way to encode how I believe the features should behave.

I’m calling the approach Deep Silos:

group features by microstructure meaning (“feature families”)
learn a compact representation per family
only then fuse the representations into a shared network

This is the same mindset shift I had in deep RL: stability isn’t luck — it’s engineering.

Not financial advice. This is research engineering work in 2019. I’m documenting what I built, what broke, and what I learned — not telling anyone how to trade.

The problem: my features are not one blob

By now my dataset is a pile of signals with very different “physics”:

book shape (depth levels, bid/ask volumes)
book flow (liquidity created/removed per side)
trade pressure (buy/sell volume in windows)
reference signals (index diff, funding-like context)
“plumbing” features that exist mainly to sanity-check reality

If I feed all of that into one big MLP, the network is free to do this:

grab whichever feature is temporarily predictive in the training slice
memorize quirky interactions between unrelated families
overfit on regime-specific artifacts (bullish months, low-vol periods)
treat missingness and imputation patterns as signal

In other words: it can learn shortcuts.

And in microstructure, shortcuts are everywhere.

A model can look “accurate” while actually learning a data collection artifact:

a clock drift signature
a missing-data pattern
a schema default value
a volume feed glitch
a dataset split bug

The market doesn’t care that my pipeline is messy. It will happily reward my mistakes in backtests and punish them live.

The idea: feature families deserve their own representation

A “feature family” is a set of features that:

come from the same measurement process
share scale / statistical behavior
should be interpreted together

So instead of a single network learning the representation from scratch, I force the structure:

Silo networks: one small network per family
Embeddings: each silo outputs a tiny vector (a learned summary)
Fusion network: concatenated embeddings → shared layers → prediction

That’s it.

It’s not a fancy paper trick. It’s an inductive bias.

Repo anchors (what I actually built)

deep_silos.py

Defines feature families (silos) and the TensorFlow network function that builds per-silo embeddings + a fusion MLP.

train_alpha_model.py

Training script: loads HDF5 datasets, applies mean/sigma normalization, trains the model, and prints threshold-based diagnostics.

Step 1: define the silos (the “feature families” contract)

The core design choice is the silo list — what gets grouped together.

In alpha_detection/deep_silos.py I encode this directly as lists of feature names.

The file builds a SILOS_LIST and a FEATURES_NAMES_LIST. The silos are patterns like:

order book “level i” groups
bid and ask volume per depth level
created/removed liquidity features (from May’s work)
multi-window volume features
other microstructure “families” that should be processed together

A simplified excerpt (structure only, not the full list):

# alpha_detection/deep_silos.py (excerpt)
SILOS_LIST = [
  ['level0', 'level1', 'level2', ...],
  ['ask_volume_level0', 'ask_volume_level1', ...],
  ['bid_volume_level0', 'bid_volume_level1', ...],
  ['ask_added_liquidity_level0', 'ask_removed_liquidity_level0', ...],
  ['buy_volume_1.500000', 'buy_volume_2.500000', ...],
  ...
]

The big win here is not that grouping is “correct”.

The win is: it’s explicit and it’s versionable.

If I change features, I have to change the silo contract too — and that forces me to think.

If you can’t explain why two features are in the same family, they probably shouldn’t be fused early.

Step 2: learn a compact embedding per silo

Inside the deep_silos() network, each silo does this:

select the silo’s features out of the full feature vector
apply a small MLP
output a tiny embedding (in my code: 2 units per silo)
apply dropout during training

From the repo:

# alpha_detection/deep_silos.py (excerpt)
for feature_list in SILOS_LIST:
    silo_x = tf.gather(x, indices=cols, axis=1)
    out = mlp(silo_x, hidden_sizes=(10, 10), output_size=2, keep_prob=keep_prob)
    silos.append(out)

Two things matter here:

Compression is a regularizer
If each silo gets only a tiny output, it can’t memorize everything.
Family-level invariances can form
The silo can learn patterns like “imbalance across levels” or “liquidity draining” inside the family.

Even though I’m not writing those formulas down, the architecture nudges the network to discover them.

Step 3: fuse embeddings + “other features” into a shared network

After the per-silo embeddings:

concatenate embeddings
optionally concatenate any remaining features not covered by silos
apply shared dense layers
output the prediction

Again, from the repo:

# alpha_detection/deep_silos.py (excerpt)
silos_concat = tf.concat(silos, axis=1)
x_concat = tf.concat([silos_concat, non_silo_features], axis=1)

out = tf.contrib.layers.fully_connected(x_concat, num_outputs=10)
out = tf.nn.dropout(out, keep_prob)
out = tf.contrib.layers.fully_connected(out, num_outputs=1, activation_fn=None)

This is the “fusion MLP”. It’s where cross-family interactions are allowed — but only after each family is summarized.

Training loop: what I log when I don’t trust myself

The training script is alpha_detection/train_alpha_model.py.

It’s classic TensorFlow-era code (placeholders + sessions), but the interesting part is the instrumentation discipline:

mean/sigma computed from training only
inf and NaN handled explicitly
dropout controlled by keep_prob
validation cost checked repeatedly
predictions sanity-checked with threshold counters

The script contains a big display_results() function that prints:

output mean/std/min/max
label mean/std/min/max
how many predictions exceed certain thresholds (0.25, 0.5, 1.0, etc.)
what the true labels were when the model was confident

This looks crude… but it is exactly the kind of “manual dashboard” I need early:

it catches exploding outputs
it catches collapsed outputs (everything near 0)
it reveals if “confidence” correlates with actual bigger moves
it tells me if I’m just learning average behavior

Yes, later I’ll want proper metrics: calibration curves, hit-rate by quantile, and eventually PnL-aware evaluation. But right now in 2019, the first battle is: is the model learning anything real or just noise?

What changed in my thinking

I used to think “feature engineering” ends when you build the feature columns.

Now I think:

Feature engineering continues into the model architecture.

Deep Silos is feature engineering with weights.

It’s me telling the network:

“these features belong together”
“summarize them first”
“don’t mix everything too early”
“earn the right to learn cross-interactions”

The deep-silos checklist (what I keep breaking)

Here’s the stuff I keep getting wrong — and the checks that save me:

Verify feature ordering is identical everywhere

If FEATURES_NAMES_LIST changes but the dataset column order doesn’t, the model silently trains on garbage.

Compare dataset column names to FEATURES_NAMES_LIST
Assert shapes and indices match

Confirm silos cover what you think they cover

A typo in a feature name makes a silo smaller than expected.

Log the number of columns per silo
Fail fast if a silo is missing columns

Normalize on training only

If I compute mean/sigma across train+valid I leak information.

mean/sigma from training slice only
reuse for valid/test
store them as artifacts

Treat NaN/inf as pipeline bugs first

Replacing NaNs with 0 is a pragmatic band-aid, not a solution.

count NaNs per feature
track NaN rate over time
inspect where they originate

Watch validation early, not at the end

With microstructure, overfitting can happen fast.

plot train vs valid cost per epoch
stop when valid cost worsens consistently

Common failure modes (symptom → likely cause → first check)

Validation improves then collapses → overfitting to regime/microstructure artifact → compare by time slice, not random shuffle
Outputs explode (very large values) → normalization bug or NaN/inf propagation → print feature stats after normalization
Outputs collapse near 0 → learning rate too low or label scale mismatch → print label distribution and baseline predictor
Model “predicts” mostly during missing-data moments → imputation pattern leakage → plot predictions vs missingness counters
Training is unstable across runs → small dataset + non-stationary market → fix seeds and log dataset hashes
Silo embeddings don’t help at all → silo grouping is wrong or too small → try bigger embedding size (2 → 8) and re-evaluate
Good MSE, bad “useful” predictions → optimizing wrong objective → inspect threshold hit rates and later map to trading decisions

Field notes (what surprised me)

“More structure” can outperform “more capacity” even with the same data.
Some feature families are simply unreliable (volume can lie more than I expected).
The most dangerous bugs are the silent ones: wrong column ordering, wrong normalization, wrong split.
Deep Silos doesn’t magically make the model smart — but it makes failures easier to interpret.

Deliverables (what I can reproduce right now)

This month’s output isn’t a “production model”. It’s a reproducible experiment:

alpha_detection/deep_silos.py — the architecture definition
alpha_detection/train_alpha_model.py — training loop + diagnostics
Saved artifacts (local, for now):
- mean/sigma vectors used for normalization
- plots of training vs validation cost
- text dumps of “high-confidence” predictions with timestamps

Expected result (qualitative, no fake precision):

training curve still improves quickly (it always does)
validation curve is less jumpy than the naive MLP baseline
“confident predictions” are rarer, but less random
debugging becomes easier because I can isolate which silo contributes to instability

Even if validation looks better, this is still not “real-world alpha”. Until I can monitor the model live and observe predictions in context (spread, liquidity, outages), I’m still in the lab.

FAQ

What’s next

Next month I take this model out of the quiet offline world and into something closer to a conversation with the market.

If the model is real, it should behave sensibly in real time:

predictions should cluster around microstructure events that make sense
confidence shouldn’t spike on data glitches
and I should finally see whether the market is “speaking” in the features I engineered

I’m not expecting magic.

I’m expecting feedback — and probably pain.

Live Alpha Monitoring - When the Market Talks Back

I stop treating my alpha model like a notebook artifact and make it sit in the real BitMEX stream. The goal is not trading yet. It is seeing whether my features, normalization, and inference loop survive reality without quietly cheating.

Supervised Baselines - First Alpha Models, First Humbling Curves

I train my first alpha predictor on BitMEX order-book features, and learn why ‘it trains’ is not the same as ‘it works’.