Blog
Dec 27, 2020 - 14 MIN READ
From Research Rig to System: 2020 Postmortem and the Real Amazing Result

From Research Rig to System: 2020 Postmortem and the Real Amazing Result

2020 is when I stopped training agents and started building a trading system: environments, evaluation discipline, safety, and a live loop that survives outages. This is the postmortem — and the first result that actually held up in reality.

Axel Domingues

Axel Domingues

In 2018, I was obsessed with RL itself: algorithms, stability tricks, papers, and “can I get it to learn anything at all?”

In 2019, BitMEX forced me to grow up: microstructure, fees, partial fills, queue priority, and those infamous exchange failures that make a backtest look smart… right until it meets reality.

And in 2020, something finally clicked:

I stopped treating my work as a model and started treating it as a system.

This post is the “what changed” write-up. It’s also the moment this project stopped being a research toy and started producing a result I could run live — and trust.


The “amazing” result (and what it really means)

By the end of 2020, I had a model I’d been running live since June 2020.

Over roughly the last 6 months, the portfolio was up ~and yes, that includes the collateral appreciation (BTC moving from ~10k to ~30k), but it also includes a lot of consistent, fee-aware behavior that didn’t collapse under normal operational stress.

The headline number is not the point.

The point is: the agent’s behavior survived reality long enough to compound.
That only happened once the environment, evaluation, and execution loop started behaving like an engineered system.

2021 Update: The Perps Catch I Didn’t Model (Funding / Open Interest Costs)

In mid 2021 I had to stop the live trading process — not because the agent “broke”, but because I missed a crucial BitMEX perpetuals mechanic in my environment + live accounting:

the daily carry cost of holding open positions (funding).

Everything still looked amazing while the market kept grinding upward and longs were “easy mode”. But once BitMEX got crowded on the long side and the market stopped moving at the same pace, the system entered a new regime:

  • the agent could be “right” directionally,
  • could trade cleanly,
  • could even look stable in terms of drawdown,
  • and still bleed because holding exposure had an ongoing cost.

Over time, the funding payments became higher and higher relative to the edge the agent was producing in that slower market. That’s when I realized: with perpetual contracts, there’s a catch you can ignore only until the market teaches you otherwise.

This is the new “don’t lie to yourself” rule I learned in 2021:

If your environment doesn’t model funding / carry, your backtest is missing a real force that can dominate returns — especially when the market goes sideways and positioning gets crowded.

What I’d change (if I rewired the system again)

  • Treat funding as a first-class term in PnL accounting (both in sim + live).
  • Make “hold exposure” a decision with an explicit price:the agent shouldn’t be rewarded for being long — it should be rewarded for being long only when the expected edge beats carry.
  • Add evaluation slices where:
    • price is slow / mean-reverting,
    • and funding is consistently expensive for the side the agent wants to sit on.

The humbling part: this wasn’t a bug in PPO, or even in my feature set.

It was a missing piece of market reality — and perps are full of those.


The real story of 2020: a sequence of “stops lying to myself”

If I had to compress the year into a single sentence:

Every “cheat” I removed made training harder — and made the result more real.

These were the big inflection points (the ones that mattered):

  1. Formal Gym contract
    I stopped letting training code and environment assumptions blur together. If it isn’t in the contract, it doesn’t exist.
  2. Walk-forward evaluation discipline
    I stopped trusting one backtest. I started requiring the model to survive multiple windows and regimes.
  3. Maker-style reward shaping
    Fees stopped being an afterthought and became a first-class reward term. That changed behavior.
  4. Management environment
    Entry-only agents are fragile. Position sizing and risk state turned behavior from “spiky” to “survivable”.
  5. Batch training + eval gates
    Training runs stopped being “one lucky seed”. Progress became “a batch that survives scrutiny”.
  6. Safety engineering and reconciliation
    Without kill-switches and recovery logic, your best model becomes a liability.

The end-to-end system (what exists by December)

At the end of 2020, the project is not “PPO on BitMEX data”.

It’s a pipeline:

  1. Market data → feature pipeline
  2. Gym environments (baseline + management)
  3. Training (batch runs, seeds, configs)
  4. Evaluation harness (walk-forward + regime slices)
  5. Model registry (naming + artifacts)
  6. Execution loop (“Chappie”)
  7. Safety layer (kill switches + reconciliation)
  8. Reporting (fees, PnL, drawdowns, churn, survival)

Here’s the mental picture:

This was the biggest mindset upgrade of the project:

the unit of progress is not “a model checkpoint”
it’s a system version (env + reward + eval + execution) that survives.


What I got wrong early 2020

1) “Good backtests” that were still biased

The biggest trap wasn’t overfitting in the classic ML sense.

It was regime bias.

Since mid-2020 I became hyper-aware of the fact that I had more bull data than bear/sideways regimes, and worse: the “validation” slices often had the same bull personality.

So the agent learned a personality:

  • confident
  • trend-friendly
  • drawdown-intolerant in slow drifts
  • and suspiciously “always ready” to be long

The fix wasn’t a single parameter.

The fix was discipline: evaluate against slices that disagree with your training distribution.

2) Training forever (and mistaking persistence for progress)

For a while I believed:

“If I train long enough, it’ll converge.”

In practice, I learned the opposite for this domain:

  • long training runs amplified dataset quirks
  • the agent got better at the environment I built
  • not at the market I wanted

By late 2020, I consistently saw the strongest models come from early stops around 1.5M–2M steps (sometimes “2M + 2M” style staged runs), not the 35M–70M monster runs I was doing months earlier.

If your eval harness is weak, long training just gives you more opportunities to accidentally overfit.

Long training is not a substitute for honest evaluation.


What I changed (and why it worked)

Reward shaping without lying

The core idea I learned: reward shaping is allowed…

…but only if it teaches a behavior that exists in live trading.

In 2020, the reward system evolved into a set of aligned terms:

  • PnL signal — but not as a naive instant reward
  • Fees/rebates — to encourage maker-like behavior and realistic costs
  • Churn penalties — to discourage spammy flip-flopping
  • Risk penalties — to avoid “martingale personality”
  • Time-in-position / exposure shaping — to make behavior smoother and survivable

Instead of thinking “what reward gets the best backtest?”, I started thinking:

What reward makes the agent behave like a trader who wants to stay alive?


The evaluation discipline that stopped the project from lying

The best thing I built in 2020 wasn’t an LSTM or a fancy architecture.

It was evaluation gates.

Walk-forward slices as a default

Every model needed to face multiple windows:

  • train on one slice
  • validate on the next
  • test on the next
  • repeat

Regime slices as a truth serum

I grouped evaluation windows by “market personality”:

  • bull-ish expansions
  • slow downward drifts
  • choppy sideways
  • stress/outage-heavy periods

A model didn’t “win” because it crushed one window.

A model “won” if it was not embarrassed by several windows that disagreed with each other.

The rule that emerged:

Ship only if it survives X slices.
Not “ship if one run looks great”.


Batch training: the new unit of progress

This is where the project became engineering.

Instead of “train one model”, the workflow became:

  • train N configs
  • across multiple seeds
  • across multiple windows
  • and let evaluation pick the survivors

The batch runner + evaluation harness turned results into something comparable.

No more “that run felt good”.

Now I had:

  • the config
  • the seed
  • the windows
  • the eval report
  • the artifact naming
  • and the selection rule

Define a batch

A set of configs (architecture, reward weights, env knobs) + seeds.

Train with fixed protocol

Same step budgets, same logging, same artifact paths.

Run evaluation gates

Same walk-forward windows + regime slices for every candidate.

Select by rule, not vibes

Pick the model that survives, not the one that looks impressive.


The live loop: why “Chappie” mattered

Training was never the finish line.

BitMEX reality forced a final constraint:

If the system can’t execute under stress, it doesn’t matter what it learned.

The live loop made the hidden requirements concrete:

  • websockets drop
  • REST calls fail
  • positions drift
  • orders get weird statuses
  • and “503” is a normal day

The reason the late-2020 approach worked is that the agent’s behavior became compatible with:

  • low-frequency decision cadence
  • maker-biased intent
  • and survival under outages

This is why you’ll see 2020 move steadily toward “management-style” agents and execution constraints that match the exchange.


Final eval summary (how I think about it)

I don’t trust a single metric. I trust a set of survival signals:

  • max drawdown
  • fee-adjusted profit
  • trade churn
  • outage slice behavior
  • regime slice stability
  • variance across seeds

A model is good if it passes enough of those gates consistently.

If you’re building your own system:

Pick your “can’t ship if…” list early.
For me it became: large drawdowns, fragile behavior under outages, and regime collapse.


Repo pointers (where these pieces live)

bitmex-deeprl-research (repo)

The full research rig: environments, training scripts, eval harnesses, and the execution client.

Environments (baseline → management)

The core contract where most “cheating” starts (and where most realism was added back in).

Key files referenced across the 2020 arc:

  • bitmex-gym/gym_bitmex/envs/bitmex_env.py
  • bitmex-management-gym/gym_bitmex_management/envs/bitmex_management_env.py
  • BitmexPythonChappie/main.py and the websocket/client wrappers
  • train-test-models.bat (batch runner)
  • ppo2_mgt_back_test.py (evaluation harness)

What I’m taking into 2021

This series started as “learn RL”.

It ended as “build a system that can survive”.

And that’s exactly why the 2021 series changes direction.

I’m switching into my professional lane: software architecture and full-stack systems:

  • React frontends
  • APIs and service boundaries
  • queues, retries, idempotency
  • databases, Redis, caching
  • microservices and deployment discipline

Because after 2020, I’m convinced the real differentiator isn’t “which algorithm”.

It’s whether you can build systems that don’t lie.


FAQ


What’s next

The next chapter is a new series (2021): software architecture and full-stack system design — the stuff that makes complex systems stable, maintainable, and shippable.

Axel Domingues - 2026