
2020 is when I stopped training agents and started building a trading system: environments, evaluation discipline, safety, and a live loop that survives outages. This is the postmortem — and the first result that actually held up in reality.
Axel Domingues
In 2018, I was obsessed with RL itself: algorithms, stability tricks, papers, and “can I get it to learn anything at all?”
In 2019, BitMEX forced me to grow up: microstructure, fees, partial fills, queue priority, and those infamous exchange failures that make a backtest look smart… right until it meets reality.
And in 2020, something finally clicked:
I stopped treating my work as a model and started treating it as a system.
This post is the “what changed” write-up. It’s also the moment this project stopped being a research toy and started producing a result I could run live — and trust.
By the end of 2020, I had a model I’d been running live since June 2020.
Over roughly the last 6 months, the portfolio was up ~5× — and yes, that includes the collateral appreciation (BTC moving from ~10k to ~30k), but it also includes a lot of consistent, fee-aware behavior that didn’t collapse under normal operational stress.
The point is: the agent’s behavior survived reality long enough to compound.
That only happened once the environment, evaluation, and execution loop started behaving like an engineered system.
In mid 2021 I had to stop the live trading process — not because the agent “broke”, but because I missed a crucial BitMEX perpetuals mechanic in my environment + live accounting:
the daily carry cost of holding open positions (funding).
Everything still looked amazing while the market kept grinding upward and longs were “easy mode”. But once BitMEX got crowded on the long side and the market stopped moving at the same pace, the system entered a new regime:
Over time, the funding payments became higher and higher relative to the edge the agent was producing in that slower market. That’s when I realized: with perpetual contracts, there’s a catch you can ignore only until the market teaches you otherwise.
This is the new “don’t lie to yourself” rule I learned in 2021:
If your environment doesn’t model funding / carry, your backtest is missing a real force that can dominate returns — especially when the market goes sideways and positioning gets crowded.
The humbling part: this wasn’t a bug in PPO, or even in my feature set.
It was a missing piece of market reality — and perps are full of those.
If I had to compress the year into a single sentence:
Every “cheat” I removed made training harder — and made the result more real.
These were the big inflection points (the ones that mattered):
At the end of 2020, the project is not “PPO on BitMEX data”.
It’s a pipeline:
Here’s the mental picture:

the unit of progress is not “a model checkpoint”
it’s a system version (env + reward + eval + execution) that survives.
The biggest trap wasn’t overfitting in the classic ML sense.
It was regime bias.
Since mid-2020 I became hyper-aware of the fact that I had more bull data than bear/sideways regimes, and worse: the “validation” slices often had the same bull personality.
So the agent learned a personality:
The fix wasn’t a single parameter.
The fix was discipline: evaluate against slices that disagree with your training distribution.
For a while I believed:
“If I train long enough, it’ll converge.”
In practice, I learned the opposite for this domain:
By late 2020, I consistently saw the strongest models come from early stops around 1.5M–2M steps (sometimes “2M + 2M” style staged runs), not the 35M–70M monster runs I was doing months earlier.
Long training is not a substitute for honest evaluation.
The core idea I learned: reward shaping is allowed…
…but only if it teaches a behavior that exists in live trading.
In 2020, the reward system evolved into a set of aligned terms:
Instead of thinking “what reward gets the best backtest?”, I started thinking:
What reward makes the agent behave like a trader who wants to stay alive?
The best thing I built in 2020 wasn’t an LSTM or a fancy architecture.
It was evaluation gates.
Every model needed to face multiple windows:
I grouped evaluation windows by “market personality”:
A model didn’t “win” because it crushed one window.
A model “won” if it was not embarrassed by several windows that disagreed with each other.
Ship only if it survives X slices.
Not “ship if one run looks great”.
This is where the project became engineering.
Instead of “train one model”, the workflow became:
The batch runner + evaluation harness turned results into something comparable.
No more “that run felt good”.
Now I had:
A set of configs (architecture, reward weights, env knobs) + seeds.
Same step budgets, same logging, same artifact paths.
Same walk-forward windows + regime slices for every candidate.
Pick the model that survives, not the one that looks impressive.
Training was never the finish line.
BitMEX reality forced a final constraint:
If the system can’t execute under stress, it doesn’t matter what it learned.
The live loop made the hidden requirements concrete:
The reason the late-2020 approach worked is that the agent’s behavior became compatible with:
This is why you’ll see 2020 move steadily toward “management-style” agents and execution constraints that match the exchange.
I don’t trust a single metric. I trust a set of survival signals:
A model is good if it passes enough of those gates consistently.
Pick your “can’t ship if…” list early.
For me it became: large drawdowns, fragile behavior under outages, and regime collapse.
Key files referenced across the 2020 arc:
bitmex-gym/gym_bitmex/envs/bitmex_env.pybitmex-management-gym/gym_bitmex_management/envs/bitmex_management_env.pyBitmexPythonChappie/main.py and the websocket/client wrapperstrain-test-models.bat (batch runner)ppo2_mgt_back_test.py (evaluation harness)This series started as “learn RL”.
It ended as “build a system that can survive”.
And that’s exactly why the 2021 series changes direction.
I’m switching into my professional lane: software architecture and full-stack systems:
Because after 2020, I’m convinced the real differentiator isn’t “which algorithm”.
It’s whether you can build systems that don’t lie.
No — BTC’s collateral appreciation played a major role. But the point is that the execution + behavior didn’t self-destruct, and fees/rebates and consistent trading behavior contributed materially over time.
Evaluation gates. Once you require a model to survive multiple regimes and slices, most “amazing” candidates disappear — and the survivors tend to be the ones you can actually run.
Models with a “bull personality”: they look incredible in bull-heavy training/validation, then collapse in slow drifts or different regimes. The fix is not more training — it’s better slicing and honest evaluation.
The next chapter is a new series (2021): software architecture and full-stack system design — the stuff that makes complex systems stable, maintainable, and shippable.
The Web's "Compression Algorithm": Static → Web 2.0 → SPA → SSR/Edge
The web didn’t evolve because developers got bored. It evolved because latency, state, and economics kept forcing us to move responsibility between server, client, and edge. This post gives you the mental model and the checklist to choose the right rendering architecture in 2021+
Batch Training & Evaluation Again: Promising Results That Survive Scrutiny
In late 2020 I stopped trusting hero backtests. I built a batch runner + a walk-forward evaluation harness, added eval gates, and discovered an uncomfortable truth: shorter training often wins.