Mar 29, 2020 - 16 MIN READ

Reward Shaping Without Lying - Penalties, Constraints, and the First Real Fixes

In March 2020, I stopped treating reward like a number and started treating it like a contract—pay real costs, punish real risk, and don’t teach the agent to win a video game.

Axel Domingues

In 2018 I learned RL as algorithms: value functions, policies, advantage, exploration.

In 2019 I learned trading as plumbing: order books, features, labels, baselines, and the first humbling curves.

In January and February 2020, I finally put the two together: a Gym environment plus evaluation discipline.

March is where I hit the wall:

If the reward is wrong, everything else is theatre.

What you’ll learn in this post

why “PnL reward” is not a complete spec (and how it teaches bad habits)
the first penalties/constraints that made training behave
how I encoded costs that exist in reality (fees, time, and the pain of being wrong)
a practical checklist for reward debugging in trading environments

The uncomfortable truth: the agent will do exactly what you pay it to do

Early on, I thought reward shaping was a kind of “extra credit.”

Like: first get it working, then add a few penalties.

Reality: reward shaping is the behavior spec.

If you pay for the wrong proxy, the agent will happily:

churn trades because you didn’t price transaction costs
hold losers forever because you didn’t price time and risk
learn fragile timing tricks because your reward leaks information through the episode design

Not because it’s “smart”.

Because that’s literally the job.

What I started with (and why it failed)

My first version was basically:

reward up when the position moves in the right direction
reward down when it moves against me

That sounds reasonable until you remember what a real exchange charges you for existing:

fees (taker costs hurt, maker rebates matter)
time (being right late is not the same as being right now)
inventory (being in a position is a risk state, not just a score multiplier)

So the agent did what you’d expect from a badly-specified game:

take actions too often
keep positions open too long when “hope” is free
behave differently depending on the episode start state

Training wasn’t “unstable because RL is hard.”

Training was unstable because I was paying for incoherent behavior.

The goal: shape reward using only things that exist in reality

My rule became:

If I can’t write it as a line item in a trading ledger, it doesn’t belong in reward.

So I focused on 3 categories that are real and unavoidable:

Costs (fees)
Time risk (staying wrong / staying exposed)
A bias toward “fixing mistakes” (don’t let unrealised losses become unpriced debt)

All of this landed in bitmex_env.py and, importantly, it stayed simple enough to audit.

Fix #1 — Fees as part of the reward contract

If your environment doesn’t price fees, you teach the agent that “more actions” is always an option.

So I made fees explicit:

when opening a position, apply maker/taker fee behavior
when closing a position, apply fees again

In the environment, this shows up as adding fee terms into the reward path.

Not as a lecture.

As a bill.

Effect:

reduced pointless flipping
forced the agent to earn more than the fee before it could “feel good” about acting

Fix #2 — Reward scaling, because gradients are not impressed by your PnL

Markets don’t care about your neural network.

Neural networks care a lot about scale.

So I introduced a simple multiplier to keep reward magnitudes inside a learnable range.

This is not finance.

This is interface design.

If reward spikes, training becomes noise.

If reward is too tiny, the agent never updates meaningfully.

So: scale it, then keep it consistent.

Fix #3 — Unrealised PnL shaping (the first “stop lying to yourself” rule)

A big failure mode was letting unrealised losses be soft.

If the agent can keep a position open forever with no increasing pain, it will.

Because “maybe it comes back” is free.

So I made two asymmetric rules:

losing unrealised PnL hurts more (amplified punishment)
winning unrealised PnL stops paying forever (cap the positive signal after a while)

This is me admitting a truth about the system:

In trading, the worst thing is not being wrong — it’s staying wrong with leverage and time.

So the reward started nudging the agent toward:

cut losers earlier
don’t camp in positions forever waiting for drift

Not because “it’s morally good.”

Because my environment needs a policy that can survive the messy reality of regimes.

Fix #4 — Time-in-position penalties (risk is a state, not a score)

Then I added the simplest form of risk constraint:

punish being in a position too long
punish staying flat too long after closing

These are deliberately crude.

They’re not a full risk model.

They’re a behavioral bias:

don’t turn every trade into a long-term bet
don’t freeze forever either

This created a tempo.

Not a strategy.

But tempo matters, because it defines what kind of policy can exist.

The hidden part: reward shaping also changes what the agent needs to observe

In the baseline Gym, opening a position commits the whole portfolio.

So I didn’t need a “position size” state variable yet.

But I did need the agent to know if it was currently exposed.

So the observation isn’t just microstructure features.

It’s features + a tiny state header:

long open?
short open?
current unrealised PnL percent
hours since open
hours since close

This is the minimum amount of “self-awareness” required for the reward constraints to make sense.

This becomes a bigger theme later in 2020:

a reward constraint without the corresponding state is just a random shock.

“But aren’t penalties cheating?”

This is where I had to get honest about what I was building.

I wasn’t trying to build a pure academic RL benchmark.

I was trying to build a policy that trades in a system that fails.

So constraints are not cheating.

They’re the only way to teach “behavior under risk” without waiting 6 months for the agent to discover it by accident.

The only real rule is:

the penalty must correspond to something real
the evaluation must still be strict (walk-forward, out-of-sample)

If you do that, shaping is not a lie.

It’s a curriculum.

The practical debugging loop I used

Reward shaping sounds philosophical until you instrument it.

Then it becomes engineering.

Here’s the checklist I kept near the code:

Decompose reward into components

Fees, unrealised shaping, time penalties, base movement reward. If you can’t log it, you can’t trust it.

Plot per-component distributions

If one term dominates, you’re training on that term. Not on “trading”.

Look for sign bugs

A single flipped sign turns “avoid fees” into “farm fees”.

Validate invariants with tiny scenarios

One candle up, one candle down. One open/close sequence. If the reward surprises you there, it will destroy you at scale.

What changed in behavior

Once these fixes went in, training stopped feeling like superstition.

The agent became:

less frantic (fees made acting expensive)
less stubborn (unrealised losses stopped being free)
more consistent across episode starts (time penalties created a stable tempo)

And that mattered because it let me focus on the next problem:

how to wire this into a real process and keep it alive.

That’s April.

Resources

Repo — bitmex-deeprl-research

All the environment work lives here. Start with bitmex-gym/gym_bitmex/envs/bitmex_env.py.

Sutton & Barto (free book site)

The RL fundamentals I leaned on when I needed to reason about reward, value, and behavior.

What’s next

Next post: Chappie Wiring.

This is the month the project stops being “training” and becomes a loop that can fail.

Chappie Wiring From Trained Policy to Running Process

The moment RL stops being a notebook artifact: load a PPO policy, rebuild the live observation stream, and turn BitMEX into a runtime you can monitor and control.

Evaluation Discipline - Walk-Forward Backtesting Inside the Gym

Training reward was lying to me. So I turned evaluation into a first-class system - chronological splits, deterministic runs, and walk-forward backtests that survive the next dataset.