
April 2017 — I used to treat activation functions like a minor math choice. Then I saw how one change (ReLU) could decide whether a deep network learns at all.
Axel Domingues
March was the first month where deep learning stopped feeling like “a bigger version of ML” and started feeling like its own discipline.
I learned a hard truth:
A deep model can be correct, expressive, and theoretically powerful — and still fail to learn anything useful.
Most of that pain showed up as gradients that either vanished quietly or exploded noisily.
So April was about looking for the first practical fix.
And I kept seeing the same thing mentioned everywhere:
ReLU.
At first I was skeptical. An activation function? Really?
In 2016, activation functions felt like a detail. Logistic regression had a sigmoid. Neural nets had sigmoid or tanh. End of story.
April taught me that activation functions are not a detail.
They’re infrastructure.
What you’ll learn
Why activation functions are not “just a curve”, and how one choice can decide whether a deep model trains at all.
The key mental model
Activations don’t only shape outputs — they shape the learning signal (gradients) flowing backward from the loss.
Practical takeaway
You’ll leave with a simple debug lens: “Is my activation helping gradients survive depth?”
Coming from classical ML, I had a very stable mental model:
Activations looked like “just a curve” inside the model.
But deep networks exposed a new dependency:
It shapes the learning signal that must travel through the network.
Activations shape what the network can represent.
Gradients flow from the loss back toward earlier layers.
So an activation can be “expressive enough” and still make training fail by weakening the learning signal.
If the activation crushes gradients, training collapses.
So the activation function is not only about expressiveness.
It’s also about trainability.
Sigmoid is intuitive:
But when I tried to imagine a deep stack of sigmoid activations, I kept running into the same behavior:
So even if backprop is correct, the early layers get almost no learning signal.
One slightly-shrunk learning signal isn’t a disaster.
But deep nets repeat that shrinkage many times, until early layers effectively hear silence.
March explained the problem. April made me understand why ReLU was a turning point.

ReLU (Rectified Linear Unit) is almost embarrassingly simple:
That’s it.
Why it helped
For positive inputs, ReLU avoids constant squashing — the learning signal can stay usable through depth.
Why it surprised me
It’s not “fancier math.” It’s a training-dynamics fix.
What changed in my framing
I stopped asking “is it expressive?” first — and started asking “does it train reliably?”
No curve fitting. No probability interpretation. No smooth “S-shape”.
At first it felt too simple to be meaningful.
Then I realized the key property:
This matters enormously in deep networks.
Because the biggest problem I had in March wasn’t expressiveness.
It was signal preservation.
ReLU is a signal-preserving activation for the region where the neuron is “active”.
This month I started seeing deep learning differently:
Classical ML mindset
Choose the right objective + model family, then optimize and regularize.
Deep learning mindset
Keep training dynamics healthy long enough for learning to happen (signal flow, stability, initialization).
That shift explains why deep learning breakthroughs often look like:
Not because the objective is new.
But because training deep networks is fragile.
ReLU isn’t free.
There’s a failure mode I ran into quickly in my reading and experiments:
If a neuron’s input stays negative, its output becomes 0. If its output becomes 0 consistently, the gradient through it can effectively disappear. Then it stops updating.
People call this a “dead ReLU”.
So ReLU trades one kind of problem (saturation everywhere) for another (units can die).
But the trade is worth it, because:
I started appreciating ReLU not as a mathematical choice, but as an engineering choice:
It’s a practical fix to a real system constraint: gradients must remain usable.
That mindset felt similar to things I learned in 2016:
But this was deeper — literally.
This month made me revisit a familiar idea from 2016:
the optimization landscape matters.
In the ML course, I learned:
In deep learning, activation functions also shape the landscape — not only the function you can represent, but how gradients behave across layers.
So the continuity was there:
But now the model’s internal design choices actively affect whether optimization works at all.
Even without a full deep learning framework, my thinking shifted:
That is a huge mental shift.
Before April, I thought activation functions were mostly about shaping outputs.
After April, I understood the deeper truth:
Activation functions are about whether learning is possible at all.
ReLU didn’t just improve performance.
It made depth practical.
ReLU helped gradients survive, but it doesn’t solve everything.
Next month I’m focusing on another silent factor that decides whether deep training behaves:
initialization.
In classical ML, starting weights rarely mattered much.
In deep networks, the starting point can decide whether the whole system learns or collapses.
May is about that fragility — and how people started engineering around it.
Initialization, Scale, and the Fragility of Deep Networks
After learning why gradients vanish, I discovered something even more unsettling - deep networks can fail before training even begins, simply because the starting scale is wrong.
Why Deeper Networks Are Harder to Train Than I Expected
I assumed “more layers” would just mean “more power.” Instead I discovered that depth introduces a new failure mode - gradients can disappear (or explode) long before the model learns anything useful.