
In 2016, gradient descent felt like “the algorithm.” In deep learning, it’s just the beginning. This month I learned why momentum and careful learning rates are what make training "actually move".
Axel Domingues
In 2016, gradient descent was the hero of the story.
I built intuition for it in the Andrew Ng course:
It felt like a solid foundation.
So when I moved into deep learning, my default was still:
define a loss
compute gradients
run gradient descent
This month was when that mindset broke.
Because deep networks don’t just “optimize slowly”. They optimize in ways that feel like fighting the terrain.
June was about learning why plain gradient descent wasn’t enough — and why momentum and learning rate behavior are basically training infrastructure.
What you’ll learn
Why training can look “stuck” even when gradients exist — and how momentum + learning-rate control fixes that.
The 3 knobs in this post
How to use this post
Read it once, then reuse the debug checklist at the end whenever training feels weird.
May taught me that initialization can block learning before it starts.
June taught me that even when initialization is reasonable and gradients exist…
training can still feel stuck.
The loss decreases a bit. Then oscillates. Then plateaus. Then randomly improves again.
It didn’t feel like the clean, smooth convergence from 2016 exercises.
And that’s when I started hearing the same advice over and over:
This is where optimization got real.
In Andrew Ng’s course, I mostly lived in “batch gradient descent land”:
Deep learning training doesn’t work like that in practice.
It’s dominated by stochastic gradient descent (SGD) or mini-batch SGD:
I knew this academically, but June is when I felt the consequences:
I stopped expecting the loss to drop every step.
I started asking: “Is the trend improving over a few hundred updates?”
That framing made the next concept click.

Momentum is one of those ideas that sounds like a tweak.
Then you use it and it feels like the training process suddenly has inertia.
The way I internalized it:
So if the gradient keeps pointing roughly the same way for many steps, momentum accelerates progress.
If the gradient keeps flipping direction (noise / oscillation), momentum damps it.
This answered a practical question I kept running into:
Why does training bounce around so much even when the model is improving?
Momentum is basically a noise filter plus an accelerator.
In a shallow cost surface, gradients tend to be well-behaved.
In a deep network, I started imagining the loss surface as:
That kind of terrain punishes plain gradient descent.
You get:
Momentum helps because it:
So momentum isn’t an “optional improvement” in deep learning.
It’s survival gear.
I already learned in 2016 that learning rate matters.
But in deep learning, I learned a more frustrating truth:
With plain gradient descent, you can often pick a reasonable alpha and be okay.
With deep training, alpha feels like a control knob you need to monitor constantly:
Too low
Loss decreases, but so slowly you start doubting the whole setup.
Too high
Loss spikes, oscillates hard, or becomes NaN/Inf. Training feels unstable.
“Almost” right
Loss improves, then bounces/plateaus. Often needs momentum or a schedule to finish the job.
This month felt like learning rates became a first-class engineering concern.
This was a new mental leap for me.
In 2016, alpha was basically a constant I tuned.
Now I started absorbing the idea that:
Even before I had a perfect toolchain, that idea felt obvious in hindsight:
If you’re far from the solution, you take bigger steps.
If you’re close, you take smaller steps.
Deep learning made that practical.
Early training should show clear progress within a reasonable number of updates.
When improvement slows and the curve starts “hovering”, that’s a hint you may need smaller steps.
Lowering the rate later helps the model “land” instead of orbiting the solution.
This month made me appreciate how much of 2016 ML was secretly about optimization hygiene:
Deep learning just pushes those concerns to the front.
So the connection is:
2017 is teaching me the operations of optimization.
2016 optimization mindset
Gradient descent felt like the algorithm. You tuned alpha, scaled features, and it converged.
2017 optimization mindset
Optimization feels like a system: noise, inertia, schedules, and stability all interact.
Same core idea: gradients and descent.
Different reality: training is noisy, fragile, and sensitive to control knobs.
This month I started writing down a more disciplined loop.
Likely causes:
Likely causes:
Likely causes:
Likely causes:
Likely causes:
It’s not perfect yet, but it’s already a better mental model than “maybe add layers”.
Before June, I thought optimization was a solved piece: “compute gradients and descend.”
After June, I see it like this:
Optimization is not a single algorithm. It’s a set of dynamics you have to engineer.
Momentum and learning rates aren’t “extras”. They’re what makes deep learning training feel possible in practice.
By June, optimization finally felt under control — or at least less mysterious.
But something else was becoming obvious:
even with better optimizers, fully connected networks still struggled to make sense of images.
Next month, I turn to Convolutional Neural Networks and try to understand why they worked so much better for vision tasks — not because of better training tricks, but because of architectural bias.
July is about convolutions, locality, and why encoding domain structure into the model itself changed everything.
Convolutions - Why CNNs See the World Differently
After months wrestling with training stability, I finally hit the next wall - fully-connected nets don’t “get” images. Convolutions felt like the first time architecture itself encoded domain knowledge.
Initialization, Scale, and the Fragility of Deep Networks
After learning why gradients vanish, I discovered something even more unsettling - deep networks can fail before training even begins, simply because the starting scale is wrong.