
Backprop stopped feeling like magic when I treated it like engineering - track shapes, follow the chain rule, and test gradients like you’d test any critical system.
Axel Domingues
In January, the perceptron gave me a comforting discovery: neural networks begin in familiar territory. A neuron is basically a weighted sum + an activation. Nothing supernatural.
Then February arrived and immediately removed that comfort.
Because the thing that makes neural networks actually learn is backpropagation.
And backpropagation has a reputation.
People talk about it like it’s a ritual:
That’s not how I wanted to learn it.
So I set myself a rule:
This post is the result of trying to make backprop boring.
What you’ll walk away with
A non-mystical mental model of backprop: forward caches, backward signals, per-layer gradients.
What to pay attention to
Backprop fails less from “math” and more from bookkeeping: shapes, element-wise vs matrix ops, bias placement.
How this post is structured
Backpropagation is not a separate algorithm.
It’s the same gradient descent loop I already understood in 2016, but with one difference:
the model is a composition of functions.
So instead of one derivative, I’m computing a chain of derivatives.
That’s it.
Backprop isn’t hard because the math is advanced.
It’s hard because you’re doing the same simple operation many times, and one tiny mistake breaks everything.
The pain comes from:
So I approached it like a software problem:
The cleanest way I found to think about backprop is credit assignment:
“The output was wrong. Who is responsible?”
Forward pass:
Backward pass:
Gradients:
That framing made it intuitive enough to survive implementation.
The biggest shift for me this month was seeing neural networks as graphs of computations, not equations.
When you draw it as a graph:
That means you can reason locally:
Backprop is just repeating that, node by node.

Here’s the non-mathy version that helped me:
The loss depends on the final prediction.
Predictions depend on the last layer activations.
Those depend on earlier activations.
Those depend on weights.
For each weight, ask: “If I nudge this weight slightly, how does the loss react after everything downstream updates?”
So every weight influences the loss indirectly, through everything that happens after it.
Backprop is just computing:
but doing it efficiently by reusing intermediate results instead of recomputing everything from scratch.
When implementing backprop, I kept two contracts in mind.
Forward pass gives me:
Backward pass gives me:
If a layer couldn’t produce those two things, it wasn’t implemented correctly.
Backprop errors are dangerous because they can be subtle.
A wrong gradient can still:
So I leaned heavily on numerical gradient checking — not as a theory trick, but as a debugging tool.
The mindset:
If they match closely, you can trust your implementation.
This felt like unit tests for ML math.
Bias units appear in different places depending on your representation. Forgetting to add a bias column at the right time leads to:
This is the Octave version of a classic production bug:
If I accidentally used matrix multiplication where I meant element-wise multiplication (or vice versa), the gradients went off the rails.
I started printing sizes constantly, not as noise, but as a normal practice.
In 2016, I spent a lot of time making gradient descent feel intuitive:
That foundation mattered because it made one thing obvious:
backprop is not the learning algorithm. Gradient descent is.
Backprop is just the machinery that produces gradients for gradient descent.
Once I understood that separation, it stopped feeling like a dark art.
Before this month, “backprop” sounded like something you either memorized or avoided.
After this month, it became something much simpler:
Backprop is the chain rule turned into an efficient engineering procedure.
Not glamorous. Not mystical. Just extremely sensitive to implementation details.
And that realization gave me confidence — because now I can debug it like software.
Next month I want to understand why deep networks are hard to train even when backprop is correct.
I keep hearing the same phrase:
vanishing gradients
I now know what gradients are and how they flow.
March is about understanding what happens when they don’t.
Why Deeper Networks Are Harder to Train Than I Expected
I assumed “more layers” would just mean “more power.” Instead I discovered that depth introduces a new failure mode - gradients can disappear (or explode) long before the model learns anything useful.
From Logistic Regression to Neurons - Rebuilding Intuition from the Perceptron
After finishing Andrew Ng’s Machine Learning course, I start my deep learning journey by revisiting the perceptron and realizing neural networks begin with ideas I already understand.