
Implement backpropagation for a 2-layer neural network, verify gradients numerically, train on handwritten digits, and hit ~95% accuracy.
Axel Domingues
Exercise 3 was exciting because it used a neural network.
Exercise 4 is exciting because it makes you train one.
This is the assignment where “neural networks” stop being a mysterious box and start feeling like a system you can debug:
If you’ve ever heard that backprop is scary: I get it.
But after implementing it once in Octave, my takeaway was:
Backpropagation is mostly bookkeeping — as long as you respect s
What you implement
A full 2-layer neural network: cost function, backpropagation, regularization, and prediction.
What you verify
Gradients via numerical gradient checking — the safety net that builds confidence.
What you achieve
Train on handwritten digits and reach ~95% accuracy with your own learned weights.
ex4.m — guided script for the exerciseex4data1.mat — 5000 handwritten digitsex4weights.mat — fixed weights for debugging costdisplayData.m — visualize digits and learned featuresfmincg.m — optimizer for large parameter vectorscomputeNumericalGradient.m — numerical gradient checkercheckNNGradients.m — validates backprop implementationsigmoidGradient.mrandInitializeWeights.mnnCostFunction.m (cost + backprop gradients)The dataset contains:
So you should expect:
X is 5000 by 400And the classic label quirk:
If your model never predicts “0”, check your label convention. In this dataset, “0” maps to label 10.
This exercise uses a neural network with:
You’ll work with two parameter matrices:
Theta1 connects input -> hiddenTheta2 connects hidden -> outputMost bugs here are shape bugs.
Print sizes aggressively while building this:
size(X)
size(Theta1)
size(Theta2)
If shapes aren’t what you think, fix that first.
Before backprop, the assignment has you implement the cost function for a neural network.
This is a good design choice:
When the script loads ex4weights.mat and calls your cost function, you should see:
0.287629 (unregularized)Then when you add regularization:
0.383770These two numbers are huge confidence boosters when you hit them.
Those cost checkpoints are not “random facts.” They’re early unit tests for your implementation.
Theta1 and Theta2.Regularization here follows the same pattern you learned in logistic regression:
In practice this means:
Theta1 and Theta2A very common bug is regularizing the bias column. It won’t crash — it’ll just learn worse.
Backprop gives you gradients for every parameter.
Here’s the mental model that made it click for me:
For each training example:
Theta1 and Theta2At the end:
This function returns the derivative of sigmoid.
A clean implementation pattern is:
function g = sigmoidGradient(z)
s = sigmoid(z);
g = s .* (1 - s);
end
This function gets used in the hidden-layer error calculation.
Neural networks should not start with all-zero weights.
If all weights start equal, all hidden units learn the same thing.
So we initialize randomly in a small range:
function W = randInitializeWeights(L_in, L_out)
epsilon_init = 0.12;
W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init;
end
The exact epsilon isn’t the point — the point is:
This is where everything comes together.
Turn the unrolled parameter vector into Theta1 and Theta2.
Compute activations layer by layer (add intercept units explicitly).
Include regularization, excluding bias columns.
Compute errors layer by layer and accumulate gradients.
Flatten gradients back into a vector for fmincg.
A typical skeleton:
function [J, grad] = nnCostFunction(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda)
% 1) reshape params into Theta1 and Theta2
% 2) forward propagation
% 3) cost (with regularization)
% 4) backprop gradients
% 5) unroll gradients
end
Neural nets typically use a “one-hot” representation for y:
This exercise expects you to build that Y matrix.
The assignment gives you tooling to validate your backprop gradients:
computeNumericalGradient.mcheckNNGradients.mThis is one of the best learning tools in the whole course.
What it does:
If they match closely, you’re probably correct.
Gradient checking is slow. Use it only for debugging. Once gradients look good, disable it and train normally.
Once nnCostFunction is correct, training becomes simple:
fmincg to minimize the costYou’ll typically see something like:
initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);
initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels);
initial_nn_params = [initial_Theta1(:); initial_Theta2(:)];
options = optimset('MaxIter', 50);
costFunction = @(p) nnCostFunction(p, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);
After training:
nn_params back into Theta1 and Theta2predictOnce trained, the assignment suggests you should see:
That’s a huge jump from one-vs-all logistic regression.
And it’s a great sanity checkpoint that your backprop + optimization are working.
If accuracy is far below ~95%:
After training, the assignment has you visualize Theta1 weights.
Since each hidden unit connects to the 20x20 pixel input, you can reshape each row of weights into an “image.”
When you plot them, you’ll see hidden units that look like:
It’s an early glimpse of representation learning — even though this is a very small network.
This is one reason I like doing ML from scratch: you don’t just get accuracy, you get insight into what the system learned.
Confirm the dataset looks like real digits, not noise.
Hit the cost checkpoint with provided weights (~0.287629).
Hit the regularized cost checkpoint (~0.383770).
Compute gradients and pass gradient checking.
Random initialize weights, optimize parameters.
Target ~95% training accuracy and inspect learned features.
When Exercise 4 fails, it usually fails for a few reasons:
mMy fastest checks:
checkNNGradientsBackprop bugs are painful when you debug by guessing. Gradient checking is your safety rope — use it.
The next post is the most practical assignment so far.
I’ll learn how to diagnose bias vs variance using learning curves, tune lambda with validation curves, and build a repeatable “what should I try next?” playbook instead of guessing when a model underperforms.
Exercise 5 - Debugging ML (Bias/Variance, Learning Curves, and What to Try Next)
The most practical assignment so far - diagnose bias vs variance using learning curves, tune lambda with validation curves, and build a repeatable “next action” playbook.
Exercise 3 - One-vs-All + Intro to Neural Networks (Handwritten Digits!)
Build a multi-class digit classifier with one-vs-all logistic regression, then run a small neural network forward pass on the same dataset.