
Build a multi-class digit classifier with one-vs-all logistic regression, then run a small neural network forward pass on the same dataset.
Axel Domingues
Exercise 3 is where ML stopped being “toy problems” for me.
We’re classifying handwritten digits (0–9) from pixel data — the kind of task you can actually imagine in a product:
The assignment has two parts:
Part 1 goal
Train 10 binary classifiers (one per digit) and choose the digit with the highest probability.
Part 2 goal
Run a pre-trained neural network forward pass (no training yet) and see it hit high accuracy.
Key “gotchas”
size(...)) constantlyex3.m — Part 1 (one-vs-all)ex3_nn.m — Part 2 (neural network forward pass)ex3data1.mat — 5000 examples of handwritten digitsex3weights.mat — provided neural network weightsdisplayData.m — shows digits in a grid (sanity check)fmincg.m — optimizer (like fminunc, better for many params)sigmoid.m — sigmoid functionlrCostFunction.m — regularized logistic regression cost + gradientoneVsAll.m — trains K classifiers and stores all_thetapredictOneVsAll.m — predicts all examples in one vectorized shotpredict.m — forward propagation prediction for the neural netThe scripts (ex3.m, ex3_nn.m) are designed like a guided debugger: run them top-to-bottom and validate each checkpoint before moving on.
The dataset has:
So X is a 5000 by 400 matrix.
There’s one detail that will bite you if you don’t read it:
If your model seems to never predict 0, check your label mapping. In this dataset, “0” is label 10.
The first thing the script does is show a grid of random digits using displayData.m.
This is not optional for me.
When you’re debugging a model, being able to see the input makes it easier to sanity-check:
If the digits look like noise, don’t even start training.
Logistic regression is naturally a binary classifier.
One-vs-all turns it into a multi-class classifier by training K separate binary classifiers, one per class:
At prediction time:
This is the heart of Part 1.
lrCostFunction computes:
JgradAnd it includes regularization.
Key implementation rule:
theta(1))Vectorized implementation pattern:
function [J, grad] = lrCostFunction(theta, X, y, lambda)
m = length(y);
h = sigmoid(X * theta);
theta_reg = theta;
theta_reg(1) = 0;
J = (1/m) * sum( -y .* log(h) - (1 - y) .* log(1 - h) ) ...
+ (lambda/(2*m)) * sum(theta_reg .^ 2);
grad = (1/m) * (X' * (h - y)) + (lambda/m) * theta_reg;
end
If fmincg behaves strangely later, it’s usually because your gradient is wrong or you accidentally regularized theta(1).
Now we train K classifiers.
Practical flow:
Xc = 1..Ky_c = (y == c)all_thetaA typical structure:
all_theta = zeros(num_labels, n + 1);
for c = 1:num_labels
initial_theta = zeros(n + 1, 1);
y_c = (y == c);
options = optimset('MaxIter', 50);
[theta] = fmincg(@(t)(lrCostFunction(t, X, y_c, lambda)), initial_theta, options);
all_theta(c, :) = theta';
end
Two important engineering observations here:
This is where the matrix mindset becomes worth it.
Instead of predicting digit-by-digit in a loop, we predict for all examples and all classes in one shot.
Typical approach:
probs = sigmoid(X * all_theta');
[~, p] = max(probs, [], 2);
p is your predicted label.
Checkpoint from the assignment:
This accuracy is on the training set (not a true generalization measure), but it’s a great “did I implement it correctly?” checkpoint.
After one-vs-all, the assignment introduces a small neural network.
Important: in Exercise 3, you do not train the neural network.
You’re given pre-trained weights (Theta1 and Theta2) and you implement prediction via forward propagation.
This is basically: “use a neural network as a function.”
The network has:
The weights are stored as matrices:
Theta1 maps input layer to hidden layerTheta2 maps hidden layer to output layerMost bugs here are shape/indexing bugs. Print sizes early and often.
The forward pass is a sequence of matrix multiplies + sigmoid activations.
Implementation pattern:
function p = predict(Theta1, Theta2, X)
m = size(X, 1);
a1 = [ones(m, 1) X];
z2 = a1 * Theta1';
a2 = sigmoid(z2);
a2 = [ones(m, 1) a2];
z3 = a2 * Theta2';
a3 = sigmoid(z3);
[~, p] = max(a3, [], 2);
end
Checkpoint from the assignment:
After that, ex3_nn.m typically shows an interactive sequence:
That interactive viewer was one of the most motivating things in the early course.
At this point, the dataset is big enough that slow code becomes painful:
If you try to compute gradients with nested loops over examples and features, Octave will punish you.
Vectorization gave me two wins:
Vectorization isn’t just a performance trick. It’s how you keep the implementation aligned with the model you think you’re building.
When Exercise 3 breaks, it usually breaks in predictable ways:
Xtheta(1) by mistakeTheta1 / Theta2 (forgot transpose)10 (and you didn’t account for it)My first-line debugging tools:
size(X)
size(theta)
size(all_theta)
size(Theta1)
size(Theta2)
And I always re-run the scripts from the top.
Next up is the big leap: training a neural network.
I’ll implement backpropagation for a 2-layer network, verify gradients with numerical gradient checking, then train on handwritten digits and push toward ~95% accuracy with my own learned weights (not pre-trained ones).
Exercise 4 - Neural Networks Learning (Backpropagation Without Tears)
Implement backpropagation for a 2-layer neural network, verify gradients numerically, train on handwritten digits, and hit ~95% accuracy.
Regularization - Overfitting in the Real World (and How to Fight It)
Overfitting is what happens when your model memorizes training data. Regularization is the practical tool that keeps it honest.