Aug 28, 2016 - 18 MIN READ

Exercise 4 - Neural Networks Learning (Backpropagation Without Tears)

Implement backpropagation for a 2-layer neural network, verify gradients numerically, train on handwritten digits, and hit ~95% accuracy.

Axel Domingues

Exercise 3 was exciting because it used a neural network.

Exercise 4 is exciting because it makes you train one.

This is the assignment where “neural networks” stop being a mysterious box and start feeling like a system you can debug:

define a cost function
compute gradients with backpropagation
verify those gradients numerically (gradient checking)
optimize parameters
validate accuracy

If you’ve ever heard that backprop is scary: I get it.

But after implementing it once in Octave, my takeaway was:

Backpropagation is mostly bookkeeping — as long as you respect s

What you implement

A full 2-layer neural network: cost function, backpropagation, regularization, and prediction.

What you verify

Gradients via numerical gradient checking — the safety net that builds confidence.

What you achieve

Train on handwritten digits and reach ~95% accuracy with your own learned weights.

What’s inside the Exercise 4 bundle

Dataset: 5000 digits, 20x20 pixels, flattened to 400 features

The dataset contains:

5000 training examples
each example is a 20 by 20 grayscale image
each image is “unrolled” into 400 features (one row)

So you should expect:

X is 5000 by 400

And the classic label quirk:

digit 0 is labeled as 10 (because Octave indexes start at 1)

If your model never predicts “0”, check your label convention. In this dataset, “0” maps to label 10.

The model: a 2-layer neural network

This exercise uses a neural network with:

input layer: 400 features (+ intercept)
hidden layer: 25 units (+ intercept)
output layer: 10 units (digits 1–10, where 10 represents digit 0)

You’ll work with two parameter matrices:

Theta1 connects input -> hidden
Theta2 connects hidden -> output

Most bugs here are shape bugs.

Print sizes aggressively while building this:

size(X)
size(Theta1)
size(Theta2)

If shapes aren’t what you think, fix that first.

Part 1 — Feedforward and cost (warm-up)

Before backprop, the assignment has you implement the cost function for a neural network.

This is a good design choice:

forward pass is simpler
you can validate your cost using provided weights

The provided checkpoint

When the script loads ex4weights.mat and calls your cost function, you should see:

cost about 0.287629 (unregularized)

Then when you add regularization:

cost about 0.383770

These two numbers are huge confidence boosters when you hit them.

Those cost checkpoints are not “random facts.” They’re early unit tests for your implementation.

Regularization (same habit as Exercise 2)

Regularizing the bias/intercept columns is a silent performance killer.

Nothing crashes.
Gradients look reasonable.
Accuracy just… underperforms.

Always exclude the first column of Theta1 and Theta2.

Regularization here follows the same pattern you learned in logistic regression:

penalize large weights
do not regularize bias/intercept columns

In practice this means:

when adding the penalty, ignore the first column of Theta1 and Theta2
when adding regularization to gradients, also ignore the first column

A very common bug is regularizing the bias column. It won’t crash — it’ll just learn worse.

Part 2 — Backpropagation (the main event)

A useful way to think about backprop:

forward pass → what did we predict?
backward pass → who is responsible for the error?
gradients → how should each weight change to reduce it?

Backprop is not magic — it’s structured bookkeeping.

Backprop gives you gradients for every parameter.

Here’s the mental model that made it click for me:

forward pass computes predictions
backward pass computes “blame” for each layer
gradients are just “input * blame” accumulated over the dataset

The backprop flow (in words)

For each training example:

run forward propagation and store activations
compute output error (difference between prediction and true label)
propagate that error back to hidden layer
accumulate gradients for Theta1 and Theta2

At the end:

average over all examples
add regularization (excluding bias columns)

Implement sigmoidGradient.m

This function returns the derivative of sigmoid.

A clean implementation pattern is:

compute sigmoid(z)
return sigmoid(z) .* (1 - sigmoid(z))

function g = sigmoidGradient(z)
  s = sigmoid(z);
  g = s .* (1 - s);
end

This function gets used in the hidden-layer error calculation.

Implement randInitializeWeights.m

Neural networks should not start with all-zero weights.

If all weights start equal, all hidden units learn the same thing.

So we initialize randomly in a small range:

function W = randInitializeWeights(L_in, L_out)
  epsilon_init = 0.12;
  W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init;
end

The exact epsilon isn’t the point — the point is:

random
small

Implement nnCostFunction.m (cost + gradients)

This is where everything comes together.

The engineering structure I followed

Reshape parameters

Turn the unrolled parameter vector into Theta1 and Theta2.

Forward propagation

Compute activations layer by layer (add intercept units explicitly).

Compute cost

Include regularization, excluding bias columns.

Backpropagation

Compute errors layer by layer and accumulate gradients.

Unroll gradients

Flatten gradients back into a vector for fmincg.

A typical skeleton:

function [J, grad] = nnCostFunction(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, lambda)

  % 1) reshape params into Theta1 and Theta2

  % 2) forward propagation

  % 3) cost (with regularization)

  % 4) backprop gradients

  % 5) unroll gradients

end

One-hot encode labels

Neural nets typically use a “one-hot” representation for y:

if the digit is 3, the label vector has a 1 at position 3 and 0 elsewhere

This exercise expects you to build that Y matrix.

Gradient checking (how I stopped being afraid)

The assignment gives you tooling to validate your backprop gradients:

computeNumericalGradient.m
checkNNGradients.m

This is one of the best learning tools in the whole course.

What it does:

slightly perturbs parameters
estimates gradients numerically
compares them to your backprop gradients

If they match closely, you’re probably correct.

Gradient checking is slow. Use it only for debugging. Once gradients look good, disable it and train normally.

Use gradient checking when:

implementing backprop for the first time
refactoring cost / gradient code

Disable it when:

training for real
performance matters

It’s a debugger, not a training tool.

Train with fmincg

Once nnCostFunction is correct, training becomes simple:

initialize weights
call fmincg to minimize the cost

You’ll typically see something like:

initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);
initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels);

initial_nn_params = [initial_Theta1(:); initial_Theta2(:)];

options = optimset('MaxIter', 50);

costFunction = @(p) nnCostFunction(p, input_layer_size, hidden_layer_size, num_labels, X, y, lambda);

[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);

After training:

reshape nn_params back into Theta1 and Theta2
compute predictions with predict

Expected training accuracy (checkpoint)

Once trained, the assignment suggests you should see:

training accuracy about 95.3%
it may vary by about 1% due to random initialization

That’s a huge jump from one-vs-all logistic regression.

And it’s a great sanity checkpoint that your backprop + optimization are working.

If accuracy is far below ~95%:

check you didn’t regularize bias columns
check your label mapping (0 -> 10)
rerun gradient checking
verify your one-hot encoding

Visualizing the hidden layer (my favorite part)

After training, the assignment has you visualize Theta1 weights.

Since each hidden unit connects to the 20x20 pixel input, you can reshape each row of weights into an “image.”

When you plot them, you’ll see hidden units that look like:

stroke detectors
blob detectors
diagonal / edge-ish patterns

It’s an early glimpse of representation learning — even though this is a very small network.

This is one reason I like doing ML from scratch: you don’t just get accuracy, you get insight into what the system learned.

Steps I followed (the workflow that kept me sane)

Visualize digits with displayData

Confirm the dataset looks like real digits, not noise.

Implement nnCostFunction cost only

Hit the cost checkpoint with provided weights (~0.287629).

Add regularization to the cost

Hit the regularized cost checkpoint (~0.383770).

Implement sigmoidGradient and backprop

Compute gradients and pass gradient checking.

Train with fmincg

Random initialize weights, optimize parameters.

Validate accuracy and visualize hidden units

Target ~95% training accuracy and inspect learned features.

Debugging checklist (things that actually matter)

When Exercise 4 fails, it usually fails for a few reasons:

wrong reshape/unroll logic
missing intercept columns in activations
regularizing the bias column
wrong one-hot encoding
forgetting to average gradients over m
mixing matrix ops and element-wise ops

My fastest checks:

print sizes
run checkNNGradients
confirm cost checkpoints

Backprop bugs are painful when you debug by guessing. Gradient checking is your safety rope — use it.

What I’m keeping from Exercise 4

Backprop is not magic; it’s structured error propagation.
Gradient checking is a practical technique that builds trust in your implementation.
Random initialization is required for symmetry breaking.
Regularization applies to weights, not bias.
Once cost + gradients are correct, training is mostly an optimizer call.

What’s Next

The next post is the most practical assignment so far.

I’ll learn how to diagnose bias vs variance using learning curves, tune lambda with validation curves, and build a repeatable “what should I try next?” playbook instead of guessing when a model underperforms.

Exercise 5 - Debugging ML (Bias/Variance, Learning Curves, and What to Try Next)

The most practical assignment so far - diagnose bias vs variance using learning curves, tune lambda with validation curves, and build a repeatable “next action” playbook.

Exercise 3 - One-vs-All + Intro to Neural Networks (Handwritten Digits!)

Build a multi-class digit classifier with one-vs-all logistic regression, then run a small neural network forward pass on the same dataset.