Nov 27, 2016 - 12 MIN READ

Exercise 7 - Unsupervised Learning (K-means) + PCA (Compression & Visualization)

Implement K-means clustering, compress an image to 16 colors, then use PCA to reduce dimensions and build eigenfaces.

Axel Domingues

In the first half of Andrew Ng’s course, everything is supervised: you get labeled data, you minimize a cost, and you measure accuracy.

Exercise 7 is where that comfort disappears (in a good way).

This exercise marks a real shift:There are no labels, no “correct answers”, and no accuracy metric. You’re learning to reason about structure, geometry, and information loss instead.

K-means: “discover structure” by grouping similar points.
PCA: “compress structure” by keeping only the directions that matter.

Part 1 — K-means clustering

K-means is surprisingly simple once you stop thinking of it as “AI” and start thinking of it as a loop:

Assign each point to the closest centroid
Update each centroid to be the mean of the points assigned to it
repeat

The implementation in the exercise is split across these functions:

findClosestCentroids.m — assignment step
computeCentroids.m — update step
runkMeans.m — orchestration (already provided)

The core idea (no magic)

K-means is refreshing because there is:

no cost function to minimize
no gradients
no learning rate

Just distance, averaging, and repetition.

“Closest centroid” is just distance

For each example x(i), compute its distance to each centroid mu(k), and take the centroid index with the smallest distance.

In this exercise it’s the standard “squared distance” idea:

subtract the centroid from the point
square the differences
sum them

No labels. No gradients. Just geometry.

A great way to sanity check your assignment step: print the idx values for a few points and see if they make sense visually. In the provided script, after implementing findClosestCentroids.m, the expected output for the first three training examples is: [1 3 2]

Implementation notes (Octave)

`findClosestCentroids.m`

The “beginner” way is a double loop: examples x centroids. It’s fine for learning.

The "lesson” for me was: the course constantly nudges you toward vectorization.

Even if you keep an outer loop over examples, you can compute the distances to all centroids in one shot by subtracting a centroid matrix from the example.

% Pseudocode style (shapes matter more than syntax)
% For a single example x (1 x n), centroids is (K x n)
% diffs becomes (K x n), then sum across columns => (K x 1)

diffs = centroids - x;            % (K x n)
dists = sum(diffs .^ 2, 2);       % (K x 1)
[~, idx(i)] = min(dists);

`computeCentroids.m`

The centroid update is “mean of assigned points”. In plain language:

for each centroid k
find all points where idx == k
set centroids(k,:) to the average of those points

The simplest implementation uses a loop over k:

for k = 1:K
  points = X(idx == k, :);
  centroids(k, :) = mean(points, 1);
end

You need to think about the edge case where a centroid gets zero points assigned. For the course datasets it usually doesn’t happen, but in real data it can. A common strategy is to reinitialize that centroid to a random point.

In real datasets, a centroid can end up with zero points assigned.

Production systems must handle this:

reinitialize the centroid
or drop it
or merge it with a nearby cluster

K-means initialization

K-means is sensitive to initialization. The exercise recommends a practical default:

randomly reorder the training examples
take the first K as the starting centroids

% Initialize the centroids to be random examples
randidx = randperm(size(X, 1));
centroids = X(randidx(1:K), :);

This avoids selecting the same point twice and usually gives a decent start.

Image compression with K-means

This isn’t about images.The exact same idea appears in:

vector quantization
embedding compression
clustering user or item representations

Once you see pixels as points, you see this pattern everywhere.

This is the “aha” moment of the exercise.

A 24-bit color image stores each pixel as three values (R, G, B). If you treat every pixel as a point in 3D RGB space, you can cluster those points with K-means.

Then you replace every pixel with its centroid color.

What I implemented

load the image (128x128)
reshape it into a list of pixels (m x 3)
run K-means with K = 16 colors
rebuild the image using the 16 centroids

In the course files you’ll see this image referenced as:

% Load 128x128 color image (bird_small.png)
A = imread('bird_small.png');

Why this matters as an engineer

This is the same “quantization” idea used in lots of real systems:

compressing color palettes
building compact “visual words”
clustering embeddings

The big lesson: unsupervised learning is often a preprocessing tool, not just a model.

The script also includes a 3D visualization of pixel clusters in RGB space (using scatter3). It’s a nice reminder that K-means is just geometry.

Part 2 — PCA (compression + visualization)

A helpful way to think about PCA:You are rotating the coordinate system so that:

axis 1 captures “most variation”
axis 2 captures the next most
and so on

Then you throw away the axes that matter least.

PCA answers a different question than K-means.

K-means: “which group is this point closest to?”
PCA: “which directions contain most of the variation in the data?”

In this exercise, PCA is used in two ways:

reduce 2D data to 1D (to make the idea obvious)
reduce face images to 100 dimensions (to show the practical value)

Functions involved:

featureNormalize.m — mean-normalize (and scale)
pca.m — compute principal components using SVD
projectData.m — project into lower dimension
recoverData.m — approximate reconstruction

Step 0: normalize features

PCA is very sensitive to feature scale.

If one feature ranges 0..1000 and another ranges 0..1, the large-scale feature will dominate the directions.

So the exercise starts with:

subtract mean (so each feature is centered)
scale by standard deviation

This is one of those boring steps that is also the difference between “works” and “doesn’t work”.

Skipping normalization is one of the fastest ways to get convincing but wrong PCA results.

The math still runs.
The outputs still look numeric.
The directions are just meaningless.

Implementing PCA (the `pca.m` function)

The PCA implementation here is:

build the covariance matrix from the normalized data
use SVD to get the principal directions

In words:

covariance matrix captures how features vary together
SVD gives you orthogonal directions sorted by “importance”

% X is normalized (m x n)
Sigma = (1 / m) * (X' * X);
[U, S, V] = svd(Sigma);
% U contains the principal directions (eigenvectors)

Quick self-check from the exercise

On the small 2D dataset provided in ex7_pca.m, after implementing pca.m, the top principal component should be approximately:

[-0.707 -0.707]

That number looks weird until you realize it’s basically “a diagonal direction” in 2D.

Projecting data down (the `projectData.m` function)

Projection means:

take the first K columns of U (the top K directions)
map each example onto those directions

In 2D-to-1D, each point becomes a single value (its coordinate on that line).

The exercise includes a useful numeric check. With K = 1, the first projected value should be approximately:

1.481

That’s a great debugging anchor.

Recovering an approximation (the `recoverData.m` function)

Recovery does the reverse mapping:

take the low-dimensional representation Z
map it back into the original feature space using the same directions

It won’t be identical — you threw information away — but it should be close.

In the exercise’s 2D dataset, the first recovered point is approximately:

[-1.047 -1.047]

That’s “close enough” to show the idea works.

This project/recover pair is exactly how I think about dimensionality reduction now: compress → compute faster → accept some loss.

PCA on faces (eigenfaces)

This is the most memorable part of Exercise 7.

The dataset is a set of 32x32 grayscale face images. Flattened, each image is a vector of 1024 pixel values.

PCA finds the directions that capture the main variations across faces. These principal directions are often called eigenfaces.

What the exercise does

compute PCA on the normalized face dataset
keep only the top 100 principal components
represent each face as a 100-dimensional vector
reconstruct an approximation and compare side-by-side

The reconstruction is slightly blurry (fine details are lost), but the face is clearly still a face.

That’s the point: you get a 10x+ reduction in input size while preserving the structure.

Why this matters

If you were training a classifier (say, “who is this person?”), reducing inputs from 1024 to 100 can:

speed up training dramatically
reduce memory
sometimes help generalization

PCA is not a free win. If you reduce too aggressively, you can throw away discriminative details. I treat PCA as a tool to try, measure, and keep only if it improves the system.

Engineer’s takeaways

These tools rarely live alone.In real systems, they usually sit before a supervised model to make everything else feasible.

Here’s what stuck with me after finishing this exercise:

K-means is a building block, not a final product. It’s great for compression, grouping, and initializing other models.
PCA is a performance tool. It reduces dimensionality and can make downstream training feasible.
Vectorization matters early. Octave makes it obvious: the difference between a loop and a matrix operation is the difference between “instant” and “painful”.
Debug with known numbers. I loved that the exercise gave concrete expected outputs like [1 3 2], [-0.707 -0.707], and 1.481.

What’s Next

This wraps up Andrew Ng’s Machine Learning course.

To close the loop, I’ll build anomaly detection and a simple movie recommender — two patterns that quietly power fraud detection, monitoring systems, and personalization engines everywhere.

Exercise 8 + Course Wrap - Anomaly Detection & Recommenders (and My Next Steps)

I wrapped Andrew Ng’s ML course by building anomaly detection and a simple movie recommender—two patterns that show up everywhere in real systems.

Exercise 6 - Support Vector Machines (When a Different Model Just Wins)

I built my first spam classifier with SVMs—learning how C and sigma shape decision boundaries, and why linear SVMs scale surprisingly well.