
Implement K-means clustering, compress an image to 16 colors, then use PCA to reduce dimensions and build eigenfaces.
Axel Domingues
In the first half of Andrew Ng’s course, everything is supervised: you get labeled data, you minimize a cost, and you measure accuracy.
Exercise 7 is where that comfort disappears (in a good way).
Assignment step, centroid updates, and convergence intuition.
Reduce a full-color image to a small, fixed color palette.
Compute principal components and understand variance directions.
Compress data, then reconstruct an approximation and inspect loss.
Know when clustering and dimensionality reduction are actually useful.
K-means is surprisingly simple once you stop thinking of it as “AI” and start thinking of it as a loop:
The implementation in the exercise is split across these functions:
findClosestCentroids.m — assignment stepcomputeCentroids.m — update steprunkMeans.m — orchestration (already provided)For each example x(i), compute its distance to each centroid mu(k), and take the centroid index with the smallest distance.
In this exercise it’s the standard “squared distance” idea:
No labels. No gradients. Just geometry.
A great way to sanity check your assignment step: print the idx values for a few points and see if they make sense visually.
In the provided script, after implementing findClosestCentroids.m, the expected output for the first three training examples is:
[1 3 2]
findClosestCentroids.mThe “beginner” way is a double loop: examples x centroids. It’s fine for learning.
The "lesson” for me was: the course constantly nudges you toward vectorization.
Even if you keep an outer loop over examples, you can compute the distances to all centroids in one shot by subtracting a centroid matrix from the example.
% Pseudocode style (shapes matter more than syntax)
% For a single example x (1 x n), centroids is (K x n)
% diffs becomes (K x n), then sum across columns => (K x 1)
diffs = centroids - x; % (K x n)
dists = sum(diffs .^ 2, 2); % (K x 1)
[~, idx(i)] = min(dists);
computeCentroids.mThe centroid update is “mean of assigned points”. In plain language:
kidx == kcentroids(k,:) to the average of those pointsThe simplest implementation uses a loop over k:
for k = 1:K
points = X(idx == k, :);
centroids(k, :) = mean(points, 1);
end
You need to think about the edge case where a centroid gets zero points assigned. For the course datasets it usually doesn’t happen, but in real data it can. A common strategy is to reinitialize that centroid to a random point.
Production systems must handle this:
K-means is sensitive to initialization. The exercise recommends a practical default:
K as the starting centroids% Initialize the centroids to be random examples
randidx = randperm(size(X, 1));
centroids = X(randidx(1:K), :);
This avoids selecting the same point twice and usually gives a decent start.
This is the “aha” moment of the exercise.
A 24-bit color image stores each pixel as three values (R, G, B). If you treat every pixel as a point in 3D RGB space, you can cluster those points with K-means.
Then you replace every pixel with its centroid color.
(m x 3)K = 16 colorsIn the course files you’ll see this image referenced as:
% Load 128x128 color image (bird_small.png)
A = imread('bird_small.png');
This is the same “quantization” idea used in lots of real systems:
The big lesson: unsupervised learning is often a preprocessing tool, not just a model.
The script also includes a 3D visualization of pixel clusters in RGB space (using scatter3). It’s a nice reminder that K-means is just geometry.
PCA answers a different question than K-means.
In this exercise, PCA is used in two ways:
Functions involved:
featureNormalize.m — mean-normalize (and scale)pca.m — compute principal components using SVDprojectData.m — project into lower dimensionrecoverData.m — approximate reconstructionPCA is very sensitive to feature scale.
If one feature ranges 0..1000 and another ranges 0..1, the large-scale feature will dominate the directions.
So the exercise starts with:
This is one of those boring steps that is also the difference between “works” and “doesn’t work”.
pca.m function)The PCA implementation here is:
In words:
% X is normalized (m x n)
Sigma = (1 / m) * (X' * X);
[U, S, V] = svd(Sigma);
% U contains the principal directions (eigenvectors)
On the small 2D dataset provided in ex7_pca.m, after implementing pca.m, the top principal component should be approximately:
[-0.707 -0.707]That number looks weird until you realize it’s basically “a diagonal direction” in 2D.
Large-scale features dominate everything.
Rows vs columns swapped — very common mistake.
Each row must be an example, each column a feature.
projectData.m function)Projection means:
K columns of U (the top K directions)In 2D-to-1D, each point becomes a single value (its coordinate on that line).
The exercise includes a useful numeric check. With K = 1, the first projected value should be approximately:
1.481That’s a great debugging anchor.
recoverData.m function)Recovery does the reverse mapping:
ZIt won’t be identical — you threw information away — but it should be close.
In the exercise’s 2D dataset, the first recovered point is approximately:
[-1.047 -1.047]That’s “close enough” to show the idea works.
This project/recover pair is exactly how I think about dimensionality reduction now: compress → compute faster → accept some loss.
This is the most memorable part of Exercise 7.
The dataset is a set of 32x32 grayscale face images. Flattened, each image is a vector of 1024 pixel values.
PCA finds the directions that capture the main variations across faces. These principal directions are often called eigenfaces.
The reconstruction is slightly blurry (fine details are lost), but the face is clearly still a face.
That’s the point: you get a 10x+ reduction in input size while preserving the structure.
If you were training a classifier (say, “who is this person?”), reducing inputs from 1024 to 100 can:
PCA is not a free win. If you reduce too aggressively, you can throw away discriminative details. I treat PCA as a tool to try, measure, and keep only if it improves the system.
Here’s what stuck with me after finishing this exercise:
[1 3 2], [-0.707 -0.707], and 1.481.This wraps up Andrew Ng’s Machine Learning course.
To close the loop, I’ll build anomaly detection and a simple movie recommender — two patterns that quietly power fraud detection, monitoring systems, and personalization engines everywhere.
Exercise 8 + Course Wrap - Anomaly Detection & Recommenders (and My Next Steps)
I wrapped Andrew Ng’s ML course by building anomaly detection and a simple movie recommender—two patterns that show up everywhere in real systems.
Exercise 6 - Support Vector Machines (When a Different Model Just Wins)
I built my first spam classifier with SVMs—learning how C and sigma shape decision boundaries, and why linear SVMs scale surprisingly well.