Dec 29, 2016 - 12 MIN READ

Exercise 8 + Course Wrap - Anomaly Detection & Recommenders (and My Next Steps)

I wrapped Andrew Ng’s ML course by building anomaly detection and a simple movie recommender—two patterns that show up everywhere in real systems.

Axel Domingues

This is the final stop in my 2016 run through Andrew Ng’s Machine Learning course.

This exercise doesn’t introduce a brand-new mathematical trick.

It shows how many of the ideas from the entire course reappear together: probability modeling, optimization, regularization, metrics, and system thinking.

Exercise 8 is a fun combination because it covers two problems that feel very “real”:

Anomaly detection — “This point looks weird; should we investigate?”
Recommender systems — “Given what you liked before, here’s what you’ll like next.”

Both problems forced me to stop thinking like “I’m fitting a line” and start thinking like “I’m building a system that makes decisions.”

Part 1: Anomaly Detection (Detecting “Weird” Without Labels)

Anomaly detection answers a very specific question:

“Does this look like the data I usually see?”

Not:

“Is this good or bad?”
“Which class does this belong to?”

That framing matters when labels are rare or delayed.

Anomaly detection is what you use when you don’t have labels for “bad cases”, or when bad cases are rare.

Think:

fraudulent transactions
failing machines
unusual network traffic
strange sensor readings

The approach in the course is intentionally simple:

Assume each feature is roughly “bell-shaped” (Gaussian-ish)
Learn what “normal” looks like from the training set
Flag points with very low probability

Step 1 — Fit a Gaussian model

The first dataset is 2D, so you can visualize it.

In code, the learning step is almost embarrassingly simple:

for each feature: compute the mean
for each feature: compute the variance

Then for each example, compute its probability under the model.

Even if you later use fancier models, this exercise is great because it builds the habit of asking: “What does normal look like, statistically?”

Step 2 — Don’t guess the threshold; choose it

The hard part isn’t computing probabilities.

The hard part is deciding: how small is “too small”?

The exercise uses a cross-validation set with labels (normal vs anomaly) only for tuning the threshold. The method:

try many candidate thresholds (epsilon values)
for each one: predict anomalies where p < epsilon
compute precision, recall, and F1 score
pick the epsilon that maximizes F1

In the provided exercise, the selected epsilon ends up being extremely small (on the order of 1e-18), and the model identifies 117 anomalies in the higher-dimensional dataset.

Accuracy is actively misleading when anomalies are rare.

A model that labels everything as normal can look “99% accurate” and still be useless.

F1 is the right metric here because anomalies are rare. Accuracy can look great even when the model is useless.

Engineering notes (things I had to watch)

Feature scale matters. If one feature ranges from 0–1 and another from 0–10,000, the probability calculation can behave strangely.
Numerical underflow is real. Multiplying many tiny probabilities can underflow to zero in floating point.
Independence assumption. The “per-feature Gaussian” approach assumes features are independent. That’s often false, but it’s a decent baseline.

In production systems, this is often handled by working in log-probabilities (add instead of multiply) to avoid underflow.

Part 2: Recommenders (Collaborative Filtering)

Recommenders are one of the first places where ML feels product-shaped:

users arrive over time
data is sparse by default
predictions must degrade gracefully

This exercise captures all three.

This is the part that made me grin because the output looks like something a product could ship.

The dataset is a big matrix of movie ratings:

Y(i, j) is the rating that user j gave to movie i
R(i, j) is 1 if user j rated movie i, else 0

That R matrix is the key. It tells the algorithm which errors to care about.

Forgetting to mask with R is equivalent to training on hallucinated data.

The model will confidently optimize errors that don’t exist.

If you forget to mask by R, you’ll accidentally punish the model for “not predicting ratings that were never provided.”

The model idea

Instead of hand-crafting movie features (genre, cast, etc.), we learn them.

each movie i gets a feature vector x(i)
each user j gets a preference vector theta(j)
predicted rating is the dot product: user preferences “match” movie features

That’s it.

The cost function (in plain language)

The collaborative filtering objective is:

for every (movie, user) pair where R(i, j) = 1:
- compute prediction error
- square it and sum it
add regularization to keep parameters from growing too large

Then we compute gradients for both:

movie features X
user preferences Theta

Why vectorization matters (again)

This is the second time in the course where vectorization stops being “nice to have” and becomes “you can’t finish otherwise.”

A nested loop over movies x users is slow and messy. The exercise pushes you into expressing the logic as matrix operations.

Here’s the mental model I kept:

compute all predictions at once: X * Theta'
subtract actual ratings: (X * Theta' - Y)
zero-out unrated entries using R

Once you do that, the cost and gradients fall out cleanly.

This pattern shows up everywhere:

matrix factorization
embeddings
recommendation, search, and ranking systems

Once you see “predict → mask → optimize”, you’ll recognize it again and again.

My output: recommendations (and why it feels like magic)

The exercise includes a script where you add your own ratings (a “new user”), train the model, and then print top recommendations.

The PDF includes an example output like:

Titanic (1997)
Star Wars (1977)
The Shawshank Redemption (1994)
As Good As It Gets (1997)
Good Will Hunting (1997)

And it also shows the kind of “seed ratings” the new user provided (a handful of 2–5 star ratings).

The important note is that your results may differ because the model starts from random initialization.

The first time you see a model recommend movies you didn’t rate, it’s a good reminder that “learning” is often just optimization + data.

The real lesson of Exercise 8

It’s tempting to treat anomaly detection and recommenders as special topics.

But as an engineer, what I’m really taking away is:

Define the objective clearly (what are we optimizing?)
Choose metrics that match reality (F1 over accuracy when positives are rare)
Use the right masking/controls (R is basically “don’t hallucinate data”)
Regularize early (models love to overfit if you let them)

My next steps (after finishing the course)

In 2016, it’s impossible to ignore what’s happening with deep learning.

My plan after this course:

Learn the popular deep learning approaches (and how to train them properly)
Get comfortable with backprop end-to-end, not just in toy networks
Start building small projects where the data pipeline matters as much as the model

I’m happy I started with the fundamentals first. Deep learning is powerful, but without the basics (optimization, regularization, diagnostics), it’s too easy to treat it like magic.

What’s Next

This closes my 2016 journey through Andrew Ng’s Machine Learning course.

In 2017, I shift focus to Neural Networks and Deep Learning — the wave that is reshaping the field at this time.

My next phase is about understanding:

the Perceptron as the foundation
CNNs for spatial structure
RNNs and LSTMs for sequence modeling
why better activation functions, weight initialization, and gradient descent improvements finally made deep networks trainable at scale

These ideas powered breakthroughs like AlexNet and kicked off the modern deep learning revolution.

The fundamentals from this course — optimization, regularization, diagnostics, and system thinking — turned out to be the difference between using deep learning and actually understanding why it works.

From Logistic Regression to Neurons - Rebuilding Intuition from the Perceptron

After finishing Andrew Ng’s Machine Learning course, I start my deep learning journey by revisiting the perceptron and realizing neural networks begin with ideas I already understand.

Exercise 7 - Unsupervised Learning (K-means) + PCA (Compression & Visualization)

Implement K-means clustering, compress an image to 16 colors, then use PCA to reduce dimensions and build eigenfaces.