
I built my first spam classifier with SVMs—learning how C and sigma shape decision boundaries, and why linear SVMs scale surprisingly well.
Axel Domingues
Up to this point in the course, I’ve been living in gradient descent land: pick a cost function, compute gradients, iterate, hope for a different outcome.
Then Support Vector Machines show up… and it feels like switching from a DIY engine rebuild to a tool that just wins.
Not because SVMs are magic (they aren’t), but because:
Why SVMs feel different
They optimize for margin, not just error — which often produces strong boundaries with less tuning.
The two knobs that matter
The surprise
A linear SVM can dominate on text problems when the feature space is large and sparse.
This exercise is split in two halves:
SVMs don’t just separate classes — they choose the boundary with the widest safety margin.
C is the tradeoff between clean separation and tolerance for mistakes.
The kernel creates similarity-based features that allow curved decision boundaries.
High-dimensional sparse features make linear separators surprisingly powerful.
The starter SVM implementation included in the assignment is designed for compatibility, not performance. For real-world training, the PDF explicitly recommends using a toolbox such as LIBSVM.
The first script is ex6.m. It walks through 3 datasets:
ex6data1.mat (linear boundary)ex6data2.mat (needs a non-linear boundary)ex6data3.mat (non-linear + parameter search)In ex6data1, there’s an obvious gap between the two classes… except for one positive example sitting in a bad place.
This is the first time the C parameter felt real.
In the assignment we compare:
C = 1 → boundary stays in the “natural gap” but misclassifies the outlierC = 100 → classifies everything correctly, but boundary looks less naturalEngineering interpretation:

Implementation detail that bit me once:
Most SVM packages (including the provided svmTrain.m) automatically handle the intercept term. Don’t manually add the x0 = 1 feature like we did for linear/logistic regression.
If you manually add an
x0 = 1column, you’ll quietly break the model.
The Gaussian kernel is a similarity function:
A good mental model:
The kernel turns “distance to many landmarks” into new features.
The bandwidth parameter sigma controls how quickly similarity drops:
gaussianKernel.mThe assignment asks you to implement the kernel and then run a quick test.
The expected value from the provided test case is:
0.324652Here’s a clean Octave-style implementation (no fancy tricks, just vectorization):
function sim = gaussianKernel(x1, x2, sigma)
x1 = x1(:); x2 = x2(:);
diff = x1 - x2;
sim = exp(-(diff' * diff) / (2 * sigma^2));
end
If you can read that code line, you understand the Gaussian kernel: “exponentially down-weight squared distance.”
ex6data2 has a layout where no straight line separates the classes well.
Once the Gaussian kernel is used, the learned boundary follows the contours of the dataset and separates most points correctly.
This is the first time in the course I felt a model “bend” the boundary in a controlled way.
This part is a practical lesson in not guessing hyperparameters.
You’re given:
X, yXval, yvalYour job: implement dataset3Params.m to search combinations of C and sigma.
The PDF suggests trying values in multiplicative steps:
0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30That’s 8 values for C and 8 for sigma → 64 models.
The error metric is classification error on the validation set:
pred = svmPredict(model, Xval);
err = mean(double(pred ~= yval));
Here is the structure I used:
function [C, sigma] = dataset3Params(X, y, Xval, yval)
values = [0.01 0.03 0.1 0.3 1 3 10 30];
bestErr = inf;
for Ctest = values
for sigTest = values
model = svmTrain(X, y, Ctest, @(x1, x2) gaussianKernel(x1, x2, sigTest));
pred = svmPredict(model, Xval);
err = mean(double(pred ~= yval));
if err < bestErr
bestErr = err;
C = Ctest;
sigma = sigTest;
end
end
end
end
This is the first “real” hyperparameter tuning pattern in the course.
It’s not fancy, but it’s reliable: define a search space, measure on validation, pick the best.
The second script, ex6spam.m, takes SVMs from toy plots to something useful: spam vs not spam.
This part made SVMs click for me because it combines:
processEmail.m)The goal: convert raw email text into a list of word indices.
The preprocessing steps are very “classic ML”:
httpaddremailaddrnumberdollarThis is the difference between a model that learns and a model that drowns in noise.
The exercise includes vocab.txt and getVocabList.m which load a fixed dictionary.
This makes the whole pipeline reproducible.
emailFeatures.m)The dictionary contains 1899 words.
Each email becomes a binary vector x of length 1899:
x(i) = 1 if vocabulary word i appears in the emailx(i) = 0 otherwiseAfter implementing emailFeatures.m, the script checks a sample email and you should see:
That “45” is a nice reminder that these vectors are sparse.
The datasets provided are:
spamTrain.mat → 4000 emailsspamTest.mat → 1000 emailsEach email is already converted to the 1899-dimensional feature vector using your preprocessing.
The expected performance checkpoint from the assignment:
That’s absurdly good for such a simple representation.
This is one of the best demonstrations (in the whole course) of why linear models can be incredibly strong when the feature representation is high-dimensional and well-designed.
The script prints the words with the largest positive weights.
From the PDF’s example output, the top predictors include (stemmed forms):
ourclickremovguarantevisitbasenumbdollarwillpricepleasnbspmostlogadollarnumbSeeing these terms is a great sanity check: the model is learning patterns that match intuition.
This was a turning point for me because it’s the first time the course feels like a deployable pipeline:
And the main engineering lesson:
A “simple” model + strong representation often beats a complex model + messy representation.
False positives and false negatives have different real-world costs. Tuning C encodes that tradeoff.
The course SVM is for learning.
Real systems should use optimized solvers.
Sparse, high-dimensional vectors + linear classifiers = strong baselines.
Next up, I leave supervised learning behind and step into unsupervised learning.
I’ll implement K-means clustering, compress an image down to 16 colors, then use PCA to reduce dimensions and build intuition for ideas like eigenfaces and variance preservation.
Exercise 7 - Unsupervised Learning (K-means) + PCA (Compression & Visualization)
Implement K-means clustering, compress an image to 16 colors, then use PCA to reduce dimensions and build eigenfaces.
Exercise 5 - Debugging ML (Bias/Variance, Learning Curves, and What to Try Next)
The most practical assignment so far - diagnose bias vs variance using learning curves, tune lambda with validation curves, and build a repeatable “next action” playbook.