
My first real classifier - predict admissions from exam scores with logistic regression, then learn why regularization matters on a non-linear dataset.
Axel Domingues
Exercise 1 taught me how to predict a number.
Exercise 2 is where things start to feel like “real ML”: classification.
Instead of predicting profit or price, we predict a label:
1 = yes (admit / accept)0 = no (reject)This exercise is split into two parts:
What you’ll build
A working logistic regression classifier: probability → threshold → 0/1 prediction.
The core contract
Implement cost + gradient correctly, then let fminunc do the optimizing.
The real lesson
More expressive features can overfit fast — regularization (lambda) keeps the model honest.
ex2.m — run logistic regression on admissions datasigmoid.m — probability function (must work for vectors/matrices)costFunction.m — cost + gradient (no regularization)predict.m — probability → 0/1 predictionex2_reg.m — run regularized logistic regressionmapFeature.m — expands features (polynomial mapping)costFunctionReg.m — cost + gradient with regularizationsize(X), size(theta), size(y).*, ./ where neededtheta_reg(1) = 0The admissions dataset contains two exam scores per student and a label (admitted / not admitted).
Plotting answers the first honest question: does a straight boundary look plausible?
If you’re using a helper like plotData.m, the pattern is usually:
This plot becomes the background for the decision boundary later.
If your plot looks strange (all points stacked, weird scales), stop. Don’t debug ML with broken data visualization.
Logistic regression works by producing a probability between 0 and 1.
The sigmoid function is the basic building block:
function g = sigmoid(z)
g = 1 ./ (1 + exp(-z));
end
Two things I learned immediately:
The key implementation in Part 1 is costFunction.m. It should return:
J: a single number (how wrong you are)grad: a vector (direction to improve theta)A clean vectorized pattern:
function [J, grad] = costFunction(theta, X, y)
m = length(y);
h = sigmoid(X * theta);
J = (1/m) * sum( -y .* log(h) - (1 - y) .* log(1 - h) );
grad = (1/m) * (X' * (h - y));
end
When theta starts at zeros, the cost should be around 0.693.
This number is incredibly useful.
If your cost is not close to 0.693 at zero theta, do not continue. Fix costFunction.m first.
theta = zeros(...), cost should be around 0.693.costFunction.mbefore running fminunc.In Exercise 2, the course introduces a practical workflow:
fminunc) handle the optimization stepsThe call pattern looks like:
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);
What I liked about this approach:
If fminunc behaves strangely, it’s usually a bug in your gradient or shape mismatch in X, theta, or y.
fminunc behaves strangely, it’s usually:X, theta, or ysigmoid not handling vectors/matrices correctlyOnce you have theta, you can predict:
sigmoid(X * theta)A simple predict.m:
function p = predict(theta, X)
p = sigmoid(X * theta) >= 0.5;
end
Then compute training accuracy:
p with yThis is the first moment where it felt like a real classifier: it makes decisions from data.

The second dataset has two test results for microchips and a label:
10The twist: a straight line can’t separate the classes well.
So the course does two things:
mapFeature.m is provided and expands the two input features into a set of polynomial features.
This changes the shape of the data the model sees:
This is powerful… and dangerous.
More features means a more flexible boundary, which means the model can start “memorizing” the training set.
You implement costFunctionReg.m which adds a penalty for large parameter values.
Important rule from the course:
theta(1) in Octave)Implementation pattern:
function [J, grad] = costFunctionReg(theta, X, y, lambda)
m = length(y);
h = sigmoid(X * theta);
theta_reg = theta;
theta_reg(1) = 0; % do not penalize intercept
J = (1/m) * sum( -y .* log(h) - (1 - y) .* log(1 - h) ) ...
+ (lambda/(2*m)) * sum(theta_reg .^ 2);
grad = (1/m) * (X' * (h - y)) + (lambda/m) * theta_reg;
end
When you play with lambda, you get three regimes:
This is the first exercise where “overfitting” becomes visible.
If you only remember one thing from this section: feature mapping increases power, regularization keeps it honest.
When this exercise breaks, it usually breaks in predictable ways:
size(X), size(theta), size(y).* and ./ where appropriateh becomes 0 or 1 exactly, log(h) can be problematictheta_reg(1) = 0The most dangerous bug is a wrong gradient: the script runs, the solver returns a theta, and your boundary looks “kind of random.” Use checkpoints and plot the boundary.
fminunc) is a practical pattern.Next up is a topic that shows up everywhere once you see it: overfitting.
Overfitting is what happens when your model memorizes training data.
Regularization is the practical tool that keeps it honest — letting you keep model power without turning “perfect training accuracy” into a trap.
Regularization - Overfitting in the Real World (and How to Fight It)
Overfitting is what happens when your model memorizes training data. Regularization is the practical tool that keeps it honest.
Normal Equation vs Gradient Descent (Choosing Tools Like an Engineer)
Notes — two ways to fit linear regression - iterative gradient descent vs one-shot normal equation. Same goal, different tradeoffs.