
Notes from Andrew Ng’s ML course — extend linear regression to multiple features, learn feature scaling/mean normalization, and stop writing slow loops by vectorizing everything.
Axel Domingues
After Exercise 1, linear regression feels almost too easy: one feature, one line, gradient descent converges.
Then the course does the most realistic thing possible:
It adds more features.
In the real world you rarely have “profit depends on one number.” You have multiple inputs (size, number of rooms, location signals, time effects, etc.). This part of the course is where I learned two practical lessons that keep showing up in every ML system I’ve touched since:
What changes vs Exercise 1
X becomes a matrix (many columns), theta becomes longer, and bugs multiply unless you watch shapes.
Non-negotiable skill
Feature scaling + mean normalization makes gradient descent stable and fast enough to be usable.
Biggest upgrade
Vectorization isn’t a speed trick — it keeps the code aligned with the idea: “predict everything, update once.”
Single-variable linear regression is basically: “learn a line.”
Multi-variable linear regression is: “learn a weighted sum of features.”
In code terms:
X becomes an m x (n+1) matrix (m examples, n features + intercept column)theta becomes a (n+1) x 1 vectorpredictions = X * theta;
Same idea, bigger surface area for bugs.
ex1_multi.m — orchestration: loads data, normalizes, runs trainingex1data2.txt — dataset (e.g., housing size + bedrooms)featureNormalize.m — makes features comparable in scalecomputeCostMulti.m — the “truth meter” for how wrong theta isgradientDescentMulti.m — iterative optimization loopnormalEqn.m — one-shot solution (no iterations)./) where needed?size(X), size(theta), size(y))?
The course’s approach is straightforward:
In Octave this usually looks like:
function [X_norm, mu, sigma] = featureNormalize(X)
mu = mean(X);
X = X - mu;
sigma = std(X);
X_norm = X ./ sigma;
end
Two notes I wrote down:
mu and sigma because you need them to normalize new inputs later../) vs matrix operations.If you normalize training data but forget to normalize your prediction input, your predictions will be nonsense.
Once features are normalized, you still build X the same way:
X = [ones(m, 1), X_norm];
theta = zeros(n + 1, 1);
This consistency is what makes the course great: the model doesn’t change — just the number of columns.
The multi-variable cost computation is the same logic as Exercise 1:
Vectorized implementation:
function J = computeCostMulti(X, y, theta)
m = length(y);
errors = (X * theta) - y;
J = (1/(2*m)) * (errors' * errors);
end
The cost function is your contract. If cost behaves weirdly, fix cost first.
The update is still the same pattern. The big improvement is that the vectorized gradient scales naturally with the number of features.
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
m = length(y);
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
errors = (X * theta) - y;
theta = theta - (alpha/m) * (X' * errors);
J_history(iter) = computeCostMulti(X, y, theta);
end
end
If you compare this to a version with explicit loops over:
…you immediately see why vectorization matters.
I initially thought vectorization was just about performance.
It’s not. It’s about correctness and clarity.
When you write loops, you introduce:
The vectorized form makes the update look like a single coherent operation.
Even if you never show the formulas, the idea is:
X * theta means “predict all examples”X' * errors means “aggregate how each feature contributed to the error”That mapping makes it easier to reason about what’s happening.
My most useful debugging tool in Octave became:
size(X)
size(theta)
size(y)
If shapes match expectations, everything downstream tends to behave.
The worst ML bugs are silent: the code runs and produces numbers, but they’re wrong. Shape checks catch a surprising amount of that.
With multiple variables, alpha sensitivity becomes obvious.
What I did:
0.01)Inf / NaN), alpha is too largeA simple diagnostic snippet:
if mod(iter, 50) == 0
fprintf('Iter %d | Cost: %f\n', iter, J_history(iter));
end
Treat cost history like logs. If your system has no telemetry, you’re guessing.
The course introduces an alternative approach: compute theta in one shot (no gradient descent).
In Octave, the canonical implementation is:
theta = pinv(X' * X) * X' * y;
When I first saw this, I loved it — it feels like cheating.
But the tradeoff is practical:
For learning purposes, it’s a great comparison tool:
mu, sigma) for inference time.Next up is a key comparison that upgrades your intuition fast: two ways to fit linear regression.
Same goal — find good parameters — different tradeoffs:
alpha.Once you see both side by side, linear regression stops being “a model” and becomes a toolbox.
Normal Equation vs Gradient Descent (Choosing Tools Like an Engineer)
Notes — two ways to fit linear regression - iterative gradient descent vs one-shot normal equation. Same goal, different tradeoffs.
Exercise 1 - Linear Regression From Scratch
Notes from Andrew Ng’s ML course — plot the food-truck dataset, implement computeCost + gradientDescent in Octave, and build intuition with J(theta) visualizations.