
Notes — two ways to fit linear regression - iterative gradient descent vs one-shot normal equation. Same goal, different tradeoffs.
Axel Domingues
After implementing linear regression with gradient descent (single variable, then multiple variables), I hit the most “engineer” question in the course:
If two methods solve the same problem, how do I pick the right one?
In this part of Andrew Ng’s ML course, linear regression can be trained using either:
Both produce theta (the parameter vector). Both can predict y from X. But the developer experience and operational tradeoffs are totally different.
If you want the simplest path
Use Normal Equation for small linear regression problems. It’s one-shot and deterministic.
If you want the reusable skill
Use Gradient Descent. It generalizes to logistic regression, neural nets, and most of the course.
The “gotcha” to remember
If gradient descent feels broken, it’s often feature scale (or alpha) — not your math.
No matter which training method you choose, the prediction step is the same:
predictions = X * theta;
theta you can use for prediction.What changes is how you get theta:
Gradient descent starts from an initial theta (often zeros) and updates it step-by-step.
In Octave, a typical vectorized update loop looks like this:
for iter = 1:num_iters
errors = (X * theta) - y;
theta = theta - (alpha/m) * (X' * errors);
J_history(iter) = computeCostMulti(X, y, theta);
end
alphaTreat J_history like logs. If you can’t see whether cost is decreasing, you’re guessing.
The normal equation computes theta directly (no iterations).
In Octave, the common implementation uses the pseudo-inverse:
theta = pinv(X' * X) * X' * y;
X' * X and inverting/solving it becomes expensive as feature count growsThe course typically recommends pinv instead of directly computing an inverse. pinv is more numerically stable and works better when matrices are not nicely invertible.
This is one of the most useful rules of thumb I took from the course:

That said, scaling can still be helpful for interpretability and numerical conditioning, but it’s not a hard requirement the way it is with gradient descent.
If gradient descent “doesn’t learn,” check feature scales first. The most common failure mode is mixing features with very different ranges.
Here’s how I frame the choice in practice.
| Question | Gradient Descent | Normal Equation |
|---|---|---|
| Do I want a general optimization tool I can reuse later? | ✅ Yes | ❌ Not really |
| Do I want the simplest path for small linear regression problems? | ⚠️ Maybe | ✅ Yes |
Do I want to avoid tuning alpha? | ❌ No | ✅ Yes |
| Do I have lots of features? | ✅ Better | ⚠️ Can get expensive |
| Do I want training telemetry and control? | ✅ Yes | ⚠️ Less relevant |
Run featureNormalize and keep mu / sigma so you can normalize future inputs the same way.
Pick alpha + num_iters, and watch J_history. If it’s not decreasing, stop and fix scaling/alpha.
Compute theta using pinv. No tuning, no iterations.
Use identical feature ordering. If the predictions are close, your pipeline is probably correct.
This is where I started thinking beyond the math.
alpha, iterations, and initialization.X and y.n, but becomes heavy as n grows.The method that’s simplest for a homework dataset might not be simplest in production. Production constraints usually push you toward iterative methods and good monitoring.
Next up is my first real classifier: Logistic Regression.
I’ll predict admissions from exam scores, then push into a non-linear dataset and learn why regularization is the difference between a model that generalizes and one that memorizes.
Exercise 2 - Logistic Regression for Classification (My First Real Classifier)
My first real classifier - predict admissions from exam scores with logistic regression, then learn why regularization matters on a non-linear dataset.
Linear Regression With Multiple Variables (and Why Vectorization Matters)
Notes from Andrew Ng’s ML course — extend linear regression to multiple features, learn feature scaling/mean normalization, and stop writing slow loops by vectorizing everything.