
After months wrestling with training stability, I finally hit the next wall - fully-connected nets don’t “get” images. Convolutions felt like the first time architecture itself encoded domain knowledge.
Axel Domingues
In June, I was finally starting to feel less helpless about optimization.
Momentum gave training inertia. Learning rates stopped being a magic number and became a knob I could reason about.
And then I tried to apply the same mental model to images… and it fell apart.
A fully-connected network looking at pixels is like a model staring at a spreadsheet where the columns have been randomly permuted.
It can still learn something, but it’s fighting the structure of the data.
July was the month I learned why CNNs weren’t “just another neural network”.
They were a different way of seeing.
What this post teaches
Why fully-connected nets “fight” image structure, and how convolutions bake that structure into the model.
The 3 ideas to remember
Locality, parameter sharing, and receptive fields (hierarchy through depth).
The practical skill
How to think about shapes: filter size, stride/padding, output dimensions, parameter count.
A picture is not a bag of numbers.
Pixels have:
A fully-connected network treats every pixel-to-hidden-unit connection as independent and equally likely to matter.
Which creates two immediate issues:
In 2016, feature engineering was my way to inject structure. In 2017, CNNs made me realize:
the model can carry the structure.

I avoided CNNs for a while because “convolution” sounded like pure signal processing.
But the simplest way I got it into my head was:
A convolution layer applies the same small pattern detector across the entire image.
That’s it.
Think “3×3-ish pattern detector” (not because it’s magical, but because images are local).
Same detector, everywhere.
That’s the key: you reuse the same weights across positions.
Each position outputs one value (how strongly the detector matches there).
Collect those values into a grid: that grid is the feature map.
Instead of one weight per pixel connection, you have a small filter (kernel) that slides across the input and produces an output map.
In practice, it’s like saying:
That idea has a name:
parameter sharing.
And it’s probably the reason CNNs are even possible.
Local connectivity
Each unit looks at a small nearby patch, not the whole image.
Parameter sharing
One detector reused everywhere → fewer parameters, more data efficiency.
Receptive fields
Stacking layers grows what a unit can “see” → hierarchy from edges to parts.
In a fully-connected net, every hidden unit can depend on every pixel.
In a CNN, a unit depends on a small patch.
That matches how images work:
And it’s exactly the kind of bias that makes learning easier.That assumption is an inductive bias.
Instead of learning separate edge detectors for every region, CNNs learn one edge detector and reuse it.
That reduces parameters drastically.
I started seeing this as an engineering tradeoff:
It’s not just “cute architecture”. It’s what makes the problem tractable.
This one took me a while.
Each convolution layer sees a small patch.
But stacking layers increases what a unit can “see”:
The key idea is receptive field expansion through depth.
Depth isn’t just “more compute”.
It’s a way to build hierarchical understanding.
In classical ML, I learned to obsess over features.
If the model wasn’t performing, the usual answer was:
CNNs flip that:
It’s not “no feature engineering”.
It’s feature engineering moved into:
So this month’s “aha” wasn’t that CNNs are powerful.
It was that CNNs feel like a design decision, not just a training outcome.
In 2017, I started engineering the model to produce its own features.
I didn’t try to build a full AlexNet this month. That would have been self-sabotage.
Instead, I forced myself to implement a minimal pipeline that would prove the concept.
My checklist:
The goal wasn’t SOTA performance.
The goal was shape sanity and intuition.
Quick mental check:
This felt weirdly familiar.
In Octave ML exercises, I debugged matrix shapes constantly.
CNNs are basically “shape debugging with more dimensions”.
So my 2016 foundation helped more than expected.
My 2016 instinct was:
if the model is weak, add features
if it overfits, add regularization
if it doesn’t learn, fix optimization
CNNs introduced a new lever:
if the model is blind, change the architecture.
This was the first time I really felt that architecture is a form of prior knowledge.
Not a minor detail.
A choice that can be stronger than:
I used to think learning was mostly:
Now I’m starting to see:
Good inductive bias can outperform brute-force flexibility.
CNNs don’t win because they can represent anything. They win because they represent the right kinds of things for images.
That’s different.
And honestly, it’s kind of relieving.
Because it means deep learning isn’t just “scale and hope for a different outcome”.
It’s design.
Convolutions explained why CNNs work structurally.
But I still didn’t fully understand what they learn internally — and why they generalize so well compared to the older “hand-crafted features + classifier” pipeline.
Next month, I’m going deeper into the CNN stack:
August is about understanding CNNs as representation learners — not just convolution operators.
Edge detection is a good mental starter, but convolution is broader: it’s “apply the same pattern detector everywhere.” Early layers often learn edges, but deeper layers learn compositions of those patterns.
You can, but you waste parameters and ignore locality. CNNs build the right assumptions into the model, which makes learning more data-efficient and easier to optimize.
Shapes. Output dimensions, filter depth, padding/stride interactions. The math is fine — the bookkeeping is where you bleed.
If your output shape surprises you, stride/padding are usually the first suspects.
Pooling, Hierarchies, and What CNNs Are Really Learning
Convolution made CNNs "possible". Pooling and depth made them "useful" - invariance, hierarchies, and feature maps that start to look like learned vision primitives.
Optimization Got Real - Momentum, Learning Rates, and Why Plain Gradient Descent Wasn’t Enough
In 2016, gradient descent felt like “the algorithm.” In deep learning, it’s just the beginning. This month I learned why momentum and careful learning rates are what make training "actually move".