
Convolution made CNNs "possible". Pooling and depth made them "useful" - invariance, hierarchies, and feature maps that start to look like learned vision primitives.
Axel Domingues
Last month I finally stopped treating CNNs as “a weird math trick” and started seeing them as an engineering idea:
encode the structure of images into the model so learning becomes easier.
But July still left a big question hanging:
If convolution is “the same detector everywhere”… what are the detectors actually detecting?
And why do CNNs seem to generalize better than older pipelines that used hand-crafted features?
August was about looking inside the box.
Not with hype. With debugging.
What you’ll learn
Why pooling exists, how depth creates hierarchies, and how feature maps become a real debugging tool.
The tradeoffs you’ll keep seeing
Invariance vs precision, compression vs detail, interpretability vs abstraction.
The practical loop
Train → check shapes → visualize feature maps → compare variants → learn the behavior.
I wanted to understand three things, concretely:
Because I’ve learned the hard way: if you can’t inspect a model, you can’t trust it.
Pooling sounded like a detail.
It’s not.
The simplest mental model that worked for me:
Pooling throws away exact position information to keep “what” over “where”.
If convolution says:
Pooling says:
This is not free. It’s a trade.
When pooling helps
When pooling hurts
I kept asking myself: why not just increase stride on convolution and skip pooling?
The answer I arrived at was practical:
Pooling acts like a stable, non-learned compression step.
That stability matters when training is already fragile.
So my mental model became:
Convolution extracts, pooling summarizes.
Not a perfect description, but it keeps me honest.
That “stability” can make training easier to reason about.
If you want downsampling that also learns which signals matter, stride is attractive.
If you want a predictable summary step to reduce spatial sensitivity, pooling is attractive.
Before CNNs, “deep” sounded like:
CNNs made me see depth differently:
depth builds a hierarchy of representations.
I started grouping what CNNs learn into “levels”:
That’s the theory.
This month I wanted evidence.
So I forced myself to inspect feature maps.

Feature maps are the output of filters.
If a filter is a detector, the feature map is the detector’s “activation heat”.
When I started visualizing them, I had a strong reaction:
CNNs aren’t learning “objects”.
They’re learning a vocabulary of useful visual signals.
A few observations that hit me:
Pooling created invariance locally.
But stacking conv + pooling repeated creates a ladder:
This is where the whole “representation learning” idea became real to me.
In 2016, PCA was “compress the data while keeping variance”.
In CNNs, the compression is learned and selective:
That’s a different objective.
And a different mindset.
I got asked about dropout recently, and it’s been nagging me.
Dropout is not a CNN-only thing, but it becomes practical right around here because:
I’d keep it focused:
I’m not going deep into “modern best practices” here—just building the intuition that regularization didn’t disappear just because the architecture got smarter.
I didn’t want “I read about CNNs” notes.
So I built a small workflow that I can reuse:
Not trying to win benchmarks — just enough to learn patterns.
If I can’t compute the output dimensions by hand, I don’t trust my code.
Same few images every time, so changes are comparable.
I wanted to feel the difference:
Not tuning to perfection — just proving the effect exists and is understandable.
Feature map inspection helped me catch that earlier than metrics did.It was thinking a model was learning when it was actually learning noise.
In 2016, I learned:
August rewired that:
The system-thinking mindset stayed the same.
The objects changed.
Representation learning replaced manual feature design.
Not because feature engineering is “dead” — but because the model architecture became the feature engineering.
Pooling and hierarchies are the bridge: they explain how CNNs go from pixels to meaning without me hand-crafting anything.
CNNs taught me something important:
good architecture is a form of domain knowledge.
But images are still “static”. The input doesn’t have order.
Next month I’m moving to the kind of data that breaks the assumptions I relied on in 2016:
September is where “time” enters the network: Recurrent Neural Networks, unrolling, and shared weights across steps.
No. Pooling is useful when you want invariance and lower compute, but it can hurt when you need precise spatial information. I’m treating it as a deliberate tradeoff, not a default.
They’re showing where a filter “fires” on an input. For early layers, that often maps to interpretable patterns like edges. For deeper layers, it becomes less obvious — but still useful for debugging whether the network is learning coherent structure.
Not necessarily. Pooling too often can remove detail too early. A common intuition is: pool when you want to trade spatial precision for invariance and compute savings — but keep enough resolution early so the network can still “see” small patterns.
Why Sequences Break Everything - Enter Recurrent Neural Networks
Images were hard, but at least they were static. Sequences add “time”, shared weights, and state — and suddenly the assumptions I relied on in 2016 stop holding.
Convolutions - Why CNNs See the World Differently
After months wrestling with training stability, I finally hit the next wall - fully-connected nets don’t “get” images. Convolutions felt like the first time architecture itself encoded domain knowledge.