
After vanilla RNNs taught me why gradients collapse through time, LSTMs finally felt like an engineered solution - keep the memory path stable, and control it with gates.
Axel Domingues
Last month’s conclusion was painfully clear:
Vanilla RNNs don’t just “struggle” with long-term dependencies — they’re structurally bad at training them.
I could get an RNN to learn local patterns (next character, short context), but the moment I needed it to remember something far back in the sequence, learning simply didn’t reach that far.
This month was the first time in 2017 I felt the “deep learning revolution” in my hands:
LSTMs are not a small tweak. They’re an architectural response to an optimization problem.
What you’ll learn
Why LSTMs are an engineered response to RNN training failure: a stable memory path + gates that control access.
The 3 ideas to keep
How to use this post
Read once for intuition, then reuse the “Engineering Notes” checklist when implementing/debugging.
In 2016, when something didn’t learn, I had a pretty reliable playbook:
With RNNs, I did all of that… and still hit a wall.
That’s when it clicked:
Sometimes the right solution isn’t “a better optimizer”.
It’s “a model design that makes optimization possible”.
Data, loss, gradients, learning rate, regularization — the usual fixes.
If the model cannot train long dependencies even when everything is “reasonable”, it’s not only tuning.
Sometimes the right move is: change the structure so optimization becomes possible.
LSTMs are exactly that.
It’s not “the network magically learns to remember.”
It’s “we give it a stable memory pathway and let training learn when to write/read/forget.”
t depends on something far earlier.
When I first read about LSTMs, the gates sounded like ceremony:
But after October, the gating idea became obvious:
If gradients die through time, build a path where they don’t.
In my head, I now picture LSTMs like this:
So instead of forcing the hidden state to be both:
…the LSTM separates those roles.
That separation is the entire point.
A long-term dependency isn’t abstract to me anymore.
It’s when:
tt - 50 (or t - 200)A vanilla RNN has to keep that information alive by repeatedly transforming it through the same recurrent weights.
That’s exactly where things go wrong:
LSTM says:
Don’t rely on repeated transformation to preserve memory. Preserve memory directly, and learn controlled updates.
Train on a short length where the model improves, then increase length. If quality collapses sharply, you’ve learned something real: the model struggles to carry learning signal through time.
I’m intentionally avoiding equations here — because what mattered to me wasn’t the derivation.
It was the control logic.
This is the gate that made me stop thinking of memory as passive.
The network can actively decide:
That’s huge for sequences where context changes.
This prevents the cell from being overwritten all the time.
In practice, it means:
This gate separated a subtle thing I didn’t appreciate before:
The LSTM can store something for later without constantly outputting it.
That helps training too, because it avoids forcing every internal state to be immediately “useful” for prediction.
October’s failure mode was:
LSTMs don’t eliminate difficulty, but they change the game:
In other words:
LSTMs don’t make sequence learning “easy”.
They make it possible.
I did two things in parallel:
I used small synthetic problems where success requires remembering something far back:
The goal was not accuracy.
The goal was to see:
This is where LSTMs started to feel like a real tool.
In 2016, regularization was mostly:
In 2017, I’m seeing another form:
Architectural constraints.
CNNs encode locality and translation bias.
LSTMs encode stable memory and controlled flow.
Those are not just representational choices.
They’re choices that shape what learning is likely to succeed at.
LSTMs aren’t automatic.
I still hit real issues:
The LSTM has more moving pieces:
One swapped dimension can silently break learning.
This month reinforced my “shape debugging” habits from Octave days.
It’s boring. It prevents hours of confusion.Write down the expected shapes for every tensor/state before coding.
LSTMs can memorize tiny datasets fast.
This is where dropout started to become relevant again — but not in the abstract.
It was the difference between:
Everything here still feels like the 2016 foundation — just under a new kind of stress.
The thing that changed is what I consider “a valid fix”.
In 2016, fixes were mostly:
In 2017, a fix can be:
That’s a different level of thinking.
I used to assume training difficulty was mostly about hyperparameters.
Now I’m convinced:
Training difficulty is often an architectural property.
LSTMs taught me that the right model design can turn an impossible optimization problem into a solvable one.
I’ve now hit the major “building blocks” of 2017:
December 2017 is the synthesis.
What actually changed between classical ML and deep learning?
What stayed the same?
And what does my “next step” look like after a year of building intuition the hard way?
Not completely, but they changed the dynamics enough that learning long-range dependencies became realistic. In practice, training was still sensitive to learning rates and still benefited from gradient clipping, but it stopped feeling “structurally impossible.”
I think of hidden state as the “working output” at each time step, and cell state as the “long-lived memory lane” that is updated in a controlled way. The separation makes optimization more stable.
Usually yes, between independent training examples.
If you carry state across unrelated sequences, you can leak context and training becomes confusing. If your data is a continuous stream where sequences truly connect, you may intentionally carry state — but then you must be explicit about boundaries.
From Classical ML to Deep Learning - What Actually Changed (and What Didn’t) (and My Next Steps)
A year after finishing Andrew Ng’s classical ML course, I’m trying to separate enduring principles from deep learning-specific techniques—and decide where to go next.
Vanishing Gradients Strike Back - The Pain of Training RNNs
RNNs looked elegant on paper. Training them exposed the same old enemy—vanishing/exploding gradients—just with “depth in time”.