
RNNs looked elegant on paper. Training them exposed the same old enemy—vanishing/exploding gradients—just with “depth in time”.
Axel Domingues
Last month, I finally understood what an RNN is:
This month, I learned what an RNN does to you:
It turns training into a fight with gradients.
And not in a theoretical way. In a “my loss is NaN” way.
What this post explains
Why vanilla RNNs reintroduce vanishing/exploding gradients via depth in time.
The 3 failure patterns
The practical toolkit
Instrumentation + gradient clipping + sequence-length curriculum + cautious learning rate.

In March I wrote about why deeper networks were harder to train than I expected.
October felt like that lesson returning… with a twist.
Because with RNNs, you can have a network that is shallow in layers, but extremely deep in time:
So all the trouble we saw in deep feedforward nets shows up again:
Just… now it’s happening through time.
Once I accepted that, RNN training stopped feeling “mysterious” and started feeling like a predictable failure mode with a checklist.
I ran a simple character-level RNN again, but pushed it slightly harder:
And I saw three recurring behaviors:
Symptom
Learns local structure, but long context doesn’t improve.
Symptom
Training looks fine… then suddenly spikes and collapses.
Symptom
Small changes in hyperparameters produce totally different outcomes.
At first this looked like randomness.
Then I realized it was the same underlying issue:
Gradient flow across long chains is fragile.
First checks:
First checks:
First checks:
Here’s how it felt in practice:
The key symptom:
Improvements in prediction happen mostly for nearby dependencies.
It’s like the network can only “hear” the last few time steps.
Everything earlier becomes muffled.
Train on a short sequence length where it clearly improves, then increase length. If learning quality collapses as length grows (with the same code), you’ve learned something real: the backward signal from the loss isn’t reaching early time steps reliably.
Every time step applies another transformation to the signal (forward) and to the gradient (backward).
If that transformation tends to shrink values, then after enough steps:
So the network may technically “have memory” — but training never teaches it to use it.
When I trained on short sequences, learning looked fine. When I trained on long sequences, the “same code” suddenly stopped learning. That length sensitivity is a huge clue.
The other failure was more dramatic.
Everything looks fine… then:
This felt like training “falls off a cliff”.
And the mechanism is the same story in reverse:
If a transformation tends to amplify values, repeated over many steps it can turn small numbers into huge ones.
In classical ML I almost never saw “catastrophic training collapse.”
Here it was… normal.
You’ll think your implementation is wrong.
My 2016 “treat it like a system” mindset saved me here.
Instead of guessing, I added instrumentation.
If I can’t see the signals, I can’t reason about the failure.
I didn’t want to jump straight to LSTMs yet.
I wanted to understand what you can do with a “vanilla” RNN first.
Here’s the short list that actually made my runs survivable.
When gradients explode, clip them to a maximum norm.
This doesn’t fix learning long-term dependencies, but it prevents training from collapsing.
I started training on shorter sequences just to validate learning, then gradually increased the length.
This made debugging possible.
Exploding gradients get worse with aggressive steps.
A smaller learning rate made training slower, but less chaotic.
Momentum can accelerate learning, but it can also amplify instability in fragile settings.
For debugging, I often started with plain SGD and only added momentum later.
It turned RNN training from “randomly catastrophic” into “predictably difficult.”Gradient clipping.
I kept trying to get the vanilla RNN to remember a detail from far back in a sequence.
And it clicked that the network isn’t failing because it’s dumb.
It’s failing because the training signal has to travel through too many transformations.
Even if the architecture could represent the dependency, the optimizer can’t reliably deliver learning to the earlier steps.
This was the crucial realization:
Architecture and optimization are inseparable.
You can’t talk about “sequence modeling” without talking about gradient flow.
This month didn’t feel like “new ML”.
It felt like the same fundamentals—just under more stress:
It also echoed CNNs:
CNNs succeed partly because their structure improves learnability.
RNNs struggle partly because their structure makes learnability harder.
Same principle, opposite outcome.
I used to think:
“If the model is expressive enough, training will figure it out.”
After this month I think:
Expressiveness doesn’t matter if gradients can’t deliver learning to where it’s needed.
Depth in time is not “just another dimension.”
It’s a multiplier on optimization pain.
Vanilla RNNs taught me the problem.
Now I want the engineering solution that made sequence modeling practical:
LSTMs: Engineering Memory into the Network
Gates, memory cells, and a design that’s basically built to keep gradients alive.
No. I saw it earlier with deep feedforward nets too. RNNs just make it unavoidable because unrolling creates deep chains through time, even with only one recurrent layer.
Gradient clipping. It doesn’t solve long-term memory, but it prevents exploding gradients from destroying training runs.
They’re still great for building intuition. They show exactly what breaks, and that makes LSTMs feel like a purposeful engineering response instead of a magical upgrade.
LSTMs - Engineering Memory into the Network
After vanilla RNNs taught me why gradients collapse through time, LSTMs finally felt like an engineered solution - keep the memory path stable, and control it with gates.
Why Sequences Break Everything - Enter Recurrent Neural Networks
Images were hard, but at least they were static. Sequences add “time”, shared weights, and state — and suddenly the assumptions I relied on in 2016 stop holding.