
Images were hard, but at least they were static. Sequences add “time”, shared weights, and state — and suddenly the assumptions I relied on in 2016 stop holding.
Axel Domingues
Convolutions taught me something comforting: if you bake the right structure into the model, learning becomes easier.
Then I switched to sequences.
And the comfort disappeared.
Text, sensor streams, audio, click trails… sequences introduce a brutal reality:
order matters, and the past can change the meaning of the present.
In 2016 I could pretend each training example was self-contained.
In September 2017, that illusion broke.
This is the month I met Recurrent Neural Networks — and understood why people say “sequence data breaks everything”.
What this post gives you
A clean mental model of RNNs: state, unrolling, and shared weights over time.
The 3 ideas to remember
The engineering mindset
RNNs are learnable and fragile — so you’ll need instrumentation (state norms, gradients, sanity runs).
I tried to approach sequence problems with my “classic ML brain”:
k stepsAnd it worked… until it didn’t.
The failures were consistent:
k changed the whole problemIt felt like vision before CNNs: a messy feature pipeline.
So I asked the obvious question:
Can the model learn the features it needs from raw sequences, the way CNNs learn features from pixels?
That question leads directly to RNNs.
h(t): the model’s running summary of what it has seen so far.
An RNN is just a neural network that keeps a running internal summary:
tx(t)h(t-1)h(t)That hidden state is the “memory”.
Not memory like a database.
Memory like: “what I’ve seen so far.”
This shifted my mental model from:
to:
So instead of me designing features, the network builds them step-by-step.The hidden state is a feature vector that the model learns to build over time.
Hidden state should usually be reset at sequence boundaries (between independent examples), or you’ll leak context from one example into the next and training will behave strangely.
I kept getting stuck on the word “recurrent” as if it was mystical.
The breakthrough was unrolling.
If you “unroll” an RNN through time, you don’t see a loop anymore.
You see a stack of repeated cells:
Once I saw it unrolled, it stopped being exotic.
Draw the same cell repeated for t = 1..T.
The parameters don’t change across time steps — the state does.
Unrolling reveals a long dependency chain: later errors can depend on much earlier steps.
It became:
a deep network where depth = time.
That’s also when I got nervous, because I already know what deep networks do to gradients.
(That’s next month.)
CNNs use parameter sharing in space.
RNNs use parameter sharing in time.
That parallel helped me a lot.
CNN parameter sharing
Same detector reused across space
→ “works anywhere in the image”
RNN parameter sharing
Same transformation reused across time
→ “works at any position in the sequence”
The cost of sharing in time
Errors at late steps can depend on computations from early steps
→ long chains in backprop through time
That means:
It also means:
The good and the painful come from the same source.
In 2016, a training example felt like a standalone row in a table.
In sequences, each element is correlated with neighbors by definition.
Treating time steps as independent is like shuffling a sentence’s words and calling it the same sentence.
With most classical ML:
With sequences, you can have different patterns:
Even if I’m not going near translation yet, just realizing these “shapes of problems” mattered was huge.
This was the biggest shift:
Instead of feature engineering for context, the RNN carries context forward as state.
Data order is not noise.
Data order is information.
I kept it deliberately simple:
The goal wasn’t state-of-the-art.
The goal was to build a system I could debug.
Dataset choice
Next-character prediction is great because you can see progress in samples, not just numbers.
Debug goal
Make it learn local structure fast on a tiny dataset before scaling anything.
My “sanity run” rule
If it can’t overfit a tiny slice a bit, something is wrong (data, shapes, loop, or gradients).
Turn text into integer ids, then create (input_seq, target_seq) pairs offset by one step.
I began with sequence length like 10–20 so I could reason about what “context” even means.
No magic loops at first — I wrote it so I could print shapes at every step.
If loss doesn’t decrease fast on a tiny dataset, something is wrong.
Look at predicted characters and see what kind of “mistakes” it makes:
The humbling part:
Which was the perfect setup for October.
This was the most common behavior.
The model starts producing reasonable local sequences (spaces, punctuation patterns), but loses coherence quickly.
That was my first taste of the long-term dependency problem.
I had runs where loss dropped, then exploded.
That was new.
In classic ML, optimization was frustrating but not chaotic.
Here it could be chaotic.
With CNNs, feature maps gave me something to inspect.
With RNNs, the hidden state is harder to interpret.
So I had to add my own debugging tricks:
First checks:
First checks:
First checks:
Last year I learned:
RNNs made me apply those principles under harsher conditions.
This was still supervised learning.
Still gradients.
Still cost functions.
But the structure changed the battlefield:
Data order is information, not noise.
In 2016 I thought of learning as:
mapping inputs to outputs
In 2017, with sequences, I started thinking of learning as:
building a state that carries context forward
That’s a different mental model.
And it makes it obvious why sequence learning deserved its own class of architectures.
I can now describe RNNs confidently.
But I can also see the storm coming.
If unrolling makes an RNN “deep in time”, then training it means gradients have to travel through a long chain of steps.
Next month is about the pain point everyone warned me about:
Vanishing Gradients Strike Back: why naive RNNs fail on long sequences, and why training can explode or stall.
In a sense, yes — but the key is shared weights and state. Unrolled, an RNN becomes a deep network across time steps, which changes the optimization behavior dramatically.
You can, and sometimes it works. But it forces you to pick a context length upfront and hand-design what matters. RNNs let the model learn what to keep, and handle variable-length sequences naturally.
Debuggability. With CNNs I can inspect feature maps. With RNNs the hidden state is harder to interpret, so I had to build my own instrumentation (state norms, gradient checks, sanity runs on tiny datasets).
Vanishing Gradients Strike Back - The Pain of Training RNNs
RNNs looked elegant on paper. Training them exposed the same old enemy—vanishing/exploding gradients—just with “depth in time”.
Pooling, Hierarchies, and What CNNs Are Really Learning
Convolution made CNNs "possible". Pooling and depth made them "useful" - invariance, hierarchies, and feature maps that start to look like learned vision primitives.