
Before transformers, language models tried to compress entire histories into a single hidden state. This post explains why that was brittle: depth-in-time, vanishing/exploding gradients, and the engineering limits of “memory” — and why attention was inevitable.
Axel Domingues
January was the “architect’s reset”:
LLMs are not deterministic functions.
They’re probabilistic components — and reliability has to be designed.
But if you want good instincts for modern LLM behavior, you need one piece of history:
Why language modeling was hard before transformers.
Not in a “fun trivia” way — in a failure-modes way.
Because a lot of what still goes wrong in LLM products (context loss, hallucination under missing evidence, sensitivity to phrasing) makes more sense when you understand what earlier NLP systems were fighting.
This month is the prequel:
The core problem
NLP needed long-range dependencies. RNNs tried to carry them through a single hidden state.
The core failure
Unrolling makes RNNs deep in time → vanishing/exploding gradients → brittle training.
The deeper limitation
“Memory” is a bottleneck: compressing a long history into one vector loses information and induces interference.
The setup for March
Attention breaks the bottleneck: direct access to past tokens + parallelism + better gradient paths.
Language is a sequence problem:
So when RNNs became practical, the pitch was incredibly appealing:
Feed tokens one by one, keep a hidden state, and learn “memory.”

That framing is not wrong.
But the engineering reality was:
To understand why, we need the two ideas that shaped everything:
Here’s the subtle trap:
An RNN can be “one layer”… but when you unroll it for 200 steps, it behaves like a 200-layer deep computation graph.
Same weights, repeated many times.

And the backprop story is brutal:
That’s the familiar deep learning lesson — just relocated into time.
Depth in time is still depth.
The failure patterns were so consistent they became muscle memory:
Learns short-term only
Local patterns improve, but long-range dependencies never get reliably better.
Sudden instability
Training looks fine… then spikes, diverges, or produces NaNs.
Hyperparameter cliff edges
Tiny changes in LR, init, or sequence length produce totally different outcomes.
“Seems to work” but is brittle
One run looks good; the next seed collapses. Results don’t survive repetition.
If this sounds like reinforcement learning instability… that’s not a coincidence.
It’s the same underlying theme:
learning signal is fragile when it has to pass through long chains.
Even if you solved gradients perfectly, RNNs still have a structural limit:
They compress the entire past into a single fixed-size hidden state.
That’s a bottleneck.
It forces a constant tradeoff:
So the model tends to learn a recency bias:
Not because they are stupid — because they are forced to compress.
As the sequence progresses, new content overwrites older content in the hidden state.
The model can’t keep “everything,” so it learns what to forget.
That often means it forgets what humans consider “important,” because the objective is next-token likelihood.
Even if a piece of information is still “in” the hidden state, it can be encoded in a tangled way.
The model might not be able to access it cleanly when needed, because retrieval becomes a learned decoding problem, not an explicit read.
This matters for product intuition:
Many modern “LLM weirdness” behaviors are still about what context is accessible versus merely “present somewhere in the prompt.”
We’ll come back to that.
LSTMs were an engineering response to the exact pain above:
They made sequence learning practical.
But they didn’t remove the core constraints:
That approach hits a ceiling.
This is the part that matters for senior engineers:
RNN-based NLP systems weren’t just “less accurate.” They were operationally fragile.
RNN inference is fundamentally sequential:
That means:
If you’ve ever tried to scale a system that can’t parallelize its hottest loop, you know how that ends: you buy hardware and still lose on p99.
Many datasets reward local heuristics. So models can “look good” while failing the real problem:
This is the same theme as 2019–2020 trading work:
If your evaluation distribution is biased toward easy shortcuts, you won’t notice brittleness until production.
You might be thinking:
“Cool history lesson. But we use transformers now.”
Yes — but the software lesson persists:
So here’s the translation table I keep in my head when designing LLM features:
| Old problem (RNN era) | Modern manifestation (LLM products) | What you do about it |
|---|---|---|
| Memory bottleneck | Context window limits, recency effects | Retrieval + context budgeting + summarization discipline |
| Weak long-range credit | Model ignores earlier constraints | Strong system prompts + structure + tool constraints + verification |
| Training instability | Unstable behavior across versions | Evals + canary rollouts + regression suites |
| “Looks good on dataset” | “Worked in a demo” but fails live | Realistic eval sets + adversarial prompts + telemetry |
| Sequential cost wall | Token cost + latency budgets | Streaming, caching, smaller models, routing policies |
You don’t ship “model intelligence.” You ship a behavior under constraints.
Even though we’re not building RNNs anymore, the debug mindset is identical.
When your LLM “forgets” something, don’t argue with it.
Instrument the system.
Take the failing conversation and turn it into:
If the model can’t answer reliably:
That last bullet is the key architectural move:
abstention is a feature.
RNNs failed in ways that made engineers cynical:
Transformers didn’t succeed because someone invented a prettier equation.
They succeeded because they attacked the two structural walls:
And the tool they used was conceptually simple:
Let the model look back directly.
That’s attention.
Which is why March is inevitable.
Because they made sequence learning concrete.
They gave us a workable mental model for:
That intuition is still valuable when reasoning about modern LLM behavior (especially context and failure modes).
They made long-term dependencies more learnable and training more stable.
But they didn’t remove:
So they improved the ceiling, but they didn’t remove the ceiling.
That “the model saw the input” is not the same as “the model can reliably use the input.”
Architecturally, this forces:
Now we’ve earned the intuition:
Next month we look at the breakthrough that changed everything:
Transformers: Attention as an Engineering Breakthrough (Not a Math Flex)
Because attention isn’t just a trick.
It’s a new system contract for memory, scale, and reliability.
Transformers: Attention as an Engineering Breakthrough (Not a Math Flex)
RNNs made sequence learning feel like fighting gradients. Transformers made it feel like building systems: parallelism, short gradient paths, and a memory mechanism you can scale. This post explains attention as an engineering unlock—and what it implies for real software.
Software in the Age of Probabilistic Components
LLMs aren’t “features” — they’re probabilistic runtime dependencies. This post gives the mental model, contracts, failure modes, and ship-ready checklists for building real products on top of them.