
RNNs made sequence learning feel like fighting gradients. Transformers made it feel like building systems: parallelism, short gradient paths, and a memory mechanism you can scale. This post explains attention as an engineering unlock—and what it implies for real software.
Axel Domingues
February was the “why NLP was hard” month.
RNNs could represent sequence behavior… but training them was a constant negotiation with physics:
March is where that story flips.
Transformers didn’t win because of a clever equation.
They won because they turned sequence modeling into something hardware can chew and optimizers can survive.
The breakthrough is attention — not as “math,” but as a new kind of system component:
A learnable, content-addressable routing layer that can connect any token to any other token in one step.
That one-step connectivity changes everything:
The goal this month
Build a mental model of transformers that helps you ship LLM features without mysticism.
The engineering unlock
Transformers replace sequential “memory updates” with parallel attention + short paths.
The real payoff
Trainability + throughput → scale → capability. “Better models” is an operational outcome.
The production lens
Attention implies a new cost model: tokens, context windows, KV caches, latency, and failure modes.
Here’s the simplest operational definition I’ve found useful:
A transformer layer is “compute features locally” (MLP) + “route information globally” (attention), repeated many times, with stability scaffolding (residuals + normalization).
The attention part is the star:
That’s it.
It’s not a memory cell. It’s not a loop.
It’s a routing decision that happens in parallel.

RNNs have two brutal constraints that show up as engineering pain:
You can’t compute step t+1 until step t is done.
That kills throughput and makes scaling expensive.
Even if you “unroll” for training, the learning signal has to travel through a long chain.
In practice that means:
I wrote about that pain explicitly in 2017 (The Pain of Training RNNs).
The transformer’s real innovation is that it makes those constraints optional.
The transformer changes the connectivity of the computation graph.
With an RNN:
With attention:
That means:
They’re better because they create shorter paths for information and learning signals.Transformers aren’t “better at memory” because they store more.

RNNs force a time loop.
Transformers let you compute token representations in parallel (during training):
That parallelism is not a convenience.
It is what made “scale” economically viable.
And scale is what turns a model from:
It turned sequence learning into batchable compute.
A production mental model that holds up:
Repeat that N times.
Most LLMs today are “decoder-only transformers,” meaning:
Attention
“Which previous tokens matter right now?” (routing)
MLP
“Given that info, compute a better representation.” (feature transformation)
Residual + norm
“Don’t let depth destroy trainability.” (stability)
Stack depth
“Repeat until the model can express complex behaviors.” (capacity)
Transformers don’t store memory.No “explicit memory cell.”
They reconstruct what they need by routing across the context.
When you build on transformers, two modes matter more than the paper diagrams.
This is where transformers win versus RNNs: parallelism + stable optimization at depth.
At inference, you generate one token at a time.
So you might ask:
“If generation is sequential anyway, why is the transformer an improvement?”
Because:
When generating token t, attention needs keys/values from tokens 1..t-1.
Naively, you’d recompute everything each step.
Instead, you cache keys/values per layer for previous tokens.
So each new token mostly computes:
This is why LLM inference has a distinct cost profile:
Attention is powerful, but not free.
The naive cost grows quickly with context length because every token can attend to many others.
What that means operationally:
If you stuff the context with noise, the model will confidently route to noise.
The famous title is easy to misread.
It doesn’t mean:
It means something more practical:
If you can route information flexibly with short paths, the rest of the network can be boring and scalable.
Which is a very 2020s engineering story:
This is also why the “transformer era” became a platform era:
The architecture and the infrastructure co-evolved.
If you build with transformers, you’re building with a new kind of component:
So the “transformer understanding” that matters is not academic.
It’s architectural.
Here are the real design consequences that start now:
Context is an input surface
Your system must assemble context intentionally (not “just dump everything”).
Cost is a product constraint
Tokens are your new currency: budget, cache, and throttle like an adult.
Reliability needs boundaries
Decide what must never be wrong, and keep that in deterministic code paths.
Evaluation becomes a discipline
You can’t reason about correctness without a harness (goldens, regressions, telemetry).
This is my “architect’s minimum bar.”
Can you explain prefill vs decode and why long prompts are expensive?
Can you explain why multi-turn chat can be fast or slow depending on caching and prompt growth?
Can you explain why decoder-only models generate left-to-right and why that affects controllability?
Can you explain why a modest architecture improvement can unlock large capability jumps when combined with data + compute?
Can you articulate hallucinations as “token-likelihood optimization,” not “the model lying”?
If you can’t answer those, you’ll still ship something… but you’ll be debugging with superstition.
They’re not “better memory” in the sense of a stronger memory cell.
They’re better at using context because attention gives short paths:
That makes long-range dependencies learnable at scale.
Parallelism matters primarily for training (where most compute happens).
At inference, you still generate token-by-token, but:
No.
Attention is a routing mechanism, not a truth mechanism.
If your context is noisy or adversarial, the model can route confidently to the wrong signals. That’s why context assembly and evaluation become product disciplines.
Encoder-decoder models are great for “transform input → output” tasks (translation-style).
Decoder-only models are trained to predict the next token and are the backbone of most LLM chat systems. They’re simpler to scale and align well with “generate text” as a universal interface.
Now we have the architecture primitive.
Next month is the other half of the story:
Pretraining Is Compression: Tokens, Datasets, and Emergent Skill
Because transformers didn’t become powerful just because of attention.
They became powerful because pretraining turned the internet into:
Pretraining Is Compression: Tokens, Datasets, and Emergent Skill
Pretraining isn’t “learning facts.” It’s learning to compress a giant slice of the internet into a predictive machine. This post gives senior engineers the mental model: tokens, data mixtures, scaling, and why capabilities seem to ‘emerge’—plus the practical implications for cost, reliability, and product design.
Why NLP Was Hard: RNN Pain, Vanishing Gradients, and the Limits of “Memory”
Before transformers, language models tried to compress entire histories into a single hidden state. This post explains why that was brittle: depth-in-time, vanishing/exploding gradients, and the engineering limits of “memory” — and why attention was inevitable.