
Pretraining isn’t “learning facts.” It’s learning to compress a giant slice of the internet into a predictive machine. This post gives senior engineers the mental model: tokens, data mixtures, scaling, and why capabilities seem to ‘emerge’—plus the practical implications for cost, reliability, and product design.
Axel Domingues
In January we drew the new boundary line: software now contains probabilistic components — and correctness is no longer “always” but “with a confidence profile.”
In February we looked at why language was hard for a long time: RNNs wanted memory, but training dynamics punished them.
In March we met the turning point: attention — not as math flair, but as an engineering unlock (parallelism, long-range access, stable optimization at scale).
Now, April is the part most teams skip.
Everyone wants to talk about “fine-tuning” and “agents.”
But the superpower comes earlier.
Pretraining is where the model becomes competent.
Everything after that mostly teaches it how to behave.
This post is the mental model for what pretraining actually does, why tokens matter more than you think, why the dataset is a product decision, and why capabilities seem to “appear” when you cross certain scale thresholds.
The goal this month
Understand pretraining as compression + prediction, not “knowledge injection.”
The core idea
A foundation model is shaped more by its data mixture than by its PR story.
The sharp edge
Tokens are the unit of compute, latency, and cost — not words.
The payoff
Better architecture decisions: model choice, evaluation, guardrails, and expectations.
Pretraining is optimizing a model to do one job:
Given a sequence of tokens, predict the next token.
That’s it.
No “facts database.”
No structured ontology.
No symbolic reasoning engine.
And yet… a huge amount of useful behavior falls out of that objective when you combine:
Pretraining is building a lossy compression of human text and code that is optimized for predictive usefulness — not truth.
The rest of this article is unpacking what “lossy compression” means in practice.
When people hear “next-token prediction,” they picture autocomplete.
But pretraining at scale becomes something else:
Compression is what happens when you represent a large dataset with fewer bits while still reconstructing it well.
Foundation models do the same thing — but with a twist: they compress the dataset into parameters and then reconstruct the continuation of text.
This immediately explains two facts that surprise new adopters:
Everything in LLM systems becomes clearer when you stop thinking in words and start thinking in tokens.
A token is a chunk of text (often subword pieces) produced by a tokenizer.
Examples (conceptually):
"architecture" might be one token"unbelievable" might split into multiple tokensBecause language is messy:
Tokenizers (BPE / unigram variants) give you:
Tokens are the budget. They control:
Practical rule
If you don’t measure token counts, you don’t know your cost or latency.
Practical implication
Context windows create “attention scarcity.” You must design what to drop.
The model didn’t forget.
You cut the evidence out of the prompt.
If tokens are the unit of cost, the dataset is the unit of capability.
A pretrained model is shaped by its training mixture:
If your mixture overweights code, you get stronger code completion.
If it overweights conversational text, you get better chat-style continuations.
If it overweights low-quality text, you get… a model that is excellent at reproducing low-quality patterns.
This is why “model X is better than model Y” is not a universal statement.
It’s a statement about:
Because it is.
If your evaluation set appears in the training data, your benchmark becomes a memorization test.
This is not a rare edge case on web-scale corpora. It is the default risk.
For product teams, contamination shows up as:
The web is not a neutral sample of human knowledge.
It has:
A pretrained model learns these priors because they reduce loss.
Two corpora with the same raw sources can produce different models depending on:
These aren’t just “ethics.” They are capability and style controls.
Engineers hate the word emergence because it sounds like mysticism.
Here’s the grounded version:
Some tasks require a minimum amount of representational capacity and training signal before the behavior becomes detectable.
Think of it like this:
So behavior looks step-like even if training improvements are smooth.
Instead of “it suddenly learned reasoning,” say:
This language keeps you honest and helps you design systems around the actual phenomenon: capability cliffs.
Model upgrades are not linear improvements. They can introduce new failure modes and new behaviors abruptly.
Pretraining produces a powerful engine — but it’s not the engine people intuitively imagine.
If you ask for an answer that “sounds like an answer,” you often get one — even when the correct response is “I don’t know.”
This is why production systems need:
We’ll get to those — but first we need one more critical concept.
A database is optimized for:
A pretrained model is optimized for:
So when you ask: “Does the model contain this fact?”
The correct mental answer is:
It contains a distributed representation that sometimes enables the fact to be reconstructed — and sometimes doesn’t.
This explains why models can:
Because “facts” aren’t stored as rows. They are stored as many small correlated weights spread across the network.
Compression buys generalization.
Generalization buys capability.
But you don’t get crisp truth for free.
If you’re building on top of pretrained models, here are the architectural consequences you can’t ignore.
Choose a truth boundary
What must never be wrong? Route that through deterministic systems or verification.
Treat tokens as a budget
Your prompt is a cost model. Your context window is a product constraint.
Design for distribution shift
Your users are a new dataset. Monitor drift and regressions like any other dependency.
Evaluate like you mean it
Demos are not evidence. Build an eval harness that matches your real tasks.
If you want “completion,” pretraining is most of the story.
If you want “assistant,” you’re entering instruction tuning and RLHF territory (next month).
If you want “tool user,” you need contracts, instrumentation, and safety constraints.
Measure:
For your domain, test:
Common guardrails:
Treat the model as a dependency:
Imagine you trained on a trillion tokens of text and code.
Now ask:
What is the cheapest shortcut to reduce loss?
Often, it’s not “be correct.”
It’s “sound correct.”
That’s why the model can produce:
This is not “the model lying.” It’s the model doing its job: generating high-probability continuations.
So your job, as an engineer, is to build systems where “high probability text” is useful and safe, not automatically trusted.
April takeaway
Pretraining builds a powerful compression engine of human text.
It gives you fluency and broad competence — but not truth. Reliability is something you design on top.
“Language Models are Few-Shot Learners” (GPT-3, 2020)
A clear snapshot of what scale + pretraining can do (and what it still can’t).
Because a lot of “reasoning-like” text is present in the corpus — explanations, proofs, code walkthroughs, step-by-step answers.
To compress that data well, the model learns internal representations that support generating those patterns.
But it’s still not a proof engine. It can generate reasoning-shaped text that is wrong.
“I don’t know” is just another continuation pattern.
Pretraining does not explicitly reward epistemic humility. It rewards predicting the next token correctly.
Unless later training stages teach refusal/uncertainty behavior, the model will often produce a plausible continuation rather than a blank.
It learns a text-shaped slice of the world.
That can be extremely useful, but it’s not a grounded simulator. The model has no direct access to physical reality, and its “knowledge” is filtered through what people wrote and what your dataset mixture included.
Three things:
Pretraining builds capability.
But capability alone doesn’t give you a usable assistant.
Next month is about the bridge from “completion engine” to “helpful tool”:
Instruction Tuning: Turning a Completion Engine into an Assistant
Because once the model can speak, the next question is:
What should it say… and when should it stay quiet?
Instruction Tuning: Turning a Completion Engine into an Assistant
Pretraining gives you a powerful text predictor. Instruction tuning turns it into something that behaves like a helpful tool. This post explains what instruction tuning changes, what it can’t change, and how to design products around the new failure modes.
Transformers: Attention as an Engineering Breakthrough (Not a Math Flex)
RNNs made sequence learning feel like fighting gradients. Transformers made it feel like building systems: parallelism, short gradient paths, and a memory mechanism you can scale. This post explains attention as an engineering unlock—and what it implies for real software.