Apr 30, 2023 - 18 MIN READ

Pretraining Is Compression: Tokens, Datasets, and Emergent Skill

Pretraining isn’t “learning facts.” It’s learning to compress a giant slice of the internet into a predictive machine. This post gives senior engineers the mental model: tokens, data mixtures, scaling, and why capabilities seem to ‘emerge’—plus the practical implications for cost, reliability, and product design.

Axel Domingues

In January we drew the new boundary line: software now contains probabilistic components — and correctness is no longer “always” but “with a confidence profile.”

In February we looked at why language was hard for a long time: RNNs wanted memory, but training dynamics punished them.

In March we met the turning point: attention — not as math flair, but as an engineering unlock (parallelism, long-range access, stable optimization at scale).

Now, April is the part most teams skip.

Everyone wants to talk about “fine-tuning” and “agents.”
But the superpower comes earlier.

Pretraining is where the model becomes competent.
Everything after that mostly teaches it how to behave.

This post is the mental model for what pretraining actually does, why tokens matter more than you think, why the dataset is a product decision, and why capabilities seem to “appear” when you cross certain scale thresholds.

The goal this month

Understand pretraining as compression + prediction, not “knowledge injection.”

The core idea

A foundation model is shaped more by its data mixture than by its PR story.

The sharp edge

Tokens are the unit of compute, latency, and cost — not words.

The payoff

Better architecture decisions: model choice, evaluation, guardrails, and expectations.

Pretraining in one sentence

Pretraining is optimizing a model to do one job:

Given a sequence of tokens, predict the next token.

That’s it.

No “facts database.”
No structured ontology.
No symbolic reasoning engine.

And yet… a huge amount of useful behavior falls out of that objective when you combine:

a large model (capacity)
a large dataset (coverage)
stable optimization (trainability)
and enough compute (time spent compressing)

The mental model that scales for architects:

Pretraining is building a lossy compression of human text and code that is optimized for predictive usefulness — not truth.

The rest of this article is unpacking what “lossy compression” means in practice.

Why “compression” is the right metaphor

When people hear “next-token prediction,” they picture autocomplete.

But pretraining at scale becomes something else:

The model learns representations that reuse structure across millions of contexts.
It learns that certain patterns co-occur: syntax, semantics, style, argument structure.
It learns latent variables: topic, intent, domain, persona, tone.
It learns a probabilistic simulator for “what humans tend to write next.”

Compression is what happens when you represent a large dataset with fewer bits while still reconstructing it well.

Foundation models do the same thing — but with a twist: they compress the dataset into parameters and then reconstruct the continuation of text.

If you want to reason about what the model “knows,” ask:

What structure would be useful to compress this corpus?
What shortcuts would minimize loss cheaply?
What patterns are overrepresented in the data mixture?

This immediately explains two facts that surprise new adopters:

Models can be brilliant in one domain and dumb in another.
Models can sound confident about nonsense — because “confidence” is not a trained concept. Probabilities are.

Tokens: the unit of reality (and the unit of pain)

Everything in LLM systems becomes clearer when you stop thinking in words and start thinking in tokens.

A token is a chunk of text (often subword pieces) produced by a tokenizer.

Examples (conceptually):

"architecture" might be one token
"unbelievable" might split into multiple tokens
whitespace and punctuation matter
code has its own token quirks

Why tokenization exists

Because language is messy:

words are unbounded (new names, typos, slang)
languages mix
code and natural language mix
you need a finite vocabulary to train efficiently

Tokenizers (BPE / unigram variants) give you:

a manageable vocabulary size
the ability to represent any string
a spectrum between “character-level” and “word-level”

Why you should care (as a system designer)

Tokens are the budget. They control:

cost (billing is usually per token)
latency (more tokens → more compute)
context window pressure (truncation decisions)
failure modes (weird splits change behavior)
prompt injection surface (hidden tokens, delimiter tricks)
evaluation realism (your “prompt length” is not what you think)

Practical rule

If you don’t measure token counts, you don’t know your cost or latency.

Practical implication

Context windows create “attention scarcity.” You must design what to drop.

A lot of “LLM unpredictability” is really context truncation.

The model didn’t forget.
You cut the evidence out of the prompt.

The dataset is the product spec (whether you admit it or not)

If tokens are the unit of cost, the dataset is the unit of capability.

A pretrained model is shaped by its training mixture:

web text
books
forums
code
academic papers
instructional content
synthetic data (in some pipelines)
filtered subsets for safety/quality

Data mixture determines your capability envelope

If your mixture overweights code, you get stronger code completion.
If it overweights conversational text, you get better chat-style continuations.
If it overweights low-quality text, you get… a model that is excellent at reproducing low-quality patterns.

This is why “model X is better than model Y” is not a universal statement.

It’s a statement about:

the data mixture
the compute budget
the training recipe
and the evaluation tasks you care about

Treat the dataset like a design decision, not an implementation detail.

Because it is.

The three hidden dataset problems teams underestimate

Why skills “emerge” (and why it’s not magic)

Engineers hate the word emergence because it sounds like mysticism.

Here’s the grounded version:

Some tasks require a minimum amount of representational capacity and training signal before the behavior becomes detectable.

Think of it like this:

The model is learning many overlapping subskills.
Some subskills only become usable when enough pieces are present.
Benchmarks have thresholds: you don’t notice progress until you cross “good enough to pass.”

So behavior looks step-like even if training improvements are smooth.

A useful way to talk about emergence

Instead of “it suddenly learned reasoning,” say:

“it crossed a threshold where multi-step patterns became reliably representable”
“the model can now maintain longer dependencies without collapsing”
“it learned reusable ‘circuits’ for certain tasks”

This language keeps you honest and helps you design systems around the actual phenomenon: capability cliffs.

Architectural implication:

Model upgrades are not linear improvements. They can introduce new failure modes and new behaviors abruptly.

What pretraining gives you (and what it doesn’t)

Pretraining produces a powerful engine — but it’s not the engine people intuitively imagine.

It gives you

a general language prior: fluent generation, summarization, rewriting
pattern completion: “continue the thing in the style of the thing”
latent structure: syntax, semantics, code patterns, common formats
zero-shot/few-shot behavior: the ability to follow examples without gradient updates (within limits)

It does not guarantee

truthfulness
calibrated confidence
stable reasoning across long chains
robustness to adversarial prompts
consistency across sessions
correct tool use
safe handling of private data

Pretraining optimizes for plausible continuation.

If you ask for an answer that “sounds like an answer,” you often get one — even when the correct response is “I don’t know.”

This is why production systems need:

truth boundaries
retrieval and citation
tool-based verification
guardrails and evals
rollback strategies

We’ll get to those — but first we need one more critical concept.

The model is not a database — it’s a probability field

A database is optimized for:

exact retrieval
stable updates
explicit consistency

A pretrained model is optimized for:

predictive compression
smooth generalization
“good enough” continuations across many domains

So when you ask: “Does the model contain this fact?”

The correct mental answer is:

It contains a distributed representation that sometimes enables the fact to be reconstructed — and sometimes doesn’t.

This explains why models can:

recall obscure details
and forget common ones
and contradict themselves within the same conversation

Because “facts” aren’t stored as rows. They are stored as many small correlated weights spread across the network.

This is not a flaw. It’s the trade.

Compression buys generalization.
Generalization buys capability.
But you don’t get crisp truth for free.

Production implications (for people shipping LLM features)

If you’re building on top of pretrained models, here are the architectural consequences you can’t ignore.

Choose a truth boundary

What must never be wrong? Route that through deterministic systems or verification.

Treat tokens as a budget

Your prompt is a cost model. Your context window is a product constraint.

Design for distribution shift

Your users are a new dataset. Monitor drift and regressions like any other dependency.

Evaluate like you mean it

Demos are not evidence. Build an eval harness that matches your real tasks.

A practical checklist: “Are we using the pretrained model correctly?”

Step 1 — Define the job (completion vs assistant vs tool user)

If you want “completion,” pretraining is most of the story.
If you want “assistant,” you’re entering instruction tuning and RLHF territory (next month).
If you want “tool user,” you need contracts, instrumentation, and safety constraints.

Step 2 — Measure token costs on representative prompts

Measure:

input tokens
output tokens
average vs p95 prompt size
truncation frequency and what gets dropped

Step 3 — Identify your failure modes

For your domain, test:

plausible nonsense (hallucination)
refusal / safety overreach
prompt injection
instruction conflict (“system vs user”)
long-context degradation

Step 4 — Install guardrails

Common guardrails:

retrieval for factual grounding
schema-constrained outputs
tool-based verification for critical invariants
deterministic post-processing for safety/correctness

Step 5 — Track behavior over time

Treat the model as a dependency:

record prompts + responses (with privacy controls)
monitor success rates
add regression tests
roll out changes gradually

A small thought experiment that keeps teams honest

Imagine you trained on a trillion tokens of text and code.

Now ask:

What is the cheapest shortcut to reduce loss?

Often, it’s not “be correct.”
It’s “sound correct.”

That’s why the model can produce:

a perfectly formatted RFC-style response that is wrong
a confident explanation with fake citations
a cleanly structured JSON object with invented fields

This is not “the model lying.” It’s the model doing its job: generating high-probability continuations.

So your job, as an engineer, is to build systems where “high probability text” is useful and safe, not automatically trusted.

April takeaway

Pretraining builds a powerful compression engine of human text.

It gives you fluency and broad competence — but not truth. Reliability is something you design on top.

Resources

“Attention Is All You Need” (2017)

The architecture that made scaling language models practical.

“Language Models are Few-Shot Learners” (GPT-3, 2020)

A clear snapshot of what scale + pretraining can do (and what it still can’t).

“Scaling Laws for Neural Language Models” (2020)

Why increasing model/data/compute tends to produce predictable improvements — until thresholds show up.

“The Pile” (2020)

A practical example of how dataset mixture is an explicit design choice.

FAQ

What’s Next

Pretraining builds capability.

But capability alone doesn’t give you a usable assistant.

Next month is about the bridge from “completion engine” to “helpful tool”:

Instruction Tuning

Because once the model can speak, the next question is:

What should it say… and when should it stay quiet?

Instruction Tuning: Turning a Completion Engine into an Assistant

Pretraining gives you a powerful text predictor. Instruction tuning turns it into something that behaves like a helpful tool. This post explains what instruction tuning changes, what it can’t change, and how to design products around the new failure modes.

Transformers: Attention as an Engineering Breakthrough (Not a Math Flex)

RNNs made sequence learning feel like fighting gradients. Transformers made it feel like building systems: parallelism, short gradient paths, and a memory mechanism you can scale. This post explains attention as an engineering unlock—and what it implies for real software.