
2026 is a build year: we’ll construct a language model and a reasoning system from scratch—month by month—without splitting the work into silos that can’t compound.
Axel Domingues
2025 ended with a clean conclusion:
Agents became platforms — and platforms need governance.
In 2026, I’m switching modes again.
Not back to a research diary.
Not forward into “prompt tricks.”
This year is a build-and-teach series:
Foundations to Reasoning: Building LLMs (and Reasoners) From Scratch
The mission is simple to state and hard to do well:
And we’ll do it while crediting — and learning from — two reference texts I’ll be following closely:
Raschka, Sebastian. Build A Large Language Model (From Scratch). Manning, 2024. ISBN: 978-1633437166.
Raschka, Sebastian. Build A Reasoning Model (From Scratch). Manning, 2025. ISBN: 9781633434677.
What this year is
A step-by-step construction of an LLM, then a reasoning pipeline that makes correctness measurable.
What this year is not
Not a “prompt cookbook”, not a vibe-driven benchmark chase, and not twelve isolated experiments.
The organizing rule
One spine: shared tokenization, model core, generation, training loops, evaluation, and traces.
What counts as progress
A capability only “exists” if we can reproduce it, measure it, and debug it.
A lot of people talk about “reasoning models” like they’re fundamentally different from language models.
In practice, most of the leverage comes from three things:
So when I say “two models”, I don’t mean two unrelated codebases.
I mean:
“One spine” is an architectural choice: shared interfaces and shared primitives.
Because without it, the year fractures into silos:
misc/ that quietly becomes your production systemThat structure feels productive until you try to answer a simple question:
“Did this change improve the model, or just change the measurement?”
A single spine makes that question answerable.
Here’s the year in one sentence:
We start by building an LLM you can train and trust mechanically,
then we build a reasoning system you can trust epistemically.
That shift matters.
Language modeling mostly rewards fluency.
Reasoning punishes you for being wrong while sounding right.
So in the second half of the year, we’ll stop asking:
…and start asking:
This is the “table of contents” for the year. Each month builds on the last.
| Month | Theme | What you’ll learn to build |
|---|---|---|
| Jan | Two Models, One Spine | The overall architecture and the operating rules for the year |
| Feb | Tokenization | BPE, IDs, chunking, and data loaders that don’t lie |
| Mar | Attention | Causal masking, multihead shapes, and why bugs hide in broadcasting |
| Apr | Transformer blocks | Residual paths, normalization choices, and stability knobs |
| May | Pretraining | Loss, perplexity, checkpoints, and an honest training loop |
| Jun | Baselines | Importing pretrained weights + parity checks (platform validation) |
| Jul | Fine-tuning I | Classification adapters and what “last token” really means |
| Aug | Fine-tuning II | Instruction formatting/masking and the difference between “chat” and “completion” |
| Sep | Reasoning eval | Bench harnesses: prompting → generation → extraction → grading → traces |
| Oct | Inference scaling I | Self-consistency, sampling, and compute/quality tradeoffs |
| Nov | Inference scaling II | Candidate scoring + refinement loops that converge (not spiral) |
| Dec | Training scaling | RL for reasoning + distillation for efficiency (and where to go next) |
“From scratch” does not mean:
It means:
You have a run that happened once.
This series is written for senior engineers and builders who want to understand the full stack — without drowning in mysticism.
So each month will follow a consistent teaching pattern:
What problem exists? What breaks? What’s the boundary?
Not “feature complete”. Complete enough to be testable.
How it fails, how you detect it, and what to do next.
Only after we can reproduce and measure improvements.
That rhythm is the difference between “content” and “craft”.
Because building these systems is layered work.
Tokenization mistakes corrupt training.
Training mistakes corrupt conclusions.
Evaluation mistakes corrupt your ability to improve.
A monthly cadence forces the right discipline: one stable layer at a time.
No.
The goal is not leaderboard heroics. The goal is architectural authority: a system you can understand, reproduce, and extend.
We’ll validate against baselines and measure improvements honestly, but the purpose is engineering literacy, not hype.
The most common failure mode is building “a model” and forgetting evaluation.
So we will treat evaluation and traceability as first-class work — especially once we cross into reasoning.
Because reasoning work explodes in complexity if your system is not modular.
Sampling, scoring, selection, refinement, and verification all want: stable tokenization, stable generation, and stable evaluation.
If those foundations are inconsistent, your conclusions won’t survive contact with reality.
This year is anchored in two texts — one for the base LLM build, one for the reasoning stack:
Next month we start where most “LLM projects” quietly go off the rails:
Tokenization as the First Model
Tokenization is the first place you can accidentally train on garbage while everything “looks fine”:
So we’ll build it as a deterministic subsystem with tests — because if this layer lies, the entire year lies.