Jan 25, 2026 - 13 MIN READ

Two Models, One Spine: the 2026 build plan (LLM → Reasoning)

2026 is a build year: we’ll construct a language model and a reasoning system from scratch—month by month—without splitting the work into silos that can’t compound.

Axel Domingues

2025 ended with a clean conclusion:

Agents became platforms — and platforms need governance.

In 2026, I’m switching modes again.

Not back to a research diary.
Not forward into “prompt tricks.”

This year is a build-and-teach series:

Foundations to Reasoning: Building LLMs (and Reasoners) From Scratch

The mission is simple to state and hard to do well:

build a language model you can pretrain, fine-tune, and evaluate
turn it into a reasoning system you can measure, scale, and improve
do it on one shared spine, so every improvement benefits the whole stack

And we’ll do it while crediting — and learning from — two reference texts I’ll be following closely:

Raschka, Sebastian. Build A Large Language Model (From Scratch). Manning, 2024. ISBN: 978-1633437166.
Raschka, Sebastian. Build A Reasoning Model (From Scratch). Manning, 2025. ISBN: 9781633434677.

The goal is not to duplicate the books’ repos. The goal is to use them as a strong guideline and build a cohesive system with clearer boundaries, better testing discipline, and a more “teachably engineered” structure.

What this year is

A step-by-step construction of an LLM, then a reasoning pipeline that makes correctness measurable.

What this year is not

Not a “prompt cookbook”, not a vibe-driven benchmark chase, and not twelve isolated experiments.

The organizing rule

One spine: shared tokenization, model core, generation, training loops, evaluation, and traces.

What counts as progress

A capability only “exists” if we can reproduce it, measure it, and debug it.

The core claim: reasoning isn’t a separate species

A lot of people talk about “reasoning models” like they’re fundamentally different from language models.

In practice, most of the leverage comes from three things:

A capable base model (the substrate)
A reasoning pipeline (the system around the model)
Training-time pressure (methods that move the model toward correctness)

So when I say “two models”, I don’t mean two unrelated codebases.

I mean:

an LLM we understand deeply enough to operate
a reasoning layer we can evolve without rewriting everything

What “one spine” means (in engineering terms)

“One spine” is an architectural choice: shared interfaces and shared primitives.

Because without it, the year fractures into silos:

one repo for tokenization
one repo for training loops
one repo for evaluation scripts
one repo for reasoning experiments
and a folder called misc/ that quietly becomes your production system

That structure feels productive until you try to answer a simple question:

“Did this change improve the model, or just change the measurement?”

A single spine makes that question answerable.

The shared primitives we will keep stable

Tokenization + batching: the rules that define what a “token stream” is
Model core: transformer blocks, attention, normalization, checkpoint format
Generation: sampling, deterministic modes, and scoring (logprobs)
Training loops: pretraining and fine-tuning as reusable engines (not scripts)
Evaluation + traces: consistent runners that produce auditable outputs
Reasoning pipelines: generation → selection → refinement → verification (as a first-class system)

If you want improvements to compound, you need stable interfaces. That’s as true for ML systems as it is for distributed systems.

The 2026 arc: from “text prediction” to “measurable correctness”

Here’s the year in one sentence:

We start by building an LLM you can train and trust mechanically,
then we build a reasoning system you can trust epistemically.

That shift matters.

Language modeling mostly rewards fluency.
Reasoning punishes you for being wrong while sounding right.

So in the second half of the year, we’ll stop asking:

“Does it sound good?”

…and start asking:

“Can we grade it?”
“Can we extract the final answer reliably?”
“Can we improve accuracy without hiding the cost?”

The roadmap (January → December)

This is the “table of contents” for the year. Each month builds on the last.

Month	Theme	What you’ll learn to build
Jan	Two Models, One Spine	The overall architecture and the operating rules for the year
Feb	Tokenization	BPE, IDs, chunking, and data loaders that don’t lie
Mar	Attention	Causal masking, multihead shapes, and why bugs hide in broadcasting
Apr	Transformer blocks	Residual paths, normalization choices, and stability knobs
May	Pretraining	Loss, perplexity, checkpoints, and an honest training loop
Jun	Baselines	Importing pretrained weights + parity checks (platform validation)
Jul	Fine-tuning I	Classification adapters and what “last token” really means
Aug	Fine-tuning II	Instruction formatting/masking and the difference between “chat” and “completion”
Sep	Reasoning eval	Bench harnesses: prompting → generation → extraction → grading → traces
Oct	Inference scaling I	Self-consistency, sampling, and compute/quality tradeoffs
Nov	Inference scaling II	Candidate scoring + refinement loops that converge (not spiral)
Dec	Training scaling	RL for reasoning + distillation for efficiency (and where to go next)

The reasoning book is in MEAP and later chapters (including distillation and pipeline improvements) may land during the year. The plan assumes those chapters will be available by the time we reach the late-year topics.

What “from scratch” means in this series

“From scratch” does not mean:

re-implementing everything the ecosystem already solved
refusing to compare against pretrained baselines
spending 6 months on build tools instead of learning

It means:

implementing the core mechanisms ourselves (so we can debug them)
building a training/eval stack that is reproducible and inspectable
validating our platform by importing pretrained weights mid-year
treating “reasoning” as a system (not just a prompt)

If you can’t explain your data pipeline, masking rules, and checkpoint integrity, you don’t “have a model.”

You have a run that happened once.

How I’m going to teach this (so it stays useful)

This series is written for senior engineers and builders who want to understand the full stack — without drowning in mysticism.

So each month will follow a consistent teaching pattern:

Build the mental model

What problem exists? What breaks? What’s the boundary?

Implement the smallest honest version

Not “feature complete”. Complete enough to be testable.

Show failure modes and diagnostics

How it fails, how you detect it, and what to do next.

Scale carefully

Only after we can reproduce and measure improvements.

That rhythm is the difference between “content” and “craft”.

FAQ

Source material and how I’m using it

This year is anchored in two texts — one for the base LLM build, one for the reasoning stack:

LLM foundation (reference)

Raschka, Sebastian. Build A Large Language Model (From Scratch). Manning, 2024. ISBN: 978-1633437166.

Reasoning stack (reference)

Raschka, Sebastian. Build A Reasoning Model (From Scratch). Manning, 2025. ISBN: 9781633434677.

What’s Next

Next month we start where most “LLM projects” quietly go off the rails:

Tokenization as the First Model

Tokenization is the first place you can accidentally train on garbage while everything “looks fine”:

off-by-one target shifts
broken chunk boundaries
inconsistent special tokens
masking that leaks future context

So we’ll build it as a deterministic subsystem with tests — because if this layer lies, the entire year lies.

Reference Architecture v2: the Operable Agent Platform

This is the 2025 finale: a practical reference architecture for running fleets of agents with governance—connectors you can trust, traces you can debug, evals you can ship, and humans you can hand off to.