Blog
Jan 25, 2026 - 13 MIN READ
Two Models, One Spine: the 2026 build plan (LLM → Reasoning)

Two Models, One Spine: the 2026 build plan (LLM → Reasoning)

2026 is a build year: we’ll construct a language model and a reasoning system from scratch—month by month—without splitting the work into silos that can’t compound.

Axel Domingues

Axel Domingues

2025 ended with a clean conclusion:

Agents became platforms — and platforms need governance.

In 2026, I’m switching modes again.

Not back to a research diary.
Not forward into “prompt tricks.”

This year is a build-and-teach series:

Foundations to Reasoning: Building LLMs (and Reasoners) From Scratch

The mission is simple to state and hard to do well:

  • build a language model you can pretrain, fine-tune, and evaluate
  • turn it into a reasoning system you can measure, scale, and improve
  • do it on one shared spine, so every improvement benefits the whole stack

And we’ll do it while crediting — and learning from — two reference texts I’ll be following closely:

  • Raschka, Sebastian. Build A Large Language Model (From Scratch). Manning, 2024. ISBN: 978-1633437166.

  • Raschka, Sebastian. Build A Reasoning Model (From Scratch). Manning, 2025. ISBN: 9781633434677.

The goal is not to duplicate the books’ repos. The goal is to use them as a strong guideline and build a cohesive system with clearer boundaries, better testing discipline, and a more “teachably engineered” structure.

What this year is

A step-by-step construction of an LLM, then a reasoning pipeline that makes correctness measurable.

What this year is not

Not a “prompt cookbook”, not a vibe-driven benchmark chase, and not twelve isolated experiments.

The organizing rule

One spine: shared tokenization, model core, generation, training loops, evaluation, and traces.

What counts as progress

A capability only “exists” if we can reproduce it, measure it, and debug it.


The core claim: reasoning isn’t a separate species

A lot of people talk about “reasoning models” like they’re fundamentally different from language models.

In practice, most of the leverage comes from three things:

  1. A capable base model (the substrate)
  2. A reasoning pipeline (the system around the model)
  3. Training-time pressure (methods that move the model toward correctness)

So when I say “two models”, I don’t mean two unrelated codebases.

I mean:

  • an LLM we understand deeply enough to operate
  • a reasoning layer we can evolve without rewriting everything

What “one spine” means (in engineering terms)

“One spine” is an architectural choice: shared interfaces and shared primitives.

Because without it, the year fractures into silos:

  • one repo for tokenization
  • one repo for training loops
  • one repo for evaluation scripts
  • one repo for reasoning experiments
  • and a folder called misc/ that quietly becomes your production system

That structure feels productive until you try to answer a simple question:

“Did this change improve the model, or just change the measurement?”

A single spine makes that question answerable.

The shared primitives we will keep stable

  • Tokenization + batching: the rules that define what a “token stream” is
  • Model core: transformer blocks, attention, normalization, checkpoint format
  • Generation: sampling, deterministic modes, and scoring (logprobs)
  • Training loops: pretraining and fine-tuning as reusable engines (not scripts)
  • Evaluation + traces: consistent runners that produce auditable outputs
  • Reasoning pipelines: generation → selection → refinement → verification (as a first-class system)
If you want improvements to compound, you need stable interfaces. That’s as true for ML systems as it is for distributed systems.

The 2026 arc: from “text prediction” to “measurable correctness”

Here’s the year in one sentence:

We start by building an LLM you can train and trust mechanically,
then we build a reasoning system you can trust epistemically.

That shift matters.

Language modeling mostly rewards fluency.
Reasoning punishes you for being wrong while sounding right.

So in the second half of the year, we’ll stop asking:

  • “Does it sound good?”

…and start asking:

  • “Can we grade it?”
  • “Can we extract the final answer reliably?”
  • “Can we improve accuracy without hiding the cost?”

The roadmap (January → December)

This is the “table of contents” for the year. Each month builds on the last.

MonthThemeWhat you’ll learn to build
JanTwo Models, One SpineThe overall architecture and the operating rules for the year
FebTokenizationBPE, IDs, chunking, and data loaders that don’t lie
MarAttentionCausal masking, multihead shapes, and why bugs hide in broadcasting
AprTransformer blocksResidual paths, normalization choices, and stability knobs
MayPretrainingLoss, perplexity, checkpoints, and an honest training loop
JunBaselinesImporting pretrained weights + parity checks (platform validation)
JulFine-tuning IClassification adapters and what “last token” really means
AugFine-tuning IIInstruction formatting/masking and the difference between “chat” and “completion”
SepReasoning evalBench harnesses: prompting → generation → extraction → grading → traces
OctInference scaling ISelf-consistency, sampling, and compute/quality tradeoffs
NovInference scaling IICandidate scoring + refinement loops that converge (not spiral)
DecTraining scalingRL for reasoning + distillation for efficiency (and where to go next)
The reasoning book is in MEAP and later chapters (including distillation and pipeline improvements) may land during the year. The plan assumes those chapters will be available by the time we reach the late-year topics.

What “from scratch” means in this series

“From scratch” does not mean:

  • re-implementing everything the ecosystem already solved
  • refusing to compare against pretrained baselines
  • spending 6 months on build tools instead of learning

It means:

  • implementing the core mechanisms ourselves (so we can debug them)
  • building a training/eval stack that is reproducible and inspectable
  • validating our platform by importing pretrained weights mid-year
  • treating “reasoning” as a system (not just a prompt)
If you can’t explain your data pipeline, masking rules, and checkpoint integrity, you don’t “have a model.”

You have a run that happened once.


How I’m going to teach this (so it stays useful)

This series is written for senior engineers and builders who want to understand the full stack — without drowning in mysticism.

So each month will follow a consistent teaching pattern:

Build the mental model

What problem exists? What breaks? What’s the boundary?

Implement the smallest honest version

Not “feature complete”. Complete enough to be testable.

Show failure modes and diagnostics

How it fails, how you detect it, and what to do next.

Scale carefully

Only after we can reproduce and measure improvements.

That rhythm is the difference between “content” and “craft”.


FAQ


Source material and how I’m using it

This year is anchored in two texts — one for the base LLM build, one for the reasoning stack:

LLM foundation (reference)

Raschka, Sebastian. Build A Large Language Model (From Scratch). Manning, 2024. ISBN: 978-1633437166.

Reasoning stack (reference)

Raschka, Sebastian. Build A Reasoning Model (From Scratch). Manning, 2025. ISBN: 9781633434677.


What’s Next

Next month we start where most “LLM projects” quietly go off the rails:

Tokenization as the First Model

Tokenization is the first place you can accidentally train on garbage while everything “looks fine”:

  • off-by-one target shifts
  • broken chunk boundaries
  • inconsistent special tokens
  • masking that leaks future context

So we’ll build it as a deterministic subsystem with tests — because if this layer lies, the entire year lies.

Axel Domingues - 2026