
A production system isn’t “done” when it works — it’s done when it can fail, recover, evolve, and stay correct under pressure. This capstone stitches the 2021–2022 series into a reference architecture and a decision log you can defend.
Axel Domingues
A system doesn’t “die” when a server crashes.
It dies when:
In 2021 we rebuilt the full-stack mental model: browser reality, HTTP semantics, APIs, caching, async, and the real cost of “distributed”.
In 2022 we got strict: repeatability, reliability, governance, observability, security, distributed data, performance, cloud choices, analytics truth, and cost as a constraint.
This capstone is the stitching.
Not “the one true stack”.
A reference architecture + decision log that helps you build something you can operate and evolve under pressure.
What “survive” means
Correctness, reliability, performance, security, and cost — at the same time.
What you get here
A pragmatic reference architecture + a decision log template you can reuse.
How to use this post
Treat it as a review checklist when designing, reviewing, or rescuing a system.
What’s next (2023)
Platform and runtime choices that encode operability (Nuxt 3 + UnJS as the case study).
If you remember one idea from this series, keep this one:
The first job of architecture is to make production behavior legible and reversible.
Everything else — performance, microservices, “modern stacks”, even developer experience — becomes a trap if you cannot:
It’s boring in the best way:
- predictable deploys
- bounded blast radius
- measurable goals
- known failure modes
Below is a reference architecture I like because it is defensible.
It makes tradeoffs explicit and gives each concern a home:

Data plane
Handles user traffic: UI, APIs, async work, data stores. Needs low latency + high availability.
Control plane
Changes and governance: CI/CD, config, secrets, feature flags, policies, rollbacks, SLOs.
If your control plane is weak, your system becomes unchangeable — and you eventually lose.
Your edge isn’t “just CDN”. It’s your first reliability layer.
Typical responsibilities:
Do not “accidentally cache” personalized or permissioned content.Cache what is:
- public, versioned, and immutable (assets)
- safely cacheable with clear keys (some read endpoints)
One survival pattern that keeps showing up:
Treat frontend as architecture.
Microservices don’t solve complexity. They move it into:
Start with a modular monolith when you can:
Then split by pain and ownership, not by ideology.
Split when
Deploy coupling slows teams, scaling needs diverge, or security boundaries must harden.
Don’t split when
You’re bored, you saw a conference talk, or “microservices” is the company identity.
This is where a lot of product teams suffer unnecessarily:
Never run analytics workloads on the same database that serves your critical transactional path.
Your transactional database should be boringly stable.
Your analytics world should be:
That means:
- event schemas with versioning
- idempotent ingestion
- replay capability
- and strict definitions of “truth”
Async is where correctness dies quietly.
So we design it like a financial system even when it’s “just emails”.
The minimum survivable pattern:
Dashboards are pictures. Survivability needs feedback loops.
At minimum, you want:
What you measure
User-facing outcomes: latency, errors, saturation, availability, correctness signals.
What you do with it
You make tradeoffs: ship faster vs stabilize — guided by error budgets.
Security that survives is mostly boring defaults:
Threat models beat compliance checklists — because they force you to answer:
“What could go wrong here, and what is our design response?”
If your system can’t do these, it’s not production-ready — it’s production-exposed.
Teams don’t fail because they picked “the wrong database”.
They fail because:
A decision log is adult supervision for architecture.
Below is a reusable template and a set of canonical decisions that show up in most systems.
ID: ADR-###
Status: Proposed | Accepted | Deprecated
Date: YYYY-MM-DD
Context
Options
Decision
Why
Consequences
Decision
Why
Consequences
Decision
Why
Consequences
Decision
Why
Consequences
Decision
Why
Consequences
Decision
Why
Consequences
The worst day is not when the incident happens.
It’s when:
A survivable system designs the incident loop explicitly.
Detect
Alerts are SLO-driven and actionable, not “CPU is 72%” noise.
Triage
You have a first-15-min playbook and known runbooks for common failures.
Mitigate
Flags, rate limits, graceful degradation, and rollback are fast and safe.
Learn
Postmortems improve the system, not blame humans.
This is the language I like to use with product leaders:
That is what mature architecture sounds like.
This finishes the series. And now it’s time to close a loop.
This blog started as my machine learning + deep learning journey.
At the end of this year (2022), the world shifted — not because ML got “new”… but because a product (ChatGPT) made these ideas operationally real for everyone.
So next, we start a new series:
Dissecting ChatGPT: what it is, why it works, and what it changes for software.
Not hype. Not prompts.
Architecture, interfaces, failure modes, and the reality of building systems on top of probabilistic engines.
Software in the Age of Probabilistic Components
LLMs aren’t “features” — they’re probabilistic runtime dependencies. This post gives the mental model, contracts, failure modes, and ship-ready checklists for building real products on top of them.
Incident Response and Resilience: Designing for Failure, Not Hope
Most teams “have on-call”. Fewer teams have resilience. This is a practical blueprint for designing systems, teams, and workflows that respond fast, recover safely, and learn without blame.