Dec 25, 2022 - 22 MIN READ

Capstone: Build a System That Can Survive (Reference Architecture + Decision Log)

A production system isn’t “done” when it works — it’s done when it can fail, recover, evolve, and stay correct under pressure. This capstone stitches the 2021–2022 series into a reference architecture and a decision log you can defend.

Axel Domingues

A system doesn’t “die” when a server crashes.

It dies when:

a deploy can’t be rolled back,
an incident can’t be explained,
a queue silently duplicates money,
a dashboard looks green while users are furious,
or the cost curve bends upward and nobody knows why.

In 2021 we rebuilt the full-stack mental model: browser reality, HTTP semantics, APIs, caching, async, and the real cost of “distributed”.

In 2022 we got strict: repeatability, reliability, governance, observability, security, distributed data, performance, cloud choices, analytics truth, and cost as a constraint.

This capstone is the stitching.

Not “the one true stack”.
A reference architecture + decision log that helps you build something you can operate and evolve under pressure.

What “survive” means

Correctness, reliability, performance, security, and cost — at the same time.

What you get here

A pragmatic reference architecture + a decision log template you can reuse.

How to use this post

Treat it as a review checklist when designing, reviewing, or rescuing a system.

What’s next (2023)

Platform and runtime choices that encode operability (Nuxt 3 + UnJS as the case study).

The Prime Directive: Operate First, Optimize Later

If you remember one idea from this series, keep this one:

The first job of architecture is to make production behavior legible and reversible.

Everything else — performance, microservices, “modern stacks”, even developer experience — becomes a trap if you cannot:

observe what’s happening
change behavior safely
recover when you’re wrong

A “survivable” system isn’t necessarily fancy.

It’s boring in the best way:
predictable deploys
bounded blast radius
measurable goals
known failure modes

Reference Architecture: The Parts That Keep You Alive

Below is a reference architecture I like because it is defensible.

It makes tradeoffs explicit and gives each concern a home:

correctness
latency
reliability
security
analytics truth
and cost control

Reference architecture for a survivable system (edge, app, data, async, observability, security, analytics)

The “Two Planes” mental model

Data plane

Handles user traffic: UI, APIs, async work, data stores. Needs low latency + high availability.

Control plane

Changes and governance: CI/CD, config, secrets, feature flags, policies, rollbacks, SLOs.

If your control plane is weak, your system becomes unchangeable — and you eventually lose.

The Core Building Blocks

1) Edge + Delivery

Your edge isn’t “just CDN”. It’s your first reliability layer.

Typical responsibilities:

TLS termination and HTTP correctness
caching and compression (when it’s honest)
WAF / rate limiting / bot protection
static asset delivery
request routing to regions (when applicable)

Edge caching should be selective.

Cache what is:
public, versioned, and immutable (assets)
safely cacheable with clear keys (some read endpoints)

Do not “accidentally cache” personalized or permissioned content.

2) UI + Backend-for-Frontend (BFF)

One survival pattern that keeps showing up:

Treat frontend as architecture.

UI owns user experience and failure behavior (fallbacks, spinners, offline-ish modes)
A BFF adapts backend complexity to user-facing needs (aggregation, shaping, auth context)
BFF becomes the performance and correctness boundary for the UI

When teams skip the BFF boundary, the frontend slowly becomes “distributed systems by surprise”:

N+1 calls
duplicated auth logic
inconsistent caching
and partial failure chaos

3) Core Business: Modular Monolith until it hurts

Microservices don’t solve complexity. They move it into:

network calls
consistency guarantees
tracing and debugging
deployment coordination
and cost

Start with a modular monolith when you can:

clear bounded contexts
strong internal contracts
a migration path to split services later

Then split by pain and ownership, not by ideology.

Split when

Deploy coupling slows teams, scaling needs diverge, or security boundaries must harden.

Don’t split when

You’re bored, you saw a conference talk, or “microservices” is the company identity.

4) Data: Separate “Truth” from “Reporting”

This is where a lot of product teams suffer unnecessarily:

OLTP is for correctness and transactions
OLAP is for analysis and decision-making
trying to use one as the other corrupts both

The rule:

Never run analytics workloads on the same database that serves your critical transactional path.

Your transactional database should be boringly stable.

Your analytics world should be:

append-friendly
history-aware
optimized for aggregation
resilient to schema evolution

The best analytics pipeline is the one that doesn’t lie.

That means:
event schemas with versioning
idempotent ingestion
replay capability
and strict definitions of “truth”

5) Async Work: Outbox + Queue + Workers

Async is where correctness dies quietly.

So we design it like a financial system even when it’s “just emails”.

The minimum survivable pattern:

Write state + an outbox record in one transaction
A dispatcher publishes to a queue/event bus
Workers process with idempotency + retries + dead-letter strategy
Workflows track state transitions explicitly

If you cannot answer “what happens when this runs twice?”, you don’t have a background job system — you have a lottery.

6) Observability: SLOs, Not Screenshots

Dashboards are pictures. Survivability needs feedback loops.

At minimum, you want:

metrics (RED for services, USE for infra)
logs with structured context (request id / trace id / user id safe fields)
distributed traces for cross-service causality
SLOs + error budgets (a contract for tradeoffs)

What you measure

User-facing outcomes: latency, errors, saturation, availability, correctness signals.

What you do with it

You make tradeoffs: ship faster vs stabilize — guided by error budgets.

7) Security: Defaults, Not Heroics

Security that survives is mostly boring defaults:

least privilege IAM
secret management (no secrets in env files committed “temporarily”)
dependency hygiene and SBOM-ish awareness
boundary protection (WAF, rate limits, auth middleware)
audit logs for sensitive actions

Threat models beat compliance checklists — because they force you to answer:

“What could go wrong here, and what is our design response?”

The Survival Checklist: What You Must Be Able to Do

If your system can’t do these, it’s not production-ready — it’s production-exposed.

Roll back safely

every deploy has a rollback plan
schema changes are backward compatible (or staged)
feature flags exist for high-risk changes

Debug from the outside in

start with user symptom → trace → service → dependency
logs and traces correlate reliably
you can reproduce a failure path in staging (or via replay)

Degrade gracefully

timeouts and retries are bounded
partial failures don’t cascade endlessly
fallback behavior is explicit (UI and API)

Prove correctness under retries

idempotency keys exist where money/side-effects happen
deduplication is designed, not assumed
dead-letter queues are monitored and actionable

Know your performance envelope

you have SLOs for key endpoints
you can explain p50/p95/p99 behavior
load tests model realistic traffic, not fantasy

Know your cost drivers

you can attribute cost to workloads, not just accounts
you can identify “top spenders” by service and by feature
you can forecast changes (not perfectly — but sanely)

Decision Log: Architecture You Can Defend

Teams don’t fail because they picked “the wrong database”.

They fail because:

decisions were implicit
tradeoffs were forgotten
and constraints changed while the system stayed the same

A decision log is adult supervision for architecture.

Below is a reusable template and a set of canonical decisions that show up in most systems.

Incident Response: Designing for the Worst Day

The worst day is not when the incident happens.

It’s when:

the alarm is ambiguous,
the on-call is blind,
and everyone argues while customers burn.

A survivable system designs the incident loop explicitly.

Detect

Alerts are SLO-driven and actionable, not “CPU is 72%” noise.

Triage

You have a first-15-min playbook and known runbooks for common failures.

Mitigate

Flags, rate limits, graceful degradation, and rollback are fast and safe.

Learn

Postmortems improve the system, not blame humans.

The best incident response improvement is usually one of:

better timeouts and bulkheads
better dependency isolation
better runbooks and dashboards for one critical journey
safer deployment/rollback controls

This is the language I like to use with product leaders:

We can ship fast until we consume our error budget.
We prioritize reliability work when user outcomes demand it.
We invest in instrumentation and rollback safety so we can take bigger bets.
We do not trade correctness for speed in irreversible workflows.

That is what mature architecture sounds like.

What’s Next

This finishes the series. And now it’s time to close a loop.

This blog started as my machine learning + deep learning journey.

At the end of this year (2022), the world shifted — not because ML got “new”… but because a product (ChatGPT) made these ideas operationally real for everyone.

So next, we start a new series:

Dissecting ChatGPT: what it is, why it works, and what it changes for software.

Not hype. Not prompts.

Architecture, interfaces, failure modes, and the reality of building systems on top of probabilistic engines.

Software in the Age of Probabilistic Components

LLMs aren’t “features” — they’re probabilistic runtime dependencies. This post gives the mental model, contracts, failure modes, and ship-ready checklists for building real products on top of them.

Incident Response and Resilience: Designing for Failure, Not Hope

Most teams “have on-call”. Fewer teams have resilience. This is a practical blueprint for designing systems, teams, and workflows that respond fast, recover safely, and learn without blame.