Blog
Dec 25, 2022 - 22 MIN READ
Capstone: Build a System That Can Survive (Reference Architecture + Decision Log)

Capstone: Build a System That Can Survive (Reference Architecture + Decision Log)

A production system isn’t “done” when it works — it’s done when it can fail, recover, evolve, and stay correct under pressure. This capstone stitches the 2021–2022 series into a reference architecture and a decision log you can defend.

Axel Domingues

Axel Domingues

A system doesn’t “die” when a server crashes.

It dies when:

  • a deploy can’t be rolled back,
  • an incident can’t be explained,
  • a queue silently duplicates money,
  • a dashboard looks green while users are furious,
  • or the cost curve bends upward and nobody knows why.

In 2021 we rebuilt the full-stack mental model: browser reality, HTTP semantics, APIs, caching, async, and the real cost of “distributed”.

In 2022 we got strict: repeatability, reliability, governance, observability, security, distributed data, performance, cloud choices, analytics truth, and cost as a constraint.

This capstone is the stitching.

Not “the one true stack”.
A reference architecture + decision log that helps you build something you can operate and evolve under pressure.

What “survive” means

Correctness, reliability, performance, security, and cost — at the same time.

What you get here

A pragmatic reference architecture + a decision log template you can reuse.

How to use this post

Treat it as a review checklist when designing, reviewing, or rescuing a system.

What’s next (2023)

Platform and runtime choices that encode operability (Nuxt 3 + UnJS as the case study).


The Prime Directive: Operate First, Optimize Later

If you remember one idea from this series, keep this one:

The first job of architecture is to make production behavior legible and reversible.

Everything else — performance, microservices, “modern stacks”, even developer experience — becomes a trap if you cannot:

  • observe what’s happening
  • change behavior safely
  • recover when you’re wrong
A “survivable” system isn’t necessarily fancy.

It’s boring in the best way:

  • predictable deploys
  • bounded blast radius
  • measurable goals
  • known failure modes

Reference Architecture: The Parts That Keep You Alive

Below is a reference architecture I like because it is defensible.

It makes tradeoffs explicit and gives each concern a home:

  • correctness
  • latency
  • reliability
  • security
  • analytics truth
  • and cost control

The “Two Planes” mental model

Data plane

Handles user traffic: UI, APIs, async work, data stores. Needs low latency + high availability.

Control plane

Changes and governance: CI/CD, config, secrets, feature flags, policies, rollbacks, SLOs.

If your control plane is weak, your system becomes unchangeable — and you eventually lose.


The Core Building Blocks

1) Edge + Delivery

Your edge isn’t “just CDN”. It’s your first reliability layer.

Typical responsibilities:

  • TLS termination and HTTP correctness
  • caching and compression (when it’s honest)
  • WAF / rate limiting / bot protection
  • static asset delivery
  • request routing to regions (when applicable)
Edge caching should be selective.

Cache what is:

  • public, versioned, and immutable (assets)
  • safely cacheable with clear keys (some read endpoints)
Do not “accidentally cache” personalized or permissioned content.

2) UI + Backend-for-Frontend (BFF)

One survival pattern that keeps showing up:

Treat frontend as architecture.

  • UI owns user experience and failure behavior (fallbacks, spinners, offline-ish modes)
  • A BFF adapts backend complexity to user-facing needs (aggregation, shaping, auth context)
  • BFF becomes the performance and correctness boundary for the UI
When teams skip the BFF boundary, the frontend slowly becomes “distributed systems by surprise”:
  • N+1 calls
  • duplicated auth logic
  • inconsistent caching
  • and partial failure chaos

3) Core Business: Modular Monolith until it hurts

Microservices don’t solve complexity. They move it into:

  • network calls
  • consistency guarantees
  • tracing and debugging
  • deployment coordination
  • and cost

Start with a modular monolith when you can:

  • clear bounded contexts
  • strong internal contracts
  • a migration path to split services later

Then split by pain and ownership, not by ideology.

Split when

Deploy coupling slows teams, scaling needs diverge, or security boundaries must harden.

Don’t split when

You’re bored, you saw a conference talk, or “microservices” is the company identity.


4) Data: Separate “Truth” from “Reporting”

This is where a lot of product teams suffer unnecessarily:

  • OLTP is for correctness and transactions
  • OLAP is for analysis and decision-making
  • trying to use one as the other corrupts both

The rule:

Never run analytics workloads on the same database that serves your critical transactional path.

Your transactional database should be boringly stable.

Your analytics world should be:

  • append-friendly
  • history-aware
  • optimized for aggregation
  • resilient to schema evolution
The best analytics pipeline is the one that doesn’t lie.

That means:

  • event schemas with versioning
  • idempotent ingestion
  • replay capability
  • and strict definitions of “truth”

5) Async Work: Outbox + Queue + Workers

Async is where correctness dies quietly.

So we design it like a financial system even when it’s “just emails”.

The minimum survivable pattern:

  • Write state + an outbox record in one transaction
  • A dispatcher publishes to a queue/event bus
  • Workers process with idempotency + retries + dead-letter strategy
  • Workflows track state transitions explicitly
If you cannot answer “what happens when this runs twice?”, you don’t have a background job system — you have a lottery.

6) Observability: SLOs, Not Screenshots

Dashboards are pictures. Survivability needs feedback loops.

At minimum, you want:

  • metrics (RED for services, USE for infra)
  • logs with structured context (request id / trace id / user id safe fields)
  • distributed traces for cross-service causality
  • SLOs + error budgets (a contract for tradeoffs)

What you measure

User-facing outcomes: latency, errors, saturation, availability, correctness signals.

What you do with it

You make tradeoffs: ship faster vs stabilize — guided by error budgets.


7) Security: Defaults, Not Heroics

Security that survives is mostly boring defaults:

  • least privilege IAM
  • secret management (no secrets in env files committed “temporarily”)
  • dependency hygiene and SBOM-ish awareness
  • boundary protection (WAF, rate limits, auth middleware)
  • audit logs for sensitive actions

Threat models beat compliance checklists — because they force you to answer:

“What could go wrong here, and what is our design response?”


The Survival Checklist: What You Must Be Able to Do

If your system can’t do these, it’s not production-ready — it’s production-exposed.

Roll back safely

  • every deploy has a rollback plan
  • schema changes are backward compatible (or staged)
  • feature flags exist for high-risk changes

Debug from the outside in

  • start with user symptom → trace → service → dependency
  • logs and traces correlate reliably
  • you can reproduce a failure path in staging (or via replay)

Degrade gracefully

  • timeouts and retries are bounded
  • partial failures don’t cascade endlessly
  • fallback behavior is explicit (UI and API)

Prove correctness under retries

  • idempotency keys exist where money/side-effects happen
  • deduplication is designed, not assumed
  • dead-letter queues are monitored and actionable

Know your performance envelope

  • you have SLOs for key endpoints
  • you can explain p50/p95/p99 behavior
  • load tests model realistic traffic, not fantasy

Know your cost drivers

  • you can attribute cost to workloads, not just accounts
  • you can identify “top spenders” by service and by feature
  • you can forecast changes (not perfectly — but sanely)

Decision Log: Architecture You Can Defend

Teams don’t fail because they picked “the wrong database”.

They fail because:

  • decisions were implicit
  • tradeoffs were forgotten
  • and constraints changed while the system stayed the same

A decision log is adult supervision for architecture.

Below is a reusable template and a set of canonical decisions that show up in most systems.


Incident Response: Designing for the Worst Day

The worst day is not when the incident happens.

It’s when:

  • the alarm is ambiguous,
  • the on-call is blind,
  • and everyone argues while customers burn.

A survivable system designs the incident loop explicitly.

Detect

Alerts are SLO-driven and actionable, not “CPU is 72%” noise.

Triage

You have a first-15-min playbook and known runbooks for common failures.

Mitigate

Flags, rate limits, graceful degradation, and rollback are fast and safe.

Learn

Postmortems improve the system, not blame humans.

The best incident response improvement is usually one of:
  • better timeouts and bulkheads
  • better dependency isolation
  • better runbooks and dashboards for one critical journey
  • safer deployment/rollback controls

The “Survivability Contract” You Can Share with Stakeholders

This is the language I like to use with product leaders:

  • We can ship fast until we consume our error budget.
  • We prioritize reliability work when user outcomes demand it.
  • We invest in instrumentation and rollback safety so we can take bigger bets.
  • We do not trade correctness for speed in irreversible workflows.

That is what mature architecture sounds like.


What’s Next

This finishes the series. And now it’s time to close a loop.

This blog started as my machine learning + deep learning journey.

At the end of this year (2022), the world shifted — not because ML got “new”… but because a product (ChatGPT) made these ideas operationally real for everyone.

So next, we start a new series:

Dissecting ChatGPT: what it is, why it works, and what it changes for software.

Not hype. Not prompts.

Architecture, interfaces, failure modes, and the reality of building systems on top of probabilistic engines.

Axel Domingues - 2026