
Most teams “have on-call”. Fewer teams have resilience. This is a practical blueprint for designing systems, teams, and workflows that respond fast, recover safely, and learn without blame.
Axel Domingues
The first time you run a real incident, you realize something uncomfortable:
Your architecture doesn’t fail when it’s wrong. It fails when it’s surprised.
And production is very good at surprise:
This post is not about heroics.
It’s about designing the system (and the team) so that when failure happens:
Incident response is where every design decision shows up—timeouts, retries, deploy strategy, observability, ownership, and the human workflow around all of it.
The real goal
Reduce time-to-mitigate and avoid repeat failures—not “avoid all incidents”.
The mindset shift
Resilience isn’t a feature you add later.
It’s a property you design into the system.
What we’ll build
A practical incident model: roles, playbooks, alerting, and safe mitigation patterns.
What “good” looks like
Fast detection, small blast radius, reversible changes, and blameless learning.
An incident is any unplanned event that:
Notice what’s missing: the word “outage”.
Many incidents are not full downtime. Most real pain is partial:
Degradation is where you can still make safe decisions. Full outages are where you panic.
A simple severity rubric (adapt to your org):
The point isn’t perfect classification.
It’s aligning the team on how big the response should be.
When incidents go badly, it’s usually not one thing. It’s the interaction of:
So treat IR as a designed system with an explicit lifecycle:
The incident lifecycle
Prepare → Detect → Triage → Mitigate → Recover → Learn
The mistake most teams make: everyone becomes a “doer” at once.
That feels productive… and it destroys coordination.
A good incident is boring because it has roles.
But you can’t skip the responsibilities.
In a live incident, “root cause” is often a trap.
You are operating under:
Your job is to restore acceptable service first.
Root cause comes later, when the system is stable enough to observe.
Mitigate first. Understand second.
Resilience is not “never failing.”
It’s failing in a way that’s bounded and recoverable.
Here are the patterns that matter most in incidents.
Timeouts everywhere
A call that never times out is an outage that never ends.
Retries with backoff + jitter
Retries are load generators. Without jitter, they synchronize into retry storms.
Circuit breakers + bulkheads
Stop calling the dying dependency. Isolate pools so one failure doesn’t starve everything.
Rate limiting + load shedding
Under overload, reject some work intentionally to protect critical paths.
Common mistake: timeout at the edge, none inside.
Result: the customer request times out at 30s… but backend work keeps running for minutes, piling up and consuming capacity.
A healthier pattern:
Retries are good when:
Retries are catastrophic when:
If you suspect a retry storm, check:
Resilience is mostly resource isolation.
Most mitigations are deploy operations:
If your deploys are slow, risky, or manual… your incident response will be slow and risky too.
If changes are easy to undo, you can move faster with less fear.
It’s an additional incident.
Dashboards don’t save you when:
In incidents, you want fast answers to four questions:
SLIs, not vibes
Measure what users feel: error rate, latency, availability, freshness.
Percentiles, not averages
p95/p99 latency is where incidents hide.
Correlation IDs
Traces connect “the slow page” to the specific slow dependency.
Alerts as triggers
Alerting is for action. Dashboards are for exploration.
Put these at the top, always:
This is the minimal workflow that keeps teams sane.
If there’s real customer impact or likely impact, declare it.
It’s cheaper to downgrade a false alarm than to catch up after 40 minutes of confusion.
Pick an IC. Pick a Comms Lead. Pick a Scribe.
Then let the responders debug without being dragged into coordination chaos.
Mitigation options are usually:
Measure before/after on:
If metrics didn’t move, the mitigation didn’t work—undo it and try another.
Stakeholders don’t need constant spam.
They need updates on a schedule:
When the system is stable:
Don’t end incidents with silence.
Make the end explicit:
A timeline turns “we think it happened around then” into real learning.
“Blameless” does not mean “nobody is responsible.”
It means:
Good postmortems answer:
Prefer actions that change the system, not actions that “remind people”.
“Five whys” often pushes teams into a single root cause and stops early.
A better framing is contributing factors:
The most dangerous time to learn incident response is during a real incident.
Practice reduces panic and exposes design gaps.
A good “GameDay” is not chaos for chaos’ sake. It’s a controlled exercise that validates:
The goal is confidence, not destruction.
If you want fewer 2AM disasters, this is the list to pressure-test.
Failure containment
Timeouts, retries (with jitter), circuit breakers, bulkheads, rate limits, load shedding.
Safe change
Fast rollback, canary deploys, feature flags, backward-compatible schema changes.
Observability
Golden signals, percentiles, traceability, deploy markers, meaningful alerts with runbooks.
Human system
Clear ownership, on-call rotation, escalation paths, practiced incident roles, comms rhythm.
And people burn out faster than systems.
Not necessarily.
The IC role is about coordination and decision-making under uncertainty.
The best IC is often someone who:
Reversibility.
Fast rollback + safe deploy patterns reduce both blast radius and recovery time. If your mitigation is risky, you’ll hesitate—and MTTR grows.
No. But practice is.
Start with GameDays, runbook drills, and small fault injection. The point is to validate that your alerting, mitigation paths, and comms work under pressure.
Treat alerting as a product with ownership.
This month was about surviving the bad day:
Next month is the capstone for 2022:
Capstone: Build a System That Can Survive (Reference Architecture + Decision Log)
We’ll assemble the whole year into an operational reference architecture you can defend—plus the decisions (and tradeoffs) that made it coherent.
Capstone: Build a System That Can Survive (Reference Architecture + Decision Log)
A production system isn’t “done” when it works — it’s done when it can fail, recover, evolve, and stay correct under pressure. This capstone stitches the 2021–2022 series into a reference architecture and a decision log you can defend.
Cost as a First-Class Constraint: FinOps for Architects
Reliability is non-negotiable, but “cost” is where architecture meets physics. This month is a practical playbook: how to model cost, allocate it, and design guardrails so your system scales without surprising invoices.