Blog
Nov 27, 2022 - 16 MIN READ
Incident Response and Resilience: Designing for Failure, Not Hope

Incident Response and Resilience: Designing for Failure, Not Hope

Most teams “have on-call”. Fewer teams have resilience. This is a practical blueprint for designing systems, teams, and workflows that respond fast, recover safely, and learn without blame.

Axel Domingues

Axel Domingues

The first time you run a real incident, you realize something uncomfortable:

Your architecture doesn’t fail when it’s wrong. It fails when it’s surprised.

And production is very good at surprise:

  • the slow query that “never happens” suddenly happens 10,000 times
  • the retry you forgot about turns a small blip into a thundering herd
  • a deploy goes out on Friday because the pipeline was green
  • a dependency degrades in a way nobody thought to alert on

This post is not about heroics.

It’s about designing the system (and the team) so that when failure happens:

  • detection is fast
  • response is calm
  • mitigation is safe
  • recovery is predictable
  • learning improves the system instead of burning people out
Production discipline as architecture.

Incident response is where every design decision shows up—timeouts, retries, deploy strategy, observability, ownership, and the human workflow around all of it.

The real goal

Reduce time-to-mitigate and avoid repeat failures—not “avoid all incidents”.

The mindset shift

Resilience isn’t a feature you add later.
It’s a property you design into the system.

What we’ll build

A practical incident model: roles, playbooks, alerting, and safe mitigation patterns.

What “good” looks like

Fast detection, small blast radius, reversible changes, and blameless learning.


Incidents Are Product Events, Not Engineering Shame

An incident is any unplanned event that:

  • harms customers
  • threatens data/security
  • violates an SLO (or will soon)
  • forces the team into emergency mode

Notice what’s missing: the word “outage”.

Many incidents are not full downtime. Most real pain is partial:

  • elevated latency (especially tail latency)
  • increased error rate for specific endpoints
  • degraded dependencies
  • stuck queues and backlogs
  • data drift and silent corruption
Treat “degraded” as a first-class state.

Degradation is where you can still make safe decisions. Full outages are where you panic.

Severity is about impact, not drama

A simple severity rubric (adapt to your org):

  • SEV0: active security/data integrity threat, or critical business functions down globally
  • SEV1: major customer impact, widespread outage/degradation
  • SEV2: partial customer impact, high error rate/latency on key paths
  • SEV3: limited impact, internal tooling or single customer / region

The point isn’t perfect classification.

It’s aligning the team on how big the response should be.


Incident Response Is a System: People + Process + Architecture

When incidents go badly, it’s usually not one thing. It’s the interaction of:

  • ambiguous ownership (“who decides?”)
  • incomplete instrumentation (“we don’t know what’s happening”)
  • unsafe mitigation (“we can’t roll back cleanly”)
  • fragile dependencies (“one failure becomes many”)
  • missing practice (“we’ve never done this under pressure”)

So treat IR as a designed system with an explicit lifecycle:

The incident lifecycle

Prepare → Detect → Triage → Mitigate → Recover → Learn


The Fastest Way to Lower MTTR Is Role Clarity

The mistake most teams make: everyone becomes a “doer” at once.

That feels productive… and it destroys coordination.

A good incident is boring because it has roles.

Small team? You can combine roles.

But you can’t skip the responsibilities.


The Golden Rule: Stop the Bleeding Before You Find the Root Cause

In a live incident, “root cause” is often a trap.

You are operating under:

  • incomplete information
  • shifting conditions
  • user behavior changes
  • cascading effects you don’t fully understand yet

Your job is to restore acceptable service first.

Root cause comes later, when the system is stable enough to observe.

If you keep debugging while customers are burning, you’re optimizing for curiosity instead of impact.

Mitigate first. Understand second.


Designing for Failure Starts with Containment

Resilience is not “never failing.”

It’s failing in a way that’s bounded and recoverable.

Here are the patterns that matter most in incidents.

Timeouts everywhere

A call that never times out is an outage that never ends.

Retries with backoff + jitter

Retries are load generators. Without jitter, they synchronize into retry storms.

Circuit breakers + bulkheads

Stop calling the dying dependency. Isolate pools so one failure doesn’t starve everything.

Rate limiting + load shedding

Under overload, reject some work intentionally to protect critical paths.

Timeouts: the hidden SLO killer

Common mistake: timeout at the edge, none inside.

Result: the customer request times out at 30s… but backend work keeps running for minutes, piling up and consuming capacity.

A healthier pattern:

  • short timeouts between internal calls (few seconds)
  • enforce a “request budget” (the total time you’re willing to spend)
  • cancel propagation where possible (don’t keep doing work after the client gave up)

Retries: success and amplification

Retries are good when:

  • the failure is transient
  • the operation is idempotent
  • you have backoff + jitter
  • you stop retrying when the system is clearly unhealthy

Retries are catastrophic when:

  • the failure is persistent (dependency is degraded)
  • the operation is not idempotent (you duplicate side effects)
  • all clients retry at the same interval (synchronized flood)

Circuit breakers + bulkheads: reduce cascade

  • Circuit breaker: after N failures, stop calling the dependency for a cool-down window.
  • Bulkhead: isolate resource pools (threads, connections, queues) so one failing dependency cannot starve unrelated work.

Resilience is mostly resource isolation.


Safe Mitigation Is a Deployment Capability, Not a Runtime Trick

Most mitigations are deploy operations:

  • roll back
  • roll forward with a hotfix
  • flip a feature flag
  • disable a bad code path
  • route traffic away from a region or dependency

If your deploys are slow, risky, or manual… your incident response will be slow and risky too.

Your best incident tool is reversibility.

If changes are easy to undo, you can move faster with less fear.

The “reversible changes” toolbox

  • feature flags / kill switches for risky paths
  • gradual rollouts (canary, percentage-based)
  • blue/green where rollback is a traffic switch
  • config changes that are validated and versioned
  • database safety: backwards-compatible migrations and dual-write strategies
A rollback that breaks because the database schema already changed is not a rollback.

It’s an additional incident.


Observability That Helps During Incidents

Dashboards don’t save you when:

  • they show averages
  • they are not tied to user impact
  • they don’t tell you what to do next

In incidents, you want fast answers to four questions:

  1. Is the customer impacted? How?
  2. Where is the bottleneck / failure domain?
  3. Is it getting better or worse?
  4. What mitigation moved the needle?

SLIs, not vibes

Measure what users feel: error rate, latency, availability, freshness.

Percentiles, not averages

p95/p99 latency is where incidents hide.

Correlation IDs

Traces connect “the slow page” to the specific slow dependency.

Alerts as triggers

Alerting is for action. Dashboards are for exploration.


How to Run the Incident (A Practical Playbook)

This is the minimal workflow that keeps teams sane.

Declare the incident early

If there’s real customer impact or likely impact, declare it.

It’s cheaper to downgrade a false alarm than to catch up after 40 minutes of confusion.

Assign roles immediately

Pick an IC. Pick a Comms Lead. Pick a Scribe.

Then let the responders debug without being dragged into coordination chaos.

Stabilize the system

Mitigation options are usually:

  • rollback or disable the recent change
  • shed load / rate limit
  • isolate the failing dependency
  • reduce blast radius (regional failover, partial feature disable)

Verify improvement with SLIs

Measure before/after on:

  • error rate
  • p95/p99 latency
  • backlog depth
  • customer-facing symptoms

If metrics didn’t move, the mitigation didn’t work—undo it and try another.

Communicate in a predictable rhythm

Stakeholders don’t need constant spam.

They need updates on a schedule:

  • current impact
  • what’s being tried
  • next update time

Transition to recovery

When the system is stable:

  • remove temporary mitigations carefully
  • drain backlogs safely
  • validate data integrity
  • keep monitoring for regressions

Close with a clear “all clear”

Don’t end incidents with silence.

Make the end explicit:

  • what improved
  • what remains risky
  • what follow-ups are coming
The most valuable habit in incident response is timestamping decisions.

A timeline turns “we think it happened around then” into real learning.


Postmortems: Blameless, Not Aimless

“Blameless” does not mean “nobody is responsible.”

It means:

  • we assume people acted reasonably with the information they had
  • we focus on system conditions that made the failure possible
  • we create corrective actions that reduce the chance of recurrence

Good postmortems answer:

  • what was the impact?
  • how did we detect it?
  • what was the timeline?
  • what were the contributing factors?
  • what mitigations worked and didn’t?
  • what will we change (with owners and dates)?

Resilience Is Practice: GameDays and Controlled Failure

The most dangerous time to learn incident response is during a real incident.

Practice reduces panic and exposes design gaps.

A good “GameDay” is not chaos for chaos’ sake. It’s a controlled exercise that validates:

  • alerting (did the right people get paged?)
  • diagnosis speed (did we know where to look?)
  • mitigation safety (could we roll back/disable?)
  • communication (were stakeholders informed?)
  • recovery (did we drain backlogs safely?)
Start small: fail a non-critical dependency in staging, then in production with a tiny percentage of traffic.

The goal is confidence, not destruction.


The Architect’s Incident Readiness Checklist

If you want fewer 2AM disasters, this is the list to pressure-test.

Failure containment

Timeouts, retries (with jitter), circuit breakers, bulkheads, rate limits, load shedding.

Safe change

Fast rollback, canary deploys, feature flags, backward-compatible schema changes.

Observability

Golden signals, percentiles, traceability, deploy markers, meaningful alerts with runbooks.

Human system

Clear ownership, on-call rotation, escalation paths, practiced incident roles, comms rhythm.

If any one of these is missing, incidents become people problems.

And people burn out faster than systems.


Resources

Google SRE Book (free online)

A foundational treatment of SLOs, alerting philosophy, toil reduction, and production practice.

Incident Command System (ICS) Basics

A clear mental model for roles and coordination under pressure (adapt it to engineering).


FAQ


What’s Next

This month was about surviving the bad day:

  • detecting fast
  • coordinating calmly
  • mitigating safely
  • and learning without blame

Next month is the capstone for 2022:

Capstone: Build a System That Can Survive (Reference Architecture + Decision Log)

We’ll assemble the whole year into an operational reference architecture you can defend—plus the decisions (and tradeoffs) that made it coherent.

Axel Domingues - 2026