
Observability isn’t “add dashboards.” It’s designing feedback loops you can trust: signals that answer real questions, alerts tied to user pain, and tooling that helps you debug under pressure.
Axel Domingues
When teams say “we need observability,” they often mean:
And the most common “solution” is… buying tools and creating dashboards.
That’s not observability.
That’s new screens to be confused by.
Real observability is a discipline:
Design the signals that let you detect, explain, and fix production behavior fast — without guessing.
In 2022, this matters because systems aren’t simple anymore:
So this month is about a practical, non-folklore mental model:
Logs + Metrics + Traces are not “three tools.”
They’re three signal types that answer different questions — and SLOs decide which questions matter.
What you’re building
A feedback system: signals → decisions → fixes → learning.
The failure mode to avoid
Dashboard theater: lots of graphs, no answers under stress.
The architect’s question
What must we know to operate this system safely?
The outcome metric
Mean time to detect + explain + mitigate (and fewer false pages).
Observability becomes much simpler if you treat it like a pipeline:

The trick is to stop starting with “what can we measure?”
Start with:
A good signal is:
Cardinality is how many unique label combinations a metric can produce.
High-cardinality labels (user_id, request_id, full URL, email) can turn metrics into an unbounded data firehose.
Most teams misuse observability because they treat these as interchangeable.
They’re not.
They’re specialized tools for specialized questions.
Metrics answer “how much?”
Rates, ratios, percentiles, saturation. Great for alerting and trends.
Traces answer “where did the time go?”
A request’s path through services. Great for distributed bottlenecks.
Logs answer “what exactly happened?”
Events and context. Great for forensics, debugging, and audits.
SLOs answer “should we care?”
They define user-impact thresholds and keep you from paging on trivia.
If you don’t define reliability as experienced by users, you end up alerting on internal noise:
SLOs are how you separate user pain from system gossip.
A practical starting set:
Example: checkout / purchase / quote bind / claim submit. Everything else can be “nice-to-have” until that journey is operable.
You want a small set of universal lenses.
Google popularized the “Four Golden Signals” framing: latency, traffic, errors, saturation.
In practice, two variants are especially useful:
Metrics are the best default for:
If your stack gives you histograms, use them. They’re how you get percentiles you can trust.
The users who complain are living in the tail.Most outages are “tail problems”: the median looks fine while p99 melts.
For each service + endpoint:
Traces are how you answer:
Traces shine when you have:
Logs are most useful when they’re structured and intentional.
A good log is an event:
A bad log is “printf debugging in prod.”
Correlation is the difference between “fast answer” and “guessing.”
If you page humans for non-actionable noise, they will learn to ignore pages.
That’s not a people problem. That’s an architecture problem.
A high-quality alert:
Bad alert
“CPU at 85%”
(no user impact, no action, no context)
Good alert
“SLO burn rate indicates 30% of error budget will be consumed in 1 hour for checkout.”
(actionable, user-impact, time-bounded)
Threshold alerts are fragile.
Burn rate alerts tell you:
That is the difference between “panic” and “control.”
When the page fires, you need a deterministic flow.
This is where most teams lose months.
Symptoms:
Fix:
Symptoms:
Fix:
Symptoms:
Fix:
Symptoms:
Fix:
If you want a crisp definition of “operable,” here it is:
The point of observability
Not prettier dashboards.
Faster truth under pressure.
Google SRE — Service Level Objectives (SLIs/SLOs)
A practical entry point for SLO thinking and how to make alerting reflect user impact.
OpenTelemetry (spec + docs)
The modern standard for traces, metrics, logs, and context propagation across services.
You need all three eventually, but not all at once.
A practical sequencing:
The priority is: detect user impact fast, then debug fast.
For user experience, p99 often reflects reality better (tail pain).
For stability, p95 can be useful as an early signal.
Best practice: define the percentile in the SLO (what users feel), and alert on burn rate, not raw latency thresholds.
Treating tracing as “free.”
Tracing is a production feature with:
If you don’t manage it, you’ll either drown in data or turn it off during the first bill shock.
March was about designing signals that produce truth you can act on.
But observability alone doesn’t keep you safe.
The next month is the other half of operational architecture:
Security for Builders: Threat Modeling and Secure-by-Default Systems
Because the best incident is the one you prevented — and the second best is the one you can detect and contain quickly.
Security for Builders: Threat Modeling and Secure-by-Default Systems
Security isn’t a checklist you add at the end — it’s a set of architectural constraints. This month is about threat modeling that fits real teams, and defaults that prevent whole classes of incidents.
CI/CD as Architecture: Testing Pyramids, Pipelines, and Rollout Safety
CI/CD isn’t a DevOps checkbox — it’s the architecture that makes change safe. This month is about test economics, pipeline design, and rollout strategies that turn “deploy” into a reversible decision.