
Reliability is non-negotiable, but “cost” is where architecture meets physics. This month is a practical playbook: how to model cost, allocate it, and design guardrails so your system scales without surprising invoices.
Axel Domingues
A system that “works” but bleeds money is still broken.
Not because finance is grumpy — but because cost shapes what you can safely ship:
In practice, cost is a constraint just like latency or correctness.
So October is about making cost architectural instead of accidental.
This is the architect’s perspective:
- model the cost curve
- make it observable
- and put guardrails where systems tend to explode
The goal this month
Make cost predictable by design: unit costs, ownership, and guardrails.
The mindset shift
Cost is not an afterthought — it’s a runtime property of your architecture.
What I’m measuring
Unit cost ($/request, $/order, $/active user), plus the top 5 cost drivers.
What “good” looks like
You can answer: “If traffic doubles, what happens to the bill — and why?”
Cost spikes rarely come from “the cloud being expensive.”
They come from unbounded behavior:
Cost is a tail risk. Like latency, the mean is boring — the spikes are what hurt.
That’s the worst moment to touch your architecture.
A cost expressed per business unit: $/order, $/1k requests, $/active user-month, $/GB processed.
If you don’t have a unit cost, you don’t have a cost model — you have a receipt.
Showback: attribute spend to teams/services for visibility.
Chargeback: actually bill teams (or budgets) for their spend.
Start with showback. Chargeback is an organizational choice, not a technical prerequisite.
Fixed: baseline you pay even at zero traffic (always-on services).
Variable: scales with usage (requests, GB, messages).
Accidental: spend caused by inefficiency or mistakes (debug logs, runaway queries, retries).
How many unique values a metric label can take.
High-cardinality labels (userId, requestId, URL with parameters) can blow up metrics/log costs.
You don’t need perfect accounting.
You need a model that lets you reason about change.
I use a simple decomposition:
Here’s the key idea:
If you can express cost as “base + (units × unit_cost) + risk,” you can design.
Baseline
Always-on resources: minimum instances, databases, NAT gateways, control planes.
Unit cost
Per request/order/user. This is the number product teams can reason about.
Growth curve
Linear? Step function? Superlinear? Depends on architecture.
Blast radius
Unbounded behavior: retries, scans, cardinality, egress, retention.
Most cloud bills collapse into four buckets:
You can’t optimize what you can’t name.
So your first FinOps move is boring but powerful:
Build a “top spenders” map by service and by environment.
Here are the decisions that move your cost curve structurally (not cosmetically).
Every synchronous hop adds:
The cost curve becomes more sensitive to traffic because one user action triggers many internal actions.
Mitigation patterns:
Retries are a cost multiplier.
A “small” error rate can become:
Mitigation patterns:
Cost spikes love:
Mitigation patterns:
Logging and metrics are not free — they are managed services with pricing curves.
Common traps:
Mitigation patterns:
If you accidentally make your architecture “egress-heavy,” you’ve created a tax that scales with success.
Common traps:
Mitigation patterns:
FinOps is a practice, not a one-off project.
What works is an operating loop:
Pick 1–3 business units your product cares about:
Write them down and stop debating. You can refine later.
You need two ingredients:
Then compute:
Every major cost center needs an owner:
No owner = no optimization.
Examples:
Don’t “save 5%” by turning knobs.
Focus on cost multipliers:
A healthy system has:
It’s preventing regressions.
You can embed cost safety into the system the same way you embed reliability safety.
Absolute thresholds are brittle.
What you want is:
If your SLO is healthy, you can sample aggressively. If it’s degraded, sample less and increase detail temporarily.
That keeps observability useful and affordable.
Non-prod is where cost discipline goes to die.
Rules that pay for themselves:
When I’m reviewing an architecture proposal, I ask these questions.
“We’ll optimize later”
Later is when you’re busy and scared.
Fix: design the cost model now (baseline + unit cost + blast radius).
“It’s just logging”
Logs scale with traffic, retries, and payload size.
Fix: sampling, retention tiers, and enforce log levels.
“The database can handle it”
Databases handle it… until they don’t, and scaling is a step function.
Fix: query budgets, read models, caching, and OLAP separation.
“We need multi-region now”
Multi-region multiplies complexity and often cost (especially traffic replication).
Fix: make the reliability goal explicit, then choose the cheapest topology that meets it.
FinOps Foundation / FinOps Framework
A shared language for the practice: inform → optimize → operate, plus tooling and organizational patterns.
Google Cloud Pricing Calculator
Sanity-check “always-on” vs “serverless” and data/egress heavy designs.
No. Cutting cost is sometimes a result, but the real win is predictability.
FinOps is about aligning engineering and finance around:
The goal is to scale without financial surprises.
Not at first.
Start with showback:
Chargeback is an organizational lever. Visibility is the architectural prerequisite.
Almost always from multipliers:
These wins are both cost and reliability wins — because they remove runaway behavior.
No. Serverless is great when:
But always-on workloads with steady volume often favor:
The right answer is: pick the model that matches your traffic curve and operational needs.
October made cost explicit:
Next month is the inevitable companion topic:
Incident Response and Resilience: Designing for Failure, Not Hope
Because the same systems that create latency spikes and outages… also create cost spikes.
The adult supervision move is the same:
Design for failure paths — and design for the bill you’ll get when those paths trigger.
Incident Response and Resilience: Designing for Failure, Not Hope
Most teams “have on-call”. Fewer teams have resilience. This is a practical blueprint for designing systems, teams, and workflows that respond fast, recover safely, and learn without blame.
Data Engineering for Product Teams: OLTP vs OLAP, Streaming, and Truth
Most “data problems” are actually truth problems. This month is a practical mental model for product teams: where truth lives, how it moves, when to stream, when to batch, and how to keep analytics useful without corrupting production.