Blog
Oct 30, 2022 - 15 MIN READ
Cost as a First-Class Constraint: FinOps for Architects

Cost as a First-Class Constraint: FinOps for Architects

Reliability is non-negotiable, but “cost” is where architecture meets physics. This month is a practical playbook: how to model cost, allocate it, and design guardrails so your system scales without surprising invoices.

Axel Domingues

Axel Domingues

A system that “works” but bleeds money is still broken.

Not because finance is grumpy — but because cost shapes what you can safely ship:

  • the features you can afford to run,
  • the reliability budget you can afford to buy,
  • and the scale you can survive without panic.

In practice, cost is a constraint just like latency or correctness.

So October is about making cost architectural instead of accidental.

This is not a “cloud billing tips” post.

This is the architect’s perspective:

  • model the cost curve
  • make it observable
  • and put guardrails where systems tend to explode

The goal this month

Make cost predictable by design: unit costs, ownership, and guardrails.

The mindset shift

Cost is not an afterthought — it’s a runtime property of your architecture.

What I’m measuring

Unit cost ($/request, $/order, $/active user), plus the top 5 cost drivers.

What “good” looks like

You can answer: “If traffic doubles, what happens to the bill — and why?”


The uncomfortable truth: most cost incidents are architecture incidents

Cost spikes rarely come from “the cloud being expensive.”

They come from unbounded behavior:

  • a retry storm that multiplies downstream traffic
  • high-cardinality metrics that explode your observability bill
  • a streaming pipeline without retention limits
  • a data model that forces full scans
  • chatty microservices with N× calls per request
  • egress-heavy patterns (cross-region, cross-provider, public internet)
  • “temporary” debug logs that became a firehose

Cost is a tail risk. Like latency, the mean is boring — the spikes are what hurt.

If you don’t design for cost, you’ll eventually “optimize cost” under pressure.

That’s the worst moment to touch your architecture.


Mini-glossary (as used in this post)


A cost model architects can actually use

You don’t need perfect accounting.

You need a model that lets you reason about change.

I use a simple decomposition:

  1. Baseline spend (what you pay to exist)
  2. Unit cost (what you pay per business outcome)
  3. Growth curve (what happens as volume increases)
  4. Blast radius (where spend can become unbounded)

Here’s the key idea:

If you can express cost as “base + (units × unit_cost) + risk,” you can design.

Baseline

Always-on resources: minimum instances, databases, NAT gateways, control planes.

Unit cost

Per request/order/user. This is the number product teams can reason about.

Growth curve

Linear? Step function? Superlinear? Depends on architecture.

Blast radius

Unbounded behavior: retries, scans, cardinality, egress, retention.


Where cloud costs actually come from (and why architects should care)

Most cloud bills collapse into four buckets:

  • Compute (CPU, memory, GPUs, “always-on” vs “per-invocation”)
  • Storage (hot vs cold, backups, snapshots, retention)
  • Network (especially egress and cross-zone/region traffic)
  • Managed services (databases, queues, analytics, observability)

You can’t optimize what you can’t name.

So your first FinOps move is boring but powerful:

Build a “top spenders” map by service and by environment.

If you do one thing this month: create a weekly view that shows:
  • spend by service
  • spend by environment (prod vs non-prod)
  • spend by tenant/customer (if B2B)
  • top 10 cost deltas week-over-week

Architecture choices that change your cost curve

Here are the decisions that move your cost curve structurally (not cosmetically).


The FinOps loop for architects (a playbook)

FinOps is a practice, not a one-off project.

What works is an operating loop:

Define your unit of value

Pick 1–3 business units your product cares about:

  • order, quote, claim, document, message, active user, tenant

Write them down and stop debating. You can refine later.

Build “cost per unit” metrics

You need two ingredients:

  • usage counters (orders/day, requests/day, messages/day)
  • cost attribution (service spend)

Then compute:

  • $/1k requests
  • $/order
  • $/active user-month
  • $ per GB processed

Attribute ownership (showback)

Every major cost center needs an owner:

  • a team, a platform group, a product area

No owner = no optimization.

Create guardrails (budgets, quotas, limits)

Examples:

  • max log ingestion per service per day
  • max message backlog
  • max query runtime or scanned bytes
  • max concurrency for expensive endpoints

Optimize the big levers first

Don’t “save 5%” by turning knobs.

Focus on cost multipliers:

  • retries
  • data scans
  • cardinality
  • egress
  • always-on capacity

Operationalize: review, regressions, and alarms

A healthy system has:

  • weekly cost review (30 minutes, not 3 hours)
  • cost regressions flagged like performance regressions
  • alerts on deltas, not just absolute thresholds
The best cost practice is not “constant optimization.”

It’s preventing regressions.


Cost guardrails you can bake into architecture (without becoming a finance team)

You can embed cost safety into the system the same way you embed reliability safety.

1) Budget-based alerting (deltas beat thresholds)

Absolute thresholds are brittle.

What you want is:

  • “spend is up 40% week-over-week”
  • “log ingestion doubled after deploy X”
  • “egress grew faster than traffic”

2) SLO-aware sampling for logs/traces

If your SLO is healthy, you can sample aggressively. If it’s degraded, sample less and increase detail temporarily.

That keeps observability useful and affordable.

3) Hard limits on unbounded dimensions

  • cardinality budgets (metrics labels)
  • retention limits (logs, events, snapshots)
  • query limits (runtime, scanned rows/bytes)
  • queue limits (max backlog before shedding load)

4) Environment hygiene

Non-prod is where cost discipline goes to die.

Rules that pay for themselves:

  • nightly shutdown for dev/staging where possible
  • TTL tags for “temporary” resources
  • separate budgets by environment
  • automated cleanup of orphaned resources
If your org doesn’t have tagging discipline, start small:
  • service
  • environment
  • owner
  • cost-center
That’s enough to make cost visible and actionable.

A practical “Costed Architecture” worksheet

When I’m reviewing an architecture proposal, I ask these questions.


Common cost anti-patterns (and the “adult supervision” fix)

“We’ll optimize later”

Later is when you’re busy and scared.
Fix: design the cost model now (baseline + unit cost + blast radius).

“It’s just logging”

Logs scale with traffic, retries, and payload size.
Fix: sampling, retention tiers, and enforce log levels.

“The database can handle it”

Databases handle it… until they don’t, and scaling is a step function.
Fix: query budgets, read models, caching, and OLAP separation.

“We need multi-region now”

Multi-region multiplies complexity and often cost (especially traffic replication).
Fix: make the reliability goal explicit, then choose the cheapest topology that meets it.


Resources

FinOps Foundation / FinOps Framework

A shared language for the practice: inform → optimize → operate, plus tooling and organizational patterns.

Google Cloud Pricing Calculator

Sanity-check “always-on” vs “serverless” and data/egress heavy designs.

AWS Pricing Calculator

Estimate baseline footprints and compare architecture options before you build them.

Azure Pricing Calculator

Model step-function services (databases, analytics) where scaling is not linear.


FAQ


What’s Next

October made cost explicit:

  • unit economics instead of receipts
  • ownership instead of blame
  • guardrails instead of panic

Next month is the inevitable companion topic:

Incident Response and Resilience: Designing for Failure, Not Hope

Because the same systems that create latency spikes and outages… also create cost spikes.

The adult supervision move is the same:

Design for failure paths — and design for the bill you’ll get when those paths trigger.

Axel Domingues - 2026