
CI/CD isn’t a DevOps checkbox — it’s the architecture that makes change safe. This month is about test economics, pipeline design, and rollout strategies that turn “deploy” into a reversible decision.
Axel Domingues
Most teams treat CI/CD as plumbing.
Something you wire once, complain about forever, and blame when production burns.
But in mature systems, CI/CD is something else entirely:
It’s the architecture of change.
It determines:
This month is about turning CI/CD into a safety system — with crisp mental models you can teach, review, and enforce.
Not “how to use a tool.”
How to design a system that remains safe while it changes.
The goal this month
Make change safe by design: tests that pay rent, pipelines that produce trust, and deployments that can be reversed.
The mindset shift
CI/CD is not automation.
It’s the contract between code and production.
The real output
Not “passing builds.”
A system where every deploy is boring.
The failure you’re avoiding
“Green pipeline, broken prod” — and no safe way back.
If you design your runtime architecture carefully but ignore your delivery architecture, you end up with a paradox:
The delivery architecture answers questions like:
If these aren’t explicit, they still exist — they’re just encoded as tribal knowledge and late-night heroics.
If you can’t control the experiment, you don’t have CI/CD — you have a build server.CI/CD is the mechanism that converts change into a controlled experiment.
People quote the testing pyramid like a rule.
It’s better understood as a budget.
Every test has a cost profile:
When you treat tests as economics, the pyramid becomes obvious:
Unit tests
Cheap, fast, stable. Great at logic + edge cases. Poor at integration truth.
Integration tests
Medium cost. Catch contract issues and wiring bugs. Must be curated to avoid slow creep.
End-to-end tests
Expensive and fragile. Validate critical paths only. Treat as “smoke alarms,” not a net.
Production validation
The only environment with real traffic. Requires safe rollout + observability to be useful.
It produces slow pipelines, flaky builds, and low trust.E2E as the primary safety net.
A good pipeline is not “steps.”
It’s a state machine with two outputs:
That credibility score comes from layers of evidence.
Flaky tests are not “just annoying.”A pipeline should fail fast on problems that won’t fix themselves.
And it should retry automatically on problems that might.
When a pipeline is green but prod fails, it’s rarely mysterious.
It’s typically one of these:
Your staging isn’t production-like (data, traffic patterns, config, dependencies).
Fix by:
One service changed a contract and nothing enforced it.
Fix by:
Code deploy and schema deploy aren’t coordinated safely.
Fix by:
It validates “does it run,” not “does it behave under load and failure.”
Fix by:
A deployment strategy is the thing that decides whether prod is a cliff.
When you say “rollout safety,” you’re really asking:
Blue/Green
Two environments. Flip traffic. Fast rollback. Requires careful handling of data migrations.
Canary / progressive delivery
Route a small % of traffic to the new version. Roll forward or back based on signals.
Feature flags
Separate deploy from release. Ship code dark, enable gradually, kill quickly when needed.
Ring deployments
Promote by cohort: internal users → beta → small region → full fleet. Great for large orgs.
Teams say “we can roll back” and then discover the trap:
So rollout safety depends on migration discipline and compatibility discipline.
That single constraint forces architecture maturity.A deploy must be safe in both directions for at least one release window.
These are the checklists I want teams to print and argue about.
If a human has to remember it, it will be forgotten at 2am.
Replace manual steps with:
Pipelines are products. If nobody owns reliability, everyone suffers.
Fix by:
Staging often becomes a museum: stable, unlike prod, and misleading.
Fix by:
If your only rollout is “deploy to everyone,” you’re relying on luck.
Fix by:
Google SRE Book — Release Engineering
A classic explanation of why release processes are reliability mechanisms — and why automation is a means, not the goal.
DORA / Accelerate — Metrics That Matter
The most useful vocabulary for delivery performance: lead time, deployment frequency, change fail rate, and time to restore.
You need deployment safety before you split systems.
Microservices increase change frequency and surface area. Without solid CI/CD, you’ll ship slower and fail more.
Fast enough that engineers don’t work around it.
As a rough heuristic:
Limit blast radius and make rollback real.
If a bad deploy affects 1% of traffic and you can undo it in minutes, you’ve turned incidents into small, survivable events.
Treat flake like a defect with an owner.
If CI/CD is the architecture of change, then observability is the architecture of truth.
Next month:
Observability that Works: Logs, Metrics, Traces, and SLO Thinking
Because safe rollouts only work if you can see reality quickly — and decide based on signals, not hope.
Observability that Works: Logs, Metrics, Traces, and SLO Thinking
Observability isn’t “add dashboards.” It’s designing feedback loops you can trust: signals that answer real questions, alerts tied to user pain, and tooling that helps you debug under pressure.
Containers, Docker, and the Discipline of Reproducibility
Containers aren’t “how you deploy apps” — they’re how you make environments stop being a variable. This is the operational discipline: immutable artifacts, repeatable builds, and runtime contracts you can actually rely on.