
Async work is where production gets honest. This month is a practical playbook for queues, retries, idempotency keys, and the patterns that keep “background jobs” from duplicating money or burning trust.
Axel Domingues
Async work is where architecture stops being a diagram and starts being a fight with physics.
Requests time out. Networks partition. Workers crash mid-flight. Vendors throttle.
And the thing that surprises most teams is not that it fails… it’s that it fails by repeating.
This post is the boring, reliable playbook:
Queues + retries + idempotency — and how to design them so production stays boring.
It’s about the invariants you must enforce regardless of whether you’re on Kafka, SQS, RabbitMQ, Cloud Tasks, Sidekiq, or a homegrown DB-backed queue.
The real goal
Make async work safe under retries, duplication, and partial failure.
The mental model
Assume at-least-once delivery, then design idempotency + dedupe as first-class.
The operational bar
You can answer: “Where is it stuck?”, “How many are failing?”, “What happens next?”
The non-goal
“Exactly once” across a distributed system is not a plan. It’s a wish.
In distributed systems, delivery is usually one of these:
If you build assuming “exactly once” and you’re wrong, your system fails catastrophically:
So take the default stance:
A queue isn’t “background jobs.” It’s a contract boundary.
It buys you:
It does not buy you correctness for free.
If you push a “charge card” message onto a queue with retries enabled, the queue is now allowed to cause duplicates. It’s doing its job.
Correctness is your job.
Retries are powerful. Retries are also dangerous.
The first design step is categorizing failures:
Examples:
Goal:
Examples:
Goal:
Idempotency means:
Running the same operation multiple times yields the same final state as running it once.
In async systems, idempotency is how you turn “at-least-once delivery” into “effectively once” outcomes.
1) Natural idempotency
The operation is inherently safe to repeat (e.g., “set status = SHIPPED” if already shipped).
2) Database constraints
Unique constraints enforce “only one” (e.g., unique(order_id, event_type) or unique(idempotency_key)).
3) Idempotency keys + dedupe store
A first-class key maps to the result so duplicates can return the same outcome.
Where teams get hurt
Side effects (payments, emails, webhooks) without a dedupe boundary.
A useful rule:
Idempotency belongs closest to the irreversible side effect.
If you’re calling:
…then that boundary must be protected by an idempotency mechanism.
Because retries will happen:
If you don’t protect that boundary, you’ll replay the side effect.
Here’s the smallest architecture that reliably survives retries and duplication:
The producer creates a durable record representing what must happen:
job_id (or command_id)idempotency_key (optional, but often essential)Publish a message with:
job_idattemptIn the consumer:
Retry transient failures:
After max attempts:
You have a duplication engine.
An idempotency key is a stable identifier for “this effect.”
Good keys:
Bad keys:
Idempotency-Key: <uuid> (client-generated)charge:<orderId>:<paymentAttemptNumber>invoiceIssued:<invoiceId>:v1emailReceipt:<orderId>When you don’t, deterministic server-side keys can still work — but be intentional about scope.
Most async bugs are ordering bugs between your DB write and your queue publish.
Common failure:
The classic fix is the Outbox pattern:
Similarly, on the consumer side, the Inbox pattern (or “dedupe log”) ensures you can safely process duplicates:
message_id / job_id as processedRetries need discipline, not enthusiasm.
Here’s a practical retry policy that scales:
Exponential backoff + jitter
Avoid synchronized retry storms. Always add randomness.
Retry budget
Cap total retry volume per service / per dependency to avoid self-DDoS.
Circuit breakers
If a dependency is down, stop hammering it. Fail fast and recover later.
Timeouts are mandatory
No timeout = infinite hanging = worker pool collapse.
A simple rule for architects:
Retries shift load from now to later — they don’t eliminate load.
Your system must have:
You must dedupe at the provider boundary.
Users don’t forgive duplicates.
unique(notification_type, recipient, business_id).Duplicates cause missing stock or double shipments.
Async systems fail silently unless you build a control panel.
You need to measure:
job_id / command_id?Different queues have different semantics:
But the architecture rules stay constant:
November takeaway
A queue is not reliability.
Idempotency is reliability. Queues just deliver the opportunity for failure — and for recovery.
The Outbox Pattern (concept)
A transaction-safe way to ensure “DB write + message publish” doesn’t create ghost jobs or lost events.
Idempotency Keys (practice)
The simplest, most scalable strategy for safe retries around external side effects (payments, emails, webhooks).
Yes — if the side effect is low-stakes (analytics, best-effort notifications) or the caller can safely retry at a higher layer.
But for money, orders, identity, and core product state: prefer at-least-once + idempotency.
If the operation begins from a client request, let the client generate it and send it.
If the operation is internal, generate a deterministic key from business identifiers (orderId, invoiceId, etc.) and scope it to the side effect.
Not always.
But if losing a message means losing money, losing orders, or violating a user promise, you want transactional guarantees — and outbox is the most common way to get them without distributed transactions.
A queue is typically about work (do this once), while a stream is about facts (this happened; many consumers may react).
In practice, systems blur the line — which is why idempotency and replay-safe processing matter in both.
This month was about making async systems honest:
Next month we scale the conversation from “one system” to “many systems”:
Microservices vs Modular Monolith
Because async boundaries are often the first crack in a monolith…
and the first regret in a microservice migration.
Microservices vs Modular Monolith: The “When” and the “How”
Microservices aren’t a flex — they’re a tax. Modular monoliths aren’t “temporary” — they’re often the best architecture. Here’s the decision framework, the failure modes, and the migration path that doesn’t create a distributed mess.
Caching Without Folklore: Redis, CDNs, and the Two Hard Things
Caching is not “make it faster.” It’s a contract: what can be stale, for how long, for whom, and how you recover when it lies. This month is a practical architecture guide to caching layers that scale without corrupting truth.