Blog
Nov 28, 2021 - 16 MIN READ
Queues, Retries, and Idempotency: Engineering Reality in Async Systems

Queues, Retries, and Idempotency: Engineering Reality in Async Systems

Async work is where production gets honest. This month is a practical playbook for queues, retries, idempotency keys, and the patterns that keep “background jobs” from duplicating money or burning trust.

Axel Domingues

Axel Domingues

Async work is where architecture stops being a diagram and starts being a fight with physics.

Requests time out. Networks partition. Workers crash mid-flight. Vendors throttle.
And the thing that surprises most teams is not that it fails… it’s that it fails by repeating.

  • The same job runs twice.
  • The same webhook arrives 10 times.
  • The same “send email” happens again after a retry.
  • The same “charge card” becomes your worst incident of the year.

This post is the boring, reliable playbook:

Queues + retries + idempotency — and how to design them so production stays boring.

This is not a “which queue is best” post.

It’s about the invariants you must enforce regardless of whether you’re on Kafka, SQS, RabbitMQ, Cloud Tasks, Sidekiq, or a homegrown DB-backed queue.

The real goal

Make async work safe under retries, duplication, and partial failure.

The mental model

Assume at-least-once delivery, then design idempotency + dedupe as first-class.

The operational bar

You can answer: “Where is it stuck?”, “How many are failing?”, “What happens next?”

The non-goal

“Exactly once” across a distributed system is not a plan. It’s a wish.


The First Truth: “Exactly Once” Is Rarely Real

In distributed systems, delivery is usually one of these:

  • At most once: might drop messages (fast, risky for critical work).
  • At least once: might deliver duplicates (safe if you handle duplicates).
  • Exactly once: only achievable under strict constraints, and often only “exactly once processing” in a narrow scope.

If you build assuming “exactly once” and you’re wrong, your system fails catastrophically:

  • duplicate billing
  • duplicated orders
  • double refunds (yes, that happens)
  • duplicated side effects (email, SMS, push, ledger entries)

So take the default stance:

Design async systems as at-least-once and make processing idempotent. It is the only robust baseline that survives reality.

What a Queue Actually Buys You (and What It Doesn’t)

A queue isn’t “background jobs.” It’s a contract boundary.

It buys you:

  • Decoupling: producers don’t need workers online right now.
  • Smoothing: bursts get buffered; workers drain at a steady rate.
  • Retries: transient failures can be retried automatically.
  • Isolation: slow work doesn’t hold open user requests.
  • Observability surface: backlog and latency become measurable.

It does not buy you correctness for free.

If you push a “charge card” message onto a queue with retries enabled, the queue is now allowed to cause duplicates. It’s doing its job.

Correctness is your job.


Two Kinds of Failures: Transient vs Permanent

Retries are powerful. Retries are also dangerous.

The first design step is categorizing failures:

If you don’t classify failure types, your retry system will happily turn permanent failures into endless churn — and “endless churn” becomes your queue backlog incident.

The Core Pattern: Idempotent Consumers

Idempotency means:

Running the same operation multiple times yields the same final state as running it once.

In async systems, idempotency is how you turn “at-least-once delivery” into “effectively once” outcomes.

The three idempotency strategies (in order of robustness)

1) Natural idempotency

The operation is inherently safe to repeat (e.g., “set status = SHIPPED” if already shipped).

2) Database constraints

Unique constraints enforce “only one” (e.g., unique(order_id, event_type) or unique(idempotency_key)).

3) Idempotency keys + dedupe store

A first-class key maps to the result so duplicates can return the same outcome.

Where teams get hurt

Side effects (payments, emails, webhooks) without a dedupe boundary.


Where Idempotency Belongs: At the Boundary of Side Effects

A useful rule:

Idempotency belongs closest to the irreversible side effect.

If you’re calling:

  • a payment provider
  • an email provider
  • a shipping label API
  • a message publish that triggers downstream charges

…then that boundary must be protected by an idempotency mechanism.

Because retries will happen:

  • HTTP timeouts happen even when the provider completed the operation
  • your worker may crash after the provider succeeded but before you persisted the result

If you don’t protect that boundary, you’ll replay the side effect.


The Minimal “Safe Async” Architecture

Here’s the smallest architecture that reliably survives retries and duplication:

  1. Persist intent (write the business command to your database)
  2. Emit work (enqueue a message/task)
  3. Process idempotently (dedupe + apply the side effect)
  4. Record outcome (store result, status, and correlation IDs)
  5. Retry safely (only transient failures; backoff + jitter; DLQ)

Persist intent with a stable identifier

The producer creates a durable record representing what must happen:

  • job_id (or command_id)
  • idempotency_key (optional, but often essential)
  • payload + version
  • current state (PENDING, IN_PROGRESS, DONE, FAILED)

Enqueue work (separately from user latency)

Publish a message with:

  • job_id
  • attempt
  • trace/correlation IDs
  • a “visibility timeout”/lease concept if your queue supports it

Process idempotently (assume duplicates)

In the consumer:

  • start by checking the dedupe record
  • if already processed: return immediately
  • otherwise: perform the operation
  • commit the result in the DB as your source of truth

Retry with backoff and jitter

Retry transient failures:

  • exponential backoff
  • random jitter
  • max attempts
  • retry budgets (per job type)

Dead-letter on poison work

After max attempts:

  • send to DLQ
  • alert humans
  • capture enough context to reproduce
If your queue retries but your consumer isn’t idempotent, you don’t have a reliability feature.

You have a duplication engine.


Designing Idempotency Keys That Don’t Lie

An idempotency key is a stable identifier for “this effect.”

Good keys:

  • are deterministic from the business command (or provided by the caller)
  • are unique for the business effect you care about
  • can be enforced with a unique constraint

Bad keys:

  • include timestamps or random numbers (destroying dedupe)
  • are derived from mutable data
  • are scoped too broadly (accidentally deduping distinct work)

Practical key shapes

  • Per external request: Idempotency-Key: <uuid> (client-generated)
  • Per business command: charge:<orderId>:<paymentAttemptNumber>
  • Per event: invoiceIssued:<invoiceId>:v1
  • Per side effect: emailReceipt:<orderId>
When you control the API, client-provided idempotency keys are a superpower.

When you don’t, deterministic server-side keys can still work — but be intentional about scope.


The “Inbox” and “Outbox” Reality (Why DB Transactions Matter)

Most async bugs are ordering bugs between your DB write and your queue publish.

Common failure:

  • You write the DB record
  • You fail to publish the message
  • The job is now “stuck forever” unless you have reconciliation

The classic fix is the Outbox pattern:

  • In the same DB transaction as your business change, write an outbox row.
  • A dispatcher reads outbox rows and publishes them to the queue.
  • Publishing becomes retryable and observable.

Similarly, on the consumer side, the Inbox pattern (or “dedupe log”) ensures you can safely process duplicates:

  • record message_id / job_id as processed
  • enforce uniqueness
  • apply business logic once
When you need strong reliability, treat the database as the source of truth for “what must happen” — and queues as the delivery mechanism, not the truth.

Retries Without Melting Your Dependencies

Retries need discipline, not enthusiasm.

Here’s a practical retry policy that scales:

Exponential backoff + jitter

Avoid synchronized retry storms. Always add randomness.

Retry budget

Cap total retry volume per service / per dependency to avoid self-DDoS.

Circuit breakers

If a dependency is down, stop hammering it. Fail fast and recover later.

Timeouts are mandatory

No timeout = infinite hanging = worker pool collapse.

A simple rule for architects:

Retries shift load from now to later — they don’t eliminate load.

Your system must have:

  • a maximum attempts strategy
  • a DLQ or terminal failure state
  • a replay mechanism that is intentional (not accidental)

Idempotency in the Real World: Three Scenarios

1) Payments

You must dedupe at the provider boundary.

  • Use provider idempotency keys if available.
  • Store provider charge ID keyed by your idempotency key.
  • If a retry happens, return the stored charge ID/result.

2) Emails / Notifications

Users don’t forgive duplicates.

  • Store a “sent” record with a unique constraint: unique(notification_type, recipient, business_id).
  • Don’t rely on “the email provider will dedupe.” It won’t.

3) Inventory / Fulfillment

Duplicates cause missing stock or double shipments.

  • Make reservation operations idempotent (“reserve X for order Y”).
  • Use unique constraints on reservation IDs.
  • Separate “reserve” from “ship” and track state transitions carefully.
When duplicates hurt, you need idempotency at the side-effect boundary and state transitions that are safe to replay.

Observability: If You Can’t See the Queue, You Don’t Own It

Async systems fail silently unless you build a control panel.

You need to measure:

  • Queue depth (how many messages are waiting)
  • Age of oldest message (latency of the backlog)
  • Processing rate (messages/sec)
  • Failure rate (by reason)
  • Retry counts (and retry distribution)
  • DLQ volume (and reason codes)
  • Idempotency hits (how many duplicates you prevented)
The best signal isn’t “queue size.”It’s usually age of oldest message. A queue can be small and still “stuck” if nothing is draining.

A Decision Checklist You Can Use in Design Reviews


Tooling Notes (Frameworks Don’t Save You Here)

Different queues have different semantics:

  • task queues (Cloud Tasks, Sidekiq, Celery) feel “job-like”
  • brokers (RabbitMQ) give routing primitives
  • logs (Kafka) give ordering and replay

But the architecture rules stay constant:

  1. assume duplicates
  2. design idempotency
  3. make retries explicit and bounded
  4. make outcomes durable and observable

November takeaway

A queue is not reliability.

Idempotency is reliability. Queues just deliver the opportunity for failure — and for recovery.


Resources

The Outbox Pattern (concept)

A transaction-safe way to ensure “DB write + message publish” doesn’t create ghost jobs or lost events.

Idempotency Keys (practice)

The simplest, most scalable strategy for safe retries around external side effects (payments, emails, webhooks).

Dead-letter Queues (practice)

Your “poison pill” containment unit: a place where work goes when it cannot be safely auto-retried.

Backoff + Jitter (practice)

The difference between “retries help” and “retries create an outage.”


FAQ


What’s Next

This month was about making async systems honest:

  • retries are inevitable
  • duplicates are normal
  • correctness is something you design for

Next month we scale the conversation from “one system” to “many systems”:

Microservices vs Modular Monolith

Because async boundaries are often the first crack in a monolith…
and the first regret in a microservice migration.

Axel Domingues - 2026