The background job infrastructure decision record: why the job queue you chose determines your retry semantics and your dead-letter visibility
Background jobs look like a simple pattern — enqueue work, run workers in parallel, emit a success or failure status. This framing conceals the architectural decisions embedded in every job system: the retry policy that determines how long the system keeps trying before admitting defeat; the dead-letter strategy that determines whether a failed job is isolatable and diagnosable or simply disappears; the deduplication approach that determines whether a billing charge executes once or twice; the concurrency model that determines whether an interactive user-facing job waits behind hundreds of batch jobs. Most teams discover these decisions not during the design phase, but when a critical job fails silently under production load and the revenue or compliance consequence surfaces three days later.
A SaaS subscription platform processes billing for Pro and Team customers on the first of each month. The billing system enqueues one job per customer, a pool of workers processes them, and the payment processor is called within each job. On a Sunday afternoon in March, one customer's account has a malformed payment method record: the currency field is null rather than "usd". The null originated from a payment method update endpoint that accepted partial JSON objects without validating required fields — a validation gap that had never triggered because this was the first customer who triggered the specific code path.
The billing job picks up the malformed record, passes it to the payment processor SDK, and receives a serialization error before the network request is made. The job fails. It retries after 5 seconds. Fails again. Retries after 25 seconds. Fails again. After three retries, the default retry count is exhausted. The job is marked failed. It does not route to a dead-letter queue — the team had not configured one. The failure is logged to the application error log. No alert fires. The error log is reviewed only when something is obviously wrong, and nothing else is obviously wrong.
The customer continues using the product. They receive no invoice. No charge is attempted again until the next billing cycle in April. No one notices in March because the March billing run completed successfully for every other customer. The accounts receivable summary for March looks normal — the single missing charge is 0.2% of monthly revenue, well below any variance threshold anyone is watching.
In April, the same failure recurs. The same malformed record. The same retry exhaustion. The same silence. A junior accountant reconciling Q1 bookings notices in mid-April that two months of billing are missing for this customer. Investigation reveals the malformed payment method record, the retry exhaustion in the job logs, and the absence of any alert or dead-letter queue that would have surfaced the failure when it first occurred. The customer had been using the product for two billing months without being charged.
The fix takes 20 minutes: add the null check, correct the payment method record, re-trigger the billing job for both months. The cost was not the fix — it was the six weeks of invisible failure, the manual reconciliation work to discover it, and the investigation to determine whether other customers had similar issues (three did, with similar validation gaps in different input paths). The decisions that created this situation were never written down: no DLQ, no DLQ alert, no retry count policy, no monitoring for job failure rates by job class.
Why background job infrastructure is an architectural decision, not just a worker pattern
Every background job system embeds at least five architectural decisions that are invisible in the "enqueue work, run workers" description. Each becomes visible only when the system behaves incorrectly under a failure condition that the team did not anticipate during the initial implementation.
The retry policy is a statement about the expected failure modes. A retry count of 25 says: "we expect this job to encounter transient failures that will resolve within 21 days." A retry count of 3 says: "we expect transient failures to resolve within seconds." Neither is correct for every job class. The retry policy embedded in the default configuration of every job framework — Sidekiq's 25 retries over 21 days, BullMQ's default of zero retries, SQS's default of 1 retry — is not a policy derived from analysis of the jobs running in the system. It is an arbitrary default that becomes the policy for the entire system unless explicitly overridden. The retry count determines how long the system attempts a failing job before routing it to dead-letter; if the retry count is wrong, either legitimate transient failures route to dead-letter after seconds, or permanent data errors spend days retrying before anyone is alerted.
The dead-letter queue is only valuable if it is monitored and actionable. A DLQ that receives failed jobs but triggers no alert is a jobs graveyard — it prevents infinite retry loops but provides no operational leverage. A DLQ that triggers an alert but lacks per-job metadata (error message, stack trace, retry count at failure, job arguments, enqueue timestamp, worker that processed it) requires log archaeology to diagnose the failure. A DLQ alert that pages on-call but has no defined replay procedure results in an engineer who knows a job failed but cannot fix and reprocess it without custom scripting. Each of these gaps converts the DLQ from an operational resource into a bookkeeping mechanism.
The deduplication policy determines correctness for jobs that represent business operations. A billing charge job enqueued twice because two processes both observed that a charge was due produces two charges to the customer's payment method. Idempotency in the job's processing logic is the correct defense — but only if it is implemented explicitly. The framework does not provide idempotency for application-level operations; it provides at-most-once or at-least-once delivery for the job envelope, which is a different guarantee. At-least-once delivery means the framework may deliver a job more than once (on worker restart, on network partition, on visibility timeout expiration in SQS). The job's processing logic must be idempotent independently of the delivery semantic. That contract must be documented per job class, because the idempotency implementation depends on the specific operation: a billing charge is idempotent via an idempotency key sent to the payment processor; a report generation job is idempotent via a check against a completed_reports table; a webhook delivery is idempotent via a deduplicated log of delivered event IDs.
The concurrency model determines the latency SLA for interactive jobs. A single shared queue with a single worker pool processes all jobs in arrival order. A user-triggered report generation job queued behind 800 notification dispatch jobs waits 40 minutes if each notification job takes 3 seconds and there are 10 workers. The latency is not a bug in the report generation code — it is the direct consequence of the concurrency model. Separate queues with dedicated worker pools per SLA tier (interactive, batch, maintenance) is the correct model for systems where different job classes have different latency requirements, but it requires a deliberate decision about queue topology and worker allocation, not the default of "one queue, one worker pool."
The PII handling policy determines compliance coverage for the job store. Job arguments are serialized to the job store — Redis, a PostgreSQL jobs table, SQS message bodies — and retained for the job's lifetime and for the DLQ's retention window. If job arguments contain personal data (user email addresses, payment card details, health information), that data is stored in systems that may not be covered by the primary data retention and erasure procedures. The data retention decision record specifies how long customer data is retained in the primary database. The background job infrastructure ADR must specify whether the job store is covered by the same retention policy, or whether a separate policy applies to job argument data.
Job queue technology selection
The job queue technology determines the operational characteristics available to the system: the throughput ceiling, the retry expressiveness, the DLQ built-in support, the persistence model, and the failure behavior when the broker node is unavailable. The technology selection is often made by copying a quickstart tutorial or by choosing the technology used in a previous project, without evaluating the structural properties of each option against the job volume and failure-handling requirements of the specific system.
PostgreSQL-backed job queues (Oban for Elixir, GoodJob and Que for Ruby, Hatchet for Node.js) store jobs as rows in a database table and use SELECT ... FOR UPDATE SKIP LOCKED to distribute work across multiple workers without a separate broker. The structural properties of this approach: transactional enqueue (a job can be enqueued in the same database transaction as the business state change that triggers it, eliminating the race condition where the business state is committed but the job is not), no additional broker infrastructure (the job queue shares the existing database connection pool and backups), and persistence durability equal to the primary database. The throughput ceiling is approximately 50–200 jobs per second per database connection depending on job duration and query complexity — high enough for most application-tier background work. The ceiling is reached before most SaaS products reach it. The limitation compared to Redis-backed queues is the polling overhead (workers poll the database for new jobs, adding latency unless the framework uses PostgreSQL's LISTEN/NOTIFY mechanism for fast delivery) and the database connection pool pressure under high worker concurrency. Oban and GoodJob use LISTEN/NOTIFY to eliminate polling latency; Que uses it as well. The database migration strategy ADR must account for the jobs table schema — the jobs table is a high-write table that benefits from partition pruning (Oban supports partition-based job table management) and careful index design on the (state, scheduled_at, queue, id) columns that worker queries use.
Redis-backed job queues (Sidekiq for Ruby, BullMQ and Bull for Node.js, Celery with Redis broker for Python, RQ for Python) store jobs in Redis data structures and deliver them via blocking pop operations or Redis Streams. The throughput ceiling is significantly higher than PostgreSQL-backed queues — Sidekiq can process tens of thousands of jobs per second per Redis instance — and the delivery latency is sub-millisecond in the common case. The structural risk of Redis-backed queues is data volatility: Redis without persistence configured (AOF or RDB) loses all enqueued jobs on Redis restart or crash. The correct configuration for any production job queue is Redis with AOF persistence at appendfsync everysec, which provides at-most-one-second data loss on crash. Redis Streams (used by BullMQ) provide stronger durability than Redis List-based queues (used by older Bull, older Sidekiq) because stream entries are acknowledged only after processing, but the fundamental volatility of Redis memory still applies. Sidekiq's default retry configuration — 25 retries over approximately 21 days — is the most important undocumented structural property of a Sidekiq deployment. Most teams inherit this default without evaluating whether 21 days is the correct retry window for any job class in their system. BullMQ's default of zero retries (jobs are not retried after failure) requires explicit configuration for any job where transient failures are expected. The caching strategy ADR intersects here: if the job queue and the application cache share the same Redis instance, a cache stampede that saturates Redis memory can cause job delivery failures by exhausting available memory for stream entries or list elements.
Managed cloud job queues (SQS for AWS, Cloud Tasks for GCP, Azure Service Bus) eliminate broker operational overhead at the cost of vendor lock-in and API-level expressiveness constraints. SQS's structural properties: 14-day maximum retention (jobs that have not been processed after 14 days are permanently deleted, with no notification), visibility timeout as the lock mechanism (a job becomes invisible to other workers for the visibility timeout duration after being received; if the worker fails to delete the job within the visibility timeout, it becomes visible again and is redelivered — at-least-once delivery), and the SQS dead-letter queue configured via a redrive policy (after N receives without deletion, the job routes to the DLQ). The visibility timeout must be longer than the longest expected job execution time plus processing overhead; a billing job that takes 30 seconds to process on a slow payment processor call should have a visibility timeout of at least 90 seconds with buffer. SQS has no native job priority or ordered processing for standard queues (SQS FIFO queues provide ordered delivery within a message group but not across the full queue). For systems already deployed on AWS that need background job processing without operational overhead, SQS is a legitimate choice; for systems that need retry expressiveness (per-job retry count, exponential backoff with jitter, DLQ per job class, job prioritization), a framework-level job queue provides the required expressiveness with more configuration control.
Workflow orchestration (Temporal, Conductor, Prefect) is not a job queue — it is a persistent workflow engine where each workflow is a durable execution that can be suspended, resumed, retried, queried, and signaled. Temporal workflows are deterministic executions whose state is replayed from an event history log; a workflow can pause at an await point, the worker can restart, and the workflow resumes exactly where it left off without re-executing completed steps. This is the correct model for multi-step business processes where intermediate state must be preserved across failures and where compensation (rollback of partially completed steps) is required. It is not the correct model for simple job processing where a job is an atomic unit of work that either succeeds or fails. The operational overhead of Temporal (cluster deployment, history storage, workflow visibility indices) is appropriate for systems where the correctness guarantees of durable execution are necessary; it is disproportionate for systems that need a reliable billing job queue.
Retry policy and backoff strategy
The retry policy answers one question: how long should the system attempt a failing job before concluding that the job cannot succeed with its current arguments in the current system state? The answer depends on the failure type, and failure types are not uniform across job classes or even across failure modes within a single job class.
Two failure types require different retry handling. Transient failures are failures caused by temporary conditions that are expected to resolve without changes to the job arguments or the system: a network timeout to a payment processor, a database connection pool exhaustion during a traffic spike, an external API rate limit that expires after 60 seconds. Transient failures should be retried after a delay, because the condition that caused the failure is likely to have resolved by the time the retry executes. Permanent failures are failures caused by conditions that will not resolve without intervention: a malformed job argument that causes a serialization error, a business rule violation where the job arguments describe an impossible operation (charging a deleted customer, generating a report for a non-existent dataset), a schema mismatch where the job was enqueued by an old worker version with a different argument structure than the current consumer expects. Permanent failures should route to dead-letter immediately, without exhausting the retry budget on attempts that will fail identically.
Most job frameworks provide a single retry count that applies to all failures, without distinguishing between failure types. The correct implementation requires the job to distinguish failure types in its error handling and raise different exception classes: a PermanentJobError (or its framework equivalent: Sidekiq's Sidekiq::JobRetry::Skip, Oban's Oban.Worker.discard/1, BullMQ's UnrecoverableError) that signals the framework to route the job to dead-letter immediately without retry; and standard exceptions that are treated as transient and retried.
The retry count for transient failures should be derived from the expected resolution time. A payment processor that goes down during a maintenance window typically recovers within 15–30 minutes. Five retries with exponential backoff (5s, 25s, 125s, 625s, 3125s ≈ 52 minutes) covers the window and routes the job to dead-letter if the condition has not resolved after 52 minutes — at which point the failure is likely not transient. Ten retries with the same backoff extends coverage to approximately 26 hours, appropriate for external APIs with looser SLAs. Twenty-five retries over 21 days is appropriate for almost no business-critical job: if a job cannot succeed after 26 hours of retrying, the condition is not transient and a human needs to investigate. A 21-day retry window produces a queue of silently failing jobs that are attempting their fifth retry a week after the initial failure, with no alert, no DLQ visibility, and no human intervention.
Exponential backoff with jitter prevents thundering herd on retry. When 1,000 jobs fail simultaneously during a brief outage and all retry after exactly 5 seconds, the retry spike at second 5 may reproduce the outage that caused the initial failures. Jitter adds randomness to the retry delay — ±10% to ±20% of the backoff duration — so the retry load is spread across a window rather than spiking at a precise instant. The backoff cap prevents retries from growing to absurd intervals: capping at 3600 seconds (one hour) means the fifteenth retry is still within a 24-hour window, not delayed by days. The correct backoff configuration for most business-critical jobs: initial delay 5 seconds, multiplier 5, cap 3600 seconds, jitter ±15%.
The retry count must be documented per job class, not inherited from the framework default. A billing charge job and a search index update job have different failure tolerance and different expected recovery windows. The billing charge job should fail fast on permanent errors (malformed arguments, deleted customer) and retry aggressively on transient errors (payment processor timeout) — 8 retries over 4 hours. The search index update job can tolerate extended delays and is idempotent — 15 retries over 24 hours. The framework default applies uniformly to all jobs unless overridden; documenting the retry policy per job class in the ADR makes the policy visible to the engineer who adds a new job class and needs to decide which policy applies.
Dead-letter visibility and the operational procedure
The dead-letter queue is the destination for jobs that have exhausted their retry budget or have been explicitly routed as non-retryable. Its value as an operational resource depends on three properties: whether it is monitored with alerting that fires when jobs arrive, whether each failed job carries sufficient metadata for diagnosis without access to the application logs, and whether the team has a defined and tested replay procedure.
DLQ depth as a metric must be separated by queue and by criticality. A DLQ depth alert threshold of "greater than zero" is correct for business-critical queues where any failed job represents a billing failure, a compliance event, or a data integrity problem. A threshold of "greater than 10" is correct for high-volume queues where a small number of failed jobs is expected noise. An aggregate DLQ depth across all queues is not actionable — it cannot distinguish a single failed billing charge from 1,000 failed low-priority notifications. The monitoring configuration must specify the alert threshold per queue, the oncall routing (does a failed billing job page the on-call engineer immediately, or is it batched into a morning report?), and the SLA for first human response.
Per-job metadata is what makes the DLQ diagnosable. A failed job that carries only the exception class and the first line of the stack trace requires the diagnosing engineer to find the full stack trace in the application log, correlate it to the specific job ID, and then find the job's arguments in a separate log entry or in the dead-letter queue record. This process works at low volume when the engineer has context; it fails at 3am when the on-call engineer has no prior context and the application log is 50GB of mixed-class output. The minimum metadata for a dead-letter job: the unique job ID (for correlation with application logs), the job class and queue name, the job arguments (with PII fields redacted per the argument serialization policy), the full exception class, message, and stack trace, the retry count at the time of routing to dead-letter, the timestamp of the first enqueue, the timestamp of each retry attempt, and the worker instance ID that processed the final attempt. Frameworks provide most of this natively — Sidekiq's job['error_message'], job['error_class'], job['error_backtrace'], job['retried_at'], job['failed_at']; BullMQ's job.failedReason, job.stacktrace, job.opts.attempts. The missing field in most default configurations is argument redaction, which must be configured explicitly if job arguments contain PII.
The replay procedure must be defined before the first DLQ failure, not discovered under incident pressure. Replaying a dead-letter job requires answers to three questions: is the job safe to replay (is it idempotent, or could a replay produce a duplicate business operation)?, what is the correct mechanism to re-enqueue it (framework command, manual database update, web UI action)?, and should the replay worker be isolated from the live worker pool so that a batch replay does not compete with live job processing? For Sidekiq, the web UI provides bulk retry from the dead-letter queue; for BullMQ, job.retry() re-enqueues the job; for Oban, Oban.retry_job/1 re-enqueues with a new attempt count. The idempotency requirement for replay is non-negotiable for business-critical jobs: a billing charge job that processes successfully on replay after a previous partial processing attempt must not charge the customer twice. The idempotency implementation — an idempotency key sent to the payment processor, a processed_jobs table with a unique constraint, a state machine precondition — must exist before the job is written, not retrofitted after the first DLQ replay produces a duplicate charge.
DLQ retention policy must balance investigation window and compliance requirements. A 7-day DLQ retention window is long enough to diagnose most failures but may not be long enough for batch billing failures discovered during monthly reconciliation (up to 31 days after the original failure). A 90-day window provides adequate investigation window for all business cycles. DLQ retention intersects with PII compliance — if job arguments contain personal data, the DLQ is a secondary PII store with its own retention schedule. The DLQ retention window must be no longer than the primary data retention policy for the same data category, and the DLQ purge procedure must be included in the GDPR erasure flow.
Job deduplication and the idempotency contract
Deduplication and idempotency are independent properties of a job system, each with independent implementation requirements. The common mistake is to implement one and assume the other: implementing deduplication and assuming that deduplicated enqueue guarantees single execution; or implementing idempotent processing and assuming that idempotency eliminates the need for deduplication.
Content-addressed deduplication derives a unique job key from the job class and arguments. Two enqueue calls with the same class and arguments produce the same key; the second enqueue is rejected if the first job is still in the queue or still processing. This approach is simple and correct for jobs where the arguments fully describe the operation — a search index update for document ID 12345 is the same work regardless of which process enqueued it. The limitation is that the deduplication window must be specified: the key typically exists in the queue store for the duration of the job's lifetime. After the job completes, the key is no longer present, and a subsequent enqueue of the same job with the same arguments is accepted. For billing jobs, this means that a charge for invoice ID 7890 in March is deduplicated within the March billing run, but not deduplicated against a re-run of the March billing job in April if the original job was retried and its argument key has expired. Content-addressed deduplication addresses the "enqueueing the same work twice simultaneously" failure mode; it does not address the "re-running billing for a prior period" failure mode, which requires business-key idempotency.
Business-key deduplication uses an explicit idempotency key tied to the business operation — report_request_id, payment_attempt_id, invoice_charge_month. The key is embedded in the job arguments and checked against a processed-operations store. Oban's unique option and BullMQ's jobId option provide framework-level business-key deduplication. The deduplication period is configurable: Oban's unique: [period: 86_400] deduplicates for 24 hours; a custom implementation using a database unique constraint on (job_class, idempotency_key) deduplicates for the retention window of the constraint table. Business-key deduplication survives key expiration, survives argument serialization changes, and can be queried to determine whether a specific business operation has already been processed.
Idempotency at the processing level is required independently of deduplication because at-least-once delivery means the framework may redeliver a job even if it was previously processed. The implementation depends on the operation: for external API calls, use the API provider's idempotency key parameter (Stripe's Idempotency-Key header, SendGrid's x-idempotency-key); for database writes, use a unique constraint on the operation's result record (a payment_charges table with a unique constraint on idempotency_key that raises a uniqueness error on the second insert, which the job catches and treats as a successful prior execution); for internal state machine transitions, use a precondition check (before sending the "invoice generated" email, check that the invoice exists and is in the pending_email state — if it's already in email_sent, the job is a no-op). The idempotency implementation must be specified per job class in the ADR, because the correct implementation depends on the specific operation and the external systems it touches.
Concurrency, prioritization, and job latency SLAs
The concurrency model is the most common source of job latency problems that users experience as product degradation. A job that takes 3 minutes to complete is not slow if the SLA is "within 10 minutes." The same job is broken if the SLA is "within 60 seconds" and 300 other jobs are ahead of it in the same queue with the same worker pool.
Queue topology by SLA tier is the correct model for systems with heterogeneous job latency requirements. Most background job systems contain at least three distinct SLA tiers: interactive jobs (triggered by user action, latency SLA in seconds — report generation on user request, PDF export, data import initiated by a user), batch processing jobs (scheduled or event-triggered, latency SLA in minutes to hours — billing runs, bulk notification dispatch, scheduled report generation), and maintenance jobs (scheduled cleanup, data retention enforcement, stale session expiration, SLA in hours). Placing all three tiers on a single shared queue with a shared worker pool guarantees that batch jobs starve interactive jobs under load. The Saturday morning when 10,000 scheduled weekly-digest emails are dispatched via the batch tier is the day when a user's interactive report generation request waits 40 minutes.
Worker pool partitioning allocates dedicated workers to each queue tier. In Sidekiq, the queues configuration specifies which queues a worker process serves: --queue interactive --queue default processes interactive and default queues, ignoring batch and maintenance. A separate worker process configured with --queue batch --queue maintenance handles the batch tier without consuming worker capacity from the interactive tier. In BullMQ, separate Worker instances are constructed for separate Queue instances. Worker pool partitioning means that a batch job backlog does not delay interactive jobs — the interactive tier's worker pool is not competing with batch work. The cost of partitioning is resource overhead: dedicated worker processes consume memory even when their queue is empty. The allocation must be sized for the peak load of each tier, not the average, and the sizing must be documented in the ADR alongside the concurrency setting and its rationale.
Sidekiq's strict queue priority versus weight-based scheduling. When a Sidekiq worker is configured with multiple queues in priority order (--queue interactive,5 --queue batch,1), Sidekiq drains queues in strict priority: the interactive queue must be empty before the batch queue is processed. Under sustained interactive load where the interactive queue is never empty, batch jobs starve indefinitely. Weight-based scheduling (the numeric suffix in --queue interactive,5 --queue batch,1) processes queues in a weighted random order — approximately 5 interactive jobs for every 1 batch job — which prevents starvation at the cost of interactive latency guarantees during batch spikes. The correct choice depends on whether the system has interactive jobs with hard latency SLAs (strict priority) or whether all job classes are acceptable at some throughput without hard guarantees (weight-based).
Job timeout policy prevents stuck workers from consuming capacity indefinitely. A job that hangs — waiting on an external API that has stopped responding, blocked on a database lock, or looping on a malformed input — holds a worker slot that cannot process other jobs. Sidekiq does not enforce a job timeout by default; a stuck job holds its worker slot until the Sidekiq process is restarted. BullMQ's timeout option on job creation terminates the job after the specified millisecond duration. The correct policy: every job class has a defined maximum execution duration, and jobs that exceed it are terminated and routed to dead-letter (as a timeout failure, which may or may not be retried depending on the job class's retry policy). The maximum execution duration must be derived from the expected execution time with a generous multiplier — a job that typically takes 5 seconds might have a timeout of 120 seconds to allow for slow external API calls — not from a framework default. Tracking job execution duration as a metric (job start timestamp to job complete timestamp) surfaces jobs that are approaching their timeout before they breach it.
Scheduled jobs and the missed-firing problem
Scheduled jobs — cron-style recurring jobs that fire on a fixed schedule — introduce a distinct set of failure modes that are separate from the failures of on-demand jobs. The two most common are missed firing and double firing.
Missed firing occurs when the scheduler is unavailable at the scheduled fire time. If the worker process that owns the cron scheduler is restarted at 3:00 UTC and the 3:00 UTC backup job is scheduled to fire while the process is down, the job does not fire. The backup job runs next at 4:00 UTC — a 1-hour data backup gap. Whether this is acceptable depends on the backup SLA: for a daily backup, a 1-hour delay is irrelevant; for an hourly audit log aggregation, a missed firing is a compliance gap. The correct handling depends on the business criticality: for business-critical scheduled jobs, a catch-up procedure (run the missed job immediately when the scheduler recovers, keyed on the scheduled time that was missed) is required; for best-effort scheduled work, skipping missed firings is acceptable and should be documented. Oban supports a catch-up window for missed firings via its scheduling engine; Sidekiq-Cron does not — missed firings are silently skipped.
Double firing occurs when two worker instances both believe they are responsible for a scheduled job at the same instant. In a horizontally scaled deployment with multiple worker processes, each process may have a cron scheduler that independently computes the next fire time and enqueues the job. Without a distributed lock or a database-backed unique scheduling mechanism, the same scheduled job fires once per worker process at each scheduled interval. Oban prevents this by using a database-backed exclusive lock per scheduled job; Sidekiq-Cron prevents this by using a Redis distributed lock. Any custom cron implementation that does not use a distributed lock will double-fire at scale. The deduplication policy for scheduled jobs must account for double firing: if the scheduled billing job fires twice because two worker processes both enqueued it, the business-key idempotency in the billing job's processing logic prevents a double charge — but only if the idempotency key is the scheduled month, not the job enqueue timestamp.
Concurrent execution of non-concurrent scheduled jobs occurs when a scheduled job's execution time exceeds its fire interval. A database cleanup job scheduled to run every 60 minutes that takes 70 minutes to complete will be enqueued again at the 60-minute mark while the first instance is still running. Two concurrent instances of the cleanup job may process the same records simultaneously, producing duplicate deletions, lock contention, or inconsistent intermediate states. The correct handling is a concurrency limit of 1 for non-concurrent scheduled jobs — Oban's unique: [states: [:scheduled, :executing, :available]] configuration, Sidekiq-Cron's per-job mutex option. The ADR must specify the concurrency limit for each scheduled job class and the expected execution duration with the observed maximum.
PII in job payloads and GDPR erasure
Job arguments are persisted in the job store for the job's lifetime, in the DLQ for the dead-letter retention window, and in application logs for the log retention window. If job arguments contain personal data — user email addresses, IP addresses, payment card data, health information — the job infrastructure is a secondary personal data processor that must be covered by the GDPR erasure procedure.
The principle for PII-safe job arguments is: embed identifiers, not values. A job that sends a password reset email should carry the user ID as its argument, not the email address. The job fetches the email address from the database at processing time using the user ID. This design means that: the job store contains user IDs, not email addresses; the DLQ contains user IDs in failed job arguments, not email addresses; the application log contains the job class and user ID, not the email address. If the user exercises GDPR erasure and their email address is deleted from the primary database, the previously-enqueued job fails at processing time (because the email address lookup returns null), routes to dead-letter as a permanent failure (the job's error handling classifies null-user-result as a PermanentJobError), and the DLQ entry contains only the user ID — which is no longer linkable to the user after erasure. The alternative — embedding the email address in the job argument — means that the email address survives in the DLQ for the DLQ retention window even after erasure from the primary database, violating the erasure guarantee.
Log redaction for job arguments is required when the embed-identifier pattern cannot be applied — for example, when a job processes a payment processor webhook that includes raw card data in its payload, or when a job argument was designed before the PII-safe convention was established. Framework-level argument filtering: Sidekiq provides Sidekiq.configure_server { |c| c.error_handlers << ->(ex, context) { Lograge.log_exception(ex, context.except('args')) } } or custom middleware that filters PII fields from logged arguments; BullMQ provides a telemetryAttributes option to exclude fields from telemetry; Rails ActiveJob provides filter_parameters. The redaction configuration must be explicit in the ADR per job class, because a general "filter all PII" policy requires knowing which argument fields contain PII in each job class.
The GDPR erasure procedure must enumerate the job infrastructure systems. When a user requests erasure, the standard procedure deletes their record from the primary database and from application-level stores. Background job infrastructure is rarely included in the standard erasure checklist. The enumeration must cover: the active job queue (are there enqueued jobs with this user's arguments that will fail when their data is deleted? — Oban's cancel_all_jobs(queue: :notifications, args: %{user_id: user.id}) can cancel pending jobs for the user), the dead-letter queue (are there failed jobs with this user's arguments in the DLQ?), the application log archive (are the user's email addresses present in archived job logs?), and any scheduled jobs that are parameterized per user (recurring report generation for specific users). The erasure procedure must specify how each system is handled: cancellation for pending jobs, purge for DLQ entries, log archive scrubbing policy (usually out-of-scope for log archives, documented as an exception with the compliance team). The observability strategy ADR intersects here because application logs that capture job arguments for debugging are a PII store with their own retention and erasure implications.
Background job decisions in AI chat history
Background job infrastructure surfaces in AI chat history across four distinct session types, each capturing decisions at a different point in the system's lifecycle.
Initial infrastructure sessions capture the technology selection and the initial configuration: "how do I run background jobs in a Rails app?", "should I use Sidekiq or a PostgreSQL-backed job queue?", "how do I set up a dead-letter queue in BullMQ?", "what is the retry configuration for Sidekiq?". These sessions contain the rejection rationale for alternatives that were not chosen — why Temporal was ruled out ("too much operational overhead for where we are"), why SQS was not used ("we're already paying for Redis, adding a separate broker felt redundant"), why a PostgreSQL-backed queue was chosen over Redis ("we don't want to manage another stateful service"). The initial sessions also contain the first discussion of job failure handling: whether DLQ was configured at the start or treated as a future improvement, whether the default retry count was accepted or overridden, whether job idempotency was discussed or assumed unnecessary. These early conversations capture the assumptions about failure rates and volumes that were used to justify the initial configuration — assumptions that may have been wrong but that have never been revisited.
Job failure incidents are the highest-value recovery target for background job decisions in AI chat history: "our billing job is in the dead-letter queue and I don't know why", "a job has been retrying for 3 days — how do I check what's going wrong?", "we have 10,000 jobs in the dead-letter queue and I need to replay them — is there a bulk retry command?", "a billing job ran twice and charged a customer twice — how do I prevent this?". These sessions capture the actual failure modes encountered in production — the failure types that the initial retry policy was not calibrated for, the DLQ diagnosis workflow that was constructed under pressure, the idempotency gap that produced a duplicate charge. Postmortem sessions in AI chat contain the decisions made under incident pressure: whether to implement idempotency at the job level or at the payment processor level, whether to use a per-job DLQ or a shared DLQ, whether to change the retry count globally or per job class. These are the decisions that should update the background job infrastructure ADR, but they rarely do — they remain as decisions made in a closed chat session, invisible to the engineer who inherits the job system and encounters the same failure six months later.
Scaling sessions capture the concurrency model changes driven by performance degradation: "our background job queue depth is growing and jobs are taking too long to process", "how do I add more Sidekiq workers to handle the load?", "should I use Sidekiq Pro's priority queues or is there a way to do this without the Pro plan?", "we have one queue for everything and the bulk email jobs are blocking our report generation jobs — how do I separate them?", "what is the difference between Sidekiq's -q priority and weighted queues?". These sessions contain the queue topology decisions — why the original single-queue model was chosen and why it was being changed, what alternative concurrency models were evaluated, and what the capacity constraints were at the time. The scaling sessions capture the SLA tiers that were not formalized during initial design: the engineering team's description of "jobs that users are waiting for" and "jobs that run in the background" is the ICP of the interactive/batch/maintenance tier model, constructed reactively under scaling pressure rather than proactively in the initial architecture. Three months of AI chat history contains the scaling discussion that reveals what the team actually understood about their job latency requirements, which is often more nuanced than what made it into the initial design documents.
Compliance sessions capture the GDPR intersections discovered at the first data erasure request or the first security audit: "a user is asking us to delete all their data — what do we need to include beyond the database?", "are job arguments stored in Sidekiq's Redis and how long do they persist?", "we have user email addresses in our Sidekiq job arguments — is that a problem?", "how do we make sure that when we delete a user's data, their pending jobs are cancelled and their queued arguments are cleared?". These sessions reveal the PII exposure that was not anticipated during the initial job design and the remediation decisions made to address it: whether to refactor existing jobs to embed user IDs instead of email addresses, whether to implement argument redaction in job logging, whether to add a DLQ purge step to the erasure procedure. The compliance sessions contain the decisions that should be captured in the background job ADR's PII section, but they are typically made as one-off responses to specific requests rather than as coherent policy — the open-source extractor surfaces these conversations from the AI chat history, where the compliance reasoning is preserved in full but has never been consolidated into a single document that future engineers can find.
Writing the background job infrastructure ADR
The background job infrastructure ADR needs six sections. Each section addresses a distinct set of questions that will be asked by different stakeholders: engineers debugging a job failure at 3am, security auditors reviewing data retention, product engineers adding a new job class, and on-call engineers managing a queue backup during a traffic spike.
Section 1: Infrastructure selection. The job queue technology chosen (framework name, version, broker if applicable). The evaluation of alternatives with rejection rationale by structural property: throughput ceiling, persistence durability, retry expressiveness, DLQ built-in support, PII persistence behavior, operational overhead, and cost at projected volume. For Redis-backed queues: the Redis persistence configuration (AOF enabled, appendfsync policy, RDB snapshot cadence). For PostgreSQL-backed queues: the jobs table index configuration and partition strategy. The connection pool allocation for the job queue (how many database or Redis connections are reserved for job workers, and how that allocation interacts with the primary application connection pool documented in the database migration strategy ADR).
Section 2: Retry policy and backoff strategy. The exception classification scheme that distinguishes transient from permanent failures, with the exception classes or codes that belong to each category. The retry count for transient failures per job class (not the framework default). The backoff algorithm and parameters: initial delay, multiplier, cap, jitter percentage. The behavior on retry exhaustion: automatic DLQ routing, alert trigger, oncall paging policy. The framework configuration values that implement the policy, so the documented policy can be verified against the running configuration. The documented rationale for the retry count — what failure scenario the count was calibrated for, and what the expected recovery window is.
Section 3: Dead-letter specification. The DLQ structure: per-queue DLQ or shared DLQ, with the rationale. The metadata each dead-letter job must carry for diagnosis without application log access (enumerated per required field). The monitoring alert configuration per queue tier: the depth threshold that triggers an alert, the oncall routing, and the expected time-to-response SLA. The replay procedure: the mechanism for re-enqueueing dead-letter jobs (web UI, framework command, or script), the idempotency guarantee that makes replay safe, and the bulkhead configuration that isolates DLQ replay workers from live job processing. The retention policy per job class: the minimum retention window, the maximum retention window for PII-containing job classes, and the purge mechanism.
Section 4: Deduplication and idempotency contract. The deduplication mechanism per job class: content-addressed deduplication (mechanism, key derivation, deduplication window), business-key deduplication (key field name, storage mechanism, deduplication window), or no deduplication (with documented rationale for why duplicate enqueue is acceptable). The idempotency implementation per job class: the specific mechanism (external API idempotency key, database unique constraint, state machine precondition, Redis SET NX) and the evidence that the mechanism has been tested against concurrent execution. The delivery semantic guaranteed by the framework (at-least-once or at-most-once) and the consequence for the idempotency implementation.
Section 5: Concurrency model and job latency SLAs. The queue topology: list of queues by SLA tier, with the SLA for each tier (maximum time from enqueue to completion under normal load). The worker pool allocation per tier: number of worker processes, concurrency setting per process, and the rationale for each setting (job duration, memory footprint per concurrent job, database connection pool allocation, external API rate limit). The Sidekiq queue priority model or weight-based scheduling selection with the rationale. The timeout policy per job class: the maximum execution duration, the termination behavior on timeout (DLQ routing, retry, discard), and the metric used to monitor execution duration against the timeout threshold. The capacity planning assumptions: at what queue depth or at what worker utilization should capacity be added, and what signals the monitoring must provide to make that decision.
Section 6: PII and compliance. The argument serialization convention: which job classes embed user IDs for deferred lookup and which embed values (with documented rationale for exceptions). The log redaction configuration: which argument fields are filtered from application logs per job class. The DLQ retention policy for job classes with PII-containing arguments: the maximum retention window and the purge mechanism. The GDPR erasure procedure for the job infrastructure: the enumeration of systems (active queue, DLQ, log archives, scheduled job parameters) and the step-by-step erasure procedure for each. The scheduled review cadence for the PII inventory — as new job classes are added, the PII inventory must be updated, and the ADR must specify who is responsible for the update and when it occurs.
Background job infrastructure is the part of the system that handles work the user never sees directly — the billing, the indexing, the notifications, the cleanup. It is also the part of the system where failures are most likely to be silent: a job that fails at 3am with no DLQ alert produces no visible error until a human notices the downstream consequence. The retry policy, the dead-letter procedure, the deduplication contract, the concurrency model, and the PII handling — each is a decision that was made at some point, usually in response to a specific failure or scaling constraint, usually in an AI chat session that was closed and never consolidated. The decisions that determine how your system behaves under failure are exactly the decisions that are hardest to reconstruct from code, because the code shows what the policy is, not why it was chosen or what alternatives were rejected. The ADR preserves the reasoning; the open-source extractor recovers the reasoning from the conversations where it was first worked out, before it was lost in the noise of 300 other closed chat tabs.