The task scheduling decision record: why the cron and job queue implementation you chose determines your duplicate execution surface and your missed job detection latency

Cron versus job queue versus event-driven trigger is decided in the first sprint where an engineer asks "how do I run this on a schedule?" — and never documented as a deliberate architecture choice with execution guarantee, distributed lock policy, or missed job detection evaluated. The execution guarantee model determines what happens when the job's host restarts mid-execution or a deployment overlaps with a scheduled firing. The distributed lock policy determines whether adding a second server for availability turns every monthly billing job into a duplicate charge. The missed job detection gap determines whether a nightly report that stops completing is discovered by an alert or by a customer complaint seventy-two hours later. Each structural property is established in the founding session, invisible in the job's source code, and impossible to reconstruct from git history once the engineer who made the choice has moved on.

A 26-person B2B SaaS platform ran their monthly subscription billing cycle as a system cron job on a single application server. The job ran at 1am on the first of each month, queried the database for all active subscriptions due for renewal, called the Stripe Charges API once per subscription, updated the subscription status in the database, and sent a receipt email to the customer. The setup was built in a Friday-evening AI session six months after launch — the founding engineer asked "how do I run a billing job once a month on a Linux server?" and implemented the suggested crontab entry with a small Node.js script. It ran without incident on the single production server for fourteen months.

The incident was triggered by a capacity upgrade. In preparation for an enterprise onboarding that would triple their customer count, the team added a second application server behind the load balancer. The new server was provisioned from the same machine image that produced the first server. The machine image included the crontab entry for the billing job. At 1am on the first of the following month, both servers executed the billing job simultaneously. The Stripe Charges API received two charge requests for each subscription — one from each server — within a 200-millisecond window. Stripe processed both charges for 318 of 340 active subscriptions; 22 subscriptions received only one charge because the second API call arrived while the first was still processing, and Stripe returned an error on the second request for those subscriptions.

The team discovered the duplicate charges at 7:43am when the first customer support ticket arrived. By that point, 318 customers had been charged twice and receipt emails had been sent twice. The billing job implementation had no distributed lock, no deduplication check, and no idempotency key on the Stripe charge creation call. Reversing 318 duplicate charges required a Stripe refund API call per subscription plus a support email to each affected customer explaining the error. The refunds processed over three business days because of Stripe's standard refund processing timeline. Three customers — all enterprise prospects who had not yet signed — did not renew after the incident.

The root cause was not the second server. The root cause was that the cron job's execution model — at-most-once, no distributed coordination, no idempotency — was chosen without documentation. The crontab entry appeared in the machine image because the machine image was the deployment artifact, and nobody had written a policy distinguishing "scripts that must run on every server" from "jobs that must run on exactly one server in the fleet." When the fleet grew from one to two, the implicit constraint became a duplicate billing incident.


A 19-person data analytics SaaS ran a nightly aggregation job that computed the prior day's usage metrics for each customer and populated the analytics dashboard. The job was scheduled for 2:15am, ran for approximately 35 minutes on a normal night, and had been operating correctly for eleven months. The monitoring setup showed CPU and memory utilization for the application servers and database; the nightly job was not registered in any monitoring system — the team considered it a background process, not a monitored service.

On the night of November 14th, the aggregation job started at 2:15am as scheduled. The query volume was higher than usual — a recently onboarded customer with a 90-day data backlog had accumulated a large dataset that the aggregation query had not previously touched. The job's database query, which typically completed in eight minutes, ran for forty-nine minutes on the backlog dataset. At 3:04am, the application server's Node.js process reached its heap memory ceiling (the server had 4 GB of RAM and Node.js's default heap limit is approximately 1.5 GB on a 64-bit system), and the operating system sent SIGKILL. The cron daemon recorded no output — the crontab entry had redirected stdout and stderr to /dev/null because the team had found the nightly output emails from cron annoying. The job left no log entry, no error file, and no database record indicating it had run.

The next morning, customers opened their dashboards and saw usage metrics that stopped updating at November 13th. The first support ticket arrived at 9:22am. The on-call engineer checked the application server logs and found no errors from the previous night because the job's output was redirected to /dev/null. The engineer checked the database for the aggregation job's completion record and found no such record existed — the job had never written one. The only evidence that something had gone wrong was the absence of the expected data.

The investigation took four hours because every step required ruling out a hypothesis from a system that had no visibility into the job's execution. The engineer checked whether the cron daemon was running (it was). The engineer checked whether the crontab entry was present (it was). The engineer ran the job manually and watched it succeed (the backlog dataset had been small enough to complete in 40 minutes when run the following morning because customer usage volume was lower at 10am than at 2am). The engineer eventually found the OOM kill in the system journal (journalctl -k | grep oom) — a log source not checked in the standard investigation flow because application incidents had never produced kernel-level log entries before. The total time from job failure to resolution was 72 hours: the failure at 3:04am on November 14th, the first customer ticket at 9:22am, the diagnosis completed at 1:47pm, the fix deployed and backlog data manually reprocessed at 5:30pm, and a second manual run required the following night after discovering that the first night's reprocessing had not covered the oldest backlog entries.

The root cause was not the OOM kill. The root cause was that the job had no missed-job detection mechanism. An OOM kill is a normal operating system event — jobs will be killed by the OS when memory consumption exceeds system limits. The correct response is a monitored, alerting system that detects when the job did not complete by its expected deadline. The team had monitoring for events that produced observable signals (errors logged, CPU spikes, request failures). The missed aggregation job produced no observable signal. The monitoring gap was established in the founding session when the team added the cron entry and never registered the job in any external system.

The four structural properties that are decided in the founding session

Both incidents — the duplicate billing charges and the invisible missed aggregation — were caused by structural properties established when the first scheduled job was implemented. These properties are not visible in the job's source code. They are visible only in the system context: how many servers the job runs on, what happens when the job is killed mid-execution, what monitoring exists for the job's completion, and whether the job's side effects are idempotent. The founding AI session that answers "how do I run this on a schedule?" establishes all four properties — and closes before any of them are documented.

1. Execution guarantee model: what the scheduler promises when the job starts and when it fails

System cron provides at-most-once execution semantics: the scheduler fires the command at the scheduled time and does not retry if the command exits with a non-zero status code or is killed by the OS mid-execution. A cron job that exits with status 1 at 2:15am does not fire again until the next scheduled time — which may be 24 hours later, or 30 days later for monthly jobs. The at-most-once model is correct for idempotent, non-critical, rerunnerable jobs where a missed execution is acceptable until the next scheduled firing.

Message queue-backed job systems provide at-least-once delivery semantics: the job remains in the queue until the worker acknowledges successful completion, and unacknowledged jobs are retried according to the configured retry policy. A job that fails due to a transient database timeout is automatically retried after a backoff delay, without requiring the operator to detect the failure and manually requeue the job. The at-least-once model requires idempotent job implementations because retry produces a second execution of the same job payload. A job that creates a database row without a conflict guard, sends an email without a deduplication check, or calls Stripe's charge API without checking for an existing charge for the same billing period will produce duplicate side effects on retry — identical to the duplicate billing incident caused by the distributed execution failure, but triggered by retry rather than simultaneous execution.

Effectively-once execution — the practical goal for business-critical jobs — requires combining at-least-once delivery with idempotent job logic. Every idempotency key must be designed at the job level: the billing job needs a unique identifier for each (subscription_id, billing_period) pair that can be passed as the Stripe idempotency key and used as a unique constraint in the billing_charges database table. If a job executes twice with the same idempotency key, the second execution produces the same outcome as the first: the Stripe API returns the existing charge, and the database insert returns a conflict on the unique constraint. The idempotency key schema is determined in the founding session — adding it retroactively requires a database migration and a rewrite of every call site in the job.

The execution guarantee and idempotency model must be decided together and documented together, because the retry behavior of the delivery mechanism determines the idempotency requirements for the job implementation. A team that documents "we use Sidekiq for scheduled jobs" without documenting "all jobs must be idempotent and here is the idempotency pattern per side effect type" has documented the delivery mechanism while leaving the idempotency contract implicit — and will discover the gap at the first retry-triggered duplicate.

2. Distributed execution surface: how many servers the job runs on and what prevents simultaneous execution

System cron fires the job on every server where the crontab entry exists. This is not a configuration option — it is the design of the system cron daemon. The cron daemon runs independently on each server, reads its local crontab at the scheduled time, and executes the listed command. There is no cross-server coordination built into the cron daemon. A job that must execute exactly once in a multi-server environment requires an external coordination mechanism.

The standard mechanism is a distributed lock. Before the job logic executes, the process acquires an exclusive lock stored in a shared resource. PostgreSQL advisory locks provide atomic acquisition with automatic release on connection close — a server crash releases the advisory lock when the database connection is dropped, preventing indefinite lock starvation. The acquisition call is SELECT pg_try_advisory_lock(hashtext('job_name')), which returns true if the lock was acquired or false if another process holds it; a false return exits the job immediately. Redis provides atomic lock acquisition with expiry via SET lock_key value NX PX expiry_ms, which atomically sets the key only if it does not exist and applies an expiry, preventing indefinite lock retention if the holder crashes before releasing it.

The lock expiry duration is the decision that most teams do not document: it must exceed the maximum job runtime under worst-case conditions (slow queries, high load, large data volume) but must be short enough that a crashed server releases the lock before the next scheduled firing. A billing job that typically runs in 12 minutes might be given a 4-hour lock expiry on a monthly schedule, because the next firing is 30 days away and a long lock duration does not matter. A nightly aggregation job that typically runs in 35 minutes but could run for 3 hours on a large backlog needs a lock expiry longer than 3 hours — but if the lock expiry is 4 hours and the cron schedule is every 6 hours, a server crash at 2:15am means the 8:15am firing will wait until 6:15am before the lock expires and the job runs, producing a data gap visible to customers. The lock expiry is not a configuration detail — it is a documented tradeoff between crash recovery latency and the acceptable data freshness gap.

Database-backed job queues (Oban for Elixir/Phoenix, GoodJob for Rails, Sidekiq Cron, BullMQ for Node.js) address the distributed execution surface differently: scheduled jobs are stored in a database table, and a single worker per job is assigned by a locking mechanism built into the queue library. The queue library's scheduler process fires the job by inserting a job row into the jobs table; the next available worker that successfully locks that row executes the job. Simultaneous execution is prevented by the database row lock, not by a distributed lock the application must implement. The tradeoff is that the queue library's built-in scheduling replaces system cron, requiring the team to commit to the queue library for both job execution and job scheduling — while system cron with a distributed lock lets the team use cron for scheduling while keeping execution in their own process.

Cloud-managed schedulers (AWS EventBridge Scheduler, GCP Cloud Scheduler) fire a target — typically an SQS message, an HTTP endpoint, or a Lambda invocation — at the scheduled time from an infrastructure-managed process that runs outside the application server fleet. The scheduler fires exactly once per schedule because it is a managed service, not a process running on every application server. A cloud scheduler is the simplest elimination of the distributed execution surface: the scheduler sends a message or invokes a function once, and the application server that receives that message or invocation handles the execution without competing with other servers. The tradeoff is that the execution is now decoupled from the scheduler: if the SQS consumer fails to process the triggered message, the cloud scheduler has already fired and will not retry — retry must be handled by the SQS queue configuration, not the scheduler. The execution guarantee of the cloud scheduler layer and the execution guarantee of the downstream consumer layer must be documented together.

3. Missed job detection: how the system knows when a job did not run on schedule

Standard application monitoring detects positive events: error log entries, CPU spikes above thresholds, request latency above SLOs, queue depth above alert thresholds. A missed job produces no positive event — it produces an absence. The cron daemon does not emit a log entry when a job is not fired (a missed firing happens when the system was powered off or the cron daemon was not running during the scheduled window). A job killed by the OS produces a kernel journal entry but no application log entry. A job that runs for less than a millisecond because the distributed lock acquisition returned false produces no output by default. An application monitoring system that watches for positive events will not detect any of these scenarios.

Heartbeat monitoring converts the job completion from an absence-detectable event to a presence-detectable event. The job sends an HTTP request to a unique URL at the end of each successful execution: curl -s "https://hc-ping.com/job-uuid" > /dev/null. The monitoring service (Cronitor, Healthchecks.io, or a self-hosted equivalent) expects to receive the HTTP request within the configured interval plus grace period. If the grace period expires without receiving the request, the monitoring service sends an alert. The job's completion is now monitored as a positive event (the HTTP request was received) rather than an absence (no log entry was found). The heartbeat URL is a monitoring dependency that must be provisioned, rotated, and documented as part of the job's operational requirements — a heartbeat endpoint that is deprovisioned or whose URL is rotated without updating the job implementation becomes a false-positive missed job alert for every subsequent firing.

Deadline alerting is the alternative for queue-backed systems: the job scheduler records when each job was due to run, and an alert fires when the job has not completed by a deadline relative to its scheduled time. Oban includes a built-in "node heartbeat" and scheduled job monitoring; GoodJob provides a recurring job audit that detects stale jobs; Sidekiq Cron's Pro tier provides missed job detection. Self-hosted deadline alerting requires a query against the jobs table: SELECT name, scheduled_at FROM scheduled_jobs WHERE scheduled_at < NOW() - INTERVAL '2 hours' AND status = 'pending', executed by a monitoring cron that fires hourly. The threshold — two hours past scheduled time — is a documented SLA: jobs that have not started within two hours of their scheduled time generate an alert.

The missed job detection gap is different from the job failure detection gap. A job that runs and fails produces an error log entry and can be monitored via log-based alerting. A job that never runs produces nothing. Teams that believe their log-based alerting covers job failures have typically only covered job execution failures — they have not covered missed firings, lock acquisition skips, or OOM kills that produce no application log entry. The monitoring architecture document that claims "all job failures are alerted" must include a test of the missed-job scenario: block the distributed lock, wait for the scheduled time, verify that an alert fires.

4. Job metadata and diagnosability: what is recorded when the job runs and when it doesn't

A system cron job's execution record is the cron daemon's log — if one exists — and any stdout/stderr the job produces if the crontab entry does not redirect to /dev/null. The cron daemon on a standard Linux system writes to syslog or journald; the entries typically include the job's start time and exit code but not the job's duration, the number of records processed, or any business-level metric. The absence of a structured execution record is the reason the aggregation incident required four hours of investigation: there was no single source of truth that recorded "this job ran at 2:15am, processed 47 customer accounts, and was killed at 3:04am after 49 minutes."

Queue-backed job systems record job metadata in the jobs database table: enqueued time, started time, completed time, worker ID, retry count, last error, and any metadata the job implementation writes to the record. An investigation that requires answering "did the billing job run for customer_id=4891 this month?" requires a single database query rather than a log archaeology project. The job record is also the deduplication surface for idempotent jobs: a unique index on (job_class, billing_period, subscription_id) in the jobs table prevents the same job from being enqueued twice for the same work unit, stopping duplicate charges before they reach the payment processor.

PII in job payloads is the missing section in most job queue implementations. The background job infrastructure decision record covers the general pattern, but scheduled jobs have a specific failure mode: scheduled jobs often process data for all customers in a single job, where the payload contains a billing period and account count rather than individual customer identifiers. The job log for a billing job that processes 340 subscriptions should not contain 340 customer email addresses — but the error log for a per-subscription job that failed for customer_id=4891 should contain enough context to diagnose the failure without requiring a cross-reference to the database. The payload design — what identifiers are logged, what values are redacted, what context is available at failure time — is a documentation decision that determines whether an on-call engineer at 2am can diagnose a failing job from the error log or must query the database to understand what the job was trying to do.

Scheduling options and their structural properties

System cron (/etc/crontab, crontab -e, /etc/cron.d/): Fires the command on every server where the entry exists, with no cross-server coordination, at-most-once execution, no built-in retry, and no missed job detection. Correct for stateless maintenance tasks (log rotation, tmp file cleanup, cache warming) that are safe to run on every server and acceptable to skip if a server is down. Incorrect for business-critical jobs that must run exactly once, produce side effects, or require detection when they do not run. The distributed lock must be added by the application, not the scheduler.

Database-backed job queues (Oban for Elixir/Phoenix, GoodJob for Rails, Sidekiq with Sidekiq-Cron, BullMQ for Node.js): Scheduled job entries stored in the database, executed by workers that acquire a database row lock to prevent simultaneous execution. Provides at-least-once delivery with configurable retry, job metadata persistence in the database, and deduplication via unique constraints. The scheduler and the executor run in the same process (or the same application cluster), reducing operational complexity. The database is the coordination layer, which means job scheduling performance scales with the database's write throughput — a high-volume scheduling scenario (thousands of jobs per minute) can produce lock contention on the jobs table. The queue depth and processing latency for scheduled jobs must be added to the monitoring stack alongside the standard web request SLOs.

Redis-backed job queues (Sidekiq for Ruby, BullMQ for Node.js, Celery with Redis broker for Python): Job delivery backed by Redis sorted sets and lists, with configurable retry, at-least-once delivery, and a worker process separate from the web server. The scheduling primitive stores jobs in a Redis sorted set scored by execution timestamp; the worker polls the sorted set and moves due jobs to the execution queue. Redis-backed queues provide higher throughput than database-backed queues for high-volume scheduling but lose job history when the Redis instance is flushed or replaced — job records that exist only in Redis are not available for audit or diagnosis after a Redis failure. The persistence model (Redis AOF, RDB snapshots, or Redis Cluster replication) must be documented as part of the scheduling architecture because it determines the job record retention window.

Cloud-managed schedulers (AWS EventBridge Scheduler, GCP Cloud Scheduler, Azure Logic Apps scheduler): Fires a target — SQS message, HTTP endpoint, Lambda invocation — at the scheduled time from a managed infrastructure process. Eliminates the distributed execution surface (the scheduler fires once from infrastructure, not from every application server). Does not provide at-least-once delivery for the downstream consumer — the consumer's retry behavior must be configured separately on the SQS queue, Lambda concurrency model, or HTTP endpoint's retry logic. The cloud scheduler and the downstream consumer are decoupled, which means the execution guarantee spans two systems with different configuration surfaces. A missed firing in a cloud scheduler is detectable via CloudWatch metrics (EventBridge Scheduler provides invocation count and failure count metrics per schedule), which is a monitoring advantage over system cron but still requires configuring the alert threshold per schedule rather than relying on default monitoring.

Workflow orchestration (Temporal, Apache Airflow, Prefect): Schedules complex multi-step workflows where each step has its own retry policy, execution guarantee, and timeout. Temporal provides durable execution — a workflow that is killed mid-execution resumes from the last completed activity rather than restarting from the beginning, making it suitable for long-running jobs that process data in stages. Airflow provides a directed acyclic graph (DAG) model for jobs with dependencies between steps, with a built-in scheduler UI, execution history, and manual retry capability. The operational complexity of workflow orchestration (Temporal requires a cluster, Airflow requires a scheduler and worker fleet) is justified for jobs that require durable mid-execution resume, step-level retry policies, or dependency graphs between steps. For a simple monthly billing job that runs as a single atomic unit, workflow orchestration adds operational cost without proportionate benefit.

AI chat sessions where scheduling decisions are made

Task scheduling decisions are made in four types of AI chat sessions, each of which establishes structural properties that are not visible in the job's source code:

The founding implementation session. "How do I run this Node.js script every night at 2am?" The session covers the crontab syntax, the script invocation, and the stdout redirection. It does not cover what happens when the script runs on two servers, what happens when the script is killed mid-execution, or how to detect when the script does not complete. The at-most-once guarantee, the absent distributed lock, and the absent missed job detection are all established here. This is the highest-value session to recover from AI chat history: it contains the rejection reasons (implicit or explicit) for job queues, the deployment context that established the single-server assumption, and the original constraints that justified the cron approach.

The horizontal scaling session. "We're adding a second server for redundancy. What do we need to change?" This session may cover load balancer configuration, session storage, and deployment procedures — but often does not cover scheduled jobs, because scheduled jobs are not explicitly part of the "stateless web server" component that horizontal scaling is typically applied to. The assumption that the second server is identical to the first (same machine image, same crontab) is often not questioned because it is never raised. The duplicate billing incident is the outcome of this session's scope gap. Recovering this session from AI chat history surfaces whether the distributed lock was discussed and deferred, or was never raised.

The incident investigation session. "Our billing job ran twice last night and charged customers twice. How do we prevent this?" This session produces a distributed lock implementation under time pressure, often copied from a pattern that works for the immediate incident but may not document the lock expiry rationale, the idempotency key schema, or the monitoring additions needed to detect future occurrences. The fix is in the code; the reasoning for the lock expiry duration and the idempotency key choice is in the session history. The WhyChose extractor recovers this session — the distributed lock configuration embedded in a 40-minute incident response session that closed when the fix was deployed.

The capacity investigation session. "Our nightly job is getting slower as our customer base grows. How do we speed it up?" This session often produces query optimizations, index additions, and batch size tuning — but may not produce a monitoring addition that detects when the job exceeds its expected runtime and is at risk of timeout or OOM kill. The performance optimization is documented in the code (the new index, the batch size parameter). The missing heartbeat monitor is not documented anywhere.

Writing the task scheduling ADR

The task scheduling ADR has five sections. Each section addresses one structural property that is established at the time the first scheduled job is implemented and difficult to change retroactively.

Section 1: Scheduling mechanism and execution guarantee. Document the scheduling approach (system cron, database-backed queue, Redis-backed queue, cloud scheduler, workflow orchestration) with the alternatives considered and the reason each alternative was rejected. Include the execution guarantee the mechanism provides (at-most-once, at-least-once) and the idempotency requirement this places on job implementations. Include the retry policy: the default retry count, the backoff algorithm, the maximum retry delay, and the distinction between transient failures eligible for retry and permanent failures that should be sent to the dead letter queue without retry. The background job infrastructure ADR covers the retry and dead letter policy in detail; the task scheduling ADR references it for retry behavior and focuses on the scheduling-specific properties: when the job fires, how frequently it fires, and what happens when a firing is missed.

Section 2: Distributed lock policy. For each scheduled job that must execute on exactly one server in a multi-server fleet, document the lock mechanism (PostgreSQL advisory lock, Redis SET NX, database-backed queue's built-in coordination), the lock key (per-job name, per-job-and-period), the lock acquisition timeout (how long to wait for a lock before giving up and exiting), and the lock expiry (how long the lock holder can hold the lock before the next server is permitted to acquire it). Include the failure mode: what happens if the lock holder crashes mid-job before releasing the lock, and how long the next firing must wait before the lock expires. Include the test: how to verify that simultaneous execution on two servers produces exactly one execution outcome rather than two. The deployment pipeline and disaster recovery procedure both interact with the distributed lock policy when servers are restarted during deployments or failover events — document what happens to a job that holds a lock when its server is replaced.

Section 3: Missed job detection policy. For each scheduled job that has a business-critical completion requirement, document the detection mechanism (heartbeat URL, deadline alerting query, cloud scheduler invocation metric), the expected completion deadline (the job scheduled at 2:15am must complete by 4:30am), the grace period before alerting (15 minutes after the deadline before the alert fires), and the alert destination (PagerDuty on-call rotation, Slack channel, email). Include the monitoring test: how to verify that a job that does not complete by its deadline produces an alert within the grace period. Include the distinction between jobs that require missed job detection and jobs that do not — not every cron job is business-critical, and the monitoring overhead of heartbeat registration for log rotation or cache warming is not justified. The observability strategy ADR documents the alerting infrastructure; the task scheduling ADR documents the per-job thresholds and tests. For cloud schedulers, reference the IaC configuration for the CloudWatch alarm that monitors the EventBridge Scheduler's invocation failure count.

Section 4: Job payload design and idempotency key schema. Document the structure of job payloads: what identifiers are included, what values are excluded (PII, mutable state that can become stale during queue wait), and the idempotency key for each job class. The idempotency key schema must be specific enough to prevent duplicate execution of the same work unit: the billing job's idempotency key is billing:subscription_id:{id}:period:{YYYY-MM}, not a UUID generated at enqueue time. A UUID generated at enqueue time is unique per enqueue event — which means a job enqueued twice has two different idempotency keys and the second enqueue produces a second execution. The idempotency key must be derived from the work unit's identity, not from the enqueue event. Document the external call idempotency keys as well: the Stripe charge idempotency key, the email deduplication key, the Slack API idempotency key. These must be derived from the same work unit identity as the job's database idempotency key, or the job-level deduplication does not prevent the downstream duplicate. The API schema decision and authentication token handling interact with job payloads when jobs make outbound API calls using short-lived credentials that may expire during queue wait time.

Section 5: Dead letter and failure escalation policy. Document the per-job retry budget (how many retries before the job moves to the dead letter queue), the dead letter queue retention period, the alert threshold for dead letter depth (how many dead-lettered jobs before an alert fires), the on-call procedure for dead-lettered jobs (inspect payload, diagnose failure, manually retry or discard), and the jobs that are exempt from retry because their failure mode is non-retryable (a job that fails because its source data was deleted should not retry indefinitely against data that no longer exists). Include the idempotent replay procedure: if a billing job is dead-lettered, the on-call engineer's procedure to manually retry it must include a verification step that confirms the subscription was not charged before retrying — otherwise manual replay produces the duplicate charge that the idempotency key was designed to prevent. The data retention policy covers how long job records and dead letter entries are retained; the task scheduling ADR documents the operational procedure for acting on them while they exist.

None of these five sections appear in the "how do I run this on a schedule?" founding session that adds the first crontab entry. The founding session answers the immediate question — how to run a script at a scheduled time — and closes when the cron entry is working. The execution guarantee, the distributed lock policy, the missed job detection, the idempotency key schema, and the dead letter procedure are the operational requirements of a scheduling system that runs correctly at scale and detects its own failures. They are not advanced optimization concerns. They are the properties that determine whether adding a second server causes duplicate billing, whether a nightly OOM kill is discovered by an alert or by a customer, and whether an incident response engineer at 2am can find the information they need in the job record or must spend four hours in log archaeology. The WhyChose extractor surfaces the founding session, the incident response session, the horizontal scaling session, and the capacity investigation session from AI chat history; the task scheduling ADR takes the scheduling choices buried in those sessions and converts them into a documented execution guarantee, a distributed lock policy with documented expiry rationale, a missed job detection mechanism with tested alert thresholds, and a dead letter procedure that the on-call engineer can execute at 2am without waking the founding engineer.

FAQs

Why does running cron jobs on multiple servers produce duplicate executions, and what is the correct mechanism to prevent it?

System cron fires the job on every server where the crontab entry exists. The cron daemon has no cross-server awareness — it fires the listed command independently on each host at the scheduled time. For jobs that write to a database, send emails, or charge payment cards, simultaneous execution on two servers produces duplicate side effects.

The standard prevention mechanism is a distributed lock. Before the job logic executes, the process attempts to acquire an exclusive lock stored in a shared resource. PostgreSQL advisory locks provide atomic acquisition with automatic release on connection close (a server crash releases the lock when the database connection drops). Redis SET NX PX provides atomic acquisition with expiry, preventing indefinite lock retention if the holder crashes. The lock expiry duration must exceed the maximum expected job runtime but must be short enough that a crashed server releases the lock before the next scheduled firing — this tension is the documented tradeoff that most teams omit.

Database-backed job queues (Oban, GoodJob, Sidekiq Cron) address the distributed execution surface at the framework level: scheduled jobs are stored in the database, and a single worker acquires a row lock before executing. The queue library's coordination replaces the application-level distributed lock, at the cost of committing to the queue library for both scheduling and execution.

What is the difference between at-most-once, at-least-once, and effectively-once execution semantics for scheduled jobs, and which does system cron provide?

System cron provides at-most-once semantics: it fires the command at the scheduled time and does not retry if the command exits with a non-zero status code or is killed by the OS. A job that fails at 2:15am does not run again until the next scheduled firing — which may be 24 hours or 30 days later.

Message queue-backed systems provide at-least-once delivery: the job remains in the queue until the worker acknowledges successful completion, and unacknowledged jobs are retried. At-least-once delivery requires idempotent job implementations — executing the same job payload twice must produce the same outcome as executing it once. Jobs that create rows without conflict guards, send emails without deduplication checks, or charge payment cards without idempotency keys will produce duplicate side effects on retry.

Effectively-once (the business-level goal) requires combining at-least-once delivery with idempotent job logic. Every idempotency key must be derived from the work unit's identity — billing:subscription_id:{id}:period:{YYYY-MM} — not from a UUID generated at enqueue time. A UUID generated at enqueue time is unique per enqueue event, which means a job enqueued twice produces two different idempotency keys and two executions.

What is the missed job detection problem, and why does standard server monitoring not solve it?

Standard monitoring detects positive events: error logs, CPU spikes, request latency above SLO. A missed job produces no positive event — it produces an absence. The cron daemon does not emit a log entry when a job is not fired. A job killed by the OS produces a kernel journal entry but no application log entry. A job that skips execution because the distributed lock was held produces no output by default. Standard monitoring will not detect any of these scenarios.

Heartbeat monitoring converts job completion from an absence-detectable event to a presence-detectable event: the job sends an HTTP request to a unique monitoring URL at the end of each successful execution. The monitoring service (Cronitor, Healthchecks.io) expects the request within the configured interval plus grace period and alerts when it is overdue. Queue-backed systems enable deadline alerting: a query against the jobs table detects jobs that were scheduled but not completed by a deadline, generating an alert from the scheduler itself. Neither approach is automatic — heartbeat monitoring requires adding the HTTP call to every job implementation, and deadline alerting requires configuring per-job thresholds and running the monitoring query.