All Posts

System Design HLD Example: Distributed Job Scheduler

A practical interview-ready HLD for a distributed job scheduler handling millions of scheduled tasks.

Abstract AlgorithmsAbstract Algorithms
ยทยท31 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Design a distributed job scheduler for one-time and recurring (cron) jobs at scale. This article covers the full solution: job definition storage with next_fire_time indexing, Redis sorted set for sub-millisecond trigger lookup, optimistic locking to prevent double-firing across scheduler instances, exponential backoff retry with a dead-letter queue, and the Quartz Scheduler clustering model in Java.

Netflix sends 150 million push notifications every day at specific scheduled times. If their job scheduler misfires, 150 million users get a notification at 3am instead of 9am. Worse, a double-fire bug where the same notification fires twice causes the kind of user complaint storm that wakes up an on-call team at midnight. A correct distributed scheduler has two invariants that must hold simultaneously: never miss a scheduled job and never fire the same job twice. These two goals pull against each other: the safest way to avoid missing a job is to fire it eagerly (possibly twice), and the safest way to avoid double-firing is to fire conservatively (possibly missing the window). The design challenge is threading that needle at millions of jobs per day.

Designing a job scheduler teaches you a precise lesson about distributed coordination: the problem of "which scheduler instance fires this job?" is structurally identical to the problem of "which payment service instance processes this transaction?" โ€” and the solution (optimistic locking with an idempotency key) is the same.

By the end of this walkthrough you'll know why a Redis sorted set with score = next_fire_time is the correct trigger index (O(log N) lookup vs O(N) table scan), why optimistic locking on the jobs row beats a distributed Redis lock for most workloads (one fewer network hop, no lock expiry edge case), and why at-least-once delivery with an idempotency key is almost always the right trade-off over exactly-once (which requires distributed transactions across the scheduler, queue, and worker).

๐Ÿ“– What a Job Scheduler Must Coordinate: Actors and Use Cases

Actors

ActorRole
Client ApplicationCalls POST /jobs to schedule a job; polls GET /jobs/{id} for status
Job API ServiceValidates requests; writes job definitions; computes first next_fire_time
Job Store (PostgreSQL)Source of truth for all job definitions and lifecycle state
Trigger Engine (Scheduler Service)Polls for due jobs; acquires execution lease; dispatches to queue
Job Queue (Kafka / SQS)Decouples trigger from execution; buffers jobs under worker lag
Worker PoolConsumes queue; executes job payload; writes execution history
Execution History ServiceRecords each execution attempt: start time, end time, status, error

Use Cases

  • Schedule a one-time job โ€” run a task at a specific future timestamp (e.g., send welcome email 5 minutes after signup)
  • Schedule a recurring job โ€” run on a cron expression (e.g., 0 9 * * MON-FRI = 9am weekdays)
  • Cancel a scheduled job โ€” mark a pending job as cancelled before it fires; no execution should occur
  • Pause and resume a recurring job โ€” temporarily halt recurrence without deleting the job definition
  • Check job status โ€” GET /jobs/{id} returns current status and last execution result
  • View execution history โ€” GET /jobs/{id}/executions returns all historical runs with outcome
  • Retry a failed job โ€” automatic exponential backoff on execution failure; manual retry via API
  • Read and write paths are explained separately so bottlenecks and consistency boundaries are explicit.

This template starts with actors and use cases because architecture only makes sense when user behavior and workload shape are clear. In interviews, this section prevents random tool selection and keeps the answer grounded in business outcomes.

๐Ÿ” Functional Requirements: Scope, Non-Goals, and API Contracts

In Scope

  • One-time job scheduling โ€” POST /jobs with run_at timestamp and payload; returns jobId
  • Recurring job scheduling โ€” POST /jobs with cron_expr (Quartz-format cron string); scheduler computes next_fire_time
  • Job lifecycle management โ€” cancel, pause, resume via PATCH /jobs/{id}/status
  • Execution history โ€” every run attempt recorded with status, duration, and error; retained 90 days
  • Retry on failure โ€” automatic exponential backoff; configurable max retries; dead-letter queue after exhaustion
  • Idempotent execution โ€” idempotency_key on each job; workers skip execution if key already processed

Out of Scope (v1 Boundary)

  • DAG workflow orchestration โ€” dependency graphs between jobs (Job B runs only after Job A succeeds) are the domain of Apache Airflow, not a cron-as-a-service scheduler. Adding inter-job dependencies to v1 bloats the design without serving the core use case.
  • Real-time event-driven triggers โ€” jobs triggered by a Kafka event or webhook are a distinct system (event processor). Our scheduler is time-driven, not event-driven.
  • Sub-second precision scheduling โ€” our target is ยฑ1 second, not sub-millisecond. High-frequency scheduling at millisecond precision requires a different architecture (e.g., time-wheel in memory).
  • Interactive workflow UI โ€” Airflow-style DAG visualisation is out of scope; a status API is sufficient.

Functional Breakdown

FeatureAPI ContractKey Decision
Create jobPOST /jobs โ†’ { jobId, status: 'scheduled', nextFireAt }Compute next_fire_time from cron at write time, not at query time
Cancel jobPATCH /jobs/{id}/status with { status: 'cancelled' } โ†’ 200Atomic CAS update; prevent cancel of already-fired job
Get statusGET /jobs/{id} โ†’ full job record with last_fired_at, statusRead from PostgreSQL; no Redis needed on read path
Trigger due jobsInternal: scheduler polls Redis sorted set every 1sZRANGEBYSCORE returns all jobs with score โ‰ค now_ms
Dispatch to queueScheduler publishes { jobId, payload, idempotencyKey } to Kafka topic jobs.executeWorker validates idempotency key before execution
Record executionWorker inserts into executions table after each runAsync; never blocks the worker's next job pick-up

Initial building blocks:

  • Job API Service โ€” validates inputs, computes next_fire_time, writes to PostgreSQL, adds to Redis sorted set
  • Trigger Engine โ€” polls Redis sorted set; acquires row-level optimistic lock; publishes to Kafka
  • Worker Pool โ€” Kafka consumer; validates idempotency key; executes payload; writes execution history
  • Job Store (PostgreSQL) โ€” jobs table + executions table; source of truth for all state
  • Redis โ€” sorted set as secondary index for fast trigger lookup; optional distributed lock for scheduler coordination

A strong answer names non-goals explicitly. Interviewers use this to judge prioritization quality and architectural maturity under time constraints.

โš™๏ธ Non-Functional Requirements: Timing, Throughput, and Durability Targets

DimensionTargetWhy it matters
Timing accuracyFire within ยฑ1 second of next_fire_time at 1M jobs/hour peakSLA for time-sensitive tasks (push notifications, payment retries)
Throughput100M scheduled jobs/day; 1.2M jobs/hour burst at midnightDaily batch jobs cluster at midnight; scheduler must survive the thundering herd
Availability99.9% for trigger engine (44 min downtime/month budget)A dead scheduler means no scheduled tasks run until recovery
DurabilityZero job definitions lost on scheduler restart or DB failoverJobs must survive full restarts; Redis is a secondary index, not the source of truth
Exactly-once executionAt-least-once delivery + idempotency key on worker sideTrue exactly-once across scheduler + queue + worker requires distributed transactions; idempotency key achieves the same result without the cost
OperabilityJob fire latency p99, DLQ depth, retry rate, scheduler lagScheduler lag (time between next_fire_time and actual dispatch) is the primary SLO signal

Non-functional requirements are where many designs fail in practice. "Fire within ยฑ1 second" is a concrete, testable SLO. "Fire reliably" is not.

๐Ÿง  Deep Dive: Capacity Estimation, Data Model, and Trigger Engine Performance

Internals: Data Model

Two tables carry the entire system's state: jobs (definitions and lifecycle) and executions (run history).

-- Source of truth for all scheduled jobs
CREATE TABLE jobs (
    job_id           UUID         PRIMARY KEY DEFAULT gen_random_uuid(),
    owner_id         UUID         NOT NULL,
    cron_expr        TEXT,                          -- NULL for one-time jobs
    run_at           TIMESTAMPTZ,                   -- NULL for recurring jobs
    payload          JSONB        NOT NULL,
    status           TEXT         NOT NULL DEFAULT 'scheduled'
                                  CHECK (status IN ('scheduled','running','completed',
                                                    'failed','cancelled','paused')),
    next_fire_time   TIMESTAMPTZ  NOT NULL,         -- precomputed; indexed for polling
    last_fired_at    TIMESTAMPTZ,
    idempotency_key  TEXT         NOT NULL UNIQUE,  -- prevents double-execution
    retry_count      INT          NOT NULL DEFAULT 0,
    max_retries      INT          NOT NULL DEFAULT 5,
    created_at       TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    updated_at       TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

-- Index enabling fast "what jobs are due right now?" query
CREATE INDEX idx_jobs_next_fire_time
    ON jobs (next_fire_time)
    WHERE status = 'scheduled';

-- Execution history: one row per attempt
CREATE TABLE executions (
    execution_id     UUID         PRIMARY KEY DEFAULT gen_random_uuid(),
    job_id           UUID         NOT NULL REFERENCES jobs(job_id),
    started_at       TIMESTAMPTZ  NOT NULL,
    finished_at      TIMESTAMPTZ,
    status           TEXT         NOT NULL CHECK (status IN ('running','succeeded','failed')),
    error_message    TEXT,
    attempt_number   INT          NOT NULL DEFAULT 1
);

CREATE INDEX idx_executions_job_id ON executions (job_id, started_at DESC);

Why next_fire_time is precomputed and indexed: A naive scheduler would SELECT * FROM jobs WHERE status='scheduled' and compute cron_expr โ†’ next fire time at query time. With 10M scheduled jobs, that's 10M cron expression evaluations per poll cycle โ€” catastrophic. Instead, next_fire_time is computed once at job creation (and again after each successful fire for recurring jobs) and stored as an indexed column. The scheduler's hot query becomes a single range scan:

-- Scheduler poll query: sub-millisecond at any table size with the index
SELECT job_id, payload, idempotency_key, next_fire_time
FROM jobs
WHERE status = 'scheduled'
  AND next_fire_time <= NOW()
LIMIT 500;  -- batch size; prevents thundering herd from locking the whole table

The WHERE status = 'scheduled' partial index ensures this query never scans completed or cancelled rows, keeping the index small even as the executions table grows into the billions.

Data model for Redis (secondary index):

Key: jobs_schedule        (sorted set)
Score: next_fire_time_ms  (Unix timestamp in milliseconds)
Member: job_id            (UUID string)

The Redis sorted set is the trigger engine's primary lookup structure. ZRANGEBYSCORE jobs_schedule 0 <now_ms> LIMIT 0 500 returns the next 500 due job IDs in O(log N + K) time (N = total jobs in set, K = results returned). The PostgreSQL index is the fallback if Redis is unavailable.

Performance Analysis: Trigger Engine at Peak Load

Capacity estimation:

SignalValueDerivation
Jobs/day100MGiven requirement
Peak fire rate (midnight batch)~1.2M jobs/hour30% of daily jobs cluster in midnight hour
Scheduler poll interval1 secondTarget ยฑ1s accuracy
Jobs returned per poll500 (batch limit)Prevents single poll from locking table
Polls needed at peak1,200,000 / 3,600 โ‰ˆ 333 jobs/secWell within single scheduler capacity of 500/sec
Job store size (90-day window)~9B rows in executions100M/day ร— 90 days; partition by month
Redis sorted set size~10M entriesOnly scheduled jobs; evicted on fire
Redis memory for sorted set~10M ร— 40 bytes โ‰ˆ 400 MBUUID (36 bytes) + score (8 bytes) + overhead

Scheduler throughput ceiling: A single scheduler instance polling every 1 second with a batch of 500 can dispatch ~500 jobs/second. At 333 jobs/sec peak, one scheduler instance is sufficient. The design supports multiple scheduler instances for availability โ€” but coordination (double-fire prevention) is then required (see Feature Deep Dive section below).

Thundering herd analysis: At midnight, millions of daily recurring jobs all have next_fire_time โ‰ค midnight. A naive "fire all due jobs immediately" loop would spike the worker queue and overwhelm external systems. The batch limit of 500 per poll cycle bounds the dispatch rate to 500 jobs/second per scheduler instance, giving the worker pool time to drain the queue at its natural processing rate.

๐Ÿ“Š High-Level Architecture: From Job Creation to Worker Execution

flowchart TD
    A[Client App] -->|POST /jobs| B[Job API Service]
    B -->|INSERT next_fire_time| C[(Job Store\nPostgreSQL)]
    B -->|ZADD score=next_fire_time_ms| D[(Redis\nSorted Set)]
    E[Scheduler Service] -->|ZRANGEBYSCORE 0 now_ms LIMIT 500\nevery 1s| D
    E -->|UPDATE status=running\nWHERE status=scheduled| C
    E -->|publish jobId + payload\n+ idempotency_key| F[Job Queue\nKafka topic: jobs.execute]
    F --> G[Worker Pool\nauto-scaled]
    G -->|execute payload| H[External Systems\nAPIs, DBs, Services]
    G -->|INSERT execution record| I[(Execution History\nPostgreSQL)]
    G -->|UPDATE status=completed\nor failed + next_fire_time| C
    G -->|ZADD updated next_fire_time\nfor recurring jobs| D
    J[Dead Letter Queue\nKafka: jobs.dlq] -->|after max retries| K[Ops Alert / Manual Replay]
    G -->|after max_retries exhausted| J

This diagram shows the two main paths: the write path (Client โ†’ Job API โ†’ PostgreSQL + Redis) and the trigger-dispatch-execute loop (Scheduler โ†’ Redis โ†’ PostgreSQL lock โ†’ Kafka โ†’ Worker โ†’ Execution History). The Scheduler Service does not execute jobs directly โ€” it only dispatches them to Kafka. This separation keeps the scheduler stateless and horizontally scalable: the scheduler's only job is to find due tasks and hand them off reliably.

The Redis sorted set is the trigger engine's fast-path index. PostgreSQL is always the source of truth: if Redis is empty due to a restart, the scheduler falls back to the partial index query on the jobs table. After every successful fire of a recurring job, the worker (or scheduler) computes the next next_fire_time from the cron expression, updates the jobs row, and re-adds the job to the Redis sorted set โ€” closing the recurrence loop.

๐Ÿ”‘ Feature Deep Dive: Time-Based Trigger Engine โ€” Polling vs Redis Sorted Set

Two approaches compete for the "find due jobs" operation. The choice has significant performance implications at scale.

Option A โ€” Direct Database Polling

-- Poll every second: fetch all jobs due in this second
SELECT job_id, payload, idempotency_key
FROM jobs
WHERE status = 'scheduled'
  AND next_fire_time <= NOW()
LIMIT 500;

Pros: No additional infrastructure; PostgreSQL is the single source of truth; the partial index makes this fast.

Cons: Every scheduler poll is a database query. At 1 poll/second across 3 scheduler instances, that is 3 queries/second on the jobs table. Acceptable for moderate scale, but the database becomes the hot path. Under heavy write load (new jobs being created while existing ones are being polled), PostgreSQL index contention increases.

Option B โ€” Redis Sorted Set as Secondary Trigger Index

# Job creation: register in the sorted set
ZADD jobs_schedule <next_fire_time_ms> <job_id>

# Scheduler poll every 1 second: atomically fetch and remove due jobs
ZRANGEBYSCORE jobs_schedule 0 <now_ms> LIMIT 0 500
# Then remove fired jobs from the set:
ZREMRANGEBYSCORE jobs_schedule 0 <now_ms>
# Or atomically with a Lua script to prevent race between ZRANGE and ZREM
-- Atomic Lua script: fetch AND remove due jobs in one Redis operation
local now = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local jobs = redis.call('ZRANGEBYSCORE', KEYS[1], 0, now, 'LIMIT', 0, limit)
if #jobs > 0 then
  redis.call('ZREM', KEYS[1], unpack(jobs))
end
return jobs

Pros: O(log N + K) lookup; no database query on the hot poll path; Redis handles 100K+ ops/second trivially.

Cons: Redis is volatile memory โ€” a Redis restart without persistence loses the sorted set. This is acceptable only if PostgreSQL is the source of truth and the sorted set is reconstructed on startup from SELECT job_id, next_fire_time FROM jobs WHERE status = 'scheduled'.

Comparison

PropertyDB PollingRedis Sorted Set
Hot path latency1-5ms (index scan)< 0.5ms (Redis ZRANGEBYSCORE)
Infrastructure complexityLow (PostgreSQL only)Medium (PostgreSQL + Redis)
Accuracy at scaleDegrades with DB loadStable; Redis single-thread
Durabilityโœ… Durable (DB is source of truth)โš ๏ธ Volatile; must rebuild on restart
Thundering herd controlLIMIT clause on queryLIMIT in Lua script

Recommendation: Use Redis sorted set for the hot trigger path at > 10K jobs/day. Use PostgreSQL polling as the fallback (and for cold start rebuild). Both approaches must be backed by an optimistic lock on the jobs row to prevent double-firing.

๐Ÿ”‘ Feature Deep Dive: Preventing Double-Firing Across Scheduler Instances

Running multiple scheduler instances for high availability introduces the most dangerous failure mode in a job scheduler: two scheduler instances both observe the same due job and both dispatch it to the queue.

Why Double-Firing Happens

Time 0: Scheduler A polls Redis โ†’ returns [job_123, job_456, job_789]
Time 0: Scheduler B polls Redis โ†’ same sorted set โ†’ returns [job_123, job_456, job_789]
Time 1: Scheduler A dispatches job_123 to Kafka โœ“
Time 1: Scheduler B dispatches job_123 to Kafka โœ“  โ† double-fire!

Even with the Lua script atomically removing jobs from the sorted set, a nanosecond race between the ZRANGEBYSCORE read and the ZREM remove allows both instances to see the same job. Atomicity at the Redis level is not enough when the dispatch step happens outside Redis.

The scheduler atomically transitions the job status from 'scheduled' to 'running' using a conditional update. Only one instance wins this race; the other sees zero rows updated and discards the job.

-- Optimistic lock: only succeeds for exactly one scheduler instance
UPDATE jobs
SET    status = 'running',
       last_fired_at = NOW(),
       updated_at = NOW()
WHERE  job_id = :job_id
  AND  status = 'scheduled';   -- CAS: this is the guard

-- rowsAffected == 1 โ†’ this instance won the lock โ†’ dispatch to Kafka
-- rowsAffected == 0 โ†’ another instance already claimed this job โ†’ discard

Why this works: PostgreSQL row-level locking ensures only one UPDATE wins the race per job_id. The WHERE status = 'scheduled' is the compare-and-swap guard. This is one database round-trip (the cheapest distributed lock you can build).

Solution 2 โ€” Redis SETNX Distributed Lock (For Cross-Service Coordination)

When the scheduler and dispatcher are different services (or when the job must be locked for longer than a single DB round-trip), use a Redis distributed lock:

# Acquire lock for job_123 with 30-second TTL (prevents deadlock if scheduler crashes)
SET scheduler_lock:job_123 <scheduler_instance_id> NX EX 30
# NX = set only if not exists (atomic compare-and-swap)
# EX 30 = auto-expire after 30 seconds (prevents permanent lock on crash)

# Returns "OK" โ†’ this instance holds the lock โ†’ dispatch
# Returns nil  โ†’ another instance holds the lock โ†’ skip this job

# Release lock after dispatch (only release our own lock)
# Lua script: atomic check-then-delete
local current = redis.call('GET', KEYS[1])
if current == ARGV[1] then
  redis.call('DEL', KEYS[1])
  return 1
end
return 0

When to use SETNX vs optimistic lock:

CriterionOptimistic Lock (DB UPDATE)Redis SETNX
Infrastructure neededPostgreSQL onlyPostgreSQL + Redis
Round-trips to lock1 (DB UPDATE)1 (Redis SET NX) + 1 (DB UPDATE)
Lock TTL on crashStatus stays running until manual resetAuto-expires after TTL
Correctness under high contentionExcellent (DB serialises)Good (Redis single-thread)
Preferred forSingle-cluster scheduler (recommended)Multi-region or cross-service dispatch

For most deployments, the optimistic DB lock is sufficient and simpler. Reserve the Redis SETNX approach for scenarios where multiple geographically separated scheduler clusters must coordinate.

๐Ÿ”‘ Feature Deep Dive: Write Path and Read Path โ€” Job Lifecycle

Write Path: Creating a New Scheduled Job

Client โ†’ POST /jobs โ†’ Job API Service โ†’ PostgreSQL INSERT โ†’ Redis ZADD โ†’ 202 Accepted

Step-by-step:

  1. Client sends POST /jobs with { cron_expr: "0 9 * * MON-FRI", payload: {...}, owner_id: "..." }
  2. Job API validates the cron expression (reject invalid syntax immediately; avoid storing a job that will never fire)
  3. Compute next_fire_time: parse the cron expression and compute the next fire timestamp from NOW() (using Quartz or Cron4j library)
  4. Generate idempotency_key: SHA-256(owner_id + cron_expr + payload_hash) โ€” ensures identical jobs submitted twice get the same key and the second insert is rejected by the UNIQUE constraint
  5. INSERT into jobs with status 'scheduled', computed next_fire_time, and idempotency_key
  6. ZADD to Redis sorted set: ZADD jobs_schedule <next_fire_time_ms> <job_id> โ€” registers the job for the trigger engine
  7. Return 202 with { jobId, nextFireAt }

Read Path: Scheduler Poll and Worker Execution

Scheduler polls Redis โ†’ acquires DB lock โ†’ publishes to Kafka
Worker consumes โ†’ validates idempotency โ†’ executes โ†’ updates DB + Redis

Step-by-step:

  1. Scheduler polls Redis every 1 second: ZRANGEBYSCORE jobs_schedule 0 <now_ms> LIMIT 0 500
  2. For each due job ID, scheduler attempts the optimistic lock: UPDATE jobs SET status='running' WHERE job_id=? AND status='scheduled'
  3. On lock success (rowsAffected = 1): scheduler publishes { jobId, payload, idempotencyKey } to Kafka topic jobs.execute
  4. Worker consumes from jobs.execute; checks Redis or DB for idempotency key: if already processed, ack and skip
  5. Worker executes the job payload (HTTP call, DB write, email send, etc.)
  6. On success: worker updates jobs.status = 'completed' (one-time job) or computes and updates next_fire_time (recurring job); inserts executions record; ZADD jobs_schedule <next_fire_time_ms> <job_id> for recurring jobs
  7. On failure: worker increments retry_count, schedules retry with exponential backoff, and re-adds to Redis sorted set with score = now_ms + backoff_ms

๐Ÿ”‘ Feature Deep Dive: Retry, Failure Recovery, and the Dead Letter Queue

A job fails when the worker's execution attempt returns an error or times out. The scheduler must retry the job without losing it, without retrying forever, and without retrying so aggressively that it amplifies load on an already-struggling downstream system.

Exponential backoff schedule:

AttemptBackoff delayCumulative wait
1 (original)โ€”0s
2 (retry 1)1 second1s
3 (retry 2)5 seconds6s
4 (retry 3)30 seconds36s
5 (retry 4)5 minutes5m 36s
6 (retry 5 = max)1 hour1h 5m 36s

After max_retries (default 5) exhaustion, the job moves to the dead-letter queue. DLQ depth is monitored and pages on-call when it exceeds a threshold (e.g., 1,000 entries), indicating a systemic downstream failure.

// Worker: retry-on-failure with exponential backoff and DLQ routing
@KafkaListener(topics = "jobs.execute", groupId = "job-workers")
public void consume(JobMessage msg, Acknowledgment ack) {
    int retryCount = msg.getRetryCount();
    try {
        jobExecutor.execute(msg.getJobId(), msg.getPayload());
        executionRepo.record(msg.getJobId(), "succeeded", retryCount + 1);
        jobRepo.markCompleted(msg.getJobId());
        ack.acknowledge();
    } catch (JobExecutionException e) {
        executionRepo.record(msg.getJobId(), "failed", retryCount + 1, e.getMessage());
        if (retryCount >= msg.getMaxRetries()) {
            // Exhausted retries โ€” route to dead-letter queue
            dlqProducer.send("jobs.dlq", msg);
            jobRepo.markFailed(msg.getJobId());
            ack.acknowledge();
        } else {
            // Schedule retry with exponential backoff
            long backoffMs = backoffDelay(retryCount); // 1s, 5s, 30s, 300s, 3600s
            long retryAt = System.currentTimeMillis() + backoffMs;
            redisTemplate.opsForZSet().add("jobs_schedule", msg.getJobId(), retryAt);
            jobRepo.scheduleRetry(msg.getJobId(), retryCount + 1, Instant.ofEpochMilli(retryAt));
            ack.acknowledge(); // ack original; retry entry is a new sorted set member
        }
    }
}

private long backoffDelay(int attempt) {
    long[] delays = {1_000, 5_000, 30_000, 300_000, 3_600_000};
    return delays[Math.min(attempt, delays.length - 1)];
}

Why DLQ over infinite retries? Infinite retries amplify load on a degraded downstream system. A job stuck retrying every second is effectively running a denial-of-service attack against a recovering service. The DLQ gives ops visibility into permanently failed jobs and enables controlled replay after the downstream system recovers.

DLQ replay after recovery: Ops runs a replay script that reads from jobs.dlq, resets retry_count = 0 on each job record, and re-adds each job_id to the Redis sorted set with score = NOW(). The job gets a fresh retry budget against the now-healthy downstream system.

๐Ÿ› ๏ธ Quartz Scheduler: How It Handles Distributed Job Scheduling in Java

Quartz Scheduler is the canonical Java library for distributed job scheduling. Its JDBCJobStore (called JobStoreTX) stores all job definitions and trigger state in a relational database โ€” the same architecture described in this post, with the job table, next_fire_time column, and optimistic row locking.

Clustering mode: Multiple Quartz nodes share a single database. When a trigger fires, all nodes race to acquire a row-level lock on the QRTZ_LOCKS table. Only one node wins; the others skip the trigger. This is exactly the optimistic locking strategy described above, implemented with an explicit SELECT FOR UPDATE on the locks table rather than a conditional UPDATE.

// Quartz job definition: @DisallowConcurrentExecution prevents the same job
// from running simultaneously on two cluster nodes
@DisallowConcurrentExecution
@PersistJobDataAfterExecution
public class ReportGenerationJob implements Job {

    @Override
    public void execute(JobExecutionContext context) throws JobExecutionException {
        JobDataMap data = context.getMergedJobDataMap();
        String reportId = data.getString("reportId");
        try {
            reportService.generate(reportId);
        } catch (Exception e) {
            // Quartz will auto-retry based on the trigger's misfire policy
            throw new JobExecutionException(e, /* refireImmediately= */ false);
        }
    }
}

// Spring Boot Quartz configuration: schedule the job on a cron expression
@Bean
public JobDetail reportJobDetail() {
    return JobBuilder.newJob(ReportGenerationJob.class)
        .withIdentity("reportJob", "reporting")
        .usingJobData("reportId", "weekly-summary")
        .storeDurably()
        .build();
}

@Bean
public Trigger reportJobTrigger(JobDetail reportJobDetail) {
    return TriggerBuilder.newTrigger()
        .forJob(reportJobDetail)
        .withIdentity("reportTrigger", "reporting")
        .withSchedule(CronScheduleBuilder.cronSchedule("0 0 9 ? * MON-FRI")
            .withMisfireHandlingInstructionDoNothing()) // skip if missed; don't fire catch-up
        .build();
}

Key Quartz design decisions mapped to our HLD:

Quartz conceptEquivalent in this HLD
QRTZ_JOB_DETAILS tablejobs table
QRTZ_TRIGGERS table with NEXT_FIRE_TIME columnjobs.next_fire_time
QRTZ_LOCKS row-level lockOptimistic UPDATE ... WHERE status='scheduled'
@DisallowConcurrentExecutionidempotency_key prevents double-execution
Misfire handling policyRetry backoff policy + DLQ
JobStoreTX clustering modeMultiple scheduler instances sharing PostgreSQL

For a full deep-dive on Quartz configuration, Misfire policies, and Spring Boot integration, a dedicated follow-up post is planned.

๐ŸŒ Real-World Applications: How AWS, Airflow, and Netflix Solve This in Production

This architecture powers multiple production scheduling systems at scale:

SystemScaleKey Design Choice
AWS EventBridge Scheduler1 trillion scheduled events/yearFully managed; serverless trigger engine; per-schedule Lambda invocation
Apache AirflowDAG-based; thousands of tasks/DAGExtends the scheduler with task dependencies; uses Celery or Kubernetes executor
Quartz Scheduler (Java)Millions of jobs/day (enterprise)JDBC-backed; row-level locking for cluster coordination
Netflix150M notifications/day on scheduleInternal cron scheduler feeding SQS; idempotency key in notification payload
GitHub Actions scheduled workflows~10M scheduled workflow runs/dayCron syntax scheduler; best-effort ยฑ5min precision at peak

AWS EventBridge Scheduler โ€” the production-scale reference: EventBridge Scheduler stores schedules as durable records in its own managed store (not exposed to users). Its trigger engine scans due schedules and invokes targets (Lambda, SQS, SNS, Step Functions) via the target resource's own invocation API. The key insight is that EventBridge Scheduler separates the trigger storage (durable, replicated) from the target invocation (at-least-once delivery to the target's queue). This is exactly the architecture described here: PostgreSQL (durable schedule store) + Kafka (at-least-once dispatch buffer) + Workers (target execution).

The difference between a cron scheduler and Apache Airflow: Airflow solves a different problem โ€” job dependencies (DAG: Job B runs only after Jobs A and C both succeed). The cron scheduler in this HLD has no concept of upstream dependencies. If your use case requires "run the billing job only after the revenue reconciliation job completes", you need Airflow (or a workflow orchestration layer on top of a basic scheduler). For simple time-based firing with retries, the architecture in this post is sufficient and much simpler to operate.

โš–๏ธ Trade-offs & Failure Modes: Where Distributed Job Schedulers Break

Scaling Strategy

BottleneckSymptomFix
Thundering herd at midnightScheduler lag spikes; worker queue depth grows to millionsBatch dispatch limit (500/poll); stagger recurring job creation time with random ยฑ60s jitter on next_fire_time
PostgreSQL scheduler poll contentionHigh lock wait on jobs partial index under concurrent updatesSwitch to Redis sorted set as primary trigger index; DB becomes lock-acquisition-only hot path
Worker pool saturationJob execution latency grows; retry backlog accumulatesAuto-scale worker pod count based on Kafka consumer lag (Kubernetes HPA + KEDA)
Redis sorted set sizeMemory pressure at 100M+ scheduled jobsEvict fired entries immediately after fetch; only 'scheduled' jobs should be in the set
DLQ accumulationJobs piling in jobs.dlq without replayAutomate DLQ replay for transient errors; alert on depth; implement circuit breaker on external systems

Availability and Resilience

  • Scheduler is stateless โ€” it holds no job state; killing and restarting a scheduler pod loses no data; it resumes polling from Redis/PostgreSQL immediately
  • Redis cold start โ€” if Redis restarts and the sorted set is empty, the scheduler falls back to the PostgreSQL partial index query on startup and rebuilds the sorted set from SELECT job_id, next_fire_time FROM jobs WHERE status = 'scheduled'
  • PostgreSQL failure โ€” if PostgreSQL is unreachable, the scheduler stops dispatching (correct: no dispatch without a durable lock acquisition). Kafka retains unprocessed messages for workers.
  • Worker crash mid-execution โ€” Kafka offset is not committed until execution completes; the message is re-delivered to another worker (at-least-once). The idempotency key prevents double-execution.

Storage and Caching

LayerWhat it storesRetention
PostgreSQL jobsJob definitions, status, next_fire_timeIndefinite for active jobs; archive after 1 year
PostgreSQL executionsEvery execution attempt90-day hot retention; archive to S3 after
Redis sorted set(job_id, next_fire_time_ms) for 'scheduled' jobs onlyTTL-less; entries removed on fire; rebuilt on cold start
Kafka jobs.executeIn-flight job dispatch messages7-day retention; ensures no message loss during worker outage
Kafka jobs.dlqPermanently failed jobs30-day retention; ops replay or archive

Consistency, Security, and Monitoring

Consistency model by operation:

  • Job creation โ€” strong; API returns jobId only after successful PostgreSQL INSERT and Redis ZADD
  • Job firing โ€” at-least-once with idempotency key; workers may receive the same job twice after a worker crash but skip duplicate execution via idempotency check
  • Status reads โ€” strong; GET /jobs/{id} reads directly from PostgreSQL

Key SLO signals to monitor:

  • scheduler.trigger_lag_p99_ms โ€” time between next_fire_time and actual Kafka dispatch; alert above 2,000ms
  • scheduler.dlq_depth โ€” alert above 1,000 entries
  • scheduler.retry_rate โ€” alert above 5% of daily job volume
  • kafka.jobs_execute.consumer_lag โ€” alert above 10,000 messages
  • redis.sorted_set_size โ€” alert above 20M entries (possible cold-start rebuild failure)

๐Ÿงญ Decision Guide: Choosing the Right Scheduler Architecture Component

SituationRecommendation
Trigger index: DB poll vs Redis sorted setUse Redis sorted set for > 10K jobs/day; DB polling for < 10K or when Redis adds operational complexity
Double-fire prevention: optimistic lock vs Redis SETNXOptimistic DB lock is simpler and sufficient for single-region; Redis SETNX for cross-region or cross-service coordination
At-least-once vs exactly-onceAt-least-once + idempotency key; exactly-once requires distributed transactions across DB + queue + worker โ€” prohibitive cost for most use cases
Cron scheduler vs AirflowCron scheduler for independent time-triggered tasks; Airflow when tasks have dependencies (DAG)
Single vs multiple scheduler instancesAlways run โ‰ฅ 2 scheduler instances for availability; design with double-fire prevention from day one
Thundering herd mitigationAdd ยฑ60s random jitter to next_fire_time during job creation for non-time-critical recurring jobs; batch dispatch limit of 500/poll
DLQ replay strategyAutomate replay for transient errors (connection timeout); require manual approval for permanent errors (invalid payload)

๐Ÿงช Practical Example for Interview Delivery

A repeatable way to deliver this design in 45 minutes:

  1. Open with the double-fire problem: "A job scheduler has two invariants โ€” never miss a job, never double-fire. These pull in opposite directions. My design resolves this with optimistic locking."
  2. State assumptions and scope: 100M jobs/day, ยฑ1s accuracy SLO, at-least-once with idempotency key, no DAG dependencies.
  3. Estimation: midnight burst = ~1.2M jobs/hour; single scheduler at 500/poll handles it; Redis sorted set at 10M entries = 400MB.
  4. Draw the HLD: Job API โ†’ PostgreSQL + Redis โ†’ Scheduler Service โ†’ Kafka โ†’ Worker Pool โ†’ Execution History.
  5. Walk the write path: POST /jobs โ†’ validate cron โ†’ compute next_fire_time โ†’ INSERT โ†’ ZADD.
  6. Walk the trigger path: ZRANGEBYSCORE โ†’ optimistic UPDATE โ†’ Kafka publish.
  7. Walk the worker path: consume โ†’ idempotency check โ†’ execute โ†’ update DB + Redis.
  8. Explain the double-fire fix: UPDATE jobs SET status='running' WHERE job_id=? AND status='scheduled' โ€” only one scheduler wins the race; rowsAffected = 0 means discard.
  9. Describe retries: exponential backoff, max 5, DLQ after exhaustion.

Closing sentence that lands well in interviews:

"I'd launch with a single scheduler backed by PostgreSQL optimistic locking, add Redis sorted set as the trigger index at 10M+ jobs/day, and introduce a second scheduler instance with jitter-spread scheduling to handle the midnight thundering herd."

๐Ÿ—๏ธ Advanced Concepts for Production Evolution

When interviewers ask follow-up scaling questions, frame the answer as a phased evolution:

  1. Phase 1 โ€” MVP: Single scheduler, PostgreSQL polling, one Kafka partition per job priority level; target 1M jobs/day
  2. Phase 2 โ€” Scale trigger engine: Add Redis sorted set index; add second scheduler instance with optimistic locking; target 100M jobs/day
  3. Phase 3 โ€” Worker auto-scaling: KEDA (Kubernetes Event-Driven Autoscaling) scales worker pods based on jobs.execute Kafka consumer lag; target 1M jobs/hour burst
  4. Phase 4 โ€” Multi-region: Active-active regions each with their own scheduler + job store; global jobs replicated via CDC (Change Data Capture); per-region idempotency keys to prevent cross-region double-fire

The cron expression parsing problem: Computing next_fire_time from a Quartz cron expression is a non-trivial parsing and calendar-arithmetic problem. In production, use a battle-tested library (Quartz CronExpression, cron-utils for Java, or croniter for Python) rather than writing a parser from scratch. The edge cases (leap years, DST transitions, February 29th) will bite you.

Time zone handling: Store all next_fire_time values as UTC. Accept cron expressions with an optional timezone parameter (e.g., TZID=America/New_York:0 9 * * MON-FRI). When computing next_fire_time, convert from the job's timezone to UTC. This prevents a 9am EST job from firing at 9am UTC during Daylight Saving Time transitions.

๐Ÿ“š Lessons Learned from Building and Operating Distributed Job Schedulers

  • The double-fire invariant must be in the DB, not just Redis. A Redis sorted set ZREM is atomic, but atomicity at the sorted set level does not prevent two scheduler instances from racing on the Kafka dispatch step. The optimistic UPDATE ... WHERE status='scheduled' is the correct guard โ€” it is the cheapest distributed lock in the system.
  • Always persist next_fire_time precomputed. Recomputing the next cron fire time at every scheduler poll for millions of jobs is catastrophically slow. Compute once at write time, store it indexed, recompute only when the job fires.
  • At-least-once delivery is cheaper and safer than exactly-once. Exactly-once delivery across scheduler + queue + worker requires distributed transactions (2PC or Saga). At-least-once with a database-backed idempotency key achieves the same user-visible outcome (no duplicate execution) at a fraction of the complexity.
  • DLQ depth is your most important health signal. A growing DLQ means jobs are permanently failing โ€” downstream systems may be broken, payloads may be malformed, or permissions may have changed. Alert on DLQ depth before users notice missed scheduled tasks.
  • Stagger recurring jobs with jitter to avoid thundering herds. Every job with cron_expr = "0 0 * * *" (daily at midnight) will fire simultaneously. Add ยฑ30โ€“60 seconds of random jitter to next_fire_time for non-time-critical jobs. This spreads the midnight spike across a 2-minute window, reducing peak load by ~120ร—.
  • Don't put job execution logic in the scheduler. The scheduler's only job is to dispatch to a queue. If the scheduler also executes jobs, a slow job blocks the scheduler's polling loop, causing all other due jobs to miss their window. Keep the scheduler stateless and fast.

๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • A distributed job scheduler has two invariants: never miss a job and never double-fire; the optimistic UPDATE ... WHERE status='scheduled' CAS is the mechanism that enforces both
  • Store next_fire_time precomputed and indexed in PostgreSQL; use a Redis sorted set as the fast secondary trigger index at scale
  • At-least-once delivery with an idempotency key is the correct trade-off over exactly-once; distributed transactions are too expensive for this use case
  • Multiple scheduler instances require double-fire prevention from day one; row-level optimistic locking is sufficient for single-region; Redis SETNX for cross-region
  • Thundering herd at midnight is real: batch dispatch limit + jitter on next_fire_time are the two mitigations
  • The DLQ is not an afterthought โ€” DLQ depth is the primary operational health signal for any job scheduler
  • The key architectural boundary: scheduler dispatches to a queue; workers execute; these two concerns must stay separated

๐Ÿ“ Practice Quiz

  1. Why is an idempotency_key required on each scheduled job, and what exactly does it prevent?

    • A) It prevents the scheduler from polling the same job twice in one cycle
    • B) It prevents a worker from executing the same job payload twice after an at-least-once re-delivery following a worker crash or Kafka offset rewind
    • C) It prevents two different jobs from having the same cron expression

    Correct Answer: B

  2. A Redis sorted set is used as the trigger index with score = next_fire_time_ms. What Redis command retrieves all jobs due for firing right now, and why is a Lua script preferred over two separate commands?

    • A) LRANGE on a list structure; Lua is faster than native Redis commands
    • B) ZRANGEBYSCORE jobs_schedule 0 <now_ms> to read and ZREM to remove; a Lua script makes the fetch-and-remove atomic, preventing two scheduler instances from both seeing the same job in the window between the range query and the remove
    • C) HGETALL on a hash map; Lua prevents connection pool exhaustion

    Correct Answer: B

  3. You run 3 scheduler instances for availability. Without any coordination mechanism, all 3 instances poll Redis and see job_abc is due. What is the minimum-overhead mechanism to ensure only one instance dispatches job_abc to Kafka?

    • A) Use a Redis SETNX lock with a 30-second TTL; whichever instance sets the key wins
    • B) Assign each job a shard ID and route each scheduler instance to a fixed shard range
    • C) Execute UPDATE jobs SET status='running' WHERE job_id='job_abc' AND status='scheduled' โ€” only one instance gets rowsAffected = 1; the others see rowsAffected = 0 and discard

    Correct Answer: C

  4. Open-ended challenge: A new requirement arrives โ€” "jobs must fire within ยฑ5 seconds of their scheduled time, but the total number of jobs reaching peak at midnight grew from 1M/hour to 50M/hour overnight." Walk through: (a) whether the existing single-scheduler Redis sorted set design can handle this, (b) what breaks first, and (c) at least two architectural changes that restore the ยฑ5 second SLO without rewriting the entire system.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms