System Design HLD Example: Distributed Job Scheduler
A practical interview-ready HLD for a distributed job scheduler handling millions of scheduled tasks.
Abstract AlgorithmsTLDR: Design a distributed job scheduler for one-time and recurring (cron) jobs at scale. This article covers the full solution: job definition storage with
next_fire_timeindexing, Redis sorted set for sub-millisecond trigger lookup, optimistic locking to prevent double-firing across scheduler instances, exponential backoff retry with a dead-letter queue, and the Quartz Scheduler clustering model in Java.
Netflix sends 150 million push notifications every day at specific scheduled times. If their job scheduler misfires, 150 million users get a notification at 3am instead of 9am. Worse, a double-fire bug where the same notification fires twice causes the kind of user complaint storm that wakes up an on-call team at midnight. A correct distributed scheduler has two invariants that must hold simultaneously: never miss a scheduled job and never fire the same job twice. These two goals pull against each other: the safest way to avoid missing a job is to fire it eagerly (possibly twice), and the safest way to avoid double-firing is to fire conservatively (possibly missing the window). The design challenge is threading that needle at millions of jobs per day.
Designing a job scheduler teaches you a precise lesson about distributed coordination: the problem of "which scheduler instance fires this job?" is structurally identical to the problem of "which payment service instance processes this transaction?" โ and the solution (optimistic locking with an idempotency key) is the same.
By the end of this walkthrough you'll know why a Redis sorted set with score = next_fire_time is the correct trigger index (O(log N) lookup vs O(N) table scan), why optimistic locking on the jobs row beats a distributed Redis lock for most workloads (one fewer network hop, no lock expiry edge case), and why at-least-once delivery with an idempotency key is almost always the right trade-off over exactly-once (which requires distributed transactions across the scheduler, queue, and worker).
๐ What a Job Scheduler Must Coordinate: Actors and Use Cases
Actors
| Actor | Role |
| Client Application | Calls POST /jobs to schedule a job; polls GET /jobs/{id} for status |
| Job API Service | Validates requests; writes job definitions; computes first next_fire_time |
| Job Store (PostgreSQL) | Source of truth for all job definitions and lifecycle state |
| Trigger Engine (Scheduler Service) | Polls for due jobs; acquires execution lease; dispatches to queue |
| Job Queue (Kafka / SQS) | Decouples trigger from execution; buffers jobs under worker lag |
| Worker Pool | Consumes queue; executes job payload; writes execution history |
| Execution History Service | Records each execution attempt: start time, end time, status, error |
Use Cases
- Schedule a one-time job โ run a task at a specific future timestamp (e.g., send welcome email 5 minutes after signup)
- Schedule a recurring job โ run on a cron expression (e.g.,
0 9 * * MON-FRI= 9am weekdays) - Cancel a scheduled job โ mark a pending job as cancelled before it fires; no execution should occur
- Pause and resume a recurring job โ temporarily halt recurrence without deleting the job definition
- Check job status โ
GET /jobs/{id}returns current status and last execution result - View execution history โ
GET /jobs/{id}/executionsreturns all historical runs with outcome - Retry a failed job โ automatic exponential backoff on execution failure; manual retry via API
- Read and write paths are explained separately so bottlenecks and consistency boundaries are explicit.
This template starts with actors and use cases because architecture only makes sense when user behavior and workload shape are clear. In interviews, this section prevents random tool selection and keeps the answer grounded in business outcomes.
๐ Functional Requirements: Scope, Non-Goals, and API Contracts
In Scope
- One-time job scheduling โ
POST /jobswithrun_attimestamp andpayload; returnsjobId - Recurring job scheduling โ
POST /jobswithcron_expr(Quartz-format cron string); scheduler computesnext_fire_time - Job lifecycle management โ cancel, pause, resume via
PATCH /jobs/{id}/status - Execution history โ every run attempt recorded with status, duration, and error; retained 90 days
- Retry on failure โ automatic exponential backoff; configurable max retries; dead-letter queue after exhaustion
- Idempotent execution โ
idempotency_keyon each job; workers skip execution if key already processed
Out of Scope (v1 Boundary)
- DAG workflow orchestration โ dependency graphs between jobs (Job B runs only after Job A succeeds) are the domain of Apache Airflow, not a cron-as-a-service scheduler. Adding inter-job dependencies to v1 bloats the design without serving the core use case.
- Real-time event-driven triggers โ jobs triggered by a Kafka event or webhook are a distinct system (event processor). Our scheduler is time-driven, not event-driven.
- Sub-second precision scheduling โ our target is ยฑ1 second, not sub-millisecond. High-frequency scheduling at millisecond precision requires a different architecture (e.g., time-wheel in memory).
- Interactive workflow UI โ Airflow-style DAG visualisation is out of scope; a status API is sufficient.
Functional Breakdown
| Feature | API Contract | Key Decision |
| Create job | POST /jobs โ { jobId, status: 'scheduled', nextFireAt } | Compute next_fire_time from cron at write time, not at query time |
| Cancel job | PATCH /jobs/{id}/status with { status: 'cancelled' } โ 200 | Atomic CAS update; prevent cancel of already-fired job |
| Get status | GET /jobs/{id} โ full job record with last_fired_at, status | Read from PostgreSQL; no Redis needed on read path |
| Trigger due jobs | Internal: scheduler polls Redis sorted set every 1s | ZRANGEBYSCORE returns all jobs with score โค now_ms |
| Dispatch to queue | Scheduler publishes { jobId, payload, idempotencyKey } to Kafka topic jobs.execute | Worker validates idempotency key before execution |
| Record execution | Worker inserts into executions table after each run | Async; never blocks the worker's next job pick-up |
Initial building blocks:
- Job API Service โ validates inputs, computes
next_fire_time, writes to PostgreSQL, adds to Redis sorted set - Trigger Engine โ polls Redis sorted set; acquires row-level optimistic lock; publishes to Kafka
- Worker Pool โ Kafka consumer; validates idempotency key; executes payload; writes execution history
- Job Store (PostgreSQL) โ
jobstable +executionstable; source of truth for all state - Redis โ sorted set as secondary index for fast trigger lookup; optional distributed lock for scheduler coordination
A strong answer names non-goals explicitly. Interviewers use this to judge prioritization quality and architectural maturity under time constraints.
โ๏ธ Non-Functional Requirements: Timing, Throughput, and Durability Targets
| Dimension | Target | Why it matters |
| Timing accuracy | Fire within ยฑ1 second of next_fire_time at 1M jobs/hour peak | SLA for time-sensitive tasks (push notifications, payment retries) |
| Throughput | 100M scheduled jobs/day; 1.2M jobs/hour burst at midnight | Daily batch jobs cluster at midnight; scheduler must survive the thundering herd |
| Availability | 99.9% for trigger engine (44 min downtime/month budget) | A dead scheduler means no scheduled tasks run until recovery |
| Durability | Zero job definitions lost on scheduler restart or DB failover | Jobs must survive full restarts; Redis is a secondary index, not the source of truth |
| Exactly-once execution | At-least-once delivery + idempotency key on worker side | True exactly-once across scheduler + queue + worker requires distributed transactions; idempotency key achieves the same result without the cost |
| Operability | Job fire latency p99, DLQ depth, retry rate, scheduler lag | Scheduler lag (time between next_fire_time and actual dispatch) is the primary SLO signal |
Non-functional requirements are where many designs fail in practice. "Fire within ยฑ1 second" is a concrete, testable SLO. "Fire reliably" is not.
๐ง Deep Dive: Capacity Estimation, Data Model, and Trigger Engine Performance
Internals: Data Model
Two tables carry the entire system's state: jobs (definitions and lifecycle) and executions (run history).
-- Source of truth for all scheduled jobs
CREATE TABLE jobs (
job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
owner_id UUID NOT NULL,
cron_expr TEXT, -- NULL for one-time jobs
run_at TIMESTAMPTZ, -- NULL for recurring jobs
payload JSONB NOT NULL,
status TEXT NOT NULL DEFAULT 'scheduled'
CHECK (status IN ('scheduled','running','completed',
'failed','cancelled','paused')),
next_fire_time TIMESTAMPTZ NOT NULL, -- precomputed; indexed for polling
last_fired_at TIMESTAMPTZ,
idempotency_key TEXT NOT NULL UNIQUE, -- prevents double-execution
retry_count INT NOT NULL DEFAULT 0,
max_retries INT NOT NULL DEFAULT 5,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Index enabling fast "what jobs are due right now?" query
CREATE INDEX idx_jobs_next_fire_time
ON jobs (next_fire_time)
WHERE status = 'scheduled';
-- Execution history: one row per attempt
CREATE TABLE executions (
execution_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
job_id UUID NOT NULL REFERENCES jobs(job_id),
started_at TIMESTAMPTZ NOT NULL,
finished_at TIMESTAMPTZ,
status TEXT NOT NULL CHECK (status IN ('running','succeeded','failed')),
error_message TEXT,
attempt_number INT NOT NULL DEFAULT 1
);
CREATE INDEX idx_executions_job_id ON executions (job_id, started_at DESC);
Why next_fire_time is precomputed and indexed: A naive scheduler would SELECT * FROM jobs WHERE status='scheduled' and compute cron_expr โ next fire time at query time. With 10M scheduled jobs, that's 10M cron expression evaluations per poll cycle โ catastrophic. Instead, next_fire_time is computed once at job creation (and again after each successful fire for recurring jobs) and stored as an indexed column. The scheduler's hot query becomes a single range scan:
-- Scheduler poll query: sub-millisecond at any table size with the index
SELECT job_id, payload, idempotency_key, next_fire_time
FROM jobs
WHERE status = 'scheduled'
AND next_fire_time <= NOW()
LIMIT 500; -- batch size; prevents thundering herd from locking the whole table
The WHERE status = 'scheduled' partial index ensures this query never scans completed or cancelled rows, keeping the index small even as the executions table grows into the billions.
Data model for Redis (secondary index):
Key: jobs_schedule (sorted set)
Score: next_fire_time_ms (Unix timestamp in milliseconds)
Member: job_id (UUID string)
The Redis sorted set is the trigger engine's primary lookup structure. ZRANGEBYSCORE jobs_schedule 0 <now_ms> LIMIT 0 500 returns the next 500 due job IDs in O(log N + K) time (N = total jobs in set, K = results returned). The PostgreSQL index is the fallback if Redis is unavailable.
Performance Analysis: Trigger Engine at Peak Load
Capacity estimation:
| Signal | Value | Derivation |
| Jobs/day | 100M | Given requirement |
| Peak fire rate (midnight batch) | ~1.2M jobs/hour | 30% of daily jobs cluster in midnight hour |
| Scheduler poll interval | 1 second | Target ยฑ1s accuracy |
| Jobs returned per poll | 500 (batch limit) | Prevents single poll from locking table |
| Polls needed at peak | 1,200,000 / 3,600 โ 333 jobs/sec | Well within single scheduler capacity of 500/sec |
| Job store size (90-day window) | ~9B rows in executions | 100M/day ร 90 days; partition by month |
| Redis sorted set size | ~10M entries | Only scheduled jobs; evicted on fire |
| Redis memory for sorted set | ~10M ร 40 bytes โ 400 MB | UUID (36 bytes) + score (8 bytes) + overhead |
Scheduler throughput ceiling: A single scheduler instance polling every 1 second with a batch of 500 can dispatch ~500 jobs/second. At 333 jobs/sec peak, one scheduler instance is sufficient. The design supports multiple scheduler instances for availability โ but coordination (double-fire prevention) is then required (see Feature Deep Dive section below).
Thundering herd analysis: At midnight, millions of daily recurring jobs all have next_fire_time โค midnight. A naive "fire all due jobs immediately" loop would spike the worker queue and overwhelm external systems. The batch limit of 500 per poll cycle bounds the dispatch rate to 500 jobs/second per scheduler instance, giving the worker pool time to drain the queue at its natural processing rate.
๐ High-Level Architecture: From Job Creation to Worker Execution
flowchart TD
A[Client App] -->|POST /jobs| B[Job API Service]
B -->|INSERT next_fire_time| C[(Job Store\nPostgreSQL)]
B -->|ZADD score=next_fire_time_ms| D[(Redis\nSorted Set)]
E[Scheduler Service] -->|ZRANGEBYSCORE 0 now_ms LIMIT 500\nevery 1s| D
E -->|UPDATE status=running\nWHERE status=scheduled| C
E -->|publish jobId + payload\n+ idempotency_key| F[Job Queue\nKafka topic: jobs.execute]
F --> G[Worker Pool\nauto-scaled]
G -->|execute payload| H[External Systems\nAPIs, DBs, Services]
G -->|INSERT execution record| I[(Execution History\nPostgreSQL)]
G -->|UPDATE status=completed\nor failed + next_fire_time| C
G -->|ZADD updated next_fire_time\nfor recurring jobs| D
J[Dead Letter Queue\nKafka: jobs.dlq] -->|after max retries| K[Ops Alert / Manual Replay]
G -->|after max_retries exhausted| J
This diagram shows the two main paths: the write path (Client โ Job API โ PostgreSQL + Redis) and the trigger-dispatch-execute loop (Scheduler โ Redis โ PostgreSQL lock โ Kafka โ Worker โ Execution History). The Scheduler Service does not execute jobs directly โ it only dispatches them to Kafka. This separation keeps the scheduler stateless and horizontally scalable: the scheduler's only job is to find due tasks and hand them off reliably.
The Redis sorted set is the trigger engine's fast-path index. PostgreSQL is always the source of truth: if Redis is empty due to a restart, the scheduler falls back to the partial index query on the jobs table. After every successful fire of a recurring job, the worker (or scheduler) computes the next next_fire_time from the cron expression, updates the jobs row, and re-adds the job to the Redis sorted set โ closing the recurrence loop.
๐ Feature Deep Dive: Time-Based Trigger Engine โ Polling vs Redis Sorted Set
Two approaches compete for the "find due jobs" operation. The choice has significant performance implications at scale.
Option A โ Direct Database Polling
-- Poll every second: fetch all jobs due in this second
SELECT job_id, payload, idempotency_key
FROM jobs
WHERE status = 'scheduled'
AND next_fire_time <= NOW()
LIMIT 500;
Pros: No additional infrastructure; PostgreSQL is the single source of truth; the partial index makes this fast.
Cons: Every scheduler poll is a database query. At 1 poll/second across 3 scheduler instances, that is 3 queries/second on the jobs table. Acceptable for moderate scale, but the database becomes the hot path. Under heavy write load (new jobs being created while existing ones are being polled), PostgreSQL index contention increases.
Option B โ Redis Sorted Set as Secondary Trigger Index
# Job creation: register in the sorted set
ZADD jobs_schedule <next_fire_time_ms> <job_id>
# Scheduler poll every 1 second: atomically fetch and remove due jobs
ZRANGEBYSCORE jobs_schedule 0 <now_ms> LIMIT 0 500
# Then remove fired jobs from the set:
ZREMRANGEBYSCORE jobs_schedule 0 <now_ms>
# Or atomically with a Lua script to prevent race between ZRANGE and ZREM
-- Atomic Lua script: fetch AND remove due jobs in one Redis operation
local now = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local jobs = redis.call('ZRANGEBYSCORE', KEYS[1], 0, now, 'LIMIT', 0, limit)
if #jobs > 0 then
redis.call('ZREM', KEYS[1], unpack(jobs))
end
return jobs
Pros: O(log N + K) lookup; no database query on the hot poll path; Redis handles 100K+ ops/second trivially.
Cons: Redis is volatile memory โ a Redis restart without persistence loses the sorted set. This is acceptable only if PostgreSQL is the source of truth and the sorted set is reconstructed on startup from SELECT job_id, next_fire_time FROM jobs WHERE status = 'scheduled'.
Comparison
| Property | DB Polling | Redis Sorted Set |
| Hot path latency | 1-5ms (index scan) | < 0.5ms (Redis ZRANGEBYSCORE) |
| Infrastructure complexity | Low (PostgreSQL only) | Medium (PostgreSQL + Redis) |
| Accuracy at scale | Degrades with DB load | Stable; Redis single-thread |
| Durability | โ Durable (DB is source of truth) | โ ๏ธ Volatile; must rebuild on restart |
| Thundering herd control | LIMIT clause on query | LIMIT in Lua script |
Recommendation: Use Redis sorted set for the hot trigger path at > 10K jobs/day. Use PostgreSQL polling as the fallback (and for cold start rebuild). Both approaches must be backed by an optimistic lock on the jobs row to prevent double-firing.
๐ Feature Deep Dive: Preventing Double-Firing Across Scheduler Instances
Running multiple scheduler instances for high availability introduces the most dangerous failure mode in a job scheduler: two scheduler instances both observe the same due job and both dispatch it to the queue.
Why Double-Firing Happens
Time 0: Scheduler A polls Redis โ returns [job_123, job_456, job_789]
Time 0: Scheduler B polls Redis โ same sorted set โ returns [job_123, job_456, job_789]
Time 1: Scheduler A dispatches job_123 to Kafka โ
Time 1: Scheduler B dispatches job_123 to Kafka โ โ double-fire!
Even with the Lua script atomically removing jobs from the sorted set, a nanosecond race between the ZRANGEBYSCORE read and the ZREM remove allows both instances to see the same job. Atomicity at the Redis level is not enough when the dispatch step happens outside Redis.
Solution 1 โ Optimistic Locking on the Job Row (Recommended)
The scheduler atomically transitions the job status from 'scheduled' to 'running' using a conditional update. Only one instance wins this race; the other sees zero rows updated and discards the job.
-- Optimistic lock: only succeeds for exactly one scheduler instance
UPDATE jobs
SET status = 'running',
last_fired_at = NOW(),
updated_at = NOW()
WHERE job_id = :job_id
AND status = 'scheduled'; -- CAS: this is the guard
-- rowsAffected == 1 โ this instance won the lock โ dispatch to Kafka
-- rowsAffected == 0 โ another instance already claimed this job โ discard
Why this works: PostgreSQL row-level locking ensures only one UPDATE wins the race per job_id. The WHERE status = 'scheduled' is the compare-and-swap guard. This is one database round-trip (the cheapest distributed lock you can build).
Solution 2 โ Redis SETNX Distributed Lock (For Cross-Service Coordination)
When the scheduler and dispatcher are different services (or when the job must be locked for longer than a single DB round-trip), use a Redis distributed lock:
# Acquire lock for job_123 with 30-second TTL (prevents deadlock if scheduler crashes)
SET scheduler_lock:job_123 <scheduler_instance_id> NX EX 30
# NX = set only if not exists (atomic compare-and-swap)
# EX 30 = auto-expire after 30 seconds (prevents permanent lock on crash)
# Returns "OK" โ this instance holds the lock โ dispatch
# Returns nil โ another instance holds the lock โ skip this job
# Release lock after dispatch (only release our own lock)
# Lua script: atomic check-then-delete
local current = redis.call('GET', KEYS[1])
if current == ARGV[1] then
redis.call('DEL', KEYS[1])
return 1
end
return 0
When to use SETNX vs optimistic lock:
| Criterion | Optimistic Lock (DB UPDATE) | Redis SETNX |
| Infrastructure needed | PostgreSQL only | PostgreSQL + Redis |
| Round-trips to lock | 1 (DB UPDATE) | 1 (Redis SET NX) + 1 (DB UPDATE) |
| Lock TTL on crash | Status stays running until manual reset | Auto-expires after TTL |
| Correctness under high contention | Excellent (DB serialises) | Good (Redis single-thread) |
| Preferred for | Single-cluster scheduler (recommended) | Multi-region or cross-service dispatch |
For most deployments, the optimistic DB lock is sufficient and simpler. Reserve the Redis SETNX approach for scenarios where multiple geographically separated scheduler clusters must coordinate.
๐ Feature Deep Dive: Write Path and Read Path โ Job Lifecycle
Write Path: Creating a New Scheduled Job
Client โ POST /jobs โ Job API Service โ PostgreSQL INSERT โ Redis ZADD โ 202 Accepted
Step-by-step:
- Client sends
POST /jobswith{ cron_expr: "0 9 * * MON-FRI", payload: {...}, owner_id: "..." } - Job API validates the cron expression (reject invalid syntax immediately; avoid storing a job that will never fire)
- Compute
next_fire_time: parse the cron expression and compute the next fire timestamp fromNOW()(using Quartz or Cron4j library) - Generate
idempotency_key:SHA-256(owner_id + cron_expr + payload_hash)โ ensures identical jobs submitted twice get the same key and the second insert is rejected by theUNIQUEconstraint - INSERT into
jobswith status'scheduled', computednext_fire_time, andidempotency_key - ZADD to Redis sorted set:
ZADD jobs_schedule <next_fire_time_ms> <job_id>โ registers the job for the trigger engine - Return 202 with
{ jobId, nextFireAt }
Read Path: Scheduler Poll and Worker Execution
Scheduler polls Redis โ acquires DB lock โ publishes to Kafka
Worker consumes โ validates idempotency โ executes โ updates DB + Redis
Step-by-step:
- Scheduler polls Redis every 1 second:
ZRANGEBYSCORE jobs_schedule 0 <now_ms> LIMIT 0 500 - For each due job ID, scheduler attempts the optimistic lock:
UPDATE jobs SET status='running' WHERE job_id=? AND status='scheduled' - On lock success (rowsAffected = 1): scheduler publishes
{ jobId, payload, idempotencyKey }to Kafka topicjobs.execute - Worker consumes from
jobs.execute; checks Redis or DB for idempotency key: if already processed, ack and skip - Worker executes the job payload (HTTP call, DB write, email send, etc.)
- On success: worker updates
jobs.status = 'completed'(one-time job) or computes and updatesnext_fire_time(recurring job); insertsexecutionsrecord;ZADD jobs_schedule <next_fire_time_ms> <job_id>for recurring jobs - On failure: worker increments
retry_count, schedules retry with exponential backoff, and re-adds to Redis sorted set withscore = now_ms + backoff_ms
๐ Feature Deep Dive: Retry, Failure Recovery, and the Dead Letter Queue
A job fails when the worker's execution attempt returns an error or times out. The scheduler must retry the job without losing it, without retrying forever, and without retrying so aggressively that it amplifies load on an already-struggling downstream system.
Exponential backoff schedule:
| Attempt | Backoff delay | Cumulative wait |
| 1 (original) | โ | 0s |
| 2 (retry 1) | 1 second | 1s |
| 3 (retry 2) | 5 seconds | 6s |
| 4 (retry 3) | 30 seconds | 36s |
| 5 (retry 4) | 5 minutes | 5m 36s |
| 6 (retry 5 = max) | 1 hour | 1h 5m 36s |
After max_retries (default 5) exhaustion, the job moves to the dead-letter queue. DLQ depth is monitored and pages on-call when it exceeds a threshold (e.g., 1,000 entries), indicating a systemic downstream failure.
// Worker: retry-on-failure with exponential backoff and DLQ routing
@KafkaListener(topics = "jobs.execute", groupId = "job-workers")
public void consume(JobMessage msg, Acknowledgment ack) {
int retryCount = msg.getRetryCount();
try {
jobExecutor.execute(msg.getJobId(), msg.getPayload());
executionRepo.record(msg.getJobId(), "succeeded", retryCount + 1);
jobRepo.markCompleted(msg.getJobId());
ack.acknowledge();
} catch (JobExecutionException e) {
executionRepo.record(msg.getJobId(), "failed", retryCount + 1, e.getMessage());
if (retryCount >= msg.getMaxRetries()) {
// Exhausted retries โ route to dead-letter queue
dlqProducer.send("jobs.dlq", msg);
jobRepo.markFailed(msg.getJobId());
ack.acknowledge();
} else {
// Schedule retry with exponential backoff
long backoffMs = backoffDelay(retryCount); // 1s, 5s, 30s, 300s, 3600s
long retryAt = System.currentTimeMillis() + backoffMs;
redisTemplate.opsForZSet().add("jobs_schedule", msg.getJobId(), retryAt);
jobRepo.scheduleRetry(msg.getJobId(), retryCount + 1, Instant.ofEpochMilli(retryAt));
ack.acknowledge(); // ack original; retry entry is a new sorted set member
}
}
}
private long backoffDelay(int attempt) {
long[] delays = {1_000, 5_000, 30_000, 300_000, 3_600_000};
return delays[Math.min(attempt, delays.length - 1)];
}
Why DLQ over infinite retries? Infinite retries amplify load on a degraded downstream system. A job stuck retrying every second is effectively running a denial-of-service attack against a recovering service. The DLQ gives ops visibility into permanently failed jobs and enables controlled replay after the downstream system recovers.
DLQ replay after recovery: Ops runs a replay script that reads from jobs.dlq, resets retry_count = 0 on each job record, and re-adds each job_id to the Redis sorted set with score = NOW(). The job gets a fresh retry budget against the now-healthy downstream system.
๐ ๏ธ Quartz Scheduler: How It Handles Distributed Job Scheduling in Java
Quartz Scheduler is the canonical Java library for distributed job scheduling. Its JDBCJobStore (called JobStoreTX) stores all job definitions and trigger state in a relational database โ the same architecture described in this post, with the job table, next_fire_time column, and optimistic row locking.
Clustering mode: Multiple Quartz nodes share a single database. When a trigger fires, all nodes race to acquire a row-level lock on the QRTZ_LOCKS table. Only one node wins; the others skip the trigger. This is exactly the optimistic locking strategy described above, implemented with an explicit SELECT FOR UPDATE on the locks table rather than a conditional UPDATE.
// Quartz job definition: @DisallowConcurrentExecution prevents the same job
// from running simultaneously on two cluster nodes
@DisallowConcurrentExecution
@PersistJobDataAfterExecution
public class ReportGenerationJob implements Job {
@Override
public void execute(JobExecutionContext context) throws JobExecutionException {
JobDataMap data = context.getMergedJobDataMap();
String reportId = data.getString("reportId");
try {
reportService.generate(reportId);
} catch (Exception e) {
// Quartz will auto-retry based on the trigger's misfire policy
throw new JobExecutionException(e, /* refireImmediately= */ false);
}
}
}
// Spring Boot Quartz configuration: schedule the job on a cron expression
@Bean
public JobDetail reportJobDetail() {
return JobBuilder.newJob(ReportGenerationJob.class)
.withIdentity("reportJob", "reporting")
.usingJobData("reportId", "weekly-summary")
.storeDurably()
.build();
}
@Bean
public Trigger reportJobTrigger(JobDetail reportJobDetail) {
return TriggerBuilder.newTrigger()
.forJob(reportJobDetail)
.withIdentity("reportTrigger", "reporting")
.withSchedule(CronScheduleBuilder.cronSchedule("0 0 9 ? * MON-FRI")
.withMisfireHandlingInstructionDoNothing()) // skip if missed; don't fire catch-up
.build();
}
Key Quartz design decisions mapped to our HLD:
| Quartz concept | Equivalent in this HLD |
QRTZ_JOB_DETAILS table | jobs table |
QRTZ_TRIGGERS table with NEXT_FIRE_TIME column | jobs.next_fire_time |
QRTZ_LOCKS row-level lock | Optimistic UPDATE ... WHERE status='scheduled' |
@DisallowConcurrentExecution | idempotency_key prevents double-execution |
| Misfire handling policy | Retry backoff policy + DLQ |
JobStoreTX clustering mode | Multiple scheduler instances sharing PostgreSQL |
For a full deep-dive on Quartz configuration, Misfire policies, and Spring Boot integration, a dedicated follow-up post is planned.
๐ Real-World Applications: How AWS, Airflow, and Netflix Solve This in Production
This architecture powers multiple production scheduling systems at scale:
| System | Scale | Key Design Choice |
| AWS EventBridge Scheduler | 1 trillion scheduled events/year | Fully managed; serverless trigger engine; per-schedule Lambda invocation |
| Apache Airflow | DAG-based; thousands of tasks/DAG | Extends the scheduler with task dependencies; uses Celery or Kubernetes executor |
| Quartz Scheduler (Java) | Millions of jobs/day (enterprise) | JDBC-backed; row-level locking for cluster coordination |
| Netflix | 150M notifications/day on schedule | Internal cron scheduler feeding SQS; idempotency key in notification payload |
| GitHub Actions scheduled workflows | ~10M scheduled workflow runs/day | Cron syntax scheduler; best-effort ยฑ5min precision at peak |
AWS EventBridge Scheduler โ the production-scale reference: EventBridge Scheduler stores schedules as durable records in its own managed store (not exposed to users). Its trigger engine scans due schedules and invokes targets (Lambda, SQS, SNS, Step Functions) via the target resource's own invocation API. The key insight is that EventBridge Scheduler separates the trigger storage (durable, replicated) from the target invocation (at-least-once delivery to the target's queue). This is exactly the architecture described here: PostgreSQL (durable schedule store) + Kafka (at-least-once dispatch buffer) + Workers (target execution).
The difference between a cron scheduler and Apache Airflow: Airflow solves a different problem โ job dependencies (DAG: Job B runs only after Jobs A and C both succeed). The cron scheduler in this HLD has no concept of upstream dependencies. If your use case requires "run the billing job only after the revenue reconciliation job completes", you need Airflow (or a workflow orchestration layer on top of a basic scheduler). For simple time-based firing with retries, the architecture in this post is sufficient and much simpler to operate.
โ๏ธ Trade-offs & Failure Modes: Where Distributed Job Schedulers Break
Scaling Strategy
| Bottleneck | Symptom | Fix |
| Thundering herd at midnight | Scheduler lag spikes; worker queue depth grows to millions | Batch dispatch limit (500/poll); stagger recurring job creation time with random ยฑ60s jitter on next_fire_time |
| PostgreSQL scheduler poll contention | High lock wait on jobs partial index under concurrent updates | Switch to Redis sorted set as primary trigger index; DB becomes lock-acquisition-only hot path |
| Worker pool saturation | Job execution latency grows; retry backlog accumulates | Auto-scale worker pod count based on Kafka consumer lag (Kubernetes HPA + KEDA) |
| Redis sorted set size | Memory pressure at 100M+ scheduled jobs | Evict fired entries immediately after fetch; only 'scheduled' jobs should be in the set |
| DLQ accumulation | Jobs piling in jobs.dlq without replay | Automate DLQ replay for transient errors; alert on depth; implement circuit breaker on external systems |
Availability and Resilience
- Scheduler is stateless โ it holds no job state; killing and restarting a scheduler pod loses no data; it resumes polling from Redis/PostgreSQL immediately
- Redis cold start โ if Redis restarts and the sorted set is empty, the scheduler falls back to the PostgreSQL partial index query on startup and rebuilds the sorted set from
SELECT job_id, next_fire_time FROM jobs WHERE status = 'scheduled' - PostgreSQL failure โ if PostgreSQL is unreachable, the scheduler stops dispatching (correct: no dispatch without a durable lock acquisition). Kafka retains unprocessed messages for workers.
- Worker crash mid-execution โ Kafka offset is not committed until execution completes; the message is re-delivered to another worker (at-least-once). The idempotency key prevents double-execution.
Storage and Caching
| Layer | What it stores | Retention |
PostgreSQL jobs | Job definitions, status, next_fire_time | Indefinite for active jobs; archive after 1 year |
PostgreSQL executions | Every execution attempt | 90-day hot retention; archive to S3 after |
| Redis sorted set | (job_id, next_fire_time_ms) for 'scheduled' jobs only | TTL-less; entries removed on fire; rebuilt on cold start |
Kafka jobs.execute | In-flight job dispatch messages | 7-day retention; ensures no message loss during worker outage |
Kafka jobs.dlq | Permanently failed jobs | 30-day retention; ops replay or archive |
Consistency, Security, and Monitoring
Consistency model by operation:
- Job creation โ strong; API returns
jobIdonly after successful PostgreSQL INSERT and Redis ZADD - Job firing โ at-least-once with idempotency key; workers may receive the same job twice after a worker crash but skip duplicate execution via idempotency check
- Status reads โ strong;
GET /jobs/{id}reads directly from PostgreSQL
Key SLO signals to monitor:
scheduler.trigger_lag_p99_msโ time betweennext_fire_timeand actual Kafka dispatch; alert above 2,000msscheduler.dlq_depthโ alert above 1,000 entriesscheduler.retry_rateโ alert above 5% of daily job volumekafka.jobs_execute.consumer_lagโ alert above 10,000 messagesredis.sorted_set_sizeโ alert above 20M entries (possible cold-start rebuild failure)
๐งญ Decision Guide: Choosing the Right Scheduler Architecture Component
| Situation | Recommendation |
| Trigger index: DB poll vs Redis sorted set | Use Redis sorted set for > 10K jobs/day; DB polling for < 10K or when Redis adds operational complexity |
| Double-fire prevention: optimistic lock vs Redis SETNX | Optimistic DB lock is simpler and sufficient for single-region; Redis SETNX for cross-region or cross-service coordination |
| At-least-once vs exactly-once | At-least-once + idempotency key; exactly-once requires distributed transactions across DB + queue + worker โ prohibitive cost for most use cases |
| Cron scheduler vs Airflow | Cron scheduler for independent time-triggered tasks; Airflow when tasks have dependencies (DAG) |
| Single vs multiple scheduler instances | Always run โฅ 2 scheduler instances for availability; design with double-fire prevention from day one |
| Thundering herd mitigation | Add ยฑ60s random jitter to next_fire_time during job creation for non-time-critical recurring jobs; batch dispatch limit of 500/poll |
| DLQ replay strategy | Automate replay for transient errors (connection timeout); require manual approval for permanent errors (invalid payload) |
๐งช Practical Example for Interview Delivery
A repeatable way to deliver this design in 45 minutes:
- Open with the double-fire problem: "A job scheduler has two invariants โ never miss a job, never double-fire. These pull in opposite directions. My design resolves this with optimistic locking."
- State assumptions and scope: 100M jobs/day, ยฑ1s accuracy SLO, at-least-once with idempotency key, no DAG dependencies.
- Estimation: midnight burst = ~1.2M jobs/hour; single scheduler at 500/poll handles it; Redis sorted set at 10M entries = 400MB.
- Draw the HLD: Job API โ PostgreSQL + Redis โ Scheduler Service โ Kafka โ Worker Pool โ Execution History.
- Walk the write path: POST /jobs โ validate cron โ compute
next_fire_timeโ INSERT โ ZADD. - Walk the trigger path: ZRANGEBYSCORE โ optimistic UPDATE โ Kafka publish.
- Walk the worker path: consume โ idempotency check โ execute โ update DB + Redis.
- Explain the double-fire fix:
UPDATE jobs SET status='running' WHERE job_id=? AND status='scheduled'โ only one scheduler wins the race; rowsAffected = 0 means discard. - Describe retries: exponential backoff, max 5, DLQ after exhaustion.
Closing sentence that lands well in interviews:
"I'd launch with a single scheduler backed by PostgreSQL optimistic locking, add Redis sorted set as the trigger index at 10M+ jobs/day, and introduce a second scheduler instance with jitter-spread scheduling to handle the midnight thundering herd."
๐๏ธ Advanced Concepts for Production Evolution
When interviewers ask follow-up scaling questions, frame the answer as a phased evolution:
- Phase 1 โ MVP: Single scheduler, PostgreSQL polling, one Kafka partition per job priority level; target 1M jobs/day
- Phase 2 โ Scale trigger engine: Add Redis sorted set index; add second scheduler instance with optimistic locking; target 100M jobs/day
- Phase 3 โ Worker auto-scaling: KEDA (Kubernetes Event-Driven Autoscaling) scales worker pods based on
jobs.executeKafka consumer lag; target 1M jobs/hour burst - Phase 4 โ Multi-region: Active-active regions each with their own scheduler + job store; global jobs replicated via CDC (Change Data Capture); per-region idempotency keys to prevent cross-region double-fire
The cron expression parsing problem: Computing next_fire_time from a Quartz cron expression is a non-trivial parsing and calendar-arithmetic problem. In production, use a battle-tested library (Quartz CronExpression, cron-utils for Java, or croniter for Python) rather than writing a parser from scratch. The edge cases (leap years, DST transitions, February 29th) will bite you.
Time zone handling: Store all next_fire_time values as UTC. Accept cron expressions with an optional timezone parameter (e.g., TZID=America/New_York:0 9 * * MON-FRI). When computing next_fire_time, convert from the job's timezone to UTC. This prevents a 9am EST job from firing at 9am UTC during Daylight Saving Time transitions.
๐ Lessons Learned from Building and Operating Distributed Job Schedulers
- The double-fire invariant must be in the DB, not just Redis. A Redis sorted set ZREM is atomic, but atomicity at the sorted set level does not prevent two scheduler instances from racing on the Kafka dispatch step. The optimistic
UPDATE ... WHERE status='scheduled'is the correct guard โ it is the cheapest distributed lock in the system. - Always persist
next_fire_timeprecomputed. Recomputing the next cron fire time at every scheduler poll for millions of jobs is catastrophically slow. Compute once at write time, store it indexed, recompute only when the job fires. - At-least-once delivery is cheaper and safer than exactly-once. Exactly-once delivery across scheduler + queue + worker requires distributed transactions (2PC or Saga). At-least-once with a database-backed idempotency key achieves the same user-visible outcome (no duplicate execution) at a fraction of the complexity.
- DLQ depth is your most important health signal. A growing DLQ means jobs are permanently failing โ downstream systems may be broken, payloads may be malformed, or permissions may have changed. Alert on DLQ depth before users notice missed scheduled tasks.
- Stagger recurring jobs with jitter to avoid thundering herds. Every job with
cron_expr = "0 0 * * *"(daily at midnight) will fire simultaneously. Add ยฑ30โ60 seconds of random jitter tonext_fire_timefor non-time-critical jobs. This spreads the midnight spike across a 2-minute window, reducing peak load by ~120ร. - Don't put job execution logic in the scheduler. The scheduler's only job is to dispatch to a queue. If the scheduler also executes jobs, a slow job blocks the scheduler's polling loop, causing all other due jobs to miss their window. Keep the scheduler stateless and fast.
๐ TLDR: Summary & Key Takeaways
- A distributed job scheduler has two invariants: never miss a job and never double-fire; the optimistic
UPDATE ... WHERE status='scheduled'CAS is the mechanism that enforces both - Store
next_fire_timeprecomputed and indexed in PostgreSQL; use a Redis sorted set as the fast secondary trigger index at scale - At-least-once delivery with an idempotency key is the correct trade-off over exactly-once; distributed transactions are too expensive for this use case
- Multiple scheduler instances require double-fire prevention from day one; row-level optimistic locking is sufficient for single-region; Redis SETNX for cross-region
- Thundering herd at midnight is real: batch dispatch limit + jitter on
next_fire_timeare the two mitigations - The DLQ is not an afterthought โ DLQ depth is the primary operational health signal for any job scheduler
- The key architectural boundary: scheduler dispatches to a queue; workers execute; these two concerns must stay separated
๐ Practice Quiz
Why is an
idempotency_keyrequired on each scheduled job, and what exactly does it prevent?- A) It prevents the scheduler from polling the same job twice in one cycle
- B) It prevents a worker from executing the same job payload twice after an at-least-once re-delivery following a worker crash or Kafka offset rewind
- C) It prevents two different jobs from having the same cron expression
Correct Answer: B
A Redis sorted set is used as the trigger index with
score = next_fire_time_ms. What Redis command retrieves all jobs due for firing right now, and why is a Lua script preferred over two separate commands?- A)
LRANGEon a list structure; Lua is faster than native Redis commands - B)
ZRANGEBYSCORE jobs_schedule 0 <now_ms>to read andZREMto remove; a Lua script makes the fetch-and-remove atomic, preventing two scheduler instances from both seeing the same job in the window between the range query and the remove - C)
HGETALLon a hash map; Lua prevents connection pool exhaustion
Correct Answer: B
- A)
You run 3 scheduler instances for availability. Without any coordination mechanism, all 3 instances poll Redis and see
job_abcis due. What is the minimum-overhead mechanism to ensure only one instance dispatchesjob_abcto Kafka?- A) Use a Redis SETNX lock with a 30-second TTL; whichever instance sets the key wins
- B) Assign each job a shard ID and route each scheduler instance to a fixed shard range
- C) Execute
UPDATE jobs SET status='running' WHERE job_id='job_abc' AND status='scheduled'โ only one instance getsrowsAffected = 1; the others seerowsAffected = 0and discard
Correct Answer: C
Open-ended challenge: A new requirement arrives โ "jobs must fire within ยฑ5 seconds of their scheduled time, but the total number of jobs reaching peak at midnight grew from 1M/hour to 50M/hour overnight." Walk through: (a) whether the existing single-scheduler Redis sorted set design can handle this, (b) what breaks first, and (c) at least two architectural changes that restore the ยฑ5 second SLO without rewriting the entire system.
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
LangChain Tools and Agents: The Classic Agent Loop
๐ฏ Quick TLDR: The Classic Agent Loop TLDR: LangChain's @tool decorator plus AgentExecutor give you a working tool-calling agent in about 30 lines of Python. The ReAct loop โ Thought โ Action โ Observation โ drives every reasoning step. For simple l...
LangChain RAG: Retrieval-Augmented Generation in Practice
โก TLDR: RAG in 30 Seconds TLDR: RAG (Retrieval-Augmented Generation) fixes the LLM knowledge-cutoff problem by fetching relevant documents at query time and injecting them as context. With LangChain you build the full pipeline โ load โ split โ embed...
LangChain Memory: Conversation History and Summarization
TLDR: LLMs are stateless โ every API call starts fresh. LangChain memory classes (Buffer, Window, Summary, SummaryBuffer) explicitly inject history into each call, and RunnableWithMessageHistory is the modern LCEL replacement for the legacy Conversat...

LangChain 101: Chains, Prompts, and LLM Integration
TLDR: LangChain's LCEL pipe operator (|) wires prompts, models, and output parsers into composable chains โ swap OpenAI for Anthropic or Ollama by changing one line without touching the rest of your c
