Abstract Algorithms
Advanced17 min readSystem DesignHldJob Scheduler

System Design HLD Example: Distributed Job Scheduler

A practical interview-ready HLD for a distributed job scheduler handling millions of scheduled tasks.

Abstract AlgorithmsAbstract Algorithms
··17 min read
More actions
Practice InterviewMock Discussion

1. Overview

A practical interview-ready HLD for a distributed job scheduler handling millions of scheduled tasks.

Why it matters

TLDR: A distributed job scheduler ensures tasks fire reliably using a durable Job Store with a index.

Show high-level concept flow
1

👻 The "Ghost Task" Problem

Starting point

2

📖 Distributed Job Scheduler: Use Cases & Requirements

Next concept

3

🔍 Basics: How Time-Based Triggers Work

Next concept

4

⚙️ Mechanics: The Lifecycle of a Distributed Task

Next concept

5

📐 Estimations & Design Goals

Outcome

Committed

At a glance

DifficultyAdvanced
Concepts22
Estimated time17 min
PrerequisitesSystem Design, Hld

System lens

See System Design HLD Example: Distributed Job Scheduler as a living topology.

A practical interview-ready HLD for a distributed job scheduler handling millions of scheduled tasks.

1

👻 The "Ghost Task" Problem

Ingress and assumptions

2

📖 Distributed Job Scheduler: Use Cases & Requirements

State transition

3

🔍 Basics: How Time-Based Triggers Work

State transition

4

⚙️ Mechanics: The Lifecycle of a Distributed Task

State transition

5

📐 Estimations & Design Goals

Outcome and guarantees

The article becomes easier when every section maps to a state change, a guarantee, or a failure boundary.

Narrative transition

Move from explanation to operating judgment.

Use these checkpoints as the conceptual pacing layer before continuing into the full article.

!Why this matters

TLDR: A distributed job scheduler ensures tasks fire reliably using a durable Job Store with a index.

#Key section to watch

Pay attention to "📖 Distributed Job Scheduler: Use Cases & Requirements"; it usually contains the main mechanism or tradeoff.

?Interview angle

Be ready to explain 👻 The "Ghost Task" Problem and 📖 Distributed Job Scheduler: Use Cases & Requirements with one concrete example and one tradeoff.

Tradeoff path 1

👻 The "Ghost Task" Problem: speed-first

TLDR: A distributed job scheduler ensures tasks fire reliably using a durable Job Store with a index.

Tradeoff path 2

📖 Distributed Job Scheduler: Use Cases & Requirements: reliability-first

To handle multiple scheduler instances without double firing, we use optimistic row level locking ( ).

Failure rehearsal

Pressure-test the mental model.

👻 The "Ghost Task" Problem misunderstood

High model quality can still produce incorrect outputs without grounding and verification.

Mitigation: Revisit 👻 The "Ghost Task" Problem and validate the first principles.

Risk 68%

📖 Distributed Job Scheduler: Use Cases & Requirements tradeoff missed

Low latency does not automatically mean high throughput under contention.

Mitigation: Compare against 📖 Distributed Job Scheduler: Use Cases & Requirements and document the tradeoff.

Risk 58%

Back to the article

Continue into the authored sections with the topology in mind: each heading should now answer what changes, what can fail, and what guarantee the system is trying to preserve.

TLDR: A distributed job scheduler ensures tasks fire reliably using a durable Job Store with a next_fire_time index. To handle multiple scheduler instances without double-firing, we use optimistic row-level locking (UPDATE WHERE status='SCHEDULED'). By decoupling the Trigger Engine from the Worker Pool via a message queue (Kafka/SQS), the system achieves independent scaling and guaranteed recovery from node failures.

👻 The "Ghost Task" Problem

Imagine you’re building a billing system for a global SaaS platform. Every month, on the 1st at midnight, your system must trigger 10 million subscription renewals. If your scheduler is a simple in-memory cron job (like a basic Timer in Java or a cron tab on a single server), and the server restarts at 11:59 PM, all 10 million jobs vanish. No bills are sent, and the company loses millions in revenue.

Conversely, if you run two copies of the scheduler for "high availability" without proper coordination, both might see that the clock hit midnight and both trigger the same billing job. Now, 10 million customers are double-charged. Your support lines are flooded, and your brand reputation is destroyed.

This is the Scheduling Paradox: the safer you make the system against missing a job (durability), the more likely you are to double-fire it (concurrency). A production-grade scheduler must solve two invariants simultaneously: never miss a job and never fire the same job twice. At the scale of 100 million tasks per day, solving this requires a robust distributed locking and state management strategy.

📖 Distributed Job Scheduler: Use Cases & Requirements

Actors & Journeys

  • Client Application: Submits tasks to be executed later (e.g., "Send this email in 2 hours").
  • Trigger Engine: The "clock" of the system that identifies which tasks are due for execution.
  • Worker Pool: The fleet of stateless execution nodes that perform the actual work (e.g., calling an external API).
  • Admin Console: Provides visibility into the job pipeline, allowing for manual retries and status monitoring.

Functional Requirements

  • One-time Scheduling: Support for "Run Job A at Timestamp T."
  • Recurring Schedules (Cron): Support for "Run Job B every Monday at 9:00 AM."
  • Job Lifecycle Management: Ability to cancel, pause, or update the payload of a pending job.
  • Task Isolation: Failure in one job should not impact the scheduling or execution of others.

Non-Functional Requirements (NFRs)

  • High Reliability (Durability): 99.99%. Once a job is accepted, it must be stored in a durable database to survive system crashes.
  • High Precision: Jobs should fire within $\pm 1$ second of their target time.
  • Scalability: Handle a backlog of 100M+ jobs and a throughput of 10k+ task dispatches per second.
  • At-Least-Once Delivery: In the event of a network failure, the system prefers retrying a job over skipping it.

🔍 Basics: How Time-Based Triggers Work

At its simplest level, a scheduler is a Priority Queue where the "Priority" is the next_fire_time.

  1. The Polling Loop: A background thread wakes up every second, queries for all jobs where next_fire_time <= now, and marks them for execution.
  2. The Precision Problem: If your polling loop takes 2 seconds but you poll every 1 second, the polls start to overlap. If your database query takes 5 seconds, your "real-time" scheduler is now lagging significantly.
  3. The Solution: We must use a Durable Priority Store. Instead of a simple table scan, we use a B-tree index on the next_fire_time column. This allows the database to find the "next 100 jobs" in logarithmic time, ensuring the trigger engine stays snappy even with a 100M job backlog.

⚙️ Mechanics: The Lifecycle of a Distributed Task

The life of a scheduled job follows a precise state machine to ensure reliability:

  1. SCHEDULED: The job is persisted in the DB and indexed in Redis.
  2. ACQUIRED/RUNNING: A scheduler instance has claimed the job and is currently pushing it to the message queue.
  3. DISPATCHED: The job is on the queue, waiting for a worker.
  4. SUCCESS/FAILED: Final terminal state. If failed, the job may be rescheduled for a "Retry" with a new next_fire_time (Exponential Backoff).

The critical "ACQUIRED" step is protected by Optimistic Locking. We use a version number or a status check in the UPDATE statement to ensure that only one node can move a job from SCHEDULED to RUNNING.

The state machine below traces every status a job can occupy, including the retry loop that prevents transient failures from permanently discarding work.

flowchart TD
    A["Submitted"] --> B["Scheduled"]
    B --> C["Acquired"]
    C --> D["Dispatched"]
    D --> E["Running"]
    E --> F{"Outcome"}
    F -->|Success| G["Completed"]
    F -->|Error| H{"Retry limit?"}
    H -->|Under limit| I["Retrying"]
    H -->|Exceeded| J["Dead Letter"]
    I --> B
    B --> K["Paused"]
    K --> B

📐 Estimations & Design Goals

The Math of "Midnight Bursts"

  • Daily Volume: 100 Million jobs/day.
  • The Hot Minute: 20% of all daily jobs cluster at 00:00 (Midnight) for daily reports and billing.
  • Burst Throughput: 20M jobs / 60 seconds $\approx \mathbf{333,000 \text{ jobs/sec}}$ to be triggered.
  • Storage Requirement: 100M jobs $\times$ 500 bytes (JSON payload + metadata) $\approx \mathbf{50 \text{ GB/day}}$.
  • Job Retention: If we keep 30 days of history, we need $\approx \mathbf{1.5 \text{ TB}}$ of storage.

Design Goals

  • Stateless Trigger Nodes: Schedulers should not "know" about jobs in memory. They must poll a shared durable store.
  • Decoupled Ingest: Use a REST API to accept jobs and immediately return a 202 Accepted to the client.

📊 High-Level Design: The Poll-Dispatch-Execute Pattern

The architecture separates the time-keeping (Trigger) from the work-execution (Worker).

graph TD
    Client -->|POST /jobs| API[Job API]
    API -->|Insert| DB[(Job Store: Postgres)]
    API -->|ZADD| Redis[(Redis Trigger Index)]

    subgraph Trigger_Engine
        Sched[Scheduler Service] -->|Poll| Redis
        Sched -->|CAS Lock| DB
        Sched -->|Dispatch| MQ[Message Queue: Kafka]
    end

    MQ --> Workers[Worker Pool]
    Workers -->|Heartbeat| DB
    Workers -->|Update Status| DB

The diagram above separates the three core roles: the Job API accepts submissions and returns 202 Accepted immediately; the Scheduler Service identifies due jobs and claims them; and the Worker Pool executes them independently. The Redis Sorted Set acts as a lightweight time index for O(log N) polling, while Postgres is the durable source of truth. This architecture means each component can fail and recover independently without losing a single job.

🧠 Deep Dive: How the Scheduler Achieves Exactly-Once Dispatch Without a Global Lock

The most subtle engineering in a distributed scheduler is the claim mechanism. If multiple Scheduler instances are running for high availability, they must race to claim each job without either missing it (never fires) or both claiming it (double fires). The solution is Optimistic Locking via Compare-and-Swap (CAS) UPDATE.

Internals: The Poll-and-Lock Cycle

Every second (configurable), each Scheduler instance queries the Redis Sorted Set for all jobs with a score (next_fire_time as Unix epoch) less than or equal to the current time. This ZRANGEBYSCORE call returns a list of due job IDs in O(log N + K) time, where K is the number of jobs returned.

For each returned job ID, the Scheduler executes a CAS UPDATE against Postgres:

UPDATE jobs SET status = 'ACQUIRED', version = version + 1 WHERE job_id = ? AND status = 'SCHEDULED' AND version = ?

If this UPDATE affects 1 row, the claim succeeded exclusively — this Scheduler is now responsible for the job and publishes it to Kafka. If the UPDATE affects 0 rows, another Scheduler instance already claimed the job (its status changed from SCHEDULED to ACQUIRED first). The losing Scheduler simply skips this job and moves to the next one. No waiting, no deadlock, no double-dispatch.

FieldTypeDescription
job_idUUIDPrimary key
job_nameVARCHAR(255)Human-readable identifier
cron_expressionVARCHAR(100)Standard cron syntax (null for one-time jobs)
next_fire_timeTIMESTAMPIndexed; determines polling priority
statusENUMSCHEDULED, ACQUIRED, RUNNING, SUCCESS, FAILED, PAUSED
payloadJSONBTask-specific data passed to the worker
versionINTEGEROptimistic lock counter — the CAS key
retry_countSMALLINTNumber of times this job has been retried
max_retriesSMALLINTMaximum retry attempts before dead-letter
created_atTIMESTAMPWhen the job was first submitted

The Redis Sorted Set stores each job as a member (job_id string) with a score (next_fire_time as Unix epoch milliseconds). This enables three critical operations:

Redis KeyTypeValuePurpose
trigger:zsetSorted Setmember=job_id, score=epoch_msO(log N) retrieval of due jobs via ZRANGEBYSCORE
job:lock:{job_id}String with TTLscheduler_instance_idDistributed presence lock for ACQUIRED state
worker:heartbeat:{job_id}String with TTLtimestampDetects stale workers (60-second TTL)

Performance Analysis: Surviving the Midnight Burst

The "midnight burst" is the canonical stress test for any scheduler. On the 1st of the month at 00:00 UTC, 20 million billing jobs become simultaneously due. The system must dispatch all of them within minutes, not hours.

At 10,000 CAS UPDATEs per second per Scheduler instance, a single node processes 600,000 claims per minute — far too slow for 20 million jobs in 2 minutes. The solution is three-part:

StrategyMechanismThroughput Gain
Horizontal Scheduler scaling20 Scheduler instances race to claim via CAS UPDATELinear: 20× throughput
Time-bucketed pollingEach poll fetches only the next 30 seconds of due jobs (not all 20M at once)Prevents memory-exhausting batch reads
Kafka partition scaling100 Kafka partitions × N Worker consumers = N-way parallel executionLinear: scales worker capacity independently

With 20 Scheduler instances each claiming 10,000 jobs/sec, the midnight burst of 20 million jobs is claimed and dispatched to Kafka in approximately 100 seconds — well within operational tolerance.

The sequence below traces a single job from client submission through the CAS claim handoff to final execution, showing each handoff that enforces exactly-once dispatch.

sequenceDiagram
    participant C as Client
    participant API as Job API
    participant DB as Job Store
    participant R as Redis ZSET
    participant S as Scheduler
    participant MQ as Kafka
    participant W as Worker
    C->>API: POST /jobs with payload and schedule
    API->>DB: INSERT job with status=SCHEDULED
    API->>R: ZADD trigger score=next_fire_time
    API-->>C: 202 Accepted
    S->>R: ZRANGEBYSCORE for due jobs
    S->>DB: CAS UPDATE status=ACQUIRED
    S->>MQ: Publish job message to topic
    MQ->>W: Deliver job to consumer
    W->>DB: UPDATE status=SUCCESS

🌍 Real-World Job Schedulers: GitHub Actions, Airbnb Chronos, and Shopify Sidekiq

GitHub Actions Scheduled Workflows uses a distributed scheduler to trigger CI/CD pipelines for millions of repositories. The challenge is "hot time slots" — thousands of repositories configured with cron: "0 3 * * *" (3 AM daily) all fire simultaneously. GitHub shards scheduled workflows by repository owner, ensuring that a single organization with 10,000 repositories doesn't starve other users during the 3 AM burst.

Apache Airflow, which replaced Airbnb's Chronos, introduced dependency-based scheduling: Job B only runs after Job A succeeds. This transforms the scheduler from a simple time-based trigger engine into a DAG execution engine. Airflow's Scheduler uses the same Poll-and-Lock pattern as described here, extended with an upstream task status check before a task is eligible for claiming. The complexity cost is real: Airflow's Scheduler is notoriously difficult to scale past 1,000 concurrent tasks.

Shopify's Sidekiq handles millions of background jobs per day using a Redis Sorted Set as the scheduling index — an identical mechanism to the architecture described in this guide. Sidekiq adds a "dead queue" for jobs that have exceeded their maximum retry count, providing operators a UI to inspect, retry, or discard permanently failed jobs. This dead-letter queue pattern is essential for long-running production systems.

⚖️ Reliability vs. Throughput: Trade-offs in Distributed Scheduling

Design DecisionAdvantageRisk
CAS UPDATE for claimingLock-free; scales with Scheduler countHigh retry rate if many Schedulers compete on same job
Redis ZSET as trigger indexO(log N) poll; sub-millisecondRedis restart without persistence requires ZSET rebuild
Kafka for worker dispatchDurable; at-least-once deliveryKafka lag can delay job execution past target fire time
Stateless WorkersEasy horizontal scaling; add nodes anytimeWorker crash leaves job in ACQUIRED; needs heartbeat detection
Exponential backoff for retriesProtects downstream services from retry stormsVery long delays accumulate for transient failures

The component diagram below shows how the Scheduler Cluster, Coordination Layer, and Worker Pool connect, making clear which tier scales independently under burst load.

flowchart LR
    subgraph Schedulers["Scheduler Cluster"]
        S1["Scheduler Node 1"]
        S2["Scheduler Node 2"]
    end
    R["Redis ZSET (time index)"]
    DB["Postgres (job store)"]
    MQ["Kafka (work queue)"]
    subgraph Workers["Worker Pool"]
        W1["Worker A"]
        W2["Worker B"]
    end
    S1 --> R
    S2 --> R
    S1 --> DB
    S2 --> DB
    S1 --> MQ
    S2 --> MQ
    MQ --> W1
    MQ --> W2

Critical Failure Mode — The Stuck Job: A Worker claims a job (status = ACQUIRED), begins execution, and then the Worker process crashes mid-run. The job is now permanently stuck in ACQUIRED state — it will never be retried and never marked as FAILED. Mitigation: Workers write a heartbeat to a Redis key (worker:heartbeat:{job_id}) with a 60-second TTL. A Watchdog thread in the Scheduler periodically checks for jobs in ACQUIRED state with no active heartbeat key in Redis. Any such job is transitioned back to SCHEDULED for re-dispatch by the next available Worker.

🧭 Choosing the Right Job Scheduling Architecture for Your Scale

Use the Poll-Dispatch-Execute architecture (Postgres + Redis ZSET + Kafka + Worker Pool) when:

  • Job volume exceeds 100,000 per day with a mix of cron and one-time tasks.
  • Execution must be exactly-once (billing, financial reports, compliance notifications).
  • Job payloads are heterogeneous — different task types processed by different Worker classes consuming different Kafka topics.
  • You need independent horizontal scaling for the Scheduler tier and the Worker tier.

A simpler single-node scheduler (e.g., Quartz in a single JVM) is sufficient when:

  • Job volume is under 10,000 per day.
  • All Scheduler and Worker logic runs in a single process with no multi-instance coordination needed.
  • Downtime of a few minutes per deployment cycle is commercially acceptable.

When to add Apache Airflow on top:

  • Jobs have inter-job dependencies (Job B must succeed before Job C can start).
  • You need a visual DAG editor, retry management UI, and operator alerting out of the box.
  • Scheduling granularity is daily or hourly rather than per-second or per-minute precision.

🧪 Delivering This Design in a System Design Interview

Act 1 — The Scheduling Paradox (2 minutes): Open with the billing scenario from the introduction. Draw the single-node cron job crashing at 11:59 PM (10 million lost jobs). Then draw two redundant schedulers without coordination both firing the same billing job (10 million double-charges). This immediately establishes why distributed locking is the core problem — not the scheduling algorithm.

Act 2 — The Poll-Dispatch-Execute Pattern (5 minutes): Draw the architecture diagram above. Highlight four components and their specific roles: Redis ZSET as the "time index," Postgres CAS UPDATE as the "claim mechanism," Kafka as the "durable work queue," and the Worker Pool as the "execution layer." Explain why each is separate and what failure scenario each separation protects against.

Act 3 — The Midnight Burst (3 minutes): The interviewer will almost certainly ask: "How do you handle 20 million jobs all due at midnight?" Walk through time-bucketed polling, parallel Scheduler instances competing via CAS UPDATE, and Kafka partition-based Worker scaling. This demonstrates you understand the peak throughput scenario that defines the system's capacity requirements.

Interviewer QuestionStrong Answer
How do you guarantee a job fires at most once?CAS UPDATE: only one Scheduler can change status from SCHEDULED to ACQUIRED because version must match
What if a Kafka consumer dies mid-processing?Kafka offset not committed until job marked SUCCESS; Kafka redelivers the message to another consumer
How do you handle jobs that run longer than expected?Heartbeat mechanism: if Redis heartbeat key expires, Watchdog resets job to SCHEDULED for re-dispatch

🛠️ Open Source Scheduling Infrastructure

Quartz Scheduler is the mature Java scheduling library that popularized the Poll-and-Lock pattern with a JDBC-backed job store. Quartz's QRTZ_TRIGGERS table uses a similar status-based CAS UPDATE to achieve multi-node cluster safety. Its primary limitation is that the Scheduler and Worker roles are tightly coupled inside the same JVM — the architecture described in this guide decouples them via Kafka for independent scaling.

Apache Airflow extends the Poll-Dispatch-Execute pattern with a DAG-based dependency graph, allowing complex workflows where task B is only triggered after task A succeeds. Airflow's Celery or Kubernetes executor backends mirror the Worker Pool concept with pluggable execution environments.

Amazon SQS with delay queues provides a managed alternative to the Kafka + Worker pattern for simple one-time scheduling. A message with a DelaySeconds attribute is invisible until the delay expires, after which a Worker consumer picks it up. The trade-off: SQS delay is capped at 15 minutes, making it unsuitable for long-horizon scheduling like "run this job in 6 months."

📚 Lessons Learned From Operating Distributed Job Schedulers in Production

Lesson 1 — Idempotency in Workers is not optional. The system provides at-least-once delivery by design — Kafka may redeliver, the Watchdog may re-dispatch a stuck job. Every Worker must be idempotent: billing a customer is not idempotent unless you check for an existing successful charge for the same billing period before processing the payment.

Lesson 2 — The midnight burst is your capacity benchmark. Design the system to handle the peak, not the average. If 20% of all daily jobs cluster at midnight, your load test scenario should fire 20% of daily volume in the first 60 seconds.

Lesson 3 — Monitor ACQUIRED queue depth as a leading indicator. A growing count of jobs in ACQUIRED state with no heartbeat is the earliest warning of Worker failures or stuck heartbeats. Set an alert at 1,000 jobs in ACQUIRED state for longer than 5 minutes.

Lesson 4 — Build job management APIs before you have an incident. Operators need to cancel stuck jobs, manually re-trigger failed jobs, and inspect job payloads during on-call incidents. Building these APIs as an afterthought means dangerous direct database surgery in production at 3 AM.

📌 TLDR & Key Takeaways for Distributed Job Scheduler Design

  • Core problem: The Scheduling Paradox — a single node loses jobs on crash; two nodes without coordination double-fire jobs.
  • Solution: Optimistic locking via CAS UPDATE — UPDATE WHERE status='SCHEDULED' AND version=? ensures exactly one Scheduler instance can claim each job.
  • Architecture: Job API → Postgres (durable store) + Redis ZSET (time index) → Scheduler Service (poll + claim) → Kafka (work queue) → Worker Pool (execution).
  • Key trade-off: At-least-once delivery means Workers must be idempotent — design every task for safe re-execution.
  • Critical failure mode: Stuck ACQUIRED jobs — mitigated by Worker heartbeat with 60-second Redis TTL and a Watchdog thread in the Scheduler.
  • Peak design target: 333,000 jobs/sec during the midnight burst — handled via time-bucketed polling and 20+ horizontal Scheduler instances.

Expandable deep dives

👻 The "Ghost Task" Problem

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

📖 Distributed Job Scheduler: Use Cases &amp; Requirements

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

Actors &amp; Journeys

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

Functional Requirements

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

Key takeaways

  • TLDR: A distributed job scheduler ensures tasks fire reliably using a durable Job Store with a index.
  • To handle multiple scheduler instances without double firing, we use optimistic row level locking ( ).
  • By decoupling the Trigger Engine from the Worker Pool via a message queue (Kafka/SQS), the system achieves independent scaling and guaranteed recovery from node failures.
  • 👻 The "Ghost Task" Problem Imagine you’re building a billing system for a global SaaS platform.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Reader feedback

Was this article useful?

Rate it before you leave, then follow or subscribe for the next deep dive.

Continue learning

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms

Related deep dives

Abstract Algorithms · © 2026 · Engineering learning lab