Home/Blog/#apache Spark/Spark on Kubernetes: Operator, Dynamic Allocation, and Production Monitoring

#apache SparkAdvanced•36 min read•Apr 19, 2026

Spark on Kubernetes: Operator, Dynamic Allocation, and Production Monitoring

How Spark runs on Kubernetes using the Spark Operator, Dynamic Resource Allocation, pod templates, and Prometheus metrics for production observability

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: Running Spark on Kubernetes replaces YARN's static queue model with a container-native, elastically-scaled execution environment. The kubeflow Spark Operator manages SparkApplication CRDs through a reconciliation loop that creates driver and executor pods, while Dynamic Resource Allocation and pod templates give platform teams granular control over where pods land and how resources are consumed. Prometheus ServiceMonitors expose executor-level metrics that YARN never surfaced, and proper memoryOverhead tuning is the single most common gap between a demo and a stable production deployment.

📖 When Four Hundred Jobs Are Waiting: Why YARN's Queue Model Was Never Built for Elasticity

It is a Tuesday morning at a mid-sized data platform team. The on-call engineer opens Ambari and finds the YARN resource manager showing 20 Spark applications in RUNNING state and 400 in the ACCEPTED queue. The SLA dashboard is red. Downstream reports that should have refreshed by 7 AM are still stale at 9 AM.

The cause is well understood by anyone who has operated a shared YARN cluster at scale: queue starvation. The platform has a root.etl queue capped at 40% of cluster capacity and a root.analytics queue at 30%. On quarter-close days, the analytics queue fills entirely with long-running ad-hoc queries, and the ETL queue — though technically available — cannot borrow resources from a queue whose jobs are all classified as RUNNING even when 60% of their allocated containers are blocked on shuffle reads or sitting idle waiting for broadcast variables to propagate.

YARN's resource model was designed for static workloads on dedicated hardware. A queue is a reservation: capacity is set at cluster launch and does not shrink or grow based on actual demand. When every team has their own queue and every team submits more jobs on high-traffic days, the queues compete for fixed shares of a fixed cluster. The resources you need most urgently are the ones locked inside another team's idle reservation.

Kubernetes does not have queues. It has a scheduler that observes the gap between requested resources and available capacity on nodes, and it fills that gap greedily — subject only to namespace quotas and node selectors you choose to enforce. When an executor pod finishes its work and the pod terminates, those CPU and memory units are immediately available for the next pod on any namespace. The elasticity is real rather than theoretical.

The migration from YARN to Kubernetes for Spark is not just a cluster manager swap. It changes the fundamental contract: instead of a ResourceManager granting ApplicationMasters a fixed container allocation up front, you get a declarative Kubernetes API that creates and destroys pods in response to demand. The Spark Operator bridges that API gap by translating a SparkApplication custom resource into the sequence of Kubernetes objects — services, config maps, RBAC bindings, driver pods, executor pods — that a Spark application needs to run from submission to completion.

This post explains how that bridge works, what happens inside the pod lifecycle when Dynamic Resource Allocation triggers scale-up and scale-down events, how to design a multi-tenant Spark platform on Kubernetes without turning it into a noisy-neighbor disaster, and how to wire Prometheus and Grafana to the metrics that actually predict job failures before they happen.

⚙️ How the Spark Operator Turns a Kubernetes Cluster into a Managed Spark Platform

The kubeflow Spark Operator — hosted at github.com/kubeflow/spark-operator — implements the Kubernetes Operator pattern for Apache Spark. The pattern itself is straightforward: extend the Kubernetes API with a Custom Resource Definition, then run a controller that watches for changes to that resource and reconciles actual cluster state toward the desired state declared in the resource. The Spark Operator makes the custom resource a SparkApplication, and the controller knows how to translate that spec into driver pods, executor pods, Kubernetes services, and the RBAC bindings that let the driver pod itself call the Kubernetes API to request executor pods.

The SparkApplication CRD

The SparkApplication is the unit of work. Every Spark job submitted through the operator is represented as a single Kubernetes resource of kind SparkApplication under the API group sparkoperator.k8s.io/v1beta2. The CRD carries everything the operator needs: the container image containing your job's JAR or Python file, the driver and executor resource specifications, the Spark configuration properties, the restart policy, and references to pod template files that let you inject Kubernetes-level customizations that Spark's native properties do not expose.

When you run kubectl apply -f my-etl-job.yaml with a SparkApplication manifest, the operator's admission webhook validates the resource and the controller loop picks it up within seconds. The controller creates a Kubernetes service for the driver (so executors can reach it by a stable hostname), a ConfigMap containing the resolved Spark configuration, and a driver pod. Once the driver pod transitions from Pending to Running, the driver JVM initializes a SparkContext, which in turn creates a Kubernetes-backed SchedulerBackend. From that point on, the driver communicates directly with the Kubernetes API server — using the ServiceAccount credentials mounted into the driver pod — to request executor pods as the job demands them.

CRD Lifecycle States

The operator tracks job progress through a lifecycle state machine. A freshly applied SparkApplication enters the NEW state. After the driver service and ConfigMap are created, it advances to SUBMITTED. When the driver pod reports Running, the state becomes RUNNING. If all stages complete without error, the application transitions through SUCCEEDING into COMPLETED. A driver OOMKill or uncaught exception drives it through FAILING into FAILED. The restart policy field — OnFailure, Always, or Never — determines whether the operator re-runs the application from NEW after a failure, with configurable retry counts and backoff intervals.

The following diagram traces the full pod lifecycle from CRD submission through task execution to final cleanup, highlighting the two most common failure branches.

flowchart TD
    A[kubectl apply SparkApplication CRD] --> B[Operator detects resource - state NEW]
    B --> C[Create Driver Service and ConfigMap]
    C --> D[Create Driver Pod - state SUBMITTED]
    D --> E{Driver Pod Running?}
    E -->|Yes| F[Driver JVM initializes SparkContext - state RUNNING]
    E -->|No - quota exceeded| G[Pod stays Pending - operator times out]
    F --> H[Driver requests executor pods via K8s API]
    H --> I[Executor pods scheduled on assigned nodes]
    I --> J[Tasks execute on executor threads]
    J --> K{All stages complete?}
    K -->|Yes| L[State COMPLETED - pods released]
    K -->|Driver OOMKill| M[State FAILED - restart policy evaluated]

Pod Templates: Kubernetes-Level Customization

Spark's native SparkConf properties cover JVM heap, core counts, and serialization formats. They do not cover Kubernetes-specific requirements like node affinity, pod topology spread constraints, custom environment variables for credential injection, sidecar containers for log forwarding, or volume mounts for local SSD scratch space. Pod templates fill that gap.

The SparkApplication spec references a pod template file path for both the driver and executor. These are standard Kubernetes Pod manifests. The operator merges the template spec with the values derived from SparkApplication fields — resource requests are overwritten by the CRD's driver/executor specs, but affinity, tolerations, nodeSelector, volumes, and additional containers are taken from the template as-is. This means a platform team can maintain a central executor template that pins executors to spot nodes, enforces anti-affinity to spread pods across availability zones, and mounts a local NVMe volume as a Spark scratch directory — without requiring individual data engineers to understand Kubernetes scheduling semantics.

🧠 Deep Dive: The Two Schedulers, the Memory Safety Model, and DRA Scaling Mathematics

The Internals

When a Spark application runs on Kubernetes, two completely independent schedulers operate simultaneously, and their interaction determines both performance and correctness. Understanding this two-scheduler model is prerequisite knowledge for diagnosing any Spark-on-K8s production issue.

The Kubernetes scheduler runs inside the control plane. Its job is pod placement: given a pod with declared resource requests, find a node where the requested CPU and memory fit within the node's allocatable capacity after accounting for already-running pods. The K8s scheduler has no concept of Spark stages, tasks, or shuffle partitions. It sees a pod spec and a cluster topology.

The Spark task scheduler runs inside the driver JVM. Its job is task dispatch: given a set of tasks in a stage, assign each task to an executor slot, preferring the executor whose node holds the HDFS or object-store block that the task needs to read (locality preference). The Spark scheduler has no concept of Kubernetes nodes, pod resource requests, or container image layers.

The interface between these two schedulers is simple: executor pod availability. When the Spark scheduler needs more executor slots than currently registered executors provide, the Dynamic Resource Allocation subsystem (running in the driver) requests new executor pods via the Kubernetes API. The Kubernetes scheduler places those pods. Once the pod reaches Running state and the executor JVM starts, it registers with the driver, and the Spark scheduler begins assigning tasks to it. The handoff point is that registration RPC — until it fires, the Kubernetes scheduler and the Spark scheduler have no shared state.

Pod lifecycle state machine. An executor pod transitions through well-defined Kubernetes states. It starts as Pending — meaning the scheduler has accepted it but no node has been found yet, or the chosen node is still pulling the container image. It transitions to ContainerCreating when the node has been selected and image pull is in progress. It becomes Running when the executor JVM starts and the readiness probe (if configured) passes. It enters Succeeded if the executor exits cleanly (graceful shutdown), or Failed if the container exits with a non-zero code — typically due to OOMKill, an uncaught exception, or spot instance preemption.

Executor decommissioning protocol. When Dynamic Resource Allocation decides to scale down, it does not simply delete executor pods. It follows a graceful decommissioning sequence to avoid losing in-flight shuffle data. The driver sends a DecommissionExecutor message via RPC to the target executor. The executor finishes any task currently assigned to it, then begins the decommission procedure: it marks itself as decommissioning (preventing new task assignments), and either pushes its shuffle blocks to the External Shuffle Service (if one is deployed) or relies on shuffle tracking metadata (if spark.dynamicAllocation.shuffleTracking.enabled=true) to signal that it holds live shuffle data that cannot be discarded until downstream stages that consume it have completed. Once the executor confirms that all tracked shuffle blocks are either transferred or no longer needed, it sends a completion acknowledgment to the driver, and the operator deletes the pod. The entire decommission sequence can be bounded by spark.storage.decommission.maxReplicationFailuresToWait and related timeouts.

Resource requests vs limits. On Kubernetes, every container has a resource request (what the scheduler uses for placement) and a resource limit (what the Linux cgroup enforces at runtime). When a JVM container exceeds its memory limit, the kernel sends SIGKILL immediately — there is no grace period, no GC attempt, no heap dump. This is categorically different from YARN, where the NodeManager measures container memory usage every few seconds and gives the JVM time to GC before killing it. The spark.executor.memoryOverhead property is the most critical configuration for avoiding OOMKill on Kubernetes. It controls the non-heap portion of the pod's memory request. The default is max(executor_memory × 0.1, 384 MiB), but workloads that use Python UDFs, large broadcast variables deserialized into off-heap buffers, or Arrow-based columnar processing routinely need 20–30% overhead.

Mathematical Model for Resource Safety and DRA Scaling

The memory model for a Spark executor pod on Kubernetes can be stated precisely. Let M_heap be spark.executor.memory (the JVM heap allocation) and M_oh be spark.executor.memoryOverhead. The total memory request submitted to Kubernetes is:

pod_request_memory = M_heap + M_oh

The default overhead formula is:

M_oh = max(M_heap × 0.10, 384 MiB)

Inside the JVM heap, Spark carves out usable memory after reserving a 300 MiB system buffer:

usable_heap = M_heap - 300 MiB
execution_and_storage_pool = spark.memory.fraction × usable_heap   (default fraction = 0.6)
user_memory = (1 - spark.memory.fraction) × usable_heap

The execution+storage pool is unified: storage can evict execution memory when the aggregate pressure exceeds the pool, and vice versa (controlled by spark.memory.storageFraction). When the JVM heap footprint — including live objects, shuffle buffers, and deserialization arenas — exceeds pod_request_memory, the Linux kernel OOMKills the container. The safe operating rule is:

pod_limit = pod_request_memory ≥ peak_heap_usage + GC_overhead + off_heap_usage

For Dynamic Resource Allocation, the ExecutorAllocationManager in the driver evaluates scale-up need on a polling interval (spark.dynamicAllocation.schedulerBacklogTimeout, default 1 second). The target executor count after observing a backlog is:

target_executors = ceil(pending_tasks / spark.executor.cores)
requested = clamp(target_executors, minExecutors, maxExecutors)

Scale-down is simpler: any executor that has been idle for spark.dynamicAllocation.executorIdleTimeout (default 60 seconds) with no running tasks and no tracked shuffle data is a decommission candidate. Setting this value too low creates a destructive cycle: pods decommission before downstream stages read their shuffle output, shuffle-fetch failures force task retries, which re-triggers scale-up, which incurs pod startup latency — a pattern sometimes called executor churn.

For K8s node schedulability, a pod with request vector (cpu_req, mem_req) can be placed on node n only when:

allocatable_cpu(n) - sum_cpu(running_pods_on_n) ≥ cpu_req
allocatable_mem(n) - sum_mem(running_pods_on_n) ≥ mem_req

Targeting a node utilization ratio of 70–85% leaves sufficient headroom for DRA bursts without triggering the Cluster Autoscaler on every job submission.

Performance Analysis

Pod startup latency is the most significant operational difference between Kubernetes and YARN. On YARN, an ApplicationMaster container starts in 2–5 seconds because the NodeManager re-uses pre-warmed JVMs and the image is the base OS. On Kubernetes, a cold executor pod start breaks down as: image pull from registry (0–30 seconds; 0 when cached locally, 10–30 seconds for a multi-gigabyte JVM image on first pull on a new node) + container startup + JVM initialization (3–5 seconds) + Spark executor registration (1–2 seconds). In the worst case — a freshly added Cluster Autoscaler node pulling a cold image — pod startup can take 45 seconds. This means that for jobs with very short stages (under 2 minutes), DRA scale-up latency can be longer than the stage itself, and spark.dynamicAllocation.minExecutors must be tuned upward to maintain a warm pool.

Shuffle service topology on Kubernetes. On YARN, the External Shuffle Service runs as a persistent daemon on every NodeManager. Executor container churn does not affect shuffle data availability because shuffle blocks are stored on the NodeManager filesystem, not inside the executor container. On Kubernetes, the equivalent is the spark-on-k8s External Shuffle Service deployed as a DaemonSet. Without it, spark.dynamicAllocation.shuffleTracking.enabled=true provides a soft substitute: Spark tracks which executors hold live shuffle blocks and refuses to decommission them until downstream stages have consumed their output. This works well for bounded batch jobs but can stall scale-down when long-running streaming jobs accumulate tracked shuffle metadata indefinitely.

GC tuning under K8s cgroup enforcement. G1GC with InitiatingHeapOccupancyPercent=35 (start concurrent marking at 35% heap occupancy rather than the default 45%) reduces the chance of evacuation failures that force a full STW GC. On Kubernetes, a full GC that spikes memory usage to 100% of the heap will not trigger a YARN-style "memory exceeded" warning — it will trigger an OOMKill. Adding MALLOC_ARENA_MAX=4 to executor containers prevents glibc from creating excessive memory arenas for native allocations, which can accumulate outside the JVM heap and push the process over its cgroup limit even when the JVM heap looks healthy.

Executor pod churn rate is a metric not available in YARN. Prometheus exposes per-pod lifecycle transitions. If the average executor pod lives for less than 5 minutes, DRA idle timeout is likely set too aggressively for the workload pattern, and raising spark.dynamicAllocation.executorIdleTimeout from 60s to 300s will reduce total pod startup overhead significantly.

📊 Visual Reference

flowchart TD
    Driver["Driver Node"]
    Planner["Query Planner"]
    Optimizer["Catalyst Optimizer"]
    Executor["Executor Task"]

    Driver -->|"Spark Execution"| Planner
    Planner --> Optimizer
    Optimizer -->|"Optimized Plan"| Executor
    Executor -->|"Result"| Driver

🏗️ Building a Production Multi-Tenant Spark Platform on Kubernetes

A shared Kubernetes cluster running Spark for multiple teams requires deliberate architectural choices to prevent one team's runaway jobs from evicting another team's drivers and to give the platform team visibility into per-team resource consumption.

Namespace Isolation and RBAC

The standard model is one Kubernetes namespace per team, with a single spark-operator namespace hosting the operator controller. The operator is granted a ClusterRole that allows it to watch SparkApplication CRDs across all namespaces and manage pods, services, and config maps within each team namespace. Each team namespace gets its own ServiceAccount used by the driver pod when it calls the Kubernetes API to request executor pods. Fine-grained audit trails require per-application service accounts rather than per-team, but per-team is the practical starting point.

ResourceQuota and LimitRange

Every team namespace should carry a ResourceQuota that caps total CPU, memory, and pod count. Without a quota, a single runaway job that loops on DRA scale-up can consume every node in the cluster. A LimitRange within the namespace sets default resource requests and limits for containers that do not specify them — this prevents pod templates that omit resource fields from scheduling as burstable pods that K8s deprioritizes for eviction.

Node Pools and Topology Assignment

Production Spark platforms on Kubernetes typically use three node pools with distinct labels and taints.

Standard compute nodes carry a spark-workload=true label and no taints. Driver pods always target this pool: drivers must not be preempted mid-job, and running them on standard on-demand instances is the cheapest way to guarantee availability. Pod templates should set nodeSelector: {spark-workload: "true"} for driver pods.

Spot or preemptible nodes carry a node-pool=spark-spot label and a spot-instance=true:NoSchedule taint. Executor pods that tolerate this taint can be scheduled here at 60–80% cost reduction compared to on-demand. Because spot nodes can be reclaimed with 2-minute notice, executor pods must be treated as disposable — the graceful decommissioning protocol described above is the safety mechanism, and the External Shuffle Service is what makes shuffle data survive spot preemptions.

GPU nodes carry nvidia.com/gpu taints and are targeted only by ML Spark workloads using GPU-accelerated UDFs or RAPIDS accelerator configurations.

The diagram below shows how the operator controller relates to team namespaces and the node pool topology:

flowchart LR
    subgraph OP[spark-operator namespace]
        CTL[Operator Controller]
    end
    subgraph TA[team-alpha namespace]
        QA[ResourceQuota]
        SA_A[ServiceAccount]
        CRD_A[SparkApplication CRDs]
    end
    subgraph TB[team-beta namespace]
        QB[ResourceQuota]
        SA_B[ServiceAccount]
        CRD_B[SparkApplication CRDs]
    end
    subgraph NP[Node Pools]
        STD[Standard compute nodes]
        SPOT[Spot nodes]
        GPU[GPU nodes]
    end
    CTL -->|watches and reconciles| CRD_A
    CTL -->|watches and reconciles| CRD_B
    CRD_A -->|nodeSelector| STD
    CRD_A -->|toleration| SPOT
    CRD_B -->|toleration| SPOT
    CRD_B -->|toleration| GPU

The operator controller watches SparkApplication resources in both team namespaces and reconciles actual pod state toward the desired spec. Driver pods follow their nodeSelector to standard nodes for stability; executor pods with spot tolerations land on the cheaper pool. Team-beta's ML workload tolerates both spot and GPU taints, letting the Kubernetes scheduler choose the most available option.

PodDisruptionBudget for Driver Pods

When a cluster operator drains a node for maintenance, Kubernetes evicts pods from it. Without a PodDisruptionBudget covering the driver pod, eviction kills the driver, which terminates the entire Spark application — all executor pods included, all intermediate shuffle data lost. Adding a PodDisruptionBudget with minAvailable: 1 for driver pods prevents kubectl drain from evicting the driver until the job completes. This is especially important for multi-hour batch jobs that run overnight.

Spark History Server

The Spark History Server provides the post-mortem UI for completed jobs. On Kubernetes, it runs as a standard Deployment with a Service and Ingress. Event logs are written by the driver to S3 or GCS (spark.eventLog.dir) and read by the History Server directly from object storage. No NFS share or hostPath volume is needed. A single History Server Deployment can serve event logs for all team namespaces.

📊 Visualizing the Spark Operator Lifecycle and Dynamic Allocation Flow

The two diagrams in this section complement each other: the first shows the coarse-grained state machine for a SparkApplication from kubectl apply to pod cleanup; the second shows the fine-grained DRA loop that runs continuously inside a running application.

The lifecycle diagram below traces every state transition the operator controller manages. The branches at Driver Pod State and Job Outcome capture the two most operationally important failure points: driver pod scheduling failures (usually caused by ResourceQuota exhaustion or missing node capacity) and driver-level failures (OOMKill or application exception) that trigger the restart policy logic.

flowchart TD
    A[kubectl apply SparkApplication CRD] --> B[Operator detects new resource]
    B --> C[Create Driver Service and ConfigMap]
    C --> D[Create Driver Pod]
    D --> E{Driver Pod State}
    E -->|Pending| F[K8s Scheduler places Driver on node]
    F --> G[Driver Pod Running]
    E -->|Running| G
    G --> H[Driver calls K8s API to request executor pods]
    H --> I[Operator creates Executor Pods]
    I --> J{Executor Pod State}
    J -->|Pending| K[K8s Scheduler places Executors on nodes]
    K --> L[Executors register with Driver]
    J -->|Running| L
    L --> M[Spark tasks execute across Executors]
    M --> N{Job completes?}
    N -->|All stages succeed| O[SparkApplication COMPLETED]
    N -->|Driver OOMKill or exception| P[SparkApplication FAILED]
    O --> Q[Operator cleans up all pods and resources]
    P --> R{Restart policy allows retry?}
    R -->|Yes| B
    R -->|No| Q

The DRA scale loop runs independently of the coarse-grained lifecycle. The diagram below focuses on what happens at the executor layer as task pressure grows and recedes. Notice the Cluster Autoscaler path at the center: when no existing node has capacity for a new executor pod, the autoscaler provisions one — but that warm-up introduces the pod startup latency discussed in the Performance Analysis section.

flowchart TD
    A[Spark task queue has pending tasks] --> B[ExecutorAllocationManager evaluates backlog]
    B --> C[Request new executors via K8s API]
    C --> D[Operator creates new Executor Pods]
    D --> E{Node capacity available?}
    E -->|Yes| F[Pod scheduled and transitions to Running]
    E -->|No| G[Cluster Autoscaler provisions new node]
    G --> F
    F --> H[Executor JVM starts and registers with Driver]
    H --> I[Tasks assigned to new executor]
    I --> J{Executor idle beyond timeout?}
    J -->|No| I
    J -->|Yes| K[DRA sends DecommissionExecutor RPC]
    K --> L[Executor completes current tasks]
    L --> M[Shuffle blocks transferred to External Shuffle Service]
    M --> N[Executor Pod terminates gracefully]
    N --> O[Cluster Autoscaler may scale down empty node]

Together these two diagrams show why spark.dynamicAllocation.executorIdleTimeout is the most operationally consequential DRA parameter. Set it too low, and the transition from N back to A happens before downstream stages read shuffle output from the decommissioned executor, forcing retries. Set it too high, and idle executors hold node capacity that could serve other jobs in the cluster.

🌍 How Spotify, Apple, and Lyft Run Spark on Kubernetes in Production

The migration from YARN to Kubernetes for Spark is not an experiment — it is the direction of the industry. Several large-scale data platforms have published their architectures and lessons publicly.

Spotify runs one of the largest data platforms on Kubernetes, processing petabytes of event streams daily. Their migration was driven by the same isolation problem: different teams sharing YARN clusters led to noisy-neighbor effects where one team's poorly-tuned job could cause GC storms across shared NodeManagers. On Kubernetes, Spotify uses namespace isolation with ResourceQuotas to hard-limit what any single team can consume, and they rely on K8s pod preemption to enforce priority tiers — high-priority ETL jobs can trigger eviction of lower-priority ad-hoc exploration jobs rather than waiting in a queue.

Apple operates a Kubernetes-based Spark platform at a scale that required custom Cluster Autoscaler configuration to handle bursty workloads that request thousands of executor pods within minutes. Their published approach uses pod priority classes with preemption to separate batch SLA workloads (high priority) from development/test workloads (low priority), allowing K8s to automatically reclaim capacity from dev jobs when production jobs need it.

Lyft built their data platform (Flyte) on Kubernetes from the start and integrated Spark as a task type within their DAG execution engine. Their insight was that treating a SparkApplication CRD as a task artifact — submitted, monitored, and cleaned up by an orchestration layer — allows Spark to participate in complex multi-stage pipelines that mix Spark stages with Python transformation tasks, dbt model runs, and ML inference jobs, all within a single DAG.

The common pattern across all three is that Kubernetes does not replace Spark's execution model — it replaces the resource manager. Spark still runs as a driver-executor distributed computation. What changes is that the cluster fabric is K8s-native, meaning the same toolchain (kubectl, Helm, ArgoCD, Prometheus, Grafana) that manages the rest of the platform also manages Spark jobs.

The flow below shows how Dynamic Resource Allocation responds to workload demand, scaling executor pods up when tasks are pending and releasing them after idle timeout.

flowchart TD
    A[Application running - tasks queued] --> B{Pending tasks exceed threshold?}
    B -->|Yes| C[DRA scale-up trigger fires]
    C --> D[Driver requests new executor pods from K8s API]
    D --> E[K8s scheduler places pods on available nodes]
    E --> F[New executors register with driver]
    F --> G[Pending tasks assigned to new executors]
    G --> H{Executors idle beyond executorIdleTimeout?}
    H -->|Yes| I[DRA scale-down trigger fires]
    I --> J[Driver decommissions idle executors gracefully]
    J --> K[Executor pods released - nodes freed]
    H -->|No| G
    B -->|No| H

⚖️ The Honest Trade-offs: YARN Versus Kubernetes and the Failure Modes Nobody Documents

Every architecture decision involves trade-offs, and the YARN-to-Kubernetes migration is no exception. The table below captures the most operationally significant differences:

Dimension	YARN	Kubernetes
Resource model	Static queues, pre-allocated capacity shares	Dynamic pod scheduling against node capacity
Executor startup time	2–5 seconds (pre-warmed JVM containers)	15–45 seconds (cold image pull + JVM init)
Shuffle service	Built-in ESS on every NodeManager	DaemonSet ESS or shuffle tracking (manual setup)
Multi-tenancy	Queue hierarchy with capacity guarantees	Namespace + ResourceQuota + LimitRange
Spot instance support	Complex, varies by cloud provider	First-class via tolerations and node labels
Observability	YARN ResourceManager UI + Spark UI	Prometheus + Grafana + Spark UI via ingress
Fault isolation	NodeManager process shared across jobs	Pod-level isolation; failure affects only that pod
Cluster autoscaling	Manual or limited auto-scaling	Kubernetes Cluster Autoscaler, Karpenter (AWS)
Learning curve	Familiar to Hadoop engineers	Requires Kubernetes literacy

Failure Modes

Driver OOMKill. Because the driver hosts the SparkContext, the DAGScheduler, the metadata for all active broadcast variables, and the result collection buffer for collect() calls, it is prone to memory pressure on large-scale jobs. Unlike executor OOMKill — which loses a single partition and triggers a task retry — driver OOMKill terminates the entire application. On Kubernetes, the driver has no YARN NodeManager to give it a graduated warning; the cgroup enforcer kills it immediately when the container exceeds its limit. Always set driver.memoryOverhead explicitly (512 MiB to 1 GiB is common for complex jobs) and avoid collect() on large datasets in the driver.

Shuffle fetch failure after executor pod eviction. When a spot executor pod is preempted before downstream stages read its shuffle output — and no External Shuffle Service is deployed — Spark raises FetchFailedException. Spark retries the failed stage up to spark.stage.maxConsecutiveAttempts times (default 4). If spot evictions happen faster than stages complete, jobs enter a retry loop and eventually fail. The fix is either to deploy the External Shuffle Service as a DaemonSet or to set spark.dynamicAllocation.shuffleTracking.enabled=true and avoid placing shuffle-heavy executors on spot nodes by using driver-only spot tolerations with executor affinity to standard nodes.

Image pull backoff. If the executor pod image is not cached on a node and the registry is under load, all executor pods on that node enter ImagePullBackOff state. Spark has no retry mechanism for pod scheduling failures — the driver will wait up to spark.kubernetes.executor.deletionGracePeriodSeconds and then mark those executor requests as failed. Proactively warming images on new nodes via a DaemonSet init job, or using a pull-through registry cache in-cluster, eliminates this class of failure.

ResourceQuota exhaustion. When a namespace's quota is fully consumed, new pods pend indefinitely. Unlike YARN queues that show queue utilization in the ResourceManager UI, Kubernetes quota exhaustion only surfaces as pod events in kubectl describe. Teams should configure Prometheus alerting on kube_resourcequota metrics to fire when namespace quota utilization exceeds 85%.

The diagram below shows how a SparkApplication spec flows through the Kubernetes resource chain before executors reach a node, highlighting the layers where misconfigurations surface.

flowchart LR
    A[SparkApplication spec] --> B[Resource requests defined for driver and executors]
    B --> C[Admission webhook validates spec]
    C --> D{Namespace ResourceQuota available?}
    D -->|Yes| E[Pods submitted to K8s scheduler]
    D -->|No| F[Pods pend - quota exhausted]
    E --> G{Node has sufficient allocatable capacity?}
    G -->|Yes| H[Pod bound to node - container starts]
    G -->|No| I[Pod pends - insufficient node capacity]
    H --> J[cgroup enforces memory and CPU limits]

🧭 Choosing Between YARN, Kubernetes, and Managed Spark for Your Platform

The decision between YARN, Kubernetes, and managed services (Databricks, EMR Serverless, GCP Dataproc Serverless) depends on organizational maturity, workload patterns, and infrastructure ownership model.

Situation	Recommendation
Use K8s Spark when	You already operate Kubernetes for other workloads, need namespace-level isolation between teams, want spot-instance cost optimization, or require fine-grained pod-level security policies
Avoid K8s Spark when	Your workloads are shuffle-heavy and you cannot deploy an External Shuffle Service, your jobs complete in under 2 minutes (startup overhead dominates), or your platform team has no Kubernetes operational experience
Stick with YARN when	You have an existing Hadoop ecosystem with HBase, HDFS data locality requirements, or large investments in YARN queue governance tooling
Prefer managed Spark when	You want zero infrastructure management overhead and are willing to accept higher per-job cost and less configurability
Edge case: GPU workloads	Kubernetes wins decisively — GPU node taints, device plugins, and NVIDIA container runtime are native K8s features; YARN GPU support is significantly more complex
Edge case: sub-minute jobs	Pre-warm executor pods using `spark.dynamicAllocation.minExecutors ≥ 2` and consider Spark Structured Streaming for near-real-time use cases rather than repeated batch micro-jobs

The most common mistake in this decision is assuming that "switching to Kubernetes" is equivalent to adopting a managed service. It is not. Running the Spark Operator on Kubernetes still requires platform engineering expertise for cluster sizing, autoscaler configuration, image management, RBAC governance, and observability setup. The payoff is complete control over the execution environment and the ability to co-locate Spark with other workloads on shared infrastructure.

The sequence below shows the Spark Operator workflow from CRD creation through executor pod scheduling, covering the interaction between the operator controller, the K8s API server, and the driver.

sequenceDiagram
    participant User
    participant K8s API
    participant Operator Controller
    participant Driver Pod
    participant Executor Pods
    User->>K8s API: kubectl apply SparkApplication CRD
    K8s API->>Operator Controller: Watch event - new SparkApplication
    Operator Controller->>K8s API: Create Driver Service and ConfigMap
    Operator Controller->>K8s API: Create Driver Pod
    K8s API->>Driver Pod: Schedule and start driver container
    Driver Pod->>K8s API: Request executor pods via ServiceAccount
    K8s API->>Executor Pods: Schedule executor containers on nodes
    Executor Pods->>Driver Pod: Register with SparkContext
    Driver Pod->>Executor Pods: Dispatch tasks
    Executor Pods->>Driver Pod: Report task results
    Driver Pod->>K8s API: Release executor pods on job completion

🧪 Configuring a Production SparkApplication with DRA, Pod Templates, and Observability

This section demonstrates the configuration decisions a platform engineer makes when preparing a production Spark job for Kubernetes. The goal is not a minimal example that runs in a sandbox — it is a template that survives spot preemptions, respects namespace quotas, and produces Prometheus-scrapable metrics.

Example 1: Executor Anti-Affinity and Spot Tolerations via Pod Template

A data platform team running a daily ETL job that produces 2 TB of output wants executors on spot nodes to reduce cost, but needs executors spread across nodes so a single spot preemption event does not kill more than 25% of running executors. They achieve this with a pod template that combines spot node tolerations with pod anti-affinity weighted toward node-level spreading.

The driver template pins the driver to standard on-demand nodes using a nodeSelector, ensuring that spot preemption cannot kill the driver mid-job. The executor template tolerates the spot taint and sets a preferred pod anti-affinity against other executor pods using the node hostname as the topology key. The weight of 80 makes co-location of two executor pods on the same node possible (not forbidden) but last-resort.

When this configuration runs against a cluster that has 4 spot nodes with 8 executor slots each and a single spot preemption event, the worst case is losing 8 out of 32 executors — 25% — rather than a correlated failure where all executors were packed onto a single large node.

Example 2: Correlating Executor Memory Overhead with Job Complexity

A team migrating a PySpark job from YARN discovers that the job OOMKills on Kubernetes even though it ran fine on YARN with the same spark.executor.memory=8g setting. The diagnostic is the overhead gap. On YARN, the NodeManager measures process memory (JVM heap + native code + Python worker subprocesses) and compares it against yarn.nodemanager.resource.memory-mb gradually. On Kubernetes, the cgroup limit is spark.executor.memory + spark.executor.memoryOverhead. The default overhead for 8 GB is max(800 MiB, 384 MiB) = 800 MiB, making the pod limit 8.8 GiB. A PySpark job using Arrow-based pandas UDFs allocates 500–800 MB per Python worker process in off-heap memory, pushing the container over the 8.8 GiB limit.

The fix is spark.executor.memoryOverhead=2048m — 25% of heap — plus spark.executor.pyspark.memory=1024m explicitly allocated for the Python worker heap. Total pod request becomes 8 + 2 = 10 GiB, which now comfortably contains the JVM heap, native allocations, and the Python Arrow worker.

🛠️ Spark Operator, DRA Configuration, and Prometheus Monitoring: The Complete OSS Reference

Installing the Spark Operator with Helm

The operator is distributed as a Helm chart. The minimal production installation enables webhook validation (which catches malformed SparkApplication CRDs before they create partially-configured pods) and sets the operator's watch scope to "*" so it covers all team namespaces from a single controller deployment.

helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm repo update

helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --create-namespace \
  --set webhook.enable=true \
  --set sparkJobNamespace="" \
  --set metrics.enable=true \
  --set metrics.port=10254

SparkApplication CRD with DRA and Pod Templates

The SparkApplication below represents a production-grade ETL job configuration. It references pod template files for driver and executor customization, enables Dynamic Resource Allocation with shuffle tracking, and configures the Prometheus metrics servlet on every executor.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: etl-daily-pipeline
  namespace: team-alpha
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "my-registry.io/spark-etl:3.5.1-jvm17"
  imagePullPolicy: IfNotPresent
  mainApplicationFile: "local:///app/daily_etl.py"
  sparkVersion: "3.5.1"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 30
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  driver:
    cores: 2
    coreLimit: "2000m"
    memory: "4g"
    memoryOverhead: "768m"
    serviceAccount: spark-team-alpha
    podTemplateFile: /etc/spark/templates/driver-template.yaml
    labels:
      team: alpha
      spark-role: driver
  executor:
    cores: 4
    coreLimit: "4000m"
    memory: "8g"
    memoryOverhead: "2048m"
    instances: 4
    podTemplateFile: /etc/spark/templates/executor-template.yaml
    labels:
      team: alpha
      spark-role: executor
  sparkConf:
    "spark.dynamicAllocation.enabled": "true"
    "spark.dynamicAllocation.shuffleTracking.enabled": "true"
    "spark.dynamicAllocation.minExecutors": "2"
    "spark.dynamicAllocation.maxExecutors": "24"
    "spark.dynamicAllocation.initialExecutors": "4"
    "spark.dynamicAllocation.executorIdleTimeout": "120s"
    "spark.dynamicAllocation.schedulerBacklogTimeout": "1s"
    "spark.decommission.enabled": "true"
    "spark.storage.decommission.shuffleBlocks.enabled": "true"
    "spark.ui.prometheus.enabled": "true"
    "spark.metrics.conf.*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet"
    "spark.metrics.conf.*.sink.prometheusServlet.path": "/metrics/prometheus"
  monitoring:
    exposeDriverMetrics: true
    exposeExecutorMetrics: true
    prometheus:
      jmxExporterJar: "/prometheus/jmx_prometheus_javaagent.jar"
      port: 8090

Executor Pod Template with Spot Tolerations and Anti-Affinity

The executor pod template below pins executors to spot nodes and distributes them across hosts to limit blast radius from a single spot preemption event.

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    node-pool: spark-spot
  tolerations:
    - key: spot-instance
      operator: Equal
      value: "true"
      effect: NoSchedule
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 80
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: spark-role
                  operator: In
                  values:
                    - executor
            topologyKey: kubernetes.io/hostname
  containers:
    - name: executor
      env:
        - name: MALLOC_ARENA_MAX
          value: "4"
        - name: JAVA_TOOL_OPTIONS
          value: "-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:+HeapDumpOnOutOfMemoryError"
      volumeMounts:
        - name: spark-local-dir
          mountPath: /tmp/spark-local
  volumes:
    - name: spark-local-dir
      emptyDir: {}

Dynamic Resource Allocation Configuration Reference

The table below maps the most important DRA properties to their operational effect:

Property	Default	Recommended Production Value	Effect
`spark.dynamicAllocation.enabled`	false	true	Enables DRA
`spark.dynamicAllocation.shuffleTracking.enabled`	false	true	Avoids ESS requirement (Spark 3.2+)
`spark.dynamicAllocation.minExecutors`	0	2–4	Warm pool prevents cold-start spikes
`spark.dynamicAllocation.maxExecutors`	infinity	Team quota limit	Hard cap on scaling
`spark.dynamicAllocation.initialExecutors`	minExecutors	4	Initial executor count at job start
`spark.dynamicAllocation.executorIdleTimeout`	60s	120–300s	Time before idle executor decommissioned
`spark.decommission.enabled`	false	true	Graceful shutdown vs abrupt kill

Prometheus ServiceMonitor for Spark Executor Metrics

The ServiceMonitor below instructs kube-prometheus-stack to scrape the Prometheus metrics endpoint on every executor pod that carries the spark-app-selector: "true" label. The operator automatically applies this label to all executor pods it creates when monitoring is enabled in the SparkApplication spec.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: spark-executor-metrics
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      spark-app-selector: "true"
  namespaceSelector:
    any: true
  endpoints:
    - port: spark-ui
      path: /metrics/prometheus
      interval: 15s
      honorLabels: true
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_label_spark_role]
          targetLabel: spark_role
        - sourceLabels: [__meta_kubernetes_namespace]
          targetLabel: namespace
        - sourceLabels: [__meta_kubernetes_pod_name]
          targetLabel: pod

Key Prometheus Metrics to Dashboard

Beyond the standard Spark metrics exposed via PrometheusServlet, the Kubernetes layer exposes pod-level metrics through kube-state-metrics that are critical for Spark-on-K8s observability. kube_pod_status_phase{phase="Pending"} filtered by the spark-role=executor label gives you the pending executor count — the primary leading indicator of DRA scale-up events or cluster capacity exhaustion. kube_pod_container_status_restarts_total filtered by executor pods reveals OOMKill patterns before they cascade into job failures. On the Spark side, spark_executor_cpuTime_total and spark_executor_shuffleWriteTime_total expose the CPU and I/O balance of executor work — a high shuffle write time relative to CPU time signals that your bottleneck is network bandwidth or disk I/O, not CPU, and more executors will not help.

For a comprehensive Grafana dashboard, the Spark community maintains a dashboard at Grafana.com (dashboard ID 7890) that integrates these Kubernetes pod metrics with Spark application-level counters.

📊 Visual Reference

flowchart LR
    Mapper["Mapper
(Shuffle Write)"]
    Network["Network
(Shuffle Read)"]
    Reducer["Reducer
(Aggregate)"]

    Mapper -->|"Partitioned Data"| Network
    Network -->|"Shuffled
Partitions"| Reducer

📚 Lessons Learned from Running Spark on Kubernetes at Scale

Always set memoryOverhead explicitly. The default 10% rule was calibrated for YARN where memory enforcement is gradual. On Kubernetes, cgroup limits are absolute. PySpark jobs with Arrow UDFs, jobs that broadcast large lookup tables into executor memory, and jobs using off-heap storage all need 20–30% overhead. The first symptom of insufficient overhead is not an OOM error in the logs — it is a pod with OOMKilled in kubectl describe, and the Spark driver reports the executor as "lost without exit code."

Never run drivers on spot nodes. A driver OOMKill or spot preemption restarts the entire job, discarding hours of executor work. The cost of one on-demand driver instance is negligible compared to a multi-hour job restart. Encode this rule in your driver pod template with a strict nodeSelector for standard nodes, not a preferred affinity that the scheduler can override.

Set minExecutors to maintain a warm pool. A minExecutors of 0 means every job starts cold. Even two or three pre-warmed executors eliminate the startup latency spike at the beginning of each stage and smooth out the DRA control loop. For streaming jobs, minExecutors should equal the base throughput requirement at off-peak load.

Treat executorIdleTimeout as a cost vs. retry risk dial. Aggressive idle timeout (30–60s) reduces idle executor costs but increases shuffle-fetch retry risk when spot preemptions coincide with idle cycles. Conservative timeout (180–300s) keeps more executors warm but spends more on idle compute. Monitor executor churn rate in Prometheus and tune from data, not intuition.

Resource quota exhaustion is silent without alerting. When a namespace quota is saturated, new executor pods pend without error. The Spark driver waits for executors that will never arrive, and the job appears to be making slow progress rather than stuck. Alert on kube_resourcequota_used / kube_resourcequota_hard > 0.85 to catch this before jobs time out.

Decommission protocol requires Spark 3.2+ for shuffleTracking. On Spark 3.1 and earlier, Dynamic Resource Allocation on Kubernetes without an External Shuffle Service DaemonSet will cause shuffle data loss. If deploying ESS as a DaemonSet is not feasible, you must either upgrade to Spark 3.2+ or disable DRA. This is the single most common incompatibility in YARN-to-K8s migrations using legacy Spark versions.

Version your executor images with your JAR. Executor pods created by DRA scale-up must use the same image as the driver. Deploying a new application version while a job is running — if the image tag is mutable — can create a split-brain situation where old executors and new executors have different class files. Use immutable SHA-based image tags in CI/CD to guarantee image consistency for the lifetime of a job.

📌 Summary and Key Takeaways

The Spark Operator replaces YARN's ResourceManager with a Kubernetes controller loop that reconciles SparkApplication CRDs into driver pods, executor pods, services, and config maps. Jobs are declared as Kubernetes resources and managed by the same kubectl toolchain as the rest of your platform.
Dynamic Resource Allocation on Kubernetes requires explicit configuration to be production-safe. The most important setting is spark.dynamicAllocation.shuffleTracking.enabled=true (Spark 3.2+), which prevents shuffle data loss during executor scale-down without requiring an External Shuffle Service. For older Spark versions or spot-heavy workloads with aggressive preemption, the ESS DaemonSet is mandatory.
Pod startup latency is 10–15× higher on Kubernetes than YARN for cold executor starts. Use spark.dynamicAllocation.minExecutors to maintain a warm pool, cache executor images on nodes using DaemonSet warm-up jobs, and avoid architectures where sub-60-second stages are dominant.
MemoryOverhead is the most under-configured property in Kubernetes Spark deployments. Set it to 20–25% of heap as a baseline, add spark.executor.pyspark.memory for PySpark workloads, and monitor for OOMKill events in Prometheus rather than relying on Spark logs to surface them.
Multi-tenancy requires namespace + ResourceQuota + LimitRange together. Namespace alone provides API isolation but not resource isolation. ResourceQuota without LimitRange allows pods without resource requests to schedule as BestEffort and consume unlimited resources. All three layers are required for reliable shared-platform behavior.
The two-scheduler model is the key mental model. K8s scheduler places pods on nodes; Spark scheduler places tasks on executor slots. They interact only through pod availability. Troubleshooting starts with which scheduler is the bottleneck: pending pods mean K8s scheduling issues (quota, capacity, image pull); slow task progress with running executors means Spark scheduling issues (data skew, shuffle partitions, GC).
Prometheus + Grafana is the production observability stack. The Spark metrics servlet + kube-state-metrics + a ServiceMonitor gives you executor pod lifecycle, GC pause duration, shuffle write throughput, and pending pod count in a single Grafana dashboard. These metrics predict job failures significantly earlier than the Spark event log.

Article tools

Explain simpler Compare approaches What next?

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata