RDDs vs DataFrames vs Datasets: Understanding the Evolution of Spark APIs
Understand the core differences, compilation paths, and performance characteristics of Spark APIs.

Abstract Algorithms
Helping engineers master software engineering topics.
TLDR: Apache Spark has evolved from low-level RDDs (Resilient Distributed Datasets) to highly optimized DataFrames and Datasets. RDDs offer functional control but miss engine-level optimization. DataFrames use Spark's Catalyst Optimizer to compile declarative queries into high-performance JVM bytecode. This guide traces their differences and compares their execution paths with runnable PySpark examples.
📖 Concept: The Evolution of Apache Spark APIs
To truly understand why Apache Spark has evolved through several iterations of APIs, we must first look back at the limitations of the classic Hadoop MapReduce programming model. In the early days of big data engineering, MapReduce was the primary tool for distributed computations. It split work into rigid map and reduce tasks.
Every stage of a MapReduce pipeline read data from and wrote data back to disk (HDFS) to ensure fault tolerance. This disk-bound layout created massive bottlenecks for iterative algorithms, such as machine learning training loops or multi-stage SQL joins, because reading and writing to disk repeatedly added heavy network and disk I/O latency.
Apache Spark revolutionized this space by introducing an in-memory computation model. The foundation of this system was the Resilient Distributed Dataset (RDD). An RDD represents an immutable, lazily evaluated, partitioned collection of elements that can be operated on in parallel across a cluster.
Instead of writing temporary results to disk, Spark kept intermediate RDD data partitions in executor memory. If a node failed, the RDD could reconstruct itself using its lineage graph—the logical sequence of transformations that produced the dataset—rather than relying on physical disk replication.
However, as RDDs were deployed to run production pipelines at scale, a new bottleneck appeared. Because RDDs are arbitrary collections of objects, Spark did not understand the internals of the data structures or functions that developers wrote. To Spark, an RDD transformation was a black-box lambda function.
If a developer wrote a filter operation in Python, Spark had to launch a Python process, serialize data from Java JVM memory to Python memory via a Py4J socket, run the user's custom filter code, and serialize the result back to Java. The engine could not run query optimizations, prune unused fields, or reorder operations to reduce data scans.
To solve this semantic gap, Spark introduced schema-aware, structured APIs: DataFrames and Datasets. These structured APIs allow the engine to inspect the data types, column names, and expressions. This schema awareness enables the Catalyst Optimizer and the Tungsten Execution Engine to optimize queries and manage memory layouts under the hood.
🔍 Basics: RDDs, DataFrames, and Datasets
Choosing the correct API requires understanding their compile-time characteristics and memory representations:
1. Resilient Distributed Dataset (RDD)
The low-level API of Spark. RDDs represent collections of JVM objects. They are strongly typed (e.g., RDD[User] in Scala), meaning they catch type errors at compile time. However, this type safety comes at a cost. Storing data as raw JVM objects adds significant garbage collection overhead and memory fragmentation. Since Spark cannot inspect RDD transformations, there is no automatic optimization, and the performance of your pipeline relies entirely on the developer's execution strategy.
2. DataFrame
A DataFrame is a distributed collection of data organized into named columns. Conceptually, it is equivalent to a table in a relational database. Under the hood, DataFrames do not store raw JVM objects. Instead, they use Spark's binary format, Tungsten, which packs data into off-heap raw memory bytes.
This layout eliminates garbage collection pauses and reduces serialization size. DataFrames are untyped in Python and dynamically resolved at runtime in Java/Scala. When you execute a DataFrame query, the Catalyst Optimizer evaluates the execution path and optimizes it before running it.
3. Dataset
An extension of the DataFrame API that provides type safety. It combines the benefits of RDDs (strong typing, object-oriented transformations like .map()) with the performance optimizations of DataFrames. Datasets use Encoders to map domain-specific Java objects to Spark's internal Tungsten binary format, allowing fast serialization and off-heap execution. Because Python does not have a compile-time static type system, Datasets are not available in PySpark; Python developers use the DataFrame API, which delivers the same performance characteristics.
The table below contrasts the main differences:
| Aspect | RDD | DataFrame | Dataset |
| API Type | Low-level functional | High-level declarative | High-level strongly typed |
| Optimization | None (user-defined logic) | High (Catalyst Optimizer) | High (Catalyst Optimizer) |
| Serialization | Java Serialization / Pickle | Tungsten Binary format | Tungsten Encoders |
| Type Safety | Compile-time safe | Runtime safe | Compile-time safe |
| Language Availability | Java, Scala, Python, R | Java, Scala, Python, R | Java, Scala |
⚙️ Mechanics: Under the Hood of the Catalyst Optimizer
The primary reason to choose DataFrames and Datasets over raw RDDs is the query optimization pipeline managed by the Catalyst Optimizer. The Catalyst Optimizer is an extensible query planning framework that automatically optimizes relational execution paths.
When you write a declarative DataFrame expression, Spark does not execute it immediately. Instead, the query passes through the following optimization pipeline:
1. Analysis and Unresolved Logical Plan
When you submit a query, Spark constructs an Abstract Syntax Tree (AST) representing your commands. At this point, the relations and column types are unresolved. For example, if you filter by a column named age, Spark does not yet know if that column exists or if its type is an integer. This AST is called the Unresolved Logical Plan.
2. Logical Analysis and Resolved Plan
Spark's analyzer checks the unresolved plan against the Catalog (the database schema metadata table, temp views, and files) to verify that column and table names exist. Once verified, it maps data types to the columns, outputting a Resolved Logical Plan.
3. Logical Optimization (Rules-Based)
The Catalyst Optimizer applies a series of standard rules-based optimizations to the resolved plan. These include:
- Constant Folding: Evaluating expressions containing constants at plan time (e.g., replacing
1 + 1with2before execution). - Predicate Pushdown: Moving filter operations as close to the data source as possible. If you filter by
country == 'US', Spark pushes this filter to the file reader, reading only files or partitions matching that criteria, rather than loading the whole dataset into memory first. - Projection Pruning: Selecting only the columns referenced in the query. If a table has 100 columns and you only select
name, the optimizer discards the other 99 columns before reading data blocks.
The output of this stage is the Optimized Logical Plan.
4. Physical Planning (Cost-Based Model)
The optimized logical plan is mapped to multiple potential physical execution plans. Spark evaluates the cost of each physical plan (calculating network overhead, join types, disk I/O, etc.) and selects the most efficient plan. For instance, if you are joining a large table with a small table, Spark's cost-based optimizer will choose a Broadcast Hash Join (broadcasting the small table to all worker nodes) instead of a Sort-Merge Join (which requires shuffling both tables across the network).
5. Code Generation (Tungsten)
Once the physical plan is chosen, Spark uses the Janino compiler to compile the execution steps into optimized Java bytecode at runtime. This avoids the overhead of traversing a complex node tree for every row and generates flat machine code instructions that run at hardware speed.
📊 Flow: Spark Query Optimization Pipeline
The flowchart below visualizes the transition steps of a query through the Catalyst Optimizer and Tungsten code generation:
graph TD
Query[User DataFrame/SQL Query] -->|Parse| ULP[Unresolved Logical Plan]
ULP -->|Analysis & Catalog Lookup| RLP[Resolved Logical Plan]
RLP -->|Catalyst Optimization Rules| OLP[Optimized Logical Plan]
OLP -->|Physical Planning Cost Model| Phys[Physical Execution Plan]
Phys -->|Code Generation Janino| Byte[Optimized JVM Bytecode]
To illustrate, suppose we execute the following query:
df.filter(df("age") > 21).select("name")
The trace table below demonstrates how the optimizer processes this statement compared to a naive execution strategy:
| Step | Naive Execution | Catalyst Optimized Execution | Benefit |
| 1 | Load all columns from storage. | Prunes columns, reading only name and age. | Reduces disk read and network I/O. |
| 2 | Filter rows in executor memory. | Pushes age > 21 filter to Parquet file reader. | Skips reading non-matching rows entirely. |
| 3 | Execute Python lambda wrapper checks. | Executes compiled Java bytecode directly. | Eliminates process serialization latency. |
🌍 Applications: When to Use RDDs vs. DataFrames
- Unstructured Data Processing: Use RDDs when processing raw, unstructured binary data (like raw video files, image streams, or raw network socket text) where schemas are impossible to define.
- Relational Data Warehouses: Use DataFrames when building large-scale ETL pipelines on structured formats like Parquet, ORC, Delta Lake, or Avro. DataFrames run at maximum speed using memory optimization.
- Machine Learning Feature Pipelines: Use DataFrames to join, aggregate, and normalize feature tables before feeding them into ML estimators.
🧪 Practical Implementation: Side-by-Side Code Comparison
Here is a complete, runnable PySpark script showing a side-by-side processing flow: performing a group-by aggregation using RDD functional APIs vs. DataFrame DSL queries.
import os
from pyspark.sql import SparkSession
def init_spark():
# Initialize a local Spark Session
return SparkSession.builder \
.appName("SparkAPIEvolution") \
.master("local[*]") \
.getOrCreate()
def run_rdd_pipeline(sc, raw_data):
# Parallelize list into an RDD of tuples: (employee, department, salary)
rdd = sc.parallelize(raw_data)
# 1. Map to key-value pair: (department, (salary, 1))
# 2. ReduceByKey to aggregate: (department, (sum_salary, count))
# 3. Map to calculate average: (department, avg_salary)
aggregated_rdd = rdd \
.map(lambda row: (row[1], (row[2], 1))) \
.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) \
.map(lambda pair: (pair[0], pair[1][0] / pair[1][1]))
return aggregated_rdd.collect()
def run_dataframe_pipeline(spark, raw_data):
# Convert list of tuples into a DataFrame with schema
columns = ["employee_name", "department", "salary"]
df = spark.createDataFrame(raw_data, schema=columns)
# Declarative aggregation using DataFrame DSL
avg_salary_df = df.groupBy("department") \
.avg("salary") \
.withColumnRenamed("avg(salary)", "average_salary")
# Print the physical execution plan compiled by the Catalyst Optimizer
print("\n--- Physical Execution Plan ---")
avg_salary_df.explain()
return avg_salary_df.collect()
if __name__ == "__main__":
# Sample worked dataset: Employees list (name, department, salary)
employees = [
("Alice", "Engineering", 120000),
("Bob", "Sales", 85000),
("Charlie", "Engineering", 140000),
("David", "Sales", 95000),
("Eve", "HR", 70000)
]
spark = init_spark()
sc = spark.sparkContext
# Run low-level RDD pipeline
print("Executing RDD Pipeline:")
rdd_results = run_rdd_pipeline(sc, employees)
for dept, avg_sal in rdd_results:
print(f"Department: {dept} -> Average Salary: ${avg_sal:,.2f}")
# Run high-level DataFrame pipeline
print("\nExecuting DataFrame Pipeline:")
df_results = run_dataframe_pipeline(spark, employees)
for row in df_results:
print(f"Department: {row['department']} -> Average Salary: ${row['average_salary']:,.2f}")
spark.stop()
📚 Lessons Learned: Common Beginner Mistakes in Spark
- Incorrect Object Serialization in Python RDDs: Python RDD transformations require serializing Python objects (using Pickle) and passing them back and forth between the Python process and the JVM executor process via a Py4J gateway. This serialization bridge is extremely slow. Using DataFrames avoids this bottleneck because processing runs natively inside JVM memory using Tungsten binary layouts.
- Calling
.collect()on Production Datasets: Calling.collect()pulls the entire distributed dataset from executor nodes and aggregates it onto the driver node's memory. If the dataset exceeds the driver memory, the JVM will throw anOutOfMemoryErrorand crash. Use.take(n)or.show(n)for inspection instead. - Ignoring Partition Skew: If your data is skewed (e.g. 90% of rows contain the same key), joining or aggregating by that key will overload a single executor node while other nodes remain idle, leading to straggling tasks and pipeline stalls.
📌 Summary: The Spark API Cheatsheet
- RDDs (Resilient Distributed Datasets): Immutable, partitioned, low-level functional API. Bypasses engine optimization.
- DataFrames: Schema-aware, tabular abstraction. Uses the Catalyst Optimizer and Tungsten execution engine for maximum speed.
- Datasets: Typed extension of DataFrames for Scala/Java providing compile-time type safety.
- Catalyst Optimizer: Compiles declarative queries into optimized JVM bytecode, applying predicate pushdowns and projection pruning automatically.
- PySpark Optimization Rule: Always use DataFrames in Python to avoid Py4J process serialization overhead.
AI-generated article quiz
Test your understanding
Ready to test what you just learned?
Generate four focused questions from this article. Answers include immediate explanations.
Guided series path
Apache Spark Engineering
Reader feedback
Was this article useful?
Rate it if it helped, then continue with the next deep dive when you are ready.
Sign in to save your rating.
Article metadata