Python

Create Your First Dataframe In Pyspark

PySpark allows users to handle large datasets efficiently through distributed computing. Whether you’re new to Spark or looking to enhance your skills, let us delve into understanding how to create DataFrames and manipulate data effectively, unlocking the power of big data analytics with PySpark.

1. Introduction

1.1 What is PySpark?

PySpark is the Python API for Apache Spark, a powerful open-source distributed computing system. It enables developers to perform large-scale data processing and analytics in a parallel and fault-tolerant manner. PySpark integrates seamlessly with Python, allowing data engineers and data scientists to leverage the power of Spark using a familiar programming language.

Apache Spark is renowned for its speed and scalability, making it a go-to framework for big data processing. It supports diverse workloads, such as batch processing, real-time streaming, machine learning, and graph processing. With PySpark, Python users can tap into Spark’s distributed computing capabilities without delving into the complexities of Java or Scala.

PySpark’s ecosystem is rich, offering tools for handling structured, semi-structured, and unstructured data. This makes it ideal for modern data workflows, including ETL processes, data warehousing, and advanced analytics. One of its standout features is the DataFrame, which brings a tabular data abstraction similar to pandas but optimized for distributed environments.

1.2 What is a DataFrame?

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in pandas but designed for large-scale data processing. DataFrames provide an easy-to-use API for performing common data manipulation tasks, such as filtering, grouping, and aggregations, with built-in optimizations for distributed computing.

In PySpark, DataFrames are immutable and distributed across a cluster. This allows operations on the data to be executed in parallel, resulting in high performance even for massive datasets. DataFrames can handle data from various sources, such as structured files (e.g., CSV, JSON), databases, or distributed storage systems like Hadoop.

1.2.1 Feature

Key features of PySpark DataFrames include:

  • Schema: DataFrames have a defined schema that specifies the structure of data, including column names and data types.
  • Lazy Evaluation: Operations on DataFrames are executed only when an action (e.g., show(), collect()) is invoked, optimizing the processing pipeline.
  • Interoperability: DataFrames can interoperate with other Spark components, such as SQL and machine learning libraries.

2. Creating Your First PySpark DataFrame

We assume that you have Apache Spark and PySpark installed on your system. If not, you can install PySpark using pip:

1
pip install pyspark

Alternatively, you can leverage the Databricks Community Edition to practice and enhance your PySpark skills.

2.1 Code Example

Let’s walk through the process of creating a DataFrame in PySpark.

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# Databricks notebook source
# MAGIC %md
# MAGIC # **Initializing pyspark**
 
# COMMAND ----------
 
from pyspark.sql import SparkSession
# Create a spark session
spark = SparkSession.builder.appName("CreateDataFrame").getOrCreate()
 
# COMMAND ----------
 
spark
 
# COMMAND ----------
 
help (spark.createDataFrame)
 
# COMMAND ----------
 
# MAGIC %md
# MAGIC # **Create dataframe from list of rows**
 
# COMMAND ----------
 
from pyspark.sql import Row
from datetime import datetime,date
 
# create a pyspark df from a list of rows
# here the spark will infer the schema based on input
df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2025,1,1), e=datetime(2025,1,1,12,0)),
    Row(a=2, b=3., c='string2', d=date(2025,2,1), e=datetime(2025,2,1,12,0)),
    Row(a=3, b=4., c='string3', d=date(2025,3,1), e=datetime(2025,3,1,12,0))
])
 
# COMMAND ----------
 
from pyspark.sql import Row
from datetime import datetime,date
 
# create a pyspark df from a list of rows and manually specifying schema
df1 = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2025,1,1), e=datetime(2025,1,1,12,0)),
    Row(a=2, b=3., c='string2', d=date(2025,2,1), e=datetime(2025,2,1,12,0)),
    Row(a=3, b=4., c='string3', d=date(2025,3,1), e=datetime(2025,3,1,12,0))
],
                           schema='a long, b double, c string, d date, e timestamp'
                           )
 
 
# COMMAND ----------
 
# show(): create the spark job/s and displays the output of the df
df1.show()
 
# COMMAND ----------
 
# MAGIC %md
# MAGIC # **Create dataframe from rdd**
 
# COMMAND ----------
 
columns = ["website", "language", "score"]
data = [("in28minutes", "java", "200"), ("trendytech", "bigdata", "200"), ("youtube", "sql", "300")]
rdd = spark.sparkContext.parallelize(data)
 
# COMMAND ----------
 
dfFromRdd = rdd.toDF()
dfFromRdd.printSchema()
dfFromRdd.show()
 
#note: we are not passing the column schema to `toDF()` method and hence generic column names are returned in the output.
 
# COMMAND ----------
 
dfFromRdd1 = rdd.toDF(columns)
dfFromRdd1.printSchema()
dfFromRdd1.show()
 
# COMMAND ----------
 
# MAGIC %md
# MAGIC # **Create dataframe from file**
 
# COMMAND ----------
 
help(spark.read)
 
# COMMAND ----------
 
df2= spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/pysparklearning/ProductCatalog.csv")
df2.printSchema()
df2.show()
 
# COMMAND ----------
 
df3= spark.read.format("csv").option("header", "false").option("inferschema","true").load("dbfs:/FileStore/shared_uploads/pysparklearning/ProductCatalog_NoHeader.csv")
df3.printSchema()
df3.show()
 
# COMMAND ----------
 
# MAGIC %md
# MAGIC # Create dataframe from list of dictionaries
 
# COMMAND ----------
 
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
 
data = [{"ID": 1, "Name": "Alice", "Age": 28},
        {"ID": 2, "Name": "Bob", "Age": 35},
        {"ID": 3, "Name": "Cathy", "Age": 29}]
 
schema = StructType([
    StructField("ID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True)
])
 
df = spark.createDataFrame(data, schema=schema)
df.show()

2.1.1 Code Explanation

The above provides an in-depth look at several methods for creating DataFrames in PySpark, a key component of Apache Spark. First, the code demonstrates how to initialize a Spark session using the SparkSession class, which is essential for working with PySpark. Once the session is established, DataFrames can be created from a variety of data sources. One common method is to use a list of rows, where PySpark automatically infers the schema from the provided data. Alternatively, you can explicitly define the schema by specifying the data types of each column.

Another approach is to create DataFrames from Resilient Distributed Datasets (RDDs), the fundamental data structure in Spark. By converting an RDD to a DataFrame, you can specify column names or rely on Spark’s default naming convention if none are provided.

Furthermore, PySpark enables you to load DataFrames directly from external files, such as CSV files, by using various options like setting headers or inferring schemas automatically from the file content.

Lastly, DataFrames can be constructed from a list of dictionaries, where you can also define the schema using StructType and StructField to precisely control the data types and structure. These methods offer flexibility when working with data in Spark, whether the source is in memory, from an external file, or a distributed RDD.

Since we’re using a Jupyter notebook, we’re unable to capture the output directly. However, you can download the source code from the downloads section, open the file in Visual Studio Code, and view the output there.

3. Conclusion

PySpark is an essential tool for big data processing, offering both scalability and simplicity. Creating your first DataFrame is a great starting point for exploring its capabilities. As you become more familiar with PySpark, you can experiment with transformations, actions, and integrations with other big data tools.

4. Download the Code

Download
You can download the Jupyter Notebook using the link provided: Create Your First Dataframe In Pyspark

Yatin Batra

An experience full-stack engineer well versed with Core Java, Spring/Springboot, MVC, Security, AOP, Frontend (Angular & React), and cloud technologies (such as AWS, GCP, Jenkins, Docker, K8).
Subscribe
Notify of
guest


This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button