Create Your First Dataframe In Pyspark
PySpark allows users to handle large datasets efficiently through distributed computing. Whether you’re new to Spark or looking to enhance your skills, let us delve into understanding how to create DataFrames and manipulate data effectively, unlocking the power of big data analytics with PySpark.
1. Introduction
1.1 What is PySpark?
PySpark is the Python API for Apache Spark, a powerful open-source distributed computing system. It enables developers to perform large-scale data processing and analytics in a parallel and fault-tolerant manner. PySpark integrates seamlessly with Python, allowing data engineers and data scientists to leverage the power of Spark using a familiar programming language.
Apache Spark is renowned for its speed and scalability, making it a go-to framework for big data processing. It supports diverse workloads, such as batch processing, real-time streaming, machine learning, and graph processing. With PySpark, Python users can tap into Spark’s distributed computing capabilities without delving into the complexities of Java or Scala.
PySpark’s ecosystem is rich, offering tools for handling structured, semi-structured, and unstructured data. This makes it ideal for modern data workflows, including ETL processes, data warehousing, and advanced analytics. One of its standout features is the DataFrame, which brings a tabular data abstraction similar to pandas but optimized for distributed environments.
1.2 What is a DataFrame?
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in pandas but designed for large-scale data processing. DataFrames provide an easy-to-use API for performing common data manipulation tasks, such as filtering, grouping, and aggregations, with built-in optimizations for distributed computing.
In PySpark, DataFrames are immutable and distributed across a cluster. This allows operations on the data to be executed in parallel, resulting in high performance even for massive datasets. DataFrames can handle data from various sources, such as structured files (e.g., CSV, JSON), databases, or distributed storage systems like Hadoop.
1.2.1 Feature
Key features of PySpark DataFrames include:
- Schema: DataFrames have a defined schema that specifies the structure of data, including column names and data types.
- Lazy Evaluation: Operations on DataFrames are executed only when an action (e.g.,
show()
,collect()
) is invoked, optimizing the processing pipeline. - Interoperability: DataFrames can interoperate with other Spark components, such as SQL and machine learning libraries.
2. Creating Your First PySpark DataFrame
We assume that you have Apache Spark and PySpark installed on your system. If not, you can install PySpark using pip:
1 | pip install pyspark |
Alternatively, you can leverage the Databricks Community Edition to practice and enhance your PySpark skills.
2.1 Code Example
Let’s walk through the process of creating a DataFrame in PySpark.
001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | # Databricks notebook source # MAGIC %md # MAGIC # **Initializing pyspark** # COMMAND ---------- from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName( "CreateDataFrame" ).getOrCreate() # COMMAND ---------- spark # COMMAND ---------- help (spark.createDataFrame) # COMMAND ---------- # MAGIC %md # MAGIC # **Create dataframe from list of rows** # COMMAND ---------- from pyspark.sql import Row from datetime import datetime,date # create a pyspark df from a list of rows # here the spark will infer the schema based on input df = spark.createDataFrame([ Row(a = 1 , b = 2. , c = 'string1' , d = date( 2025 , 1 , 1 ), e = datetime( 2025 , 1 , 1 , 12 , 0 )), Row(a = 2 , b = 3. , c = 'string2' , d = date( 2025 , 2 , 1 ), e = datetime( 2025 , 2 , 1 , 12 , 0 )), Row(a = 3 , b = 4. , c = 'string3' , d = date( 2025 , 3 , 1 ), e = datetime( 2025 , 3 , 1 , 12 , 0 )) ]) # COMMAND ---------- from pyspark.sql import Row from datetime import datetime,date # create a pyspark df from a list of rows and manually specifying schema df1 = spark.createDataFrame([ Row(a = 1 , b = 2. , c = 'string1' , d = date( 2025 , 1 , 1 ), e = datetime( 2025 , 1 , 1 , 12 , 0 )), Row(a = 2 , b = 3. , c = 'string2' , d = date( 2025 , 2 , 1 ), e = datetime( 2025 , 2 , 1 , 12 , 0 )), Row(a = 3 , b = 4. , c = 'string3' , d = date( 2025 , 3 , 1 ), e = datetime( 2025 , 3 , 1 , 12 , 0 )) ], schema = 'a long, b double, c string, d date, e timestamp' ) # COMMAND ---------- # show(): create the spark job/s and displays the output of the df df1.show() # COMMAND ---------- # MAGIC %md # MAGIC # **Create dataframe from rdd** # COMMAND ---------- columns = [ "website" , "language" , "score" ] data = [( "in28minutes" , "java" , "200" ), ( "trendytech" , "bigdata" , "200" ), ( "youtube" , "sql" , "300" )] rdd = spark.sparkContext.parallelize(data) # COMMAND ---------- dfFromRdd = rdd.toDF() dfFromRdd.printSchema() dfFromRdd.show() #note: we are not passing the column schema to `toDF()` method and hence generic column names are returned in the output. # COMMAND ---------- dfFromRdd1 = rdd.toDF(columns) dfFromRdd1.printSchema() dfFromRdd1.show() # COMMAND ---------- # MAGIC %md # MAGIC # **Create dataframe from file** # COMMAND ---------- help (spark.read) # COMMAND ---------- df2 = spark.read. format ( "csv" ).option( "header" , "true" ).load( "dbfs:/FileStore/shared_uploads/pysparklearning/ProductCatalog.csv" ) df2.printSchema() df2.show() # COMMAND ---------- df3 = spark.read. format ( "csv" ).option( "header" , "false" ).option( "inferschema" , "true" ).load( "dbfs:/FileStore/shared_uploads/pysparklearning/ProductCatalog_NoHeader.csv" ) df3.printSchema() df3.show() # COMMAND ---------- # MAGIC %md # MAGIC # Create dataframe from list of dictionaries # COMMAND ---------- from pyspark.sql.types import StructType, StructField, IntegerType, StringType data = [{ "ID" : 1 , "Name" : "Alice" , "Age" : 28 }, { "ID" : 2 , "Name" : "Bob" , "Age" : 35 }, { "ID" : 3 , "Name" : "Cathy" , "Age" : 29 }] schema = StructType([ StructField( "ID" , IntegerType(), True ), StructField( "Name" , StringType(), True ), StructField( "Age" , IntegerType(), True ) ]) df = spark.createDataFrame(data, schema = schema) df.show() |
2.1.1 Code Explanation
The above provides an in-depth look at several methods for creating DataFrames in PySpark, a key component of Apache Spark. First, the code demonstrates how to initialize a Spark session using the SparkSession
class, which is essential for working with PySpark. Once the session is established, DataFrames can be created from a variety of data sources. One common method is to use a list of rows, where PySpark automatically infers the schema from the provided data. Alternatively, you can explicitly define the schema by specifying the data types of each column.
Another approach is to create DataFrames from Resilient Distributed Datasets (RDDs), the fundamental data structure in Spark. By converting an RDD to a DataFrame, you can specify column names or rely on Spark’s default naming convention if none are provided.
Furthermore, PySpark enables you to load DataFrames directly from external files, such as CSV files, by using various options like setting headers or inferring schemas automatically from the file content.
Lastly, DataFrames can be constructed from a list of dictionaries, where you can also define the schema using StructType
and StructField
to precisely control the data types and structure. These methods offer flexibility when working with data in Spark, whether the source is in memory, from an external file, or a distributed RDD.
Since we’re using a Jupyter notebook, we’re unable to capture the output directly. However, you can download the source code from the downloads section, open the file in Visual Studio Code, and view the output there.
3. Conclusion
PySpark is an essential tool for big data processing, offering both scalability and simplicity. Creating your first DataFrame is a great starting point for exploring its capabilities. As you become more familiar with PySpark, you can experiment with transformations, actions, and integrations with other big data tools.
4. Download the Code
You can download the Jupyter Notebook using the link provided: Create Your First Dataframe In Pyspark