Using PySpark in Google colab , read an 25 MB Json file
#pyspark #json #googlecolab #pysparkdataframe #bigdata
Here , I will be describing step-by-step appraoch to load , read, and process a huge json file using PySpark in Google colaboratory.
At the bottom of this writing, you can find all codes in one place.
1.Open a Google colab notebook
a) Go to your Google drive and click on New
b) Create a new Google colab notebook
New > More > Google Colaboratory
c) A new untitled Colab notebook opens up in new tab, rename it.
I re-named it as “PySpark-read-json.ipynb”
2. Install and import PySpark in the Colab notebook
a) Install the PySpark package
!pip install pyspark -v
You should be able to see the “Successfully installed” messages
b) Now , import the package
from pyspark.sql import SparkSession,Row
3. Create a Spark session
spark = SparkSession.builder.appName(‘Load-jston-to-pyspark-dataframe’).getOrCreate()
4. Download a large JSON file from the internet (path given below) and upload it to Google drive.
a) Download a 25 MB large json file from https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json and save it in your machine.
b) Now, upload the file to Google drive/colab
Click on Upload to session storage
Click Ok on the warning message
It will take a while to load the file, note the progress at bottom right corner
Copy the path of the file once the upload is complete, make a note of the path as a comment in the notebook.
You can click on the folder icon, to hide/unhide the left section of object explorer
5. Write PySpark code to load data from the json file to the PySpark dataframe
# Read the json file usring spark.read method, it will create a Spark dataframe
df = spark.read.option(“multiline”,”true”).json(“/content/large-file.json”)
6. Processing the data loaded into the data frame
a) Show the top 10 rows of the data frame
df.show(10)
b) Filter rows of the dataframe where type is PushEvent
df.where(“type=’PushEvent’ “).show(truncate= False)
c) Show the total count of rows where type is PushEvent
df.where(“type=’PushEvent’ “).count()
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Complete code screenshot
Code
# Install PySpark library
!pip install pyspark -v
# Import libraries
from pyspark.sql import SparkSession,Row
# Create a Spark session
spark = SparkSession.builder.appName(‘Load-jston-to-pyspark-dataframe’).getOrCreate()
# Read the json file usring spark.read method, it will create a Spark dataframe
df = spark.read.option(“multiline”,”true”).json(“/content/large-file.json”)
# show the data, default 20 rows
df.show(10)
# Show only those rows of the dataframe where type is PushEvent ( type is a column name here)
df.where(“type=’PushEvent’ “).show(truncate= False)
# Show the count of total rows where type is PushEvent
df.where(“type=’PushEvent’ “).count()
— — THANK YOU — — — — — — -