Using PySpark in Google colab , read an 25 MB Json file | by Lipsa Biswas | Medium

Using PySpark in Google colab , read an 25 MB Json file
Lipsa Biswas
·Follow
4 min read·
Mar 14, 2023
--
#pyspark #json #googlecolab #pysparkdataframe #bigdata
Here , I will be describing step-by-step appraoch to load , read, and process a huge json file using PySpark in Google colaboratory.
At the bottom of this writing, you can find all codes in one place.
1.Open a Google colab notebooka) Go to your Google drive and click on New
b) Create a new Google colab notebook
New > More > Google Colaboratory
c) A new untitled Colab notebook opens up in new tab, rename it.
I re-named it as “PySpark-read-json.ipynb”
2. Install and import PySpark in the Colab notebooka) Install the PySpark package
!pip install pyspark -v
You should be able to see the “Successfully installed” messages
b) Now , import the package
from pyspark.sql import SparkSession,Row
3. Create a Spark sessionspark = SparkSession.builder.appName(‘Load-jston-to-pyspark-dataframe’).getOrCreate()
4. Download a large JSON file from the internet (path given below) and upload it to Google drive.a) Download a 25 MB large json file from https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json and save it in your machine.
b) Now, upload the file to Google drive/colab
Click on Upload to session storage
Click Ok on the warning message
It will take a while to load the file, note the progress at bottom right corner
Copy the path of the file once the upload is complete, make a note of the path as a comment in the notebook.
You can click on the folder icon, to hide/unhide the left section of object explorer
5. Write PySpark code to load data from the json file to the PySpark dataframe# Read the json file usring spark.read method, it will create a Spark dataframe
df = spark.read.option(“multiline”,”true”).json(“/content/large-file.json”)
6. Processing the data loaded into the data framea) Show the top 10 rows of the data frame
df.show(10)
b) Filter rows of the dataframe where type is PushEvent
df.where(“type=’PushEvent’ “).show(truncate= False)
c) Show the total count of rows where type is PushEvent
df.where(“type=’PushEvent’ “).count()
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Complete code screenshot
Code
# Install PySpark library
!pip install pyspark -v
# Import libraries
from pyspark.sql import SparkSession,Row
# Create a Spark session
spark = SparkSession.builder.appName(‘Load-jston-to-pyspark-dataframe’).getOrCreate()
# Read the json file usring spark.read method, it will create a Spark dataframe
df = spark.read.option(“multiline”,”true”).json(“/content/large-file.json”)
# show the data, default 20 rows
df.show(10)
# Show only those rows of the dataframe where type is PushEvent ( type is a column name here)
df.where(“type=’PushEvent’ “).show(truncate= False)
# Show the count of total rows where type is PushEvent
df.where(“type=’PushEvent’ “).count()
— — THANK YOU — — — — — — -
--
--
Written by Lipsa Biswas11 Followers
·5 Following
No responses yet
Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams