Using PySpark in Google colab , read an 25 MB Json file

Lipsa Biswas
4 min readMar 14, 2023

--

#pyspark #json #googlecolab #pysparkdataframe #bigdata

Here , I will be describing step-by-step appraoch to load , read, and process a huge json file using PySpark in Google colaboratory.

At the bottom of this writing, you can find all codes in one place.

1.Open a Google colab notebook

a) Go to your Google drive and click on New

b) Create a new Google colab notebook

New > More > Google Colaboratory

c) A new untitled Colab notebook opens up in new tab, rename it.

I re-named it as “PySpark-read-json.ipynb”

2. Install and import PySpark in the Colab notebook

a) Install the PySpark package

!pip install pyspark -v

You should be able to see the “Successfully installed” messages

b) Now , import the package

from pyspark.sql import SparkSession,Row

3. Create a Spark session

spark = SparkSession.builder.appName(‘Load-jston-to-pyspark-dataframe’).getOrCreate()

4. Download a large JSON file from the internet (path given below) and upload it to Google drive.

a) Download a 25 MB large json file from https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json and save it in your machine.

b) Now, upload the file to Google drive/colab

Click on Upload to session storage

Click Ok on the warning message

It will take a while to load the file, note the progress at bottom right corner

Copy the path of the file once the upload is complete, make a note of the path as a comment in the notebook.

You can click on the folder icon, to hide/unhide the left section of object explorer

5. Write PySpark code to load data from the json file to the PySpark dataframe

# Read the json file usring spark.read method, it will create a Spark dataframe

df = spark.read.option(“multiline”,”true”).json(“/content/large-file.json”)

6. Processing the data loaded into the data frame

a) Show the top 10 rows of the data frame

df.show(10)

b) Filter rows of the dataframe where type is PushEvent

df.where(“type=’PushEvent’ “).show(truncate= False)

c) Show the total count of rows where type is PushEvent

df.where(“type=’PushEvent’ “).count()

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Complete code screenshot

Code

# Install PySpark library

!pip install pyspark -v

# Import libraries

from pyspark.sql import SparkSession,Row

# Create a Spark session

spark = SparkSession.builder.appName(‘Load-jston-to-pyspark-dataframe’).getOrCreate()

# Read the json file usring spark.read method, it will create a Spark dataframe

df = spark.read.option(“multiline”,”true”).json(“/content/large-file.json”)

# show the data, default 20 rows

df.show(10)

# Show only those rows of the dataframe where type is PushEvent ( type is a column name here)

df.where(“type=’PushEvent’ “).show(truncate= False)

# Show the count of total rows where type is PushEvent

df.where(“type=’PushEvent’ “).count()

— — THANK YOU — — — — — — -

--

--

No responses yet