Create a complex JSON structure using multiple data frames in PySpark

Nachiket Rajput

1 min readFeb 7, 2021

Use Case

Create an complex JSON structure by joining multiple data frames.

Lets, directly move on to coding part.

We have following data frames,

df1 — contain mobile:string, amount:string

df2 — contain mobile:string, status:int

df3 — contain mobile:string, dueDate:string

Step 1 : Convert each data frame into one-level JSON array
Dataframe df1 :

from pyspark.sql.functions import *

embeddedElementName = ‘label1’
joinColumnName = ‘mobile’
labelOneEmbeddedDF = (df1.withColumn(embeddedElementName (struct(col(‘amount’)))).select(joinColumnName,embeddedElementName)
.alias(embeddedElementName))

It will create following structure.

mobile:string

label1:struct amount:string

Dataframe df2 :

embeddedElementName = ‘label2’
joinColumnName = ‘mobile’
labelTwoEmbeddedDF = (df2.withColumn(embeddedElementName
,(struct(col(‘status’)))).select(joinColumnName,embeddedElementName)
.alias(embeddedElementName))

It will create following structure.

mobile:string

label2:struct status:int

Dataframe df3 :

embeddedElementName = ‘label3’
joinColumnName = ‘mobile’
labelThreeEmbeddedDF = (df3.withColumn(embeddedElementName,(struct(col(‘dueDate’)))).select(joinColumnName,embeddedElementName)
.alias(embeddedElementName))

It will create following structure.

mobile:string

label3:struct dueDate:string

Now, we are going to join these dataframes to create a new dataframe that contain following structure.

Here, we have applied left join.

finalDF = labelOneEmbeddedDF.join(labelTwoEmbeddedDF, on=[‘mobile’], how=’left’).join(labelThreeEmbeddedDF, on=[‘mobile’],how=’left’)

It will create following structure.

mobile:string

label1:struct amount:string

label2:struct status:integer

label3:struct dueDate:string

That’s what we want to achieve.

Hope this will help you to start your journey with Data Bricks and Cosmos connectivity along with Data Lake. I’ll discuss more strategies in a future blog post!

Create a complex JSON structure using multiple data frames in PySpark

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Nachiket Rajput

No responses yet