Create a complex JSON structure using multiple data frames in PySpark
Use Case
Create an complex JSON structure by joining multiple data frames.
Lets, directly move on to coding part.
We have following data frames,
df1 — contain mobile:string, amount:string
df2 — contain mobile:string, status:int
df3 — contain mobile:string, dueDate:string
Step 1 : Convert each data frame into one-level JSON array
Dataframe df1 :
from pyspark.sql.functions import *
embeddedElementName = ‘label1’
joinColumnName = ‘mobile’
labelOneEmbeddedDF = (df1.withColumn(embeddedElementName (struct(col(‘amount’)))).select(joinColumnName,embeddedElementName)
.alias(embeddedElementName))
It will create following structure.
mobile:string
label1:struct amount:string
Dataframe df2 :
embeddedElementName = ‘label2’
joinColumnName = ‘mobile’
labelTwoEmbeddedDF = (df2.withColumn(embeddedElementName
,(struct(col(‘status’)))).select(joinColumnName,embeddedElementName)
.alias(embeddedElementName))
It will create following structure.
mobile:string
label2:struct status:int
Dataframe df3 :
embeddedElementName = ‘label3’
joinColumnName = ‘mobile’
labelThreeEmbeddedDF = (df3.withColumn(embeddedElementName,(struct(col(‘dueDate’)))).select(joinColumnName,embeddedElementName)
.alias(embeddedElementName))
It will create following structure.
mobile:string
label3:struct dueDate:string
Now, we are going to join these dataframes to create a new dataframe that contain following structure.
Here, we have applied left join.
finalDF = labelOneEmbeddedDF.join(labelTwoEmbeddedDF, on=[‘mobile’], how=’left’).join(labelThreeEmbeddedDF, on=[‘mobile’],how=’left’)
It will create following structure.
mobile:string
label1:struct amount:string
label2:struct status:integer
label3:struct dueDate:string
That’s what we want to achieve.
Hope this will help you to start your journey with Data Bricks and Cosmos connectivity along with Data Lake. I’ll discuss more strategies in a future blog post!