Scala -Create Map from Spark DataFrame - list

I have a Spark DataFrame and I want to create Map and store values as Map[String, Map[String, String]].
I am not getting idea to do it, any help would be appreciated.
Below is Input and Output Format :
Input :
+-----------------+------------+---+--------------------------------+
|relation |obj_instance|obj|map_value |
+-----------------+------------+---+--------------------------------+
|Start~>HInfo~>Mnt|Mnt |Mnt|[Model -> 2000, Version -> 1.0] |
|Start~>HInfo~>Cbl|Cbl-3 |Cbl|[VSData -> XYZVN, Name -> Smart]|
+-----------------+------------+---+--------------------------------+
Output :
Map(relation -> Start~>HInfo~>Mnt, obj_instance -> Mnt, obj -> Mnt, Mnt -> Map(Model -> 2000, Version -> 1.0))
Map(relation -> Start~>HInfo~>Cbl, obj_instance -> Cbl-3, obj -> Cbl, Cbl -> Map(VSData -> XYZVN, Name -> Smart))
Code, I'm trying but not success :
var resultMap: Map[Any, Any] = Map()
groupedDataSet.foreach( r => {
val key1 = "relation".toString
val value1 = r(0).toString
val key2 = "obj_instance".toString
val value2 = r(1).toString
val key3 = "obj".toString
val value3 = r(2).toString
val key4 = r(2).toString
val value4 = r(3)
resultMap += (key1 -> value1, key2 -> value2, key3 -> value3, key4 -> value4)
})
resultMap.foreach(println)
Please help.
Below is the Code to create Test DataFrame and Map Column
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.functions._
object DFToMap extends App {
//Creating SparkSession
lazy val conf = new SparkConf().setAppName("df-to-map").set("spark.default.parallelism", "2")
.setIfMissing("spark.master", "local[*]")
lazy val sparkSession = SparkSession.builder().config(conf).getOrCreate()
import sparkSession.implicits._
// Creating raw DataFrame
val rawTestDF = Seq(("Start~>HInfo~>Cbl", "Cbl-3", "Cbl", "VSData", "XYZVN"), ("Start~>HInfo~>Cbl", "Cbl-3", "Cbl", "Name", "Smart"),
("Start~>HInfo~>Mnt", "Mnt", "Mnt", "Model", "2000"), ("Start~>HInfo~>Mnt", "Mnt", "Mnt", "Version", "1.0"))
.toDF("relation", "obj_instance", "obj", "key", "value")
rawTestDF.show(false)
val joinTheMap = udf { json_value: Seq[Map[String, String]] => json_value.flatten.toMap }
val groupedDataSet = rawTestDF.groupBy("relation", "obj_instance", "obj").agg(collect_list(map(col("key"), col("value"))) as "map_value_temp").withColumn("map_value", joinTheMap(col("map_value_temp")))
.drop("map_value_temp")
groupedDataSet.show(false) //This is the Input DataFrame.
}
Final Output Json from Map :
[{"relation":"Start~>HInfo~>Mnt","obj_instance":"Mnt","obj":"Mnt","Mnt":{"Model":"2000","Version":"1.0"}}
{"relation":"Start~>HInfo~>Cbl","obj_instance":"Cbl-3","obj:"Cbl","Cbl":{"VSData":"XYZVN","Name":"Smart"}}]
Note : I don't want to use any Spark groupBy, pivot, agg as Spark streaming doesn't support multi aggregation. Hence I want to get it with pure Scala code. Kindly help.

Created custom UDF to parse & generate data in JSON format.
import org.json4s.native.JsonMethods._
import org.json4s._
import org.json4s.JsonDSL._
def toJson(relation:String,obj_instance: String,obj: String,map_value: Map[String,String]) = {
compact(render(
JObject("relation" -> JString(relation),
"obj_instance" -> JString(obj_instance),
"obj" -> JString(obj),
obj -> map_value)))
}
import org.apache.spark.sql.functions._
val createJson = udf(toJson _)
val df = Seq(("Start~>HInfo~>Mnt","Mnt","Mnt",Map("Model" -> "2000", "Version" -> "1.0")),("Start~>HInfo~>Cbl","Cbl-3","Cbl",Map("VSData" -> "XYZVN", "Name" -> "Smart"))).toDF("relation","obj_instance","obj","map_value")
df.select(createJson($"relation",$"obj_instance",$"obj",$"map_value").as("json_map")).show(false)
+-----------------------------------------------------------------------------------------------------------+
|json_map |
+-----------------------------------------------------------------------------------------------------------+
|{"relation":"Start~>HInfo~>Mnt","obj_instance":"Mnt","obj":"Mnt","Mnt":{"Model":"2000","Version":"1.0"}} |
|{"relation":"Start~>HInfo~>Cbl","obj_instance":"Cbl-3","obj":"Cbl","Cbl":{"VSData":"XYZVN","Name":"Smart"}}|
+-----------------------------------------------------------------------------------------------------------+

Related

How do I transform a List[String] to a List[Map[String,String]] given that the list of string represents the keys to the map in Scala?

I have a list of references:
val references: List[String]= List("S","R")
I also have variables which is:
val variables: Map[String,List[String]]=("S"->("a","b"),"R"->("west","east"))
references is a list of keys of the variables map.
I want to construct a function which takes:
def expandReplacements(references:List[String],variables:Map[String,List[String]]):List[Map(String,String)]
and this function should basically create return the following combinations
List(Map("S"->"a"),("R"->"west"),Map("S"->"a"),("R"->"east"),Map("S"->"b"),("R"->"west"),Map("S"->"b"),("R"->"east"))
I tried doing this:
val variables: Map[String,List[String]] = Map("S" -> List("a", "b"), "R" -> List("east", "central"))
val references: List[String] = List("S","R")
def expandReplacements(references: List[String]): List[Map[String, String]] =
references match {
case ref :: refs =>
val variableValues =
variables(ref)
val x = variableValues.flatMap { variableValue =>
val remaining = expandReplacements(refs)
remaining.map(rem => rem + (ref -> variableValue))
}
x
case Nil => List.empty
}
If you have more than 2 references, you can do:
def expandReplacements(references: List[String], variables :Map[String,List[String]]): List[Map[String, String]] = {
references match {
case Nil => List(Map.empty[String, String])
case x :: xs =>
variables.get(x).fold {
expandReplacements(xs, variables)
} { variableList =>
for {
variable <- variableList.map(x -> _)
otherReplacements <- expandReplacements(xs, variables)
} yield otherReplacements + variable
}
}
}
Code run at Scastie.
So I have Figured it Out
def expandSubstitutions(references: List[String]): List[Map[String, String]] =
references match {
case r :: Nil => variables(r).map(v => Map(r -> v))
case r :: rs => variables(r).flatMap(v => expandSubstitutions(rs).map(expanded => expanded + (r -> v)))
case Nil => Nil
}
This Returns:
List(Map(R -> west, S -> a), Map(R -> east, S -> a), Map(R -> west, S -> b), Map(R -> east, S -> b))
Your references representation is suboptimal but if you want to use it...
val variables: Map[String,List[String]] = [S -> List("a", "b"), R -> List("east", "central")]
val references: List[String] = List("S","R")
def expandReplacements(references: List[String]): List[Map[String, String]] =
references match {
case List(aKey, bKey) =>
val as = variables.get(aKey).toList.flatten
val bs = variables.get(bKey).toList.flatten
as.zip(bs).map { case (a, b) =>
Map(aKey -> a, bKey -> b)
}
case _ => Nil
}

Kotlin How to integrate/merge with same value in the List of data class

I want to merge same "startime" to one (step, distance and calorie) in the list, how can I to do this.
var listNewStepData = arrayListOf<NewStepData>()
data class
data class NewStepData (
val startTime: String?,
val endTime: String?,
val step: Int? = 0,
val distance: Int? = 0,
val calorie: Int? = 0
)
this is sample
NewStepData(startTime=2020-04-14T00:00:00.000Z, endTime=2020-04-14T00:00:00.000Z, step=4433, distance=0, calorie=0)
NewStepData(startTime=2020-04-14T00:00:00.000Z, endTime=2020-04-15T00:00:00.000Z, step=0, distance=0, calorie=1697)
NewStepData(startTime=2020-04-14T00:00:00.000Z, endTime=2020-04-14T00:00:00.000Z, step=0, distance=2436, calorie=0)
NewStepData(startTime=2020-04-15T00:00:00.000Z, endTime=2020-04-15T00:00:00.000Z, step=5423, distance=0, calorie=0)
NewStepData(startTime=2020-04-15T00:00:00.000Z, endTime=2020-04-16T00:00:00.000Z, step=0, distance=0, calorie=1715)
NewStepData(startTime=2020-04-15T00:00:00.000Z, endTime=2020-04-15T00:00:00.000Z, step=0, distance=3196, calorie=0)
I want to get this
NewStepData(startTime=2020-04-14T00:00:00.000Z, endTime=2020-04-15T00:00:00.000Z, step=4433, distance=2436, calorie=1697)
NewStepData(startTime=2020-04-15T00:00:00.000Z, endTime=2020-04-16T00:00:00.000Z, step=5423, distance=3196, calorie=1715)
thanks
You can use groupBy { } for your list. It will return a map of your grouping variable type to lists of your original type. And then, use flatMap to aggregate your data.
I assume that you take maximum end date which is maxBy and sum of distances, and steps which you need sumBy for, and calories which sumByDouble is the best choice.
Here's the sample code:
var grouped = listNewStepData.groupBy { it.startTime }.flatMap { entry -> NewStepData(startTime = entry.key,
endTime = entry.value.maxBy { item -> item.endTime },
step = entry.value.sumBy { item -> item.step },
distance = entry.value.sumBy { item -> item.distance },
calorie = entry.value.sumByDouble { item -> item.calorie })
}

Select columns whose name contains a specific string from spark scala DataFrame

I have a DataFrame like this.
Name City Name_index City_index
Ali lhr 2.0 0.0
abc swl 0.0 2.0
xyz khi 1.0 1.0
I want to drop columns that don't contain string like "index".
Expected Output should be like:
Name_index City_index
2.0 0.0
0.0 2.0
1.0 1.0
I have tried this.
val cols = newDF.columns
val regex = """^((?!_indexed).)*$""".r
val selection = cols.filter(s => regex.findFirstIn(s).isDefined)
cols.diff(selection)
val res =newDF.select(selection.head, selection.tail : _*)
res.show()
But I am getting this:
Name City
Ali lhr
abc swl
xyz khi
There is a typo in your regex , fixed it in below code
import org.apache.spark.sql.SparkSession
object FilterColumn {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val newDF = List(PersonCity("Ali","lhr",2.0,0.0)).toDF()
newDF.show()
val cols = newDF.columns
val regex = """^((?!_index).)*$""".r
val selection = cols.filter(s => regex.findFirstIn(s).isDefined)
val finalCols = cols.diff(selection)
val res =newDF.select(finalCols.head,finalCols.tail: _*)
res.show()
}
}
case class PersonCity(Name : String, City :String, Name_index : Double, City_index: Double)
import org.apache.spark.sql.functions.col
val regex = """^((?!_indexed).)*$""".r
val schema = StructType(
Seq(StructField("Name", StringType, false),
StructField("City", StringType, false),
StructField("Name_indexed", IntegerType, false),
StructField("City_indexed", LongType, false)))
val empty: DataFrame = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema = schema)
val columns = schema.map(_.name).filter(el => regex.pattern.matcher(el).matches())
empty.select(columns.map(col):_*).show()
It gives
+----+----+
|Name|City|
+----+----+
+----+----+

Convert Map<String,List<String>> to List<Map<String,String>> in Kotlin

Input: {param1=[x1,y1], param2=[p1,q1],param3=[m1,n1]....}
Output: [{param1=x1, param2=p1,param3=m1....},{param1=y1, param2=q1,param3=n1....}]
I need to convert this input Map<String,List<String>> to List<Map<String,String>>
Any help is appreciated.
Your help can save my day.. Thank you
val source =
mapOf(
"param1" to listOf("x1", "y1"),
"param2" to listOf("p1", "q1"),
"param3" to listOf("m1", "n1")
)
val result = source.values.first().indices.map { index ->
source.entries.associate { (param, list) -> param to list[index] }
}

How to apply filters while reading all the json files from a folder using scala?

I have folder which has multiple json files(first.json, second.json) .Using scala i am loading all the jsonfiles data to rdd/dataset of spark and then applying filter on the data.
The problem here is if we have 600 of data then we need to load all of them into rdd/dataset and then we are applying filter
looking for a solution where i can filter the records while reading from the folder itself and not loading into spark memory.
Filtering is done based on blockheight property.
Json structure in each file :
first.json :
[{"IsFee":false,"BlockDateTime":"2015-10-14T09:02:46","Address":"0xe8fdc802e721426e0422d18d371ab59a41ddaeac","BlockHeight":381859,"Type":"IN","Value":0.61609232637203584,"TransactionHash":"0xe6fc01ff633b4170e0c8f2df7db717e0608f8aaf62e6fbf65232a7009b53da4e","UserName":null,"ProjectName":null,"CreatedUser":null,"Id":0,"CreatedUserId":0,"CreatedTime":"2019-08-26T22:32:45.2686137+05:30","UpdatedUserId":0,"UpdatedTime":"2019-08-26T22:32:45.2696126+05:30"},{"IsFee":false,"BlockDateTime":"2015-10-14T09:02:46","Address":"0x52bc44d5378309ee2abf1539bf71de1b7d7be3b5","BlockHeight":381859,"Type":"OUT","Value":-0.61609232637203584,"TransactionHash":"0xe6fc01ff633b4170e0c8f2df7db717e0608f8aaf62e6fbf65232a7009b53da4e","UserName":null,"ProjectName":null,"CreatedUser":null,"Id":0,"CreatedUserId":0,"CreatedTime":"2019-08-26T22:32:45.3141203+05:30","UpdatedUserId":0,"UpdatedTime":"2019-08-26T22:32:45.3141203+05:30"}]
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
object BalanceAndTransactionDownload {
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("xxx").getOrCreate()
val currencyDataSchema = StructType(Array(
StructField("Type", StringType, true),
StructField("TransactionHash", StringType, true),
StructField("BlockHeight", LongType, true),
StructField("BlockDateTime", TimestampType, true),
StructField("Value", DecimalType(38, 18), true),
StructField("Address", StringType, true),
StructField("IsFee", BooleanType, true)
))
val projectAddressesFile = args(0)
val blockJSONFilesContainer = args(1)
val balanceFolderName = args(2)
val downloadFolderName = args(3)
val blockHeight = args(4)
val projectAddresses = spark.read.option("multiline", "true").json(projectAddressesFile)
val currencyDataFile = spark.read.option("multiline", "true").schema(currencyDataSchema).json(blockJSONFilesContainer) // This is where i want to filter out the data
val filteredcurrencyData = currencyDataFile.filter(currencyDataFile("BlockHeight") <= blockHeight)
filteredcurrencyData.join(projectAddresses, filteredcurrencyData("Address") === projectAddresses("address")).groupBy(projectAddresses("address")).agg(sum("Value").alias("Value")).repartition(1).write.option("header", "true").format("com.databricks.spark.csv").csv(balanceFolderName)
filteredcurrencyData.join(projectAddresses, filteredcurrencyData("Address") === projectAddresses("address")).drop(projectAddresses("address")).drop(projectAddresses("CurrencyId")).drop(projectAddresses("Id")).repartition(1).write.option("header", "true").format("com.databricks.spark.csv").csv(downloadFolderName)
}
}
the files sitting on the datastore should be partitioned. u seem to filter by block height. so u can have multiple folders like: blockheight=1, blockheight=2 etc.. and have json files inside these folders. inthis case spark will not read all json files it will instead scan the folders required.