Select columns whose name contains a specific string from spark scala DataFrame - regex

I have a DataFrame like this.
Name City Name_index City_index
Ali lhr 2.0 0.0
abc swl 0.0 2.0
xyz khi 1.0 1.0
I want to drop columns that don't contain string like "index".
Expected Output should be like:
Name_index City_index
2.0 0.0
0.0 2.0
1.0 1.0
I have tried this.
val cols = newDF.columns
val regex = """^((?!_indexed).)*$""".r
val selection = cols.filter(s => regex.findFirstIn(s).isDefined)
cols.diff(selection)
val res =newDF.select(selection.head, selection.tail : _*)
res.show()
But I am getting this:
Name City
Ali lhr
abc swl
xyz khi

There is a typo in your regex , fixed it in below code
import org.apache.spark.sql.SparkSession
object FilterColumn {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val newDF = List(PersonCity("Ali","lhr",2.0,0.0)).toDF()
newDF.show()
val cols = newDF.columns
val regex = """^((?!_index).)*$""".r
val selection = cols.filter(s => regex.findFirstIn(s).isDefined)
val finalCols = cols.diff(selection)
val res =newDF.select(finalCols.head,finalCols.tail: _*)
res.show()
}
}
case class PersonCity(Name : String, City :String, Name_index : Double, City_index: Double)

import org.apache.spark.sql.functions.col
val regex = """^((?!_indexed).)*$""".r
val schema = StructType(
Seq(StructField("Name", StringType, false),
StructField("City", StringType, false),
StructField("Name_indexed", IntegerType, false),
StructField("City_indexed", LongType, false)))
val empty: DataFrame = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema = schema)
val columns = schema.map(_.name).filter(el => regex.pattern.matcher(el).matches())
empty.select(columns.map(col):_*).show()
It gives
+----+----+
|Name|City|
+----+----+
+----+----+

Related

Putting children in one list

Please tell me how in such a data structure (simplified for better understanding) to bring all the children of the entity into one list:
fun main() {
val listOfEntities = listOf(
Entity(
name = "John",
entities = listOf(
Entity(
name = "Adam",
entities = listOf()
),
Entity(
name = "Ivan",
entities = listOf(
Entity(
name = "Henry",
entities = listOf(
Entity(
name = "Kate",
entities = listOf(
Entity(
name = "Bob",
entities = listOf()
)
)
)
)
)
)
)
)
)
)
val result = listOfEntities.flatMap { it.entities }.map { it.name }
println(result)
}
data class Entity(
val name: String,
val entities: List<Entity>
)
I expect to see following result:
[John, Adam, Ivan, Henry, Kate, Bob]
I try to use flatMap, but it did not lead to the expected result.
Thank you in advance!
You can traverse the tree of Entities recursively like this:
fun List<Entity>.flattenEntities(): List<Entity> =
this + flatMap { it.entities.flattenEntities() }
Then you can call
val result = listOfEntities.flattenEntities().map { it.name }
to obtain the desired result.
You could do it like this
fun List<Entity>.flatten(): List<String> {
return flatMap { listOf(it.name) + it.entities.flatten()}
}
and then
val result = listOfEntities.flatten()

Regex RDD using Apache Spark Scala

I have the following RDD:
x: Array[String] =
Array("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready
group: NT=app_1,hadoop-exec,sparkConnection,Ready
group: NT=app_exmpl_2,DB-exec,MDBConnection,NR
group: NT=apprexec,hadoop-exec,sparkConnection,Ready
group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR
I just want to get the first part of every part of this RDD as you can see in the next example:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app
To do it I am trying it in this way.
//Here I get the RDD:
val x = spark.sparkContext.parallelize(List(value)).collect()
//Try to use regex on it, this regex is to get until the first comma
val regex1 = """(^(.+?),)"""
val rdd_1 = x.map(g => g.matches(regex1))
This is what I am trying but is not working for me because I just get an Array of Boolean. What am I doing wrong?
I am new with Apache Spark Scala. If you need something more just tell me it. Thanks in advance!
try this.
val x: Array[String] =
Array(
"Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
"group: NT=app_1,hadoop-exec,sparkConnection,Ready",
"group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
"group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
"group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")
val rdd = sc.parallelize(x)
val result = rdd.map(lines => {
lines.split(",")(0)
})
result.collect().foreach(println)
output:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app
Try with this regex :
^\s*([^,]+)(_\w+)?
Demo
To implement this regex in your example, you can try :
val arr = Seq("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
"group: NT=app_1,hadoop-exec,sparkConnection,Ready",
"group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
"group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
"group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")
val rd_var = spark.sparkContext.parallelize((arr).map((Row(_))))
val pattern = "^\s*([^,]+)(_\w+)?".r
rd_var.map {
case Row(str) => str match {
case pattern(gr1, _) => gr1
}
}.foreach(println(_))
With RDD:
val spark = SparkSession.builder().master("local[1]").getOrCreate()
val pattern = "([a-zA-Z0-9=:_ ]+),(.*)".r
val el = Seq("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
"group: NT=app_1,hadoop-exec,sparkConnection,Ready",
"group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
"group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
"group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")
def main(args: Array[String]): Unit = {
val rdd = spark.sparkContext.parallelize((el).map((Row(_))))
rdd.map {
case Row(str) => str match {
case pattern(gr1, _) => gr1
}
}.foreach(println(_))
}
It gives:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app

Scala -Create Map from Spark DataFrame

I have a Spark DataFrame and I want to create Map and store values as Map[String, Map[String, String]].
I am not getting idea to do it, any help would be appreciated.
Below is Input and Output Format :
Input :
+-----------------+------------+---+--------------------------------+
|relation |obj_instance|obj|map_value |
+-----------------+------------+---+--------------------------------+
|Start~>HInfo~>Mnt|Mnt |Mnt|[Model -> 2000, Version -> 1.0] |
|Start~>HInfo~>Cbl|Cbl-3 |Cbl|[VSData -> XYZVN, Name -> Smart]|
+-----------------+------------+---+--------------------------------+
Output :
Map(relation -> Start~>HInfo~>Mnt, obj_instance -> Mnt, obj -> Mnt, Mnt -> Map(Model -> 2000, Version -> 1.0))
Map(relation -> Start~>HInfo~>Cbl, obj_instance -> Cbl-3, obj -> Cbl, Cbl -> Map(VSData -> XYZVN, Name -> Smart))
Code, I'm trying but not success :
var resultMap: Map[Any, Any] = Map()
groupedDataSet.foreach( r => {
val key1 = "relation".toString
val value1 = r(0).toString
val key2 = "obj_instance".toString
val value2 = r(1).toString
val key3 = "obj".toString
val value3 = r(2).toString
val key4 = r(2).toString
val value4 = r(3)
resultMap += (key1 -> value1, key2 -> value2, key3 -> value3, key4 -> value4)
})
resultMap.foreach(println)
Please help.
Below is the Code to create Test DataFrame and Map Column
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.functions._
object DFToMap extends App {
//Creating SparkSession
lazy val conf = new SparkConf().setAppName("df-to-map").set("spark.default.parallelism", "2")
.setIfMissing("spark.master", "local[*]")
lazy val sparkSession = SparkSession.builder().config(conf).getOrCreate()
import sparkSession.implicits._
// Creating raw DataFrame
val rawTestDF = Seq(("Start~>HInfo~>Cbl", "Cbl-3", "Cbl", "VSData", "XYZVN"), ("Start~>HInfo~>Cbl", "Cbl-3", "Cbl", "Name", "Smart"),
("Start~>HInfo~>Mnt", "Mnt", "Mnt", "Model", "2000"), ("Start~>HInfo~>Mnt", "Mnt", "Mnt", "Version", "1.0"))
.toDF("relation", "obj_instance", "obj", "key", "value")
rawTestDF.show(false)
val joinTheMap = udf { json_value: Seq[Map[String, String]] => json_value.flatten.toMap }
val groupedDataSet = rawTestDF.groupBy("relation", "obj_instance", "obj").agg(collect_list(map(col("key"), col("value"))) as "map_value_temp").withColumn("map_value", joinTheMap(col("map_value_temp")))
.drop("map_value_temp")
groupedDataSet.show(false) //This is the Input DataFrame.
}
Final Output Json from Map :
[{"relation":"Start~>HInfo~>Mnt","obj_instance":"Mnt","obj":"Mnt","Mnt":{"Model":"2000","Version":"1.0"}}
{"relation":"Start~>HInfo~>Cbl","obj_instance":"Cbl-3","obj:"Cbl","Cbl":{"VSData":"XYZVN","Name":"Smart"}}]
Note : I don't want to use any Spark groupBy, pivot, agg as Spark streaming doesn't support multi aggregation. Hence I want to get it with pure Scala code. Kindly help.
Created custom UDF to parse & generate data in JSON format.
import org.json4s.native.JsonMethods._
import org.json4s._
import org.json4s.JsonDSL._
def toJson(relation:String,obj_instance: String,obj: String,map_value: Map[String,String]) = {
compact(render(
JObject("relation" -> JString(relation),
"obj_instance" -> JString(obj_instance),
"obj" -> JString(obj),
obj -> map_value)))
}
import org.apache.spark.sql.functions._
val createJson = udf(toJson _)
val df = Seq(("Start~>HInfo~>Mnt","Mnt","Mnt",Map("Model" -> "2000", "Version" -> "1.0")),("Start~>HInfo~>Cbl","Cbl-3","Cbl",Map("VSData" -> "XYZVN", "Name" -> "Smart"))).toDF("relation","obj_instance","obj","map_value")
df.select(createJson($"relation",$"obj_instance",$"obj",$"map_value").as("json_map")).show(false)
+-----------------------------------------------------------------------------------------------------------+
|json_map |
+-----------------------------------------------------------------------------------------------------------+
|{"relation":"Start~>HInfo~>Mnt","obj_instance":"Mnt","obj":"Mnt","Mnt":{"Model":"2000","Version":"1.0"}} |
|{"relation":"Start~>HInfo~>Cbl","obj_instance":"Cbl-3","obj":"Cbl","Cbl":{"VSData":"XYZVN","Name":"Smart"}}|
+-----------------------------------------------------------------------------------------------------------+

Scala Escape Character Regex

How can I write an expression to filter inputs so that it would be in the format of
(AAA) where A is a number from 0-9.
EX: (123), (592), (999)
Usually you want to do more than filter.
scala> val r = raw"\(\d{3}\)".r
r: scala.util.matching.Regex = \(\d{3}\)
scala> List("(123)", "xyz", "(456)").filter { case r() => true case _ => false }
res0: List[String] = List((123), (456))
scala> import PartialFunction.{cond => when}
import PartialFunction.{cond=>when}
scala> List("(123)", "xyz", "(456)").filter(when(_) { case r() => true })
res1: List[String] = List((123), (456))
Keeping all matches from each input:
scala> List("a(123)b", "xyz", "c(456)d").flatMap(s =>
| r.findAllMatchIn(s).map(_.matched).toList)
res2: List[String] = List((123), (456))
scala> List("a(123)b", "xyz", "c(456)d(789)e").flatMap(s =>
| r.findAllMatchIn(s).map(_.matched).toList)
res3: List[String] = List((123), (456), (789))
Keeping just the first:
scala> val r = raw"(\(\d{3}\))".r.unanchored
r: scala.util.matching.UnanchoredRegex = (\(\d{3}\))
scala> List("a(123)b", "xyz", "c(456)d(789)e").flatMap(r.unapplySeq(_: String)).flatten
res4: List[String] = List((123), (456))
scala> List("a(123)b", "xyz", "c(456)d(789)e").collect { case r(x) => x }
res5: List[String] = List((123), (456))
Keeping entire lines that match:
scala> List("a(123)b", "xyz", "c(456)d(789)e").collect { case s # r(_*) => s }
res6: List[String] = List(a(123)b, c(456)d(789)e)
Java API:
scala> import java.util.regex._
import java.util.regex._
scala> val p = Pattern.compile(raw"(\(\d{3}\))")
p: java.util.regex.Pattern = (\(\d{3}\))
scala> val q = p.asPredicate
q: java.util.function.Predicate[String] = java.util.regex.Pattern$$Lambda$1107/824691524#3234474
scala> List("(123)", "xyz", "(456)").filter(q.test)
res0: List[String] = List((123), (456))
Typically you create regexes by using the .r method available on strings, such as "[0-9]".r. However, as you have noticed, that means you can't interpolate escape characters, as the parser thinks you want to insert escape characters into the string, not the regex.
For this, you can use Scala's triple-quoted strings, which create strings of the exact character sequence, including backslashes and newlines.
To create a regex like you describe, you could write """\(\d\d\d\)""".r. Here's an example of it in use:
val regex = """\(\d\d\d\)""".r.pattern
Seq("(123)", "(---)", "456").filter(str => regex.matcher(str).matches)

How to apply filters while reading all the json files from a folder using scala?

I have folder which has multiple json files(first.json, second.json) .Using scala i am loading all the jsonfiles data to rdd/dataset of spark and then applying filter on the data.
The problem here is if we have 600 of data then we need to load all of them into rdd/dataset and then we are applying filter
looking for a solution where i can filter the records while reading from the folder itself and not loading into spark memory.
Filtering is done based on blockheight property.
Json structure in each file :
first.json :
[{"IsFee":false,"BlockDateTime":"2015-10-14T09:02:46","Address":"0xe8fdc802e721426e0422d18d371ab59a41ddaeac","BlockHeight":381859,"Type":"IN","Value":0.61609232637203584,"TransactionHash":"0xe6fc01ff633b4170e0c8f2df7db717e0608f8aaf62e6fbf65232a7009b53da4e","UserName":null,"ProjectName":null,"CreatedUser":null,"Id":0,"CreatedUserId":0,"CreatedTime":"2019-08-26T22:32:45.2686137+05:30","UpdatedUserId":0,"UpdatedTime":"2019-08-26T22:32:45.2696126+05:30"},{"IsFee":false,"BlockDateTime":"2015-10-14T09:02:46","Address":"0x52bc44d5378309ee2abf1539bf71de1b7d7be3b5","BlockHeight":381859,"Type":"OUT","Value":-0.61609232637203584,"TransactionHash":"0xe6fc01ff633b4170e0c8f2df7db717e0608f8aaf62e6fbf65232a7009b53da4e","UserName":null,"ProjectName":null,"CreatedUser":null,"Id":0,"CreatedUserId":0,"CreatedTime":"2019-08-26T22:32:45.3141203+05:30","UpdatedUserId":0,"UpdatedTime":"2019-08-26T22:32:45.3141203+05:30"}]
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
object BalanceAndTransactionDownload {
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("xxx").getOrCreate()
val currencyDataSchema = StructType(Array(
StructField("Type", StringType, true),
StructField("TransactionHash", StringType, true),
StructField("BlockHeight", LongType, true),
StructField("BlockDateTime", TimestampType, true),
StructField("Value", DecimalType(38, 18), true),
StructField("Address", StringType, true),
StructField("IsFee", BooleanType, true)
))
val projectAddressesFile = args(0)
val blockJSONFilesContainer = args(1)
val balanceFolderName = args(2)
val downloadFolderName = args(3)
val blockHeight = args(4)
val projectAddresses = spark.read.option("multiline", "true").json(projectAddressesFile)
val currencyDataFile = spark.read.option("multiline", "true").schema(currencyDataSchema).json(blockJSONFilesContainer) // This is where i want to filter out the data
val filteredcurrencyData = currencyDataFile.filter(currencyDataFile("BlockHeight") <= blockHeight)
filteredcurrencyData.join(projectAddresses, filteredcurrencyData("Address") === projectAddresses("address")).groupBy(projectAddresses("address")).agg(sum("Value").alias("Value")).repartition(1).write.option("header", "true").format("com.databricks.spark.csv").csv(balanceFolderName)
filteredcurrencyData.join(projectAddresses, filteredcurrencyData("Address") === projectAddresses("address")).drop(projectAddresses("address")).drop(projectAddresses("CurrencyId")).drop(projectAddresses("Id")).repartition(1).write.option("header", "true").format("com.databricks.spark.csv").csv(downloadFolderName)
}
}
the files sitting on the datastore should be partitioned. u seem to filter by block height. so u can have multiple folders like: blockheight=1, blockheight=2 etc.. and have json files inside these folders. inthis case spark will not read all json files it will instead scan the folders required.