I have a dataframe with a column having Array:
+----------------------------+
|User | Color |
+----------------------------+
|User1 | [Green,Blue,Red] |
|User2 | [Blue,Red] |
+----------------------------+
I am trying to filter for User1 and get the list of colors into a Scala List:
val colorsList: List[String] = List("Green","Blue","Red")
Here's what I have tried so far (output is added as comments):
Attempt 1:
val dfTest1 = myDataframe.where("User=='User1'").select("Color").rdd.map(r => r(0)).collect()
println(dfTest1) //[Ljava.lang.Object;#44022255
for(EachColor<- dfTest1){
println(EachColor) //WrappedArray(Green, Blue, Red)
}
Attempt 2:
val dfTest2 = myDataframe.where("User=='User1'").select("Color").collectAsList.get(0).getList(0)
println(dfTest2) //[Green, Blue, Red] but type is util.List[Nothing]
Attempt 3:
val dfTest32 = myDataframe.where("User=='User1'").select("Color").rdd.map(r => r(0)).collect.toList
println(dfTest32) //List(WrappedArray(Green, Blue, Red))
for(EachColor <- dfTest32){
println(EachColor) //WrappedArray(Green, Blue, Red)
}
Attempt 4:
val dfTest31 = myDataframe.where("User=='User1'").select("Color").map(r => r.getString(0)).collect.toList
//Exception : scala.collection.mutable.WrappedArray$ofRef cannot be cast to java.lang.String
You can try getting as Seq[String] and converting toList:
val colorsList = df.where("User=='User1'")
.select("Color")
.rdd.map(r => r.getAs[Seq[String]](0))
.collect()(0)
.toList
Or equivalently
val colorsList = df.where("User=='User1'")
.select("Color")
.collect()(0)
.getAs[Seq[String]](0)
.toList
Related
I have spark DataFrame with two columns
colA colB
1 3
1 2
2 4
2 5
2 1
I want to groupBy colA and iterate over colB list for each group such that:
res = 0
for i in collect_list(col("colB")):
res += i * (3+res)
returned value shall be res
so I get:
colA colB
1 24
2 78
how can i do this in scala?
You can achieve the result you want with the following:
val df = Seq((1,3), (1,2), (2,4), (2,5), (2,1)).toDF("colA", "colB")
val retDf = df
.groupBy("colA")
.agg(
aggregate(
collect_list("colB"), lit(0), (acc, nxt) => nxt * (acc + 3)
) as "colB")
You need to be very careful with this however, as data on Spark is distributed. If the data has been shuffled since being read into Spark there is no guarantee that it will be collected in the same order. In the toy example collect_list("colB") will return Seq(3,2) where colA is 1. If there had been any shuffles at an earlier phase however, collect_list can just as well return Seq(2,3) which would give you 27 instead of the desired 24. You need to provide some metadata to your data which you can use to ensure you're processing this in the order you expect such as with the monotonicallyIncreasingId method.
RDD approach with no loss of ordering.
%scala
val rdd1 = spark.sparkContext.parallelize(Seq((1,3), (1,3), (2,4), (2,5), (2,1))).zipWithIndex().map(x => ((x._1._1), (x._1._2, x._2)) )
val rdd2 = rdd1.groupByKey
// Convert to Array.
val rdd3 = rdd2.map(x => (x._1, x._2.toArray))
val rdd4 = rdd3.map(x => (x._1, x._2.sortBy(_._2)))
val rdd5 = rdd4.mapValues(v => v.map(_._1))
rdd5.collect()
val res = rdd5.map(x => (x._1, x._2.fold(0)((acc, nxt) => nxt * (acc + 3) )))
res.collect()
returns:
res201: Array[(Int, Int)] = Array((1,24), (2,78))
Covert from and to DF as required.
I'm trying to interact with this List[Option[Map[String, DataFrame]]] but I'm having a bit of trouble.
Inside it has something like this:
customer1 -> dataframeX
customer2 -> dataframeY
customer3 -> dataframeZ
Where the customer is an identifier that will become a new column.
I need to do an union of dataframeX, dataframeY and dataframeZ (all df have the same columns). Before I had this:
map(_.get).reduce(_ union _).select(columns:_*)
And it was working fine because I only had a List[Option[DataFrame]] and didn't need the identifier but I'm having trouble with the new list. My idea is to modify my old mapping, I know I can do stuff like "(0).get" and that would bring me "Map(customer1 -> dataframeX)" but I'm not quite sure how to do that iteration in the mapping and get the final dataframe that is the union of all three plus the identifier. My idea:
map(/*get identifier here along with dataframe*/).reduce(_ union _).select(identifier +: columns:_*)
The final result would be something like:
-------------------------------
|identifier | product |State |
-------------------------------
| customer1| prod1 | VA |
| customer1| prod132 | VA |
| customer2| prod32 | CA |
| customer2| prod51 | CA |
| customer2| prod21 | AL |
| customer2| prod52 | AL |
-------------------------------
You could use collect to unnest Option[Map[String, Dataframe]] to Map[String, DataFrame]. To put an identifier into the column you should use withColumn. So your code could look like:
import org.apache.spark.sql.functions.lit
val result: DataFrame = frames.collect {
case Some(m) =>
m.map {
case (identifier, dataframe) => dataframe.withColumn("identifier", lit(identifier))
}.reduce(_ union _)
}.reduce(_ union _)
Something like this perhaps?
list
.flatten
.flatMap {
_.map { case (id, df) =>
df.withColumn("identifier", id) }
}.reduce(_ union _)
We have .txt log file , i used scala spark to read the file. the file contains sets of data in row wise . i read the data one by one like as below
val sc = spark.SparkContext
val dataframe = sc.textFile(/path/to/log/*.txt)
We have .txt log file , i used scala spark to read the file. the file contains sets of data in row wise . i read the data one by one like as below
val sc = spark.SparkContext
val dataframe = sc.textFile(/path/to/log/*.txt)
val get_set_element = sc.textFile(filepath.txt)
val pattern = """(\S+) "([\S\s]+)\" (\S+) (\S+) (\S+) (\S+)""".r
val test = get_set_element.map{ line =>
( for {
m <- pattern.findAllIn(line).matchData
g <- m.subgroups
} yield(g)
).toList
}.
map(l => (l(0), l(1), l(2), l(3), l(4), l(5)))
I want to create a DataFrame so that i can save it into csv file.
Can be created from RDD[Row], with schema assigned:
// instead of: map(l => (l(0), l(1), l(2), l(3), l(4), l(5)))
.map(Row.fromSeq)
val fields = (0 to 5).map(idx => StructField(name = "l" + idx, dataType = StringType, nullable = true))
val df = spark.createDataFrame(test, StructType(fields))
Output:
+---+---+---+---+---+---+
|l0 |l1 |l2 |l3 |l4 |l5 |
+---+---+---+---+---+---+
|a |b |c |d |e |f |
+---+---+---+---+---+---+
I need to write some regex for condition check in spark while doing some join,
My regex should match below string
n3_testindia1 = test-india-1
n2_stagamerica2 = stag-america-2
n1_prodeurope2 = prod-europe-2
df1.select("location1").distinct.show()
+----------------+
| location1 |
+----------------+
|n3_testindia1 |
|n2_stagamerica2 |
|n1_prodeurope2 |
df2.select("loc1").distinct.show()
+--------------+
| loc1 |
+--------------+
|test-india-1 |
|stag-america-2|
|prod-europe-2 |
+--------------+
I want to join based on location columns like below
val joindf = df1.join(df2, df1("location1") == regex(df2("loc1")))
Based on the information above you can do that in Spark 2.4.0 using
val joindf = df1.join(df2,
regexp_extract(df1("location1"), """[^_]+_(.*)""", 1)
=== translate(df2("loc1"), "-", ""))
Or in prior versions something like
val joindf = df1.join(df2,
df1("location1").substr(lit(4), length(df1("location1")))
=== translate(df2("loc1"), "-", ""))
You can split by "_" in location1 and take the 2 element, then match with the entire string of "-" removed string in loc1. Check this out:
scala> val df1 = Seq(("n3_testindia1"),("n2_stagamerica2"),("n1_prodeurope2")).toDF("location1")
df1: org.apache.spark.sql.DataFrame = [location1: string]
scala> val df2 = Seq(("test-india-1"),("stag-america-2"),("prod-europe-2")).toDF("loc1")
df2: org.apache.spark.sql.DataFrame = [loc1: string]
scala> df1.join(df2,split('location1,"_")(1) === regexp_replace('loc1,"-",""),"inner").show
+---------------+--------------+
| location1| loc1|
+---------------+--------------+
| n3_testindia1| test-india-1|
|n2_stagamerica2|stag-america-2|
| n1_prodeurope2| prod-europe-2|
+---------------+--------------+
scala>
I have a data frame with column (A, B) where column B is free test which I am converting to type (NOT_FOUND, TOO_LOW_PURCHASE_COUNT and etc) to aggregate better. I created a switch case of all possible patter and their respective type but it is not working.
def getType(x: String): String = x match {
case "Item % not found %" =>"NOT_FOUND"
case "%purchase count % is too low %" =>"TOO_LOW_PURCHASE_COUNT"
case _ => "Unknown"
}
getType("Item 75gb not found")
val newdf = df.withColumn("updatedType",getType(col("raw_type")))
This gives me "Unknown". Can some one tell me how to do switch case for like operator ?
Use when and like
import org.apache.spark.sql.functions.when
val df = Seq(
"Item foo not found", "Foo purchase count 1 is too low ", "#!#"
).toDF("raw_type")
val newdf = df.withColumn(
"updatedType",
when($"raw_type" like "Item % not found%", "NOT_FOUND")
.when($"raw_type" like "%purchase count % is too low%", "TOO_LOW_PURCHASE_COUNT")
.otherwise("Unknown")
)
Result:
newdf.show
// +--------------------+--------------------+
// | raw_type| updatedType|
// +--------------------+--------------------+
// | Item foo not found| NOT_FOUND|
// |Foo purchase coun...|TOO_LOW_PURCHASE_...|
// | #!#| Unknown|
// +--------------------+--------------------+
Reference:
Spark Equivalent of IF Then ELSE
Filter spark DataFrame on string contains
SQL symbol "%" in regexp world can be replaced with ".*". UDF can be created for match value to patterns:
val originalSqlLikePatternMap = Map("Item % not found%" -> "NOT_FOUND",
// 20 other patterns here
"%purchase count % is too low %" -> "TOO_LOW_PURCHASE_COUNT")
val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*") -> v._2)
val df = Seq(
"Item foo not found ", "Foo purchase count 1 is too low ", "#!#"
).toDF("raw_type")
val converter = (value: String) => javaPatternMap.find(v => value.matches(v._1)).map(_._2).getOrElse("Unknown")
val converterUDF = udf(converter)
val result = df.withColumn("updatedType", converterUDF($"raw_type"))
result.show(false)
Output:
+--------------------------------+----------------------+
|raw_type |updatedType |
+--------------------------------+----------------------+
|Item foo not found |NOT_FOUND |
|Foo purchase count 1 is too low |TOO_LOW_PURCHASE_COUNT|
|#!# |Unknown |
+--------------------------------+----------------------+