How to add switch case for like operator in scala? - regex

I have a data frame with column (A, B) where column B is free test which I am converting to type (NOT_FOUND, TOO_LOW_PURCHASE_COUNT and etc) to aggregate better. I created a switch case of all possible patter and their respective type but it is not working.
def getType(x: String): String = x match {
case "Item % not found %" =>"NOT_FOUND"
case "%purchase count % is too low %" =>"TOO_LOW_PURCHASE_COUNT"
case _ => "Unknown"
}
getType("Item 75gb not found")
val newdf = df.withColumn("updatedType",getType(col("raw_type")))
This gives me "Unknown". Can some one tell me how to do switch case for like operator ?

Use when and like
import org.apache.spark.sql.functions.when
val df = Seq(
"Item foo not found", "Foo purchase count 1 is too low ", "#!#"
).toDF("raw_type")
val newdf = df.withColumn(
"updatedType",
when($"raw_type" like "Item % not found%", "NOT_FOUND")
.when($"raw_type" like "%purchase count % is too low%", "TOO_LOW_PURCHASE_COUNT")
.otherwise("Unknown")
)
Result:
newdf.show
// +--------------------+--------------------+
// | raw_type| updatedType|
// +--------------------+--------------------+
// | Item foo not found| NOT_FOUND|
// |Foo purchase coun...|TOO_LOW_PURCHASE_...|
// | #!#| Unknown|
// +--------------------+--------------------+
Reference:
Spark Equivalent of IF Then ELSE
Filter spark DataFrame on string contains

SQL symbol "%" in regexp world can be replaced with ".*". UDF can be created for match value to patterns:
val originalSqlLikePatternMap = Map("Item % not found%" -> "NOT_FOUND",
// 20 other patterns here
"%purchase count % is too low %" -> "TOO_LOW_PURCHASE_COUNT")
val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*") -> v._2)
val df = Seq(
"Item foo not found ", "Foo purchase count 1 is too low ", "#!#"
).toDF("raw_type")
val converter = (value: String) => javaPatternMap.find(v => value.matches(v._1)).map(_._2).getOrElse("Unknown")
val converterUDF = udf(converter)
val result = df.withColumn("updatedType", converterUDF($"raw_type"))
result.show(false)
Output:
+--------------------------------+----------------------+
|raw_type |updatedType |
+--------------------------------+----------------------+
|Item foo not found |NOT_FOUND |
|Foo purchase count 1 is too low |TOO_LOW_PURCHASE_COUNT|
|#!# |Unknown |
+--------------------------------+----------------------+

Related

Spark dataframe value to Scala List

I have a dataframe with a column having Array:
+----------------------------+
|User | Color |
+----------------------------+
|User1 | [Green,Blue,Red] |
|User2 | [Blue,Red] |
+----------------------------+
I am trying to filter for User1 and get the list of colors into a Scala List:
val colorsList: List[String] = List("Green","Blue","Red")
Here's what I have tried so far (output is added as comments):
Attempt 1:
val dfTest1 = myDataframe.where("User=='User1'").select("Color").rdd.map(r => r(0)).collect()
println(dfTest1) //[Ljava.lang.Object;#44022255
for(EachColor<- dfTest1){
println(EachColor) //WrappedArray(Green, Blue, Red)
}
Attempt 2:
val dfTest2 = myDataframe.where("User=='User1'").select("Color").collectAsList.get(0).getList(0)
println(dfTest2) //[Green, Blue, Red] but type is util.List[Nothing]
Attempt 3:
val dfTest32 = myDataframe.where("User=='User1'").select("Color").rdd.map(r => r(0)).collect.toList
println(dfTest32) //List(WrappedArray(Green, Blue, Red))
for(EachColor <- dfTest32){
println(EachColor) //WrappedArray(Green, Blue, Red)
}
Attempt 4:
val dfTest31 = myDataframe.where("User=='User1'").select("Color").map(r => r.getString(0)).collect.toList
//Exception : scala.collection.mutable.WrappedArray$ofRef cannot be cast to java.lang.String
You can try getting as Seq[String] and converting toList:
val colorsList = df.where("User=='User1'")
.select("Color")
.rdd.map(r => r.getAs[Seq[String]](0))
.collect()(0)
.toList
Or equivalently
val colorsList = df.where("User=='User1'")
.select("Color")
.collect()(0)
.getAs[Seq[String]](0)
.toList

Power BI, Power Query - Go trough list and add text with each step (for each)

I have a list for example:
A
B
C
D
Now I want to create a query in power query that will create a text like this:
"The letters are: " A B C D
I already have this query:
let
xx= KEYListfunction(),
list = Text.Split(xx, ","),
text = "",
loop = List.Accumulate(list, 0, (state, current) => text = text & " " & current ))
in
loop
But the result only says "FALSE"
Are you looking for this?
let
list = {"A".."D"},
text = "The letters are: " & Text.Combine(list, ", ")
in
text

How do I match multiple regex patterns against a column value in Spark?

I have column :
val originalSqlLikePatternMap = Map("item (%) is blacklisted%" -> "BLACK_LIST",
"%Testing%" -> "TESTING",
"%purchase count % is too low %" -> "TOO_LOW_PURCHASE_COUNT")
val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*") -> v._2)
val df = Seq(
"Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
"Foo purchase count (12, 4) is too low ", "#!#", "item (mejwnw) is blacklisted",
"item (1) is blacklisted, #!#"
).toDF("raw_type")
val converter = (value: String) => javaPatternMap.find(v => value.matches(v._1)).map(_._2).getOrElse("Unknown")
val converterUDF = udf(converter)
val result = df.withColumn("updatedType", converterUDF($"raw_type"))
but it gives :
+---------------------------------------------------------+----------------------+
|raw_type |updatedType |
+---------------------------------------------------------+----------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING |
|Foo purchase count (12, 4) is too low |TOO_LOW_PURCHASE_COUNT|
|#!# |Unknown |
|item (mejwnw) is blacklisted |BLACK_LIST |
|item (1) is blacklisted, #!# |BLACK_LIST |
+---------------------------------------------------------+----------------------+
But I want "Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low" to give 2 values "TESTING, TOO_LOW_PURCHASE_COUNT" like this :
+---------------------------------------------------------+--------------------------------+
|raw_type |updatedType |
+---------------------------------------------------------+--------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT |
|Foo purchase count (12, 4) is too low |TOO_LOW_PURCHASE_COUNT |
|#!# |Unknown |
|item (mejwnw) is blacklisted |BLACK_LIST |
|item (1) is blacklisted, #!# |BLACK_LIST, Unkown |
+---------------------------------------------------------+--------------------------------+
Can someone tell what I am doing wrong here ?
Ok. So, couple of things here,
Regarding find, you need to check each Row against each regex for your desired output, so find is not the right choice.
the first value produced by the iterator satisfying a predicate, if
any.
Take care with regex, you've left a space after low, thats why its not matching. May you should reconsider just replacing % with .* also,
%purchase count % is too low %
So, with the changes, your code will be something like,
val originalSqlLikePatternMap = Map(
"item (%) is blacklisted%" -> "BLACK_LIST",
"%Testing%" -> "TESTING",
"%purchase count % is too low%" -> "TOO_LOW_PURCHASE_COUNT")
val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*").r -> v._2)
val df = Seq(
"Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
"Foo purchase count (12, 4) is too low ", "#!#", "item (mejwnw) is blacklisted",
"item (1) is blacklisted, #!#"
).toDF("raw_type")
val converter = (value: String) => {
val res = javaPatternMap.map(v => {
v._1.findFirstIn(value) match {
case Some(_) => v._2
case None => ""
}
})
.filter(_.nonEmpty).mkString(", ")
if (res.isEmpty) "Unknown" else res
}
val converterUDF = udf(converter)
val result = df.withColumn("updatedType", converterUDF($"raw_type"))
result.show(false)
Output,
+---------------------------------------------------------+-------------------------------+
|raw_type |updatedType |
+---------------------------------------------------------+-------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT|
|Foo purchase count (12, 4) is too low |TOO_LOW_PURCHASE_COUNT |
|#!# |Unknown |
|item (mejwnw) is blacklisted |BLACK_LIST |
|item (1) is blacklisted, #!# |BLACK_LIST |
+---------------------------------------------------------+-------------------------------+
Hope this helps!

spark scala pattern matching on a dataframe column

I am coming from R background. I could able to implement the pattern search on a Dataframe col in R. But now struggling to do it in spark scala. Any help would be appreciated
problem statement is broken down into details just to describe it appropriately
DF :
Case Freq
135322 265
183201,135322 36
135322,135322 18
135322,121200 11
121200,135322 8
112107,112107 7
183201,135322,135322 4
112107,135322,183201,121200,80000 2
I am looking for a pattern search UDF, which gives me back all the matches of the pattern and then corresponding Freq value from the second col.
example : for pattern 135322 , i would like to find out all the matches in first col Case.It should return corresponding Freq number from Freq col.
Like 265,36,18,11,8,4,2
for pattern 112107,112107 it should return just 7 because there is one matching pattern.
This is how the end result should look
Case Freq results
135322 265 256+36+18+11+8+4+2
183201,135322 36 36+4+2
135322,135322 18 18+4
135322,121200 11 11+2
121200,135322 8 8+2
112107,112107 7 7
183201,135322,135322 4 4
112107,135322,183201,121200,80000 2 2
what i tried so far:
val text= DF.select("case").collect().map(_.getString(0)).mkString("|")
//search function for pattern search
val valsum = udf((txt: String, pattern : String)=> {
txt.split("\\|").count(_.contains(pattern))
} )
//apply the UDF on the first col
val dfValSum = DF.withColumn("results", valsum( lit(text),DF("case")))
This one works
import common.Spark.sparkSession
import java.util.regex.Pattern
import util.control.Breaks._
object playground extends App {
import org.apache.spark.sql.functions._
val pattern = "135322,121200" // Pattern you want to search for
// udf declaration
val coder: ((String, String) => Boolean) = (caseCol: String, pattern: String) =>
{
var result = true
val splitPattern = pattern.split(",")
val splitCaseCol = caseCol.split(",")
var foundAtIndex = -1
for (i <- 0 to splitPattern.length - 1) {
breakable {
for (j <- 0 to splitCaseCol.length - 1) {
if (j > foundAtIndex) {
println(splitCaseCol(j))
if (splitCaseCol(j) == splitPattern(i)) {
result = true
foundAtIndex = j
break
} else result = false
} else result = false
}
}
}
println(caseCol, result)
(result)
}
// registering the udf
val udfFilter = udf(coder)
//reading the input file
val df = sparkSession.read.option("delimiter", "\t").option("header", "true").csv("output.txt")
//calling the function and aggregating
df.filter(udfFilter(col("Case"), lit(pattern))).agg(lit(pattern), sum("Freq")).toDF("pattern","sum").show
}
if input is
135322,121200
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,121200|13.0|
+-------------+----+
if input is
135322,135322
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,135322|22.0|
+-------------+----+

Regex / subString to extract all matching patterns / groups

I get this as a response to an API hit.
1735 Queries
Taking 1.001303 to 31.856310 seconds to complete
SET timestamp=XXX;
SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
38 Queries
Taking 1.007646 to 5.284330 seconds to complete
SET timestamp=XXX;
show slave status;
6 Queries
Taking 1.021271 to 1.959838 seconds to complete
SET timestamp=XXX;
SHOW SLAVE STATUS;
2 Queries
Taking 4.825584, 18.947725 seconds to complete
use marketing;
SET timestamp=XXX;
SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
I have extracted this out of the response html and have it as a string now.I need to retrieve values as concisely as possible such that I get a map of values of this format Map(Query -> T1 to T2 seconds) Basically what this is the status of all the slow queries running on MySQL slave server. I am building an alert system over it . So from this entire paragraph in the form of String I need to separate out the queries and save the corresponding time range with them.
1.001303 to 31.856310 is a time range . And against the time range the corresponding query is :
SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
This information I was hoping to save in a Map in scala. A Map of the form (query:String->timeRange:String)
Another example:
("use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified xyz ;"->"4.825584 to 18.947725 seconds")
"""###(.)###(.)\n\n(.*)###""".r.findAllIn(reqSlowQueryData).matchData foreach {m => println("group0"+m.group(1)+"next group"+m.group(2)+m.group(3)}
I am using the above statement to extract the the repeating cells to do my manipulations on it later. But it doesnt seem to be working;
THANKS IN ADvance! I know there are several ways to do this but all the ones striking me are inefficient and tedious. I need Scala to do the same! Maybe I can extract recursively using the subString method ?
If you want use scala try this:
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
val txt = """
|1735 Queries
|
|Taking 1.001303 to 31.856310 seconds to complete
|
|SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
|
|38 Queries
|
|Taking 1.007646 to 5.284330 seconds to complete
|
|SET timestamp=XXX; show slave status;
|
|6 Queries
|
|Taking 1.021271 to 1.959838 seconds to complete
|
|SET timestamp=XXX; SHOW SLAVE STATUS;
|
|2 Queries
|
|Taking 4.825584, 18.947725 seconds to complete
|
|use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
""".stripMargin
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(acc,el) =>
val (taking,map) = acc // taking contains range
taking match {
case Some(range) if el.trim.nonEmpty => //Some contains range
(None,map + ( el -> range)) // add to map
case None =>
regex.findFirstIn(el) match { //extract range
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map) // probably empty line
}
}
map
}
Modified ajozwik's answer to work for SQL commands over multiple lines :
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(accumulator,element) =>
val (taking,map) = accumulator
taking match {
case Some(range) if element.trim.nonEmpty=> {
if (element.contains("Queries"))
(None, map)
else
(Some(range),map+(range->(map.getOrElse(range,"")+element)))
}
case None =>
regex.findFirstIn(element) match {
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map)
}
}
println(map)
map
}