How to extract where clause as array in spark sql? - regex

I am trying to extract where clause from SQL query.
Multiple conditions in where clause should be in form array. Please help me.
Sample Input String:
select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null
Output Expected:
Array("col1=1", "(col2 between 1 and 10 or col2 between 190 and 200)", "col2 is not null")
Thanks in advance.
EDIT:
My question here is like... I would like to split all the conditions as separate items... let's say my query is like
select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null
The output I'm expecting is like
List("col1=1", "col2 between 1 and 10", "col2 between 190 and 200", "col2 is not null")
The thing is the query may have multiple levels of conditions like
select * from table where col1=1 and (col2 =2 or(col3 between 1 and 10 or col3 is between 190 and 200)) and col4='xyz'
in output each condition should be a separate item
List("col1=1","col2=2", "col3 between 1 and 10", "col3 between 190 and 200", "col4='xyz'")

I wouldn't use Regex for this. Here's an alternative way to extract your conditions based on Catalyst's Logical Plan :
val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
f.condition.productIterator.flatMap{
case And(l,r) => Seq(l,r)
case o:Predicate => Seq(o)
}
}.toList.flatten
println(predicates)
Output :
List(('col1 = 1), ((('col2 >= 1) && ('col2 <= 10)) || (('col2 >= 190) && ('col2 <= 200))), isnotnull('col2))
Here the predicates are still Expressions and hold information (tree representation).
EDIT :
As asked in comment, here's a String (user friendly I hope) representation of the predicates :)
val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
f.condition.productIterator.flatMap{
case o:Predicate => Seq(o)
}
}.toList.flatten
def stringifyExpressions(expression: Expression): Seq[String] = {
expression match{
case And(l,r) => (l,r) match {
case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
}
case Or(l,r) => Seq(Seq(l,r).flatMap(stringifyExpressions).mkString("(",") OR (", ")"))
case eq: EqualTo => Seq(s"${eq.left.toString} = ${eq.right.toString}")
case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
case p: Predicate => Seq(p.toString)
}
}
val stringRepresentation = predicates.flatMap{stringifyExpressions}
println(stringRepresentation)
New Output :
List('col1 = 1, ('col2 between 1 and 10) OR ('col2 between 190 and 200), 'col2 is not null)
You can keep playing with the recursive stringifyExpressions method if you want to customize the output.
EDIT 2 : In response to your own edit :
You can change the Or / EqualTo cases to the following
def stringifyExpressions(expression: Expression): Seq[String] = {
expression match{
case And(l,r) => (l,r) match {
case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
}
case Or(l,r) => Seq(l,r).flatMap(stringifyExpressions)
case EqualTo(l,r) =>
val prettyLeft = if(l.resolved && l.dataType == StringType) s"'${l.toString}'" else l.toString
val prettyRight = if(r.resolved && r.dataType == StringType) s"'${r.toString}'" else r.toString
Seq(s"$prettyLeft=$prettyRight")
case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
case p: Predicate => Seq(p.toString)
}
}
This gives the 4 elements List :
List('col1=1, 'col2 between 1 and 10, 'col2 between 190 and 200, 'col2 is not null)
For the second example :
select * from table where col1=1 and (col2 =2 or (col3 between 1 and 10 or col3 between 190 and 200)) and col4='xyz'
You'd get this output (List[String] with 5 elements) :
List('col1=1, 'col2=2, 'col3 between 1 and 10, 'col3 between 190 and 200, 'col4='xyz')
Additional note: If you want to print the attribute names without the starting quote, you can handle it by printing this instead of toString :
node.asInstanceOf[UnresolvedAttribute].name

Related

scala spark reduce list in groupby

I have spark DataFrame with two columns
colA colB
1 3
1 2
2 4
2 5
2 1
I want to groupBy colA and iterate over colB list for each group such that:
res = 0
for i in collect_list(col("colB")):
res += i * (3+res)
returned value shall be res
so I get:
colA colB
1 24
2 78
how can i do this in scala?
You can achieve the result you want with the following:
val df = Seq((1,3), (1,2), (2,4), (2,5), (2,1)).toDF("colA", "colB")
val retDf = df
.groupBy("colA")
.agg(
aggregate(
collect_list("colB"), lit(0), (acc, nxt) => nxt * (acc + 3)
) as "colB")
You need to be very careful with this however, as data on Spark is distributed. If the data has been shuffled since being read into Spark there is no guarantee that it will be collected in the same order. In the toy example collect_list("colB") will return Seq(3,2) where colA is 1. If there had been any shuffles at an earlier phase however, collect_list can just as well return Seq(2,3) which would give you 27 instead of the desired 24. You need to provide some metadata to your data which you can use to ensure you're processing this in the order you expect such as with the monotonicallyIncreasingId method.
RDD approach with no loss of ordering.
%scala
val rdd1 = spark.sparkContext.parallelize(Seq((1,3), (1,3), (2,4), (2,5), (2,1))).zipWithIndex().map(x => ((x._1._1), (x._1._2, x._2)) )
val rdd2 = rdd1.groupByKey
// Convert to Array.
val rdd3 = rdd2.map(x => (x._1, x._2.toArray))
val rdd4 = rdd3.map(x => (x._1, x._2.sortBy(_._2)))
val rdd5 = rdd4.mapValues(v => v.map(_._1))
rdd5.collect()
val res = rdd5.map(x => (x._1, x._2.fold(0)((acc, nxt) => nxt * (acc + 3) )))
res.collect()
returns:
res201: Array[(Int, Int)] = Array((1,24), (2,78))
Covert from and to DF as required.

spark scala pattern matching on a dataframe column

I am coming from R background. I could able to implement the pattern search on a Dataframe col in R. But now struggling to do it in spark scala. Any help would be appreciated
problem statement is broken down into details just to describe it appropriately
DF :
Case Freq
135322 265
183201,135322 36
135322,135322 18
135322,121200 11
121200,135322 8
112107,112107 7
183201,135322,135322 4
112107,135322,183201,121200,80000 2
I am looking for a pattern search UDF, which gives me back all the matches of the pattern and then corresponding Freq value from the second col.
example : for pattern 135322 , i would like to find out all the matches in first col Case.It should return corresponding Freq number from Freq col.
Like 265,36,18,11,8,4,2
for pattern 112107,112107 it should return just 7 because there is one matching pattern.
This is how the end result should look
Case Freq results
135322 265 256+36+18+11+8+4+2
183201,135322 36 36+4+2
135322,135322 18 18+4
135322,121200 11 11+2
121200,135322 8 8+2
112107,112107 7 7
183201,135322,135322 4 4
112107,135322,183201,121200,80000 2 2
what i tried so far:
val text= DF.select("case").collect().map(_.getString(0)).mkString("|")
//search function for pattern search
val valsum = udf((txt: String, pattern : String)=> {
txt.split("\\|").count(_.contains(pattern))
} )
//apply the UDF on the first col
val dfValSum = DF.withColumn("results", valsum( lit(text),DF("case")))
This one works
import common.Spark.sparkSession
import java.util.regex.Pattern
import util.control.Breaks._
object playground extends App {
import org.apache.spark.sql.functions._
val pattern = "135322,121200" // Pattern you want to search for
// udf declaration
val coder: ((String, String) => Boolean) = (caseCol: String, pattern: String) =>
{
var result = true
val splitPattern = pattern.split(",")
val splitCaseCol = caseCol.split(",")
var foundAtIndex = -1
for (i <- 0 to splitPattern.length - 1) {
breakable {
for (j <- 0 to splitCaseCol.length - 1) {
if (j > foundAtIndex) {
println(splitCaseCol(j))
if (splitCaseCol(j) == splitPattern(i)) {
result = true
foundAtIndex = j
break
} else result = false
} else result = false
}
}
}
println(caseCol, result)
(result)
}
// registering the udf
val udfFilter = udf(coder)
//reading the input file
val df = sparkSession.read.option("delimiter", "\t").option("header", "true").csv("output.txt")
//calling the function and aggregating
df.filter(udfFilter(col("Case"), lit(pattern))).agg(lit(pattern), sum("Freq")).toDF("pattern","sum").show
}
if input is
135322,121200
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,121200|13.0|
+-------------+----+
if input is
135322,135322
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,135322|22.0|
+-------------+----+

doctrine2 get rows from mapping table

I am trying to get the mapping table associated ids.
mapping_table
id service_receiver_id service_provider_id
1 1 2
2 4 1
How can i write the doctrine query for retrieving 1 mapped to...
i need the results like,
associated_with
2
4
In my case i have using this query:
$qb->select('om', 'o', 'ot')
->from('Organization\Entity\OrgMapping', 'om')
->leftJoin('om.organization', 'o')
->where('om.organization = :hspId')
->setParameter('hspId', $hspId);
above query result only
//List of Associates that is Mapped Already
$organizations = $this->orgMappingRepository->listAssociatesByHSPId($hspId, $mapType, $filterBy, $searchBy, $pageNo, $paginationArr);
$mappedAssociates = array();
foreach ($organizations as $org) {
$mappedAssociates[$org->getServiceProvider()->getId()] = array(
'id' => $org->getServiceProvider()->getId(),
'name' => $org->getServiceProvider()->getName(),
'orgType' => $org->getServiceProvider()->getOrgType()->getName(),
'logo' => $org->getServiceProvider()->getLogo(),
'cityName' => $org->getServiceProvider()->getCity() ? $org->getServiceProvider()->getCity()->getName() : null,
'areaName' => $org->getServiceProvider()->getArea() ? $org->getServiceProvider()->getArea()->getName() : null,
'zipcode' => $org->getServiceProvider()->getZipcode(),
);
}
the result i am getting is :
id service_receiver_id service_provider_id
1 1 2

Scala creating a list with some data at specific indices and 0 at all the rest indices

I have a list named positiveDays which contains values (2,4,6) and I want to create a list DaysDetails having value of 1 at all the positiveDays indices and 0 at the rest indices.
Example -
positiveDays(2,4,6)
O/p List -> DaysDetails(0,0,1,0,1,0,1)
Can anyone suggest me a way to do that without use of var?
You can put your "special" day numbers into list and then map all week days with check if it's one of your "special" day.
val positiveDays = List(2,4,6)
(0 to 6) map { i =>
if (positiveDays.contains(i)) 1
else 0
}
res1: scala.collection.immutable.IndexedSeq[Int] = Vector(0, 0, 1, 0, 1, 0, 1)
Of course, if you are only interested in even days, then you can make it like that:
(0 to 6) map { i =>
if (i % 2 == 0) 1
else 0
}
And if you want to start your week with Monday, not Sunday, then use 1 to 7 instead of 0 to 6.
This should work (only when positiveDays List is not empty):
val positiveDays = List(2,4,6)
List.tabulate(1 + positiveDays.last) {
pos => if (positiveDays.contains(pos)) 1 else 0
}
To correctly handle the case when positiveDays is empty you could use:
List.tabulate(positiveDays.lastOption.fold(0)(1 + _)) {
pos => if (positiveDays.contains(pos)) 1 else 0
}
Lots of ways to skin this particular cat, but I think this one is succinct and quite clear:
val positiveDays = Set(2,3,6)
for (i<- 0 to 6)
yield if (positiveDays(i)) 1 else 0
I believe that using Int's to represent days of the week is wrong. There's Int.MinValue + Int.MaxValue Int's, but 7 days of the week.
What day of the week does 666 represent? How about -42?
scala> sealed trait DayOfWeek
defined trait DayOfWeek
scala> case object Monday extends DayOfWeek
defined object Monday
scala> case object Tuesday extends DayOfWeek
defined object Tuesday
// I excluded the rest for conciseness
scala> val positiveDays: List[DayOfWeek] = List(Tuesday)
positiveDays: List[DayOfWeek] = List(Tuesday)
scala> val AllDays: List[DayOfWeek] = List(Monday, Tuesday)
AllDays: List[DayOfWeek] = List(Monday, Tuesday)
scala> AllDays.map(d => if (positiveDays.contains(d)) (d, 1) else (d,0) )
res0: List[(DayOfWeek, Int)] = List((Monday,0), (Tuesday,1))

Computing all values or stopping and returning just the best value if found

I have a list of items and for each item I am computing a value. Computing this value is a bit computationally intensive so I want to minimise it as much as possible.
The algorithm I need to implement is this:
I have a value X
For each item
a. compute the value for it, if it is < 0 ignore it completely
b. if (value > 0) && (value < X)
return pair (item, value)
Return all (item, value) pairs in a List (that have the value > 0), ideally sorted by value
To make it a bit clearer, step 3 only happens if none of the items have a value less than X. In step 2, when we encounter the first item that is less than X we should not compute the rest and just return that item (we can obviously return it in a Set() by itself to match the return type).
The code I have at the moment is as follows:
val itemValMap = items.foldLeft(Map[Item, Int)]()) {
(map : Map[Item, Int], key : Item) =>
val value = computeValue(item)
if ( value >= 0 ) //we filter out negative ones
map + (key -> value)
else
map
}
val bestItem = itemValMap.minBy(_._2)
if (bestItem._2 < bestX)
{
List(bestItem)
}
else
{
itemValMap.toList.sortBy(_._2)
}
However, what this code is doing is computing all the values in the list and choosing the best one, rather than stopping as a 'better' one is found. I suspect I have to use Streams in some way to achieve this?
OK, I'm not sure how your whole setup looks like, but I tried to prepare a minimal example that would mirror your situation.
Here it is then:
object StreamTest {
case class Item(value : Int)
def createItems() = List(Item(0),Item(3),Item(30),Item(8),Item(8),Item(4),Item(54),Item(-1),Item(23),Item(131))
def computeValue(i : Item) = { Thread.sleep(3000); i.value * 2 - 2 }
def process(minValue : Int)(items : Seq[Item]) = {
val stream = Stream(items: _*).map(item => item -> computeValue(item)).filter(tuple => tuple._2 >= 0)
stream.find(tuple => tuple._2 < minValue).map(List(_)).getOrElse(stream.sortBy(_._2).toList)
}
}
Each calculation takes 3 seconds. Now let's see how it works:
val items = StreamTest.createItems()
val result = StreamTest.process(2)(items)
result.foreach(r => println("Original: " + r._1 + " , calculated: " + r._2))
Gives:
[info] Running Main
Original: Item(3) , calculated: 4
Original: Item(4) , calculated: 6
Original: Item(8) , calculated: 14
Original: Item(8) , calculated: 14
Original: Item(23) , calculated: 44
Original: Item(30) , calculated: 58
Original: Item(54) , calculated: 106
Original: Item(131) , calculated: 260
[success] Total time: 31 s, completed 2013-11-21 15:57:54
Since there's no value smaller than 2, we got a list ordered by the calculated value. Notice that two pairs are missing, because calculated values are smaller than 0 and got filtered out.
OK, now let's try with a different minimum cut-off point:
val result = StreamTest.process(5)(items)
Which gives:
[info] Running Main
Original: Item(3) , calculated: 4
[success] Total time: 7 s, completed 2013-11-21 15:55:20
Good, it returned a list with only one item, the first value (second item in the original list) that was smaller than 'minimal' value and was not smaller than 0.
I hope that the example above is easily adaptable to your needs...
A simple way to avoid the computation of unneeded values is to make your collection lazy by using the view method:
val weigthedItems = items.view.map{ i => i -> computeValue(i) }.filter(_._2 >= 0 )
weigthedItems.find(_._2 < X).map(List(_)).getOrElse(weigthedItems.sortBy(_._2))
By example here is a test in the REPL:
scala> :paste
// Entering paste mode (ctrl-D to finish)
type Item = String
def computeValue( item: Item ): Int = {
println("Computing " + item)
item.toInt
}
val items = List[Item]("13", "1", "5", "-7", "12", "3", "-1", "15")
val X = 10
val weigthedItems = items.view.map{ i => i -> computeValue(i) }.filter(_._2 >= 0 )
weigthedItems.find(_._2 < X).map(List(_)).getOrElse(weigthedItems.sortBy(_._2))
// Exiting paste mode, now interpreting.
Computing 13
Computing 1
defined type alias Item
computeValue: (item: Item)Int
items: List[String] = List(13, 1, 5, -7, 12, 3, -1, 15)
X: Int = 10
weigthedItems: scala.collection.SeqView[(String, Int),Seq[_]] = SeqViewM(...)
res27: Seq[(String, Int)] = List((1,1))
As you can see computeValue was only called up to the first value < X (that is, up to 1)