spark scala pattern matching on a dataframe column

spark scala pattern matching on a dataframe column - regex

I am coming from R background. I could able to implement the pattern search on a Dataframe col in R. But now struggling to do it in spark scala. Any help would be appreciated
problem statement is broken down into details just to describe it appropriately
DF :
Case Freq
135322 265
183201,135322 36
135322,135322 18
135322,121200 11
121200,135322 8
112107,112107 7
183201,135322,135322 4
112107,135322,183201,121200,80000 2
I am looking for a pattern search UDF, which gives me back all the matches of the pattern and then corresponding Freq value from the second col.
example : for pattern 135322 , i would like to find out all the matches in first col Case.It should return corresponding Freq number from Freq col.
Like 265,36,18,11,8,4,2
for pattern 112107,112107 it should return just 7 because there is one matching pattern.
This is how the end result should look
Case Freq results
135322 265 256+36+18+11+8+4+2
183201,135322 36 36+4+2
135322,135322 18 18+4
135322,121200 11 11+2
121200,135322 8 8+2
112107,112107 7 7
183201,135322,135322 4 4
112107,135322,183201,121200,80000 2 2
what i tried so far:
val text= DF.select("case").collect().map(_.getString(0)).mkString("|")
//search function for pattern search
val valsum = udf((txt: String, pattern : String)=> {
txt.split("\\|").count(_.contains(pattern))
} )
//apply the UDF on the first col
val dfValSum = DF.withColumn("results", valsum( lit(text),DF("case")))

This one works
import common.Spark.sparkSession
import java.util.regex.Pattern
import util.control.Breaks._
object playground extends App {
import org.apache.spark.sql.functions._
val pattern = "135322,121200" // Pattern you want to search for
// udf declaration
val coder: ((String, String) => Boolean) = (caseCol: String, pattern: String) =>
{
var result = true
val splitPattern = pattern.split(",")
val splitCaseCol = caseCol.split(",")
var foundAtIndex = -1
for (i <- 0 to splitPattern.length - 1) {
breakable {
for (j <- 0 to splitCaseCol.length - 1) {
if (j > foundAtIndex) {
println(splitCaseCol(j))
if (splitCaseCol(j) == splitPattern(i)) {
result = true
foundAtIndex = j
break
} else result = false
} else result = false
}
}
}
println(caseCol, result)
(result)
}
// registering the udf
val udfFilter = udf(coder)
//reading the input file
val df = sparkSession.read.option("delimiter", "\t").option("header", "true").csv("output.txt")
//calling the function and aggregating
df.filter(udfFilter(col("Case"), lit(pattern))).agg(lit(pattern), sum("Freq")).toDF("pattern","sum").show
}
if input is
135322,121200
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,121200|13.0|
+-------------+----+
if input is
135322,135322
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,135322|22.0|
+-------------+----+

Related

How to extract where clause as array in spark sql?

I am trying to extract where clause from SQL query.
Multiple conditions in where clause should be in form array. Please help me.
Sample Input String:
select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null
Output Expected:
Array("col1=1", "(col2 between 1 and 10 or col2 between 190 and 200)", "col2 is not null")
Thanks in advance.
EDIT:
My question here is like... I would like to split all the conditions as separate items... let's say my query is like
select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null
The output I'm expecting is like
List("col1=1", "col2 between 1 and 10", "col2 between 190 and 200", "col2 is not null")
The thing is the query may have multiple levels of conditions like
select * from table where col1=1 and (col2 =2 or(col3 between 1 and 10 or col3 is between 190 and 200)) and col4='xyz'
in output each condition should be a separate item
List("col1=1","col2=2", "col3 between 1 and 10", "col3 between 190 and 200", "col4='xyz'")

I wouldn't use Regex for this. Here's an alternative way to extract your conditions based on Catalyst's Logical Plan :
val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
f.condition.productIterator.flatMap{
case And(l,r) => Seq(l,r)
case o:Predicate => Seq(o)
}
}.toList.flatten
println(predicates)
Output :
List(('col1 = 1), ((('col2 >= 1) && ('col2 <= 10)) || (('col2 >= 190) && ('col2 <= 200))), isnotnull('col2))
Here the predicates are still Expressions and hold information (tree representation).
EDIT :
As asked in comment, here's a String (user friendly I hope) representation of the predicates :)
val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
f.condition.productIterator.flatMap{
case o:Predicate => Seq(o)
}
}.toList.flatten
def stringifyExpressions(expression: Expression): Seq[String] = {
expression match{
case And(l,r) => (l,r) match {
case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
}
case Or(l,r) => Seq(Seq(l,r).flatMap(stringifyExpressions).mkString("(",") OR (", ")"))
case eq: EqualTo => Seq(s"${eq.left.toString} = ${eq.right.toString}")
case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
case p: Predicate => Seq(p.toString)
}
}
val stringRepresentation = predicates.flatMap{stringifyExpressions}
println(stringRepresentation)
New Output :
List('col1 = 1, ('col2 between 1 and 10) OR ('col2 between 190 and 200), 'col2 is not null)
You can keep playing with the recursive stringifyExpressions method if you want to customize the output.
EDIT 2 : In response to your own edit :
You can change the Or / EqualTo cases to the following
def stringifyExpressions(expression: Expression): Seq[String] = {
expression match{
case And(l,r) => (l,r) match {
case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
}
case Or(l,r) => Seq(l,r).flatMap(stringifyExpressions)
case EqualTo(l,r) =>
val prettyLeft = if(l.resolved && l.dataType == StringType) s"'${l.toString}'" else l.toString
val prettyRight = if(r.resolved && r.dataType == StringType) s"'${r.toString}'" else r.toString
Seq(s"$prettyLeft=$prettyRight")
case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
case p: Predicate => Seq(p.toString)
}
}
This gives the 4 elements List :
List('col1=1, 'col2 between 1 and 10, 'col2 between 190 and 200, 'col2 is not null)
For the second example :
select * from table where col1=1 and (col2 =2 or (col3 between 1 and 10 or col3 between 190 and 200)) and col4='xyz'
You'd get this output (List[String] with 5 elements) :
List('col1=1, 'col2=2, 'col3 between 1 and 10, 'col3 between 190 and 200, 'col4='xyz')
Additional note: If you want to print the attribute names without the starting quote, you can handle it by printing this instead of toString :
node.asInstanceOf[UnresolvedAttribute].name

Loading csv files with wildcards to match a pattern

I have 24 csv files that contain 0 to 23 in there name example hyper01.csv , hyper02.csv,....,hyper23.csv . But i just want to load files from 08 to 15 using wildcards
currently i am using folder_name/*{08-15} but its not working i am using spark

if you are using scala you could try this:
def getPaths(dir: String): List[String] = {
val file = new File(dir)
file.listFiles.filter(_.isFile)
.filter(_.getName.startsWith("hyper"))
.filter(s=>{
val index = s.getName.replace("hyper", "")
.replaceFirst("0", "")
.replaceAll("\\.(.*)","")
.toInt
index >= 8 && index <= 15
})
.map(_.getPath).toList
}
val filesDirectory = "C:\\Users\\user\\hyper"
getPaths(filesDirectory).foreach(println(_))
val df = spark.read.csv(getPaths(filesDirectory):_*)
output:
C:\Users\user\hyper\hyper08.csv
C:\Users\user\hyper\hyper15.csv

Scala small function can be used for get required files:
def getDirs(root: String, start: Int, end: Int): Seq[String] = {
(start to end).map(v => f"$root/hyper$v%02d.csv")
}
// usage
spark.read.parquet(getDirs("folder_name", 8, 15): _*)

Find starting and ending index of each unique charcters in a string in python

I have a string with characters repeated. My Job is to find starting Index and ending index of each unique characters in that string. Below is my code.
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
mo = re.search(item,x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
Output :
a 0 1
b 3 4
c 7 8
Here the end index of the characters are not correct. I understand why it's happening but how can I pass the character to be matched dynamically to the regex search function. For instance if I hardcode the character in the search function it provides the desired output
x = 'aabbbbccc'
xs = set(x)
mo = re.search("[b]+",x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
output:
b 2 5
The above function is providing correct result but here I can't pass the characters to be matched dynamically.
It will be really a help if someone can let me know how to achieve this any hint will also do. Thanks in advance

String literal formatting to the rescue:
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
# for patterns better use raw strings - and format the letter into it
mo = re.search(fr"{item}+",x) # fr and rf work both :) its a raw formatted literal
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n) # fix upper limit by n-1
Output:
a 0 3 # you do see that the upper limit is off by 1?
b 3 7 # see above for fix
c 7 9
Your pattern does not need the [] around the letter - you are matching just one anyhow.
Without regex1:
x = "aaabbbbcc"
last_ch = x[0]
start_idx = 0
# process the remainder
for idx,ch in enumerate(x[1:],1):
if last_ch == ch:
continue
else:
print(last_ch,start_idx, idx-1)
last_ch = ch
start_idx = idx
print(ch,start_idx,idx)
output:
a 0 2 # not off by 1
b 3 6
c 7 8
1RegEx: And now you have 2 problems...

Looking at the output, I'm guessing that another option would be,
import re
x = "aaabbbbcc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
print(output)
Output
a 0 3
b 3 7
c 7 9
I think it'll be in the Order of N, you can likely benchmark it though, if you like.
import re, time
timer_on = time.time()
for i in range(10000000):
x = "aabbbbccc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
timer_off = time.time()
timer_total = timer_off - timer_on
print(timer_total)

Golang SplitAfter for Regexp type

In golang strings.SplitAfter method split text after an special character into an slice, but I didn't find a way for Regexp type to split text after matches. Is there a way to do that?
Example :
var text string = "1.2.3.4.5.6.7.8.9"
res := strings.Split(text, ".")
fmt.Println(res) // print [1 2 3 4 5 6 7 8 9]
res = strings.SplitAfter(text, ".")
fmt.Println(res) // print [1. 2. 3. 4. 5. 6. 7. 8. 9]

first at all, your regex "." is wrong for splitAfter function. You want number followed by value "." so the regex is: "[1-9]".
The function you are looking might look like this:
func splitAfter(s string, re *regexp.Regexp) (r []string) {
re.ReplaceAllStringFunc(s, func(x string) string {
s = strings.Replace(s,x,"::"+x,-1)
return s
})
for _, x := range strings.Split(s,"::") {
if x != "" {
r = append(r, x)
}
}
return
}
Than:
fmt.Println(splitAfter("healthyRecordsMetric",regexp.MustCompile("[A-Z]")))
fmt.Println(splitAfter("healthyrecordsMETetric",regexp.MustCompile("[A-Z]")))
fmt.Println(splitAfter("HealthyHecord Hetrics",regexp.MustCompile("[A-Z]")))
fmt.Println(splitAfter("healthy records metric",regexp.MustCompile("[A-Z]")))
fmt.Println(splitAfter("1.2.3.4.5.6.7.8.9",regexp.MustCompile("[1-9]")))
[Healthy Records Metric]
[healthy Records Metric]
[healthyrecords M E Tetric]
[Healthy Hecord Hetrics]
[healthy records metric]
[1. 2. 3. 4. 5. 6. 7. 8. 9]
Good luck!

Regexp type itself does not have a method to do that exactly that but it's quite simple to write a function that implements what your asking based on Regexp functionality:
func SplitAfter(s string, re *regexp.Regexp) []string {
var (
r []string
p int
)
is := re.FindAllStringIndex(s, -1)
if is == nil {
return append(r, s)
}
for _, i := range is {
r = append(r, s[p:i[1]])
p = i[1]
}
return append(r, s[p:])
}
Here I left a program to play with it.

Why can't I extract text between patterns with varying text length

I have a column in a data frame with characters that need to be split into columns. My code seems to break when the string in the column has a length of 12 but it works fine when the string has a length of 11.
S99.ABCD{T}
S99.ABCD{V}
S99.ABCD{W}
S99.ABCD{Y}
Q100.ABCD{A}
Q100.ABCD{C}
Q100.ABCD{D}
Q100.ABCD{E}
An example of the ideal format is on the left, what I'm getting is on the right:
ID WILD RES MUT | ID WILD RES MUT
ABCD S 99 T | ABCD S 99 T
... | ...
ABCD Q 100 A | .ABC Q 100 {
... | ...
My current solution is the following:
data <- data.frame(ID = substr(mdata$substitution,
gregexpr(pattern = "\\.",
mdata$substitution)[[1]] + 1,
gregexpr(pattern = "\\{",
mdata$substitution)[[1]] - 1),
WILD = substr(mdata$substitution, 0, 1),
RES = gsub("[^0-9]","", mdata$substitution),
MUT = substr(mdata$substitution,
gregexpr(pattern = "\\{",
mdata$substitution)[[1]] + 1,
gregexpr(pattern = "\\}",
mdata$substitution)[[1]] - 1))
I'm not sure why my code isn't working, I thought using gregexpr I would be able to find where the pattern was in the string to find out the position of characters I want to extract but it doesn't work when the length of the string changes.

Using this example you can do what you want
test=c("S99.ABCD{T}",
"S99.ABCD{V}",
"S99.ABCD{W}",
"S99.ABCD{Y}",
"Q100.ABCD{A}",
"Q100.ABCD{C}",
"Q100.ABCD{D}",
"Q100.ABCD{E}")
library(stringr)
test=str_remove(test,pattern = "\\}")
testdf=as.data.frame(str_split(test,pattern = "\\.|\\{",simplify = T))
testdf$V4substring(testdf$V1, 1, 1)
testdf$V5=substring(testdf$V1, 2)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

spark scala pattern matching on a dataframe column - regex

Related

How to extract where clause as array in spark sql?

Loading csv files with wildcards to match a pattern

Find starting and ending index of each unique charcters in a string in python

Golang SplitAfter for Regexp type

Why can't I extract text between patterns with varying text length

Categories

Resources