Spark: Add Regex column into Row - regex

I am writing a spark job which iterates through dataset and finds matches, here's what the pseudo code looks like:
def map(data: Dataset[Row], queries: Array[Row]): Dataset[Row] = {
import spark.implicits._
val val1 = data
.flatMap(r => {
val text = r.getAs[String]("text");
queries.filter(t => t.getAs[String]("query").r.findFirstIn(message).text)
.map(..//mapping)
}).toDF(..columns);
}
So, it iterates through the data and performs regex matching. The issue is, it tries to convert string into regex (t.getAs[String]("query").r) every time, and I am trying to swap it outside the loop as it's not really needed.
So, I tried this (where queries array is generated):
val convertToRegex = udf[Regex, String]((arg:String) => if(arg != null) arg.r else null)
queries.withColumn("queryR", convertToRegex(col("query"))) //queries is DataFrame here
However, as expected, it threw an error saying (Schema for type scala.util.matching.Regex is not supported).
Is there any way I can add a Regex column into an array or create a temp column before stating the iteration?

Related

Linq get element from string list and a position of a char in this list

i want to get an element from a list of string and get the position of a char in this list by using linq ?
Example :
List<string> lines = new List<string> { "TOTO=1", "TATA=2", "TUTU=3"}
I want to extract the value 1 from TOTO in the list
here is the begin of my code
var value= lines.ToList().Single(x =>x.Contains("TOTO=")).ToString().Trim();
How to continue this code to extract 1 ?
Add this :
value = value[(value.LastIndexOf('=') + 1)..];
Using LINQ you can do this:
List<string> lines = new List<string> { "TOTO=1", "TATA=2", "TUTU=3" };
int value = lines
.Select(line => line.Split('='))
.Where(parts => parts[0] == "TOTO")
.Select(parts => int.Parse(parts[1]))
.Single();
If you always expect each item in that list to be in the proper format then this should work, otherwise you'd need to add some validation.
Similar to What #jtate proposed, Some minor enhancements can help.
int value = lines
.Select(line => line.Split(new []{ '=' }, StringSplitOptions.RemoveEmptyEntries))
.Where(parts => string.Equals(parts[0], "TOTO", StringComparison.InvariantCultureIgnoreCase))
.Select(parts => int.Parse(parts[1]))
.SingleOrDefault();
SingleOrDefault - If you don't find any elements matching your constraints, Single() would thow an exception. Here, SingleOrDefault would return 0;
String.Equals - would take care of any upper lowere or any culture related problems.
StringSplitOptions.RemoveEmptyEntries - would limit some unecessary iterations and improve performance.
Also see if you need int.TryParse instead of int.Prase. All these checks would help cover edges cases in production

matching new line in Scala regex, when reading from file

For processing a file with SQL statements such as:
ALTER TABLE ONLY the_schema.the_big_table
ADD CONSTRAINT the_schema_the_big_table_pkey PRIMARY KEY (the_id);
I am using the regex:
val primaryKeyConstraintNameCatchingRegex: Regex = "([a-z]|_)+\\.([a-z]|_)+\n\\s*(ADD CONSTRAINT)\\s*([a-z]|_)+\\s*PRIMARY KEY”.r
Now the problem is that this regex does not return any results, despite the fact that both the regex
val alterTableRegex = “ALTER TABLE ONLY\\s+([a-z]|_)+\\.([a-z]|_)+”.r
and
val addConstraintRegex = “ADD CONSTRAINT\\s*([a-z]|_)+\\s*PRIMARY KEY”.r
match the intended sequences.
I thought the problem could be with the new line, and, so far, I have tried writing \\s+, \\W+, \\s*, \\W*, \\n*, \n*, \n+, \r+, \r*, \r\\s*, \n*\\s*, \\s*\n*\\s*, and other combinations to match the white space between the table name and add constraint to no avail.
I would appreciate any help with this.
Edit
This is the code I am using:
import scala.util.matching.Regex
import java.io.File
import scala.io.Source
object Hello extends Greeting with App {
val primaryKeyConstraintNameCatchingRegex: Regex = "([a-z]|_)+\\.([a-z]|_)+\r\\s*(ADD CONSTRAINT)\\s*([a-z]|_)+\\s*PRIMARY KEY".r
readFile
def readFile: Unit = {
val fname = "dump.sql"
val fSource = Source.fromFile(fname)
for (line <- fSource.getLines) {
val matchExp = primaryKeyConstraintNameCatchingRegex.findAllIn(line).foreach(
segment => println(segment)
)
}
fSource.close()
}
}
Edit 2
Another strange behavior is that when matching with
"""[a-z_]+(\.[a-z_]+)\s*A""”.r
the matches happen and they include A, but when I use
"""[a-z_]+(\.[a-z_]+)\s*ADD""”.r
which is only different in DD, no sequence is matched.
Your problem is that you read the file line by line (see for (line <- fSource.getLines) code part).
You need to grab the contents as a single string to be able to match across line breaks.
val fSource = Source.fromFile(fname).mkString
val matchExps = primaryKeyConstraintNameCatchingRegex.findAllIn(fSource)
Now, fSource will contain the whole text file contents as one string and matchExps will contain all found matches.

Casting regex match to String

I created a simple code in Scala that checks whether an input is correctly formatted as HH:mm. I expect the code to result in an Array of valid strings. However, what I'm getting as a result is of type Any = Array(), which is problematic as when I try to print that result I get something like that:
[Ljava.lang.Object;#32a59591.
I guess it's a simple problem but being a Scala newbie I didn't manage to solve it even after a good few hours of googling and trial & error.
val scheduleHours = if (inputScheduleHours == "") {
dbutils.notebook.exit(s"ERROR: Missing param value for schedule hours.")
}
else {
val timePattern = """^((?:[0-30]?[0-9]|2[0-3]):[0-5][0-9])$""".r
val inputScheduleHoursParsed = inputScheduleHours.split(";").map(_.trim)
for (e <- inputScheduleHoursParsed) yield e match {
case timePattern(e) => e.toString
case _ => dbutils.notebook.exit(s"ERROR: Wrong param value for schedule hours: '${inputScheduleHours}'")
}
}
The problem is that some branches return the result you want and others return dbutils.notebook.exit which (I think) returns Unit. Scala must pick a type for the result that is compatible with both Unit and Array[String], and Any is the only one that fits.
One solution is to add a compatible value after the calls to dbutils.notebook.exit, e.g.
val scheduleHours = if (inputScheduleHours == "") {
dbutils.notebook.exit(s"ERROR: Missing param value for schedule hours.")
Array.empty[String]
}
Then all the branches return Array[String] so that will be the type of the result.

Scala Spark count regex matches in a file

I am learning Spark+Scala and I am stuck with this problem. I have one file that contains many sentences, and another file with a large number of regular expressions. Both files have one element per line.
What I want is to count how many times each regex has a match in the whole sentences file. For example if the sentences file (after becoming an array or list) was represented by ["hello world and hello life", "hello i m fine", "what is your name"], and the regex files by ["hello \\w+", "what \\w+ your", ...] then I would like the output to be something like: [("hello \\w+", 3),("what \\w+ your",1), ...]
My code is like this:
object PatternCount_v2 {
def main(args: Array[String]) {
// The text where we will find the patterns
val inputFile = args(0);
// The list of patterns
val inputPatterns = args(1)
val outputPath = args(2);
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
// Load the text file
val textFile = sc.textFile(inputFile).cache()
// Load the patterns
val patterns = Source.fromFile(inputPatterns).getLines.map(line => line.r).toList
val patternCounts = textFile.flatMap(line => {
println(line)
patterns.foreach(
pattern => {
println(pattern)
(pattern,pattern.findAllIn(line).length )
}
)
}
)
patternCounts.saveAsTextFile(outputPath)
}}
But the compiler complains:
If I change the flatMap to just map the code runs but returns a bunch of empty tuples () () () ()
Please help! This is driving me crazy.
Thanks,
As far as I can see, there are two issues here:
You should use map instead of foreach: foreach returns Unit, it performs an action with a potential side effect on each element of a collection, it doesn't return a new collection. map on the other hand transform a collection into a new one by applying the supplied function to each element
You're missing the part where you aggregate the results of flatMap to get the actual count per "key" (pattern). This can be done easily with reduceByKey
Altogether - this does what you need:
val patternCounts = textFile
.flatMap(line => patterns.map(pattern => (pattern, pattern.findAllIn(line).length)))
.reduceByKey(_ + _)

Extract substring based on regex to use in RDD.filter

I am trying to filter out rows of a text file whose second column value begins with words from a list.
I have the list such as:
val mylist = ["Inter", "Intra"]
If I have a row like:
Cricket Inter-house
Inter is in the list, so that row should get filtered out by the RDD.filter operation. I am using the following regex:
`[A-Za-z0-9]+`
I tried using """[A-Za-z0-9]+""".r to extract the substring but the result is in a non empty iterator.
My question is how to access the above result in the filter operation?
You need to construct regular expression like ".* Inter.*".r since """[A-Za-z0-9]+""" matches any word. Here is some working example, hope it helps:
val mylist = List("Inter", "Intra")
val textRdd = sc.parallelize(List("Cricket Inter-house", "Cricket Int-house",
"AAA BBB", "Cricket Intra-house"))
// map over my list to dynamically construct regular expressions and check if it is within
// the text and use reduce to make sure none of the pattern exists in the text, you have to
// call collect() to see the result or take(5) if you just want to see the first five results.
(textRdd.filter(text => mylist.map(word => !(".* " + word + ".*").r
.pattern.matcher(text).matches).reduce(_&&_)).collect())
// res1: Array[String] = Array(Cricket Int-house, AAA BBB)
filter will remove anything for which the function passed to the filter method returns true. Thus, Regex isn't exactly what you want. Instead, let's develop a function that takes a row and compares it against a candidate string and returns true if the second column in that row starts with the candidate:
val filterFunction: (String, String) => Boolean =
(row, candidate) => row.split(" ").tail.head.startsWith(candidate)
We can convince ourselves that this works pretty easily using a worksheet:
// Test data
val mylist = List("Inter", "Intra")
val file = List("Cricket Inter-house", "Boom Shakalaka")
filterFunction("Cricket Inter-house", "Inter") // true
filterFunction("Cricket Inter-house", "Intra") // false
filterFunction("Boom Shakalaka", "Inter") // false
filterFunction("Boom Shakalaka", "Intra") // false
Now all that remains is to utilize this function in the filter. Essentially, for every row, we want to test the filter against every line in our candidate list. That means taking the candidate list and 'folding left' to check every item on it against the function. If any candidate reports true, then we know that row should be filtered out of the final result:
val result = file.filter((row: String) => {
!mylist.foldLeft(false)((x: Boolean, candidate: String) => {
x || filterFunction(row, candidate)
})
})
// result: List[String] = List(Boom Shakalaka)
The above can be a little dense to unpack. We are passing to the filter method a function that takes in a row and produces a boolean value. We want that value to be true if and only if the row does not match our criteria. We've already embedded our criteria in the filterFunction: we just need to run it against every combination of item in mylist.
To do this we use foldLeft, which takes a starting value (in this case false) and iteratively moves through the list, updating that starting value and returning the final result.
To 'update' that value we write a function that logically-ORs the starting value with the result of running our filter function against the row and the current item in mylist.