Scala Spark count regex matches in a file - regex

I am learning Spark+Scala and I am stuck with this problem. I have one file that contains many sentences, and another file with a large number of regular expressions. Both files have one element per line.
What I want is to count how many times each regex has a match in the whole sentences file. For example if the sentences file (after becoming an array or list) was represented by ["hello world and hello life", "hello i m fine", "what is your name"], and the regex files by ["hello \\w+", "what \\w+ your", ...] then I would like the output to be something like: [("hello \\w+", 3),("what \\w+ your",1), ...]
My code is like this:
object PatternCount_v2 {
def main(args: Array[String]) {
// The text where we will find the patterns
val inputFile = args(0);
// The list of patterns
val inputPatterns = args(1)
val outputPath = args(2);
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
// Load the text file
val textFile = sc.textFile(inputFile).cache()
// Load the patterns
val patterns = Source.fromFile(inputPatterns).getLines.map(line => line.r).toList
val patternCounts = textFile.flatMap(line => {
println(line)
patterns.foreach(
pattern => {
println(pattern)
(pattern,pattern.findAllIn(line).length )
}
)
}
)
patternCounts.saveAsTextFile(outputPath)
}}
But the compiler complains:
If I change the flatMap to just map the code runs but returns a bunch of empty tuples () () () ()
Please help! This is driving me crazy.
Thanks,

As far as I can see, there are two issues here:
You should use map instead of foreach: foreach returns Unit, it performs an action with a potential side effect on each element of a collection, it doesn't return a new collection. map on the other hand transform a collection into a new one by applying the supplied function to each element
You're missing the part where you aggregate the results of flatMap to get the actual count per "key" (pattern). This can be done easily with reduceByKey
Altogether - this does what you need:
val patternCounts = textFile
.flatMap(line => patterns.map(pattern => (pattern, pattern.findAllIn(line).length)))
.reduceByKey(_ + _)

Related

Linq get element from string list and a position of a char in this list

i want to get an element from a list of string and get the position of a char in this list by using linq ?
Example :
List<string> lines = new List<string> { "TOTO=1", "TATA=2", "TUTU=3"}
I want to extract the value 1 from TOTO in the list
here is the begin of my code
var value= lines.ToList().Single(x =>x.Contains("TOTO=")).ToString().Trim();
How to continue this code to extract 1 ?
Add this :
value = value[(value.LastIndexOf('=') + 1)..];
Using LINQ you can do this:
List<string> lines = new List<string> { "TOTO=1", "TATA=2", "TUTU=3" };
int value = lines
.Select(line => line.Split('='))
.Where(parts => parts[0] == "TOTO")
.Select(parts => int.Parse(parts[1]))
.Single();
If you always expect each item in that list to be in the proper format then this should work, otherwise you'd need to add some validation.
Similar to What #jtate proposed, Some minor enhancements can help.
int value = lines
.Select(line => line.Split(new []{ '=' }, StringSplitOptions.RemoveEmptyEntries))
.Where(parts => string.Equals(parts[0], "TOTO", StringComparison.InvariantCultureIgnoreCase))
.Select(parts => int.Parse(parts[1]))
.SingleOrDefault();
SingleOrDefault - If you don't find any elements matching your constraints, Single() would thow an exception. Here, SingleOrDefault would return 0;
String.Equals - would take care of any upper lowere or any culture related problems.
StringSplitOptions.RemoveEmptyEntries - would limit some unecessary iterations and improve performance.
Also see if you need int.TryParse instead of int.Prase. All these checks would help cover edges cases in production

matching new line in Scala regex, when reading from file

For processing a file with SQL statements such as:
ALTER TABLE ONLY the_schema.the_big_table
ADD CONSTRAINT the_schema_the_big_table_pkey PRIMARY KEY (the_id);
I am using the regex:
val primaryKeyConstraintNameCatchingRegex: Regex = "([a-z]|_)+\\.([a-z]|_)+\n\\s*(ADD CONSTRAINT)\\s*([a-z]|_)+\\s*PRIMARY KEY”.r
Now the problem is that this regex does not return any results, despite the fact that both the regex
val alterTableRegex = “ALTER TABLE ONLY\\s+([a-z]|_)+\\.([a-z]|_)+”.r
and
val addConstraintRegex = “ADD CONSTRAINT\\s*([a-z]|_)+\\s*PRIMARY KEY”.r
match the intended sequences.
I thought the problem could be with the new line, and, so far, I have tried writing \\s+, \\W+, \\s*, \\W*, \\n*, \n*, \n+, \r+, \r*, \r\\s*, \n*\\s*, \\s*\n*\\s*, and other combinations to match the white space between the table name and add constraint to no avail.
I would appreciate any help with this.
Edit
This is the code I am using:
import scala.util.matching.Regex
import java.io.File
import scala.io.Source
object Hello extends Greeting with App {
val primaryKeyConstraintNameCatchingRegex: Regex = "([a-z]|_)+\\.([a-z]|_)+\r\\s*(ADD CONSTRAINT)\\s*([a-z]|_)+\\s*PRIMARY KEY".r
readFile
def readFile: Unit = {
val fname = "dump.sql"
val fSource = Source.fromFile(fname)
for (line <- fSource.getLines) {
val matchExp = primaryKeyConstraintNameCatchingRegex.findAllIn(line).foreach(
segment => println(segment)
)
}
fSource.close()
}
}
Edit 2
Another strange behavior is that when matching with
"""[a-z_]+(\.[a-z_]+)\s*A""”.r
the matches happen and they include A, but when I use
"""[a-z_]+(\.[a-z_]+)\s*ADD""”.r
which is only different in DD, no sequence is matched.
Your problem is that you read the file line by line (see for (line <- fSource.getLines) code part).
You need to grab the contents as a single string to be able to match across line breaks.
val fSource = Source.fromFile(fname).mkString
val matchExps = primaryKeyConstraintNameCatchingRegex.findAllIn(fSource)
Now, fSource will contain the whole text file contents as one string and matchExps will contain all found matches.

How to use match with regular expressions in Scala

I am starting to learn Scala and want to use regular expressions to match a character from a string so I can populate a mutable map of characters and their value (String values, numbers etc) and then print the result.
I have looked at several answers on SO and gone over the Scala Docs but can't seem to get this right. I have a short Lexer class that currently looks like this:
class Lexer {
private val tokens: mutable.Map[String, Any] = collection.mutable.Map()
private def checkCharacter(char: Character): Unit = {
val Operator = "[-+*/^%=()]".r
val Digit = "[\\d]".r
val Other = "[^\\d][^-+*/^%=()]".r
char.toString match {
case Operator(c) => tokens(c) = "Operator"
case Digit(c) => tokens(c) = Integer.parseInt(c)
case Other(c) => tokens(c) = "Other" // Temp value, write function for this
}
}
def lex(input: String): Unit = {
val inputArray = input.toArray
for (s <- inputArray)
checkCharacter(s)
for((key, value) <- tokens)
println(key + ": " + value)
}
}
I'm pretty confused by the sort of strange method syntax, Operator(c), that I have seen being used to handle the value to match and am also unsure if this is the correct way to use regex in Scala. I think what I want this code to do is clear, I'd really appreciate some help understanding this. If more info is needed I will supply what I can
This official doc has lot's of examples: https://www.scala-lang.org/api/2.12.1/scala/util/matching/Regex.html. What might be confusing is the type of the regular expression and its use in pattern matching...
You can construct a regex from any string by using .r:
scala> val regex = "(something)".r
regex: scala.util.matching.Regex = (something)
Your regex becomes an object that has a few useful methods to be able to find matching groups like findAllIn.
In Scala it's idiomatic to use pattern matching for safe extraction of values, thus Regex class also has unapplySeq method to support pattern matching. This makes it an extractor object. You can use it directly (not common):
scala> regex.unapplySeq("something")
res1: Option[List[String]] = Some(List(something))
or you can let Scala compiler call it for you when you do pattern matching:
scala> "something" match {
| case regex(x) => x
| case _ => ???
| }
res2: String = something
You might ask why exactly this return type on unapply/unapplySeq. The doc explains it very well:
The return type of an unapply should be chosen as follows:
If it is just a test, return a Boolean. For instance case even().
If it returns a single sub-value of type T, return an Option[T].
If you want to return several sub-values T1,...,Tn, group them in an optional tuple Option[(T1,...,Tn)].
Sometimes, the number of values to extract isn’t fixed and we would
like to return an arbitrary number of values, depending on the input.
For this use case, you can define extractors with an unapplySeq method
which returns an Option[Seq[T]]. Common examples of these patterns
include deconstructing a List using case List(x, y, z) => and
decomposing a String using a regular expression Regex, such as case
r(name, remainingFields # _*) =>
In short your regex might match one or more groups, thus you need to return a list/seq. It has to be wrapped in an Option to comply with extractor contract.
The way you are using regex is correct, I would just map your function over the input array to avoid creating mutable maps. Perhaps something like this:
class Lexer {
private def getCharacterType(char: Character): Any = {
val Operator = "([-+*/^%=()])".r
val Digit = "([\\d])".r
//val Other = "[^\\d][^-+*/^%=()]".r
char.toString match {
case Operator(c) => "Operator"
case Digit(c) => Integer.parseInt(c)
case _ => "Other" // Temp value, write function for this
}
}
def lex(input: String): Unit = {
val inputArray = input.toArray
val tokens = inputArray.map(x => x -> getCharacterType(x))
for((key, value) <- tokens)
println(key + ": " + value)
}
}
scala> val l = new Lexer()
l: Lexer = Lexer#60f662bd
scala> l.lex("a-1")
a: Other
-: Operator
1: 1

Spark: Add Regex column into Row

I am writing a spark job which iterates through dataset and finds matches, here's what the pseudo code looks like:
def map(data: Dataset[Row], queries: Array[Row]): Dataset[Row] = {
import spark.implicits._
val val1 = data
.flatMap(r => {
val text = r.getAs[String]("text");
queries.filter(t => t.getAs[String]("query").r.findFirstIn(message).text)
.map(..//mapping)
}).toDF(..columns);
}
So, it iterates through the data and performs regex matching. The issue is, it tries to convert string into regex (t.getAs[String]("query").r) every time, and I am trying to swap it outside the loop as it's not really needed.
So, I tried this (where queries array is generated):
val convertToRegex = udf[Regex, String]((arg:String) => if(arg != null) arg.r else null)
queries.withColumn("queryR", convertToRegex(col("query"))) //queries is DataFrame here
However, as expected, it threw an error saying (Schema for type scala.util.matching.Regex is not supported).
Is there any way I can add a Regex column into an array or create a temp column before stating the iteration?

How does regex capturing work in scala?

Here is an example:
object RegexTest {
def main (args: Array[String]): Unit = {
val input = "Enjoy this apple 3.14 times"
val pattern = """.* apple ([\d.]+) times""".r
val pattern(amountText) = input
val amount = amountText.toDouble
println(amount)
}
}
I understand what this does, but how does val pattern(amountText) = input actually work? It looks very weird to me.
What that line is doing is calling Regex.unapplySeq (which is also called an extractor) to deconstruct input into a list of captured groups, and then bind each group to a new variable. In this particular scenario, only one group is expected to be captured and bound to the value amountText.
Validation aside, this is kinda what's going on behind the scenes:
val capturedGroups = pattern.unapplySeq(input)
val amountText = capturedGroups(0)
// And this:
val pattern(a, b, c) = input
// Would be equivalent to this:
val capturedGroups = pattern.unapplySeq(input)
val a = capturedGroups(0)
val b = capturedGroups(1)
val c = capturedGroups(2)
It is very similar in essence to extracting tuples:
val (a, b) = (2, 3)
Or even pattern matching:
(2,3) match {
case (a, b) =>
}
In both of these cases, Tuple.unapply is being called.
I suggest you have a look at this page : http://docs.scala-lang.org/tutorials/tour/extractor-objects.html. It is the official tutorial regarding extractors which this the pattern you are looking for.
I find that looking at the source makes it clear how it works : https://github.com/scala/scala/blob/2.11.x/src/library/scala/util/matching/Regex.scala#L243
Then, note that your code val pattern(amountText) = input is perfectly working, but, you must be sure about the input and be sure that there is a match with the regex.
Otherwise, I recommend you to write it this way :
input match {
case pattern(amountText) => ...
case _ => ...
}