Parsing simple query syntax - regex

Let's say I have a query string like that:
#some terms! "phrase query" in:"my container" in:group_3
or
#some terms!
or
in:"my container" in:group_3 terms! "phrase query"
or
in:"my container" test in:group_3 terms!
What is the best way to parse this correctly?
I've looked at Lucene's SimpleQueryParser but it seems quite complicated for my usecase. And I'm trying to parse that query using regexes but not really successful until now, mostly due to the possibility of using whitespace inside quotes
Any simple idea?
I just need to get as output a list of elements, afterward it's pretty easy for me to solve the rest of the problem:
[
"#some",
"terms!",
"phrase query",
"in:\"my container\"",
"in:group_3"
]

The following regex matches the text of your output:
(?:\S*"(?:[^"]+)"|\S+)
See the demo

Just for those interested, here's the final Scala/Java parser I used to solve my problem, inspired by answers in this question:
def testMatcher(query: String): Unit = {
def optionalPrefix(groupName: String) = s"(?:(?:(?<$groupName>[a-zA-Z]+)[:])?)"
val quoted = optionalPrefix("prefixQuoted") + "\"(?<textQuoted>[^\"]*)\""
val unquoted = optionalPrefix("prefixUnquoted") + "(?<textUnquoted>[^\\s\"]+)"
val regex = quoted + "|" + unquoted
val matcher = regex.r.pattern.matcher(query)
var results: List[QueryTerm] = Nil
while (matcher.find()) {
val quotedResult = Option(matcher.group("textQuoted")).map(textQuoted =>
(Option(matcher.group("prefixQuoted")),textQuoted)
)
val unquotedResult = Option(matcher.group("textUnquoted")).map(textUnquoted =>
(Option(matcher.group("prefixUnquoted")),textUnquoted)
)
val anyResult = quotedResult.orElse(unquotedResult).get
results = QueryTerm(anyResult._1,anyResult._2) :: results
}
println(s"results=${results.mkString("\n")}")
}

Related

transform string scala in an elegant way

I have the following input string: val s = 19860803 000000
I want to convert it to 1986/08/03
I tried this s.split(" ").head, but this is not complete
is there any elegant scala coding way with regex to get the expected result ?
You can use a date like pattern using 3 capture groups, and match the following space and the 6 digits.
In the replacement use the 3 groups in the replacement with the forward slashes.
val s = "19860803 000000"
val result = s.replaceAll("^(\\d{4})(\\d{2})(\\d{2})\\h\\d{6}$", "$1/$2/$3")
Output
result: String = 1986/08/03
i haven't tested this, but i think the following will work
val expr = raw"(\d{4})(\d{2})(\d{2}) (.*)".r
val formatted = "19860803 000000" match {
case expr(year,month,day,_) =>. s"$year/$month/$day"
}
scala docs have a lot of good info
https://www.scala-lang.org/api/2.13.6/scala/util/matching/Regex.html
An alternative, without a regular expression, by using slice and take.
val s = "19860803 000000"
val year = s.take(4)
val month = s.slice(4,6)
val day = s.slice(6,8)
val result = s"$year/$month/$day"
Or as a one liner
val result = Seq(s.take(4), s.slice(4,6), s.slice(6,8)).mkString("/")

How to use match with regular expressions in Scala

I am starting to learn Scala and want to use regular expressions to match a character from a string so I can populate a mutable map of characters and their value (String values, numbers etc) and then print the result.
I have looked at several answers on SO and gone over the Scala Docs but can't seem to get this right. I have a short Lexer class that currently looks like this:
class Lexer {
private val tokens: mutable.Map[String, Any] = collection.mutable.Map()
private def checkCharacter(char: Character): Unit = {
val Operator = "[-+*/^%=()]".r
val Digit = "[\\d]".r
val Other = "[^\\d][^-+*/^%=()]".r
char.toString match {
case Operator(c) => tokens(c) = "Operator"
case Digit(c) => tokens(c) = Integer.parseInt(c)
case Other(c) => tokens(c) = "Other" // Temp value, write function for this
}
}
def lex(input: String): Unit = {
val inputArray = input.toArray
for (s <- inputArray)
checkCharacter(s)
for((key, value) <- tokens)
println(key + ": " + value)
}
}
I'm pretty confused by the sort of strange method syntax, Operator(c), that I have seen being used to handle the value to match and am also unsure if this is the correct way to use regex in Scala. I think what I want this code to do is clear, I'd really appreciate some help understanding this. If more info is needed I will supply what I can
This official doc has lot's of examples: https://www.scala-lang.org/api/2.12.1/scala/util/matching/Regex.html. What might be confusing is the type of the regular expression and its use in pattern matching...
You can construct a regex from any string by using .r:
scala> val regex = "(something)".r
regex: scala.util.matching.Regex = (something)
Your regex becomes an object that has a few useful methods to be able to find matching groups like findAllIn.
In Scala it's idiomatic to use pattern matching for safe extraction of values, thus Regex class also has unapplySeq method to support pattern matching. This makes it an extractor object. You can use it directly (not common):
scala> regex.unapplySeq("something")
res1: Option[List[String]] = Some(List(something))
or you can let Scala compiler call it for you when you do pattern matching:
scala> "something" match {
| case regex(x) => x
| case _ => ???
| }
res2: String = something
You might ask why exactly this return type on unapply/unapplySeq. The doc explains it very well:
The return type of an unapply should be chosen as follows:
If it is just a test, return a Boolean. For instance case even().
If it returns a single sub-value of type T, return an Option[T].
If you want to return several sub-values T1,...,Tn, group them in an optional tuple Option[(T1,...,Tn)].
Sometimes, the number of values to extract isn’t fixed and we would
like to return an arbitrary number of values, depending on the input.
For this use case, you can define extractors with an unapplySeq method
which returns an Option[Seq[T]]. Common examples of these patterns
include deconstructing a List using case List(x, y, z) => and
decomposing a String using a regular expression Regex, such as case
r(name, remainingFields # _*) =>
In short your regex might match one or more groups, thus you need to return a list/seq. It has to be wrapped in an Option to comply with extractor contract.
The way you are using regex is correct, I would just map your function over the input array to avoid creating mutable maps. Perhaps something like this:
class Lexer {
private def getCharacterType(char: Character): Any = {
val Operator = "([-+*/^%=()])".r
val Digit = "([\\d])".r
//val Other = "[^\\d][^-+*/^%=()]".r
char.toString match {
case Operator(c) => "Operator"
case Digit(c) => Integer.parseInt(c)
case _ => "Other" // Temp value, write function for this
}
}
def lex(input: String): Unit = {
val inputArray = input.toArray
val tokens = inputArray.map(x => x -> getCharacterType(x))
for((key, value) <- tokens)
println(key + ": " + value)
}
}
scala> val l = new Lexer()
l: Lexer = Lexer#60f662bd
scala> l.lex("a-1")
a: Other
-: Operator
1: 1

Search and replace special character in a string in scala

In scala ,I have a string where I need to replace %23 with # , as below:
From https://bucket_name.s3.amazonaws.com/scripts/%23%23%23ENVIRONMENT_NAME%23%23%23/abc/template_abc_windows_%23%23%23ENVIRONMENT_NAME%23%23%23.zip?X-Amz-Security-Token=FQoGZXIvYXdzEOghsfgdghgkjkjjklj
to https://bucket_name.s3.amazonaws.com/scripts/###ENVIRONMENT_NAME###/abc/template_abc_windows_###ENVIRONMENT_NAME###.zip?X-Amz-Security-Token=FQoGZXIvYXdzEOghsfgdghgkjkjjklj
I have used below regex and logic for substitution but I get error as:
java.lang.IllegalStateException: No match found
Code:
val originalURL = "https://bucket_name.s3.amazonaws.com/scripts/%23%23%23ENVIRONMENT_NAME%23%23%23/abc/template_abc_windows_%23%23%23ENVIRONMENT_NAME%23%23%23.zip?X-Amz-Security-Token=FQoGZXIvYXdzEOghsfgdghgkjkjjklj"
val pattern = Pattern.compile("(https://bucket_name.s3.amazonaws.com/scripts/)((%23){3})([a-zA-Z]+_[a-zA-Z]+)((%23){3})(/abc/template_abc_windows_)((%23){3})([a-zA-Z]+_[a-zA-Z]+)((%23){3})(..*)")
val matcher = pattern.matcher(originalURL)
val replacedURL = matcher.group(1)+"###"+ matcher.group(4)+"###"+ matcher.group(7)+"###"+ matcher.group(10)+"###"+matcher.group(13)
println("*******replacedURL******* => "+ replacedURL)
Any help is much appreciated.Thank you.
Maybe you can just use String.replaceAll?
val url = "https://bucket_name.s3.amazonaws.com/scripts/%23%23%23ENVIRONMENT_NAME%23%23%23/abc/template_abc_windows_%23%23%23ENVIRONMENT_NAME%23%23%23.zip?X-Amz-Security-Token=FQoGZXIvYXdzEOghsfgdghgkjkjjklj"
url.replaceAll("%23", "#")

Scala Spark count regex matches in a file

I am learning Spark+Scala and I am stuck with this problem. I have one file that contains many sentences, and another file with a large number of regular expressions. Both files have one element per line.
What I want is to count how many times each regex has a match in the whole sentences file. For example if the sentences file (after becoming an array or list) was represented by ["hello world and hello life", "hello i m fine", "what is your name"], and the regex files by ["hello \\w+", "what \\w+ your", ...] then I would like the output to be something like: [("hello \\w+", 3),("what \\w+ your",1), ...]
My code is like this:
object PatternCount_v2 {
def main(args: Array[String]) {
// The text where we will find the patterns
val inputFile = args(0);
// The list of patterns
val inputPatterns = args(1)
val outputPath = args(2);
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
// Load the text file
val textFile = sc.textFile(inputFile).cache()
// Load the patterns
val patterns = Source.fromFile(inputPatterns).getLines.map(line => line.r).toList
val patternCounts = textFile.flatMap(line => {
println(line)
patterns.foreach(
pattern => {
println(pattern)
(pattern,pattern.findAllIn(line).length )
}
)
}
)
patternCounts.saveAsTextFile(outputPath)
}}
But the compiler complains:
If I change the flatMap to just map the code runs but returns a bunch of empty tuples () () () ()
Please help! This is driving me crazy.
Thanks,
As far as I can see, there are two issues here:
You should use map instead of foreach: foreach returns Unit, it performs an action with a potential side effect on each element of a collection, it doesn't return a new collection. map on the other hand transform a collection into a new one by applying the supplied function to each element
You're missing the part where you aggregate the results of flatMap to get the actual count per "key" (pattern). This can be done easily with reduceByKey
Altogether - this does what you need:
val patternCounts = textFile
.flatMap(line => patterns.map(pattern => (pattern, pattern.findAllIn(line).length)))
.reduceByKey(_ + _)

Selectively uppercasing a string

I have a string with some XML tags in it, like:
"hello <b>world</b> and <i>everyone</i>"
Is there a good Scala/functional way of uppercasing the words, but not the tags, so that it looks like:
"HELLO <b>WORLD<b> AND <i>EVERYONE</i>"
We can use dustmouse's regex to replace all the text in/outside XML tags with Regex.replaceAllIn. We can get the matched text with Regex.Match.matched which then can easily be uppercased using toUpperCase.
val xmlText = """(?<!<|<\/)\b\w+(?!>)""".r
val string = "hello <b>world</b> and <i>everyone</i>"
xmlText.replaceAllIn(string, _.matched.toUpperCase)
// String = HELLO <b>WORLD</b> AND <i>EVERYONE</i>
val string2 = "<h1>>hello</h1> <span>world</span> and <span><i>everyone</i>"
xmlText.replaceAllIn(string2, _.matched.toUpperCase)
// String = <h1>>HELLO</h1> <span>WORLD</span> AND <span><i>EVERYONE</i>
Using dustmouse's updated regex :
val xmlText = """(?:<[^<>]+>\s*)(\w+)""".r
val string3 = """<h1>>hello</h1> <span id="test">world</span>"""
xmlText.replaceAllIn(string3, m =>
m.group(0).dropRight(m.group(1).length) + m.group(1).toUpperCase)
// String = <h1>>hello</h1> <span id="test">WORLD</span>
Okay, how about this. It just prints the results, and takes into consideration some of the scenarios brought up by others. Not sure how to capitalize the output without mercilessly poaching from Peter's answer:
val string = "<h1 id=\"test\">hello</h1> <span>world</span> and <span><i>everyone</i></span>"
val pattern = """(?:<[^<>]+>\s*)(\w+)""".r
pattern.findAllIn(string).matchData foreach {
m => println(m.group(1))
}
The main thing here is that it is extracting the correct capture group.
Working example: http://ideone.com/2qlwoP
Also need to give credit to the answer here for getting capture groups in scala: Scala capture group using regex