Pattern Matching Scala Regex evaluation - regex

Imagine you have String that contains the Ampersand Symbol &
my goal is to add spaces between the & and any character if there isn't any
e.x
Case 1: Body&Soul should be-->Body & Soul (working)
Case 2: Body &Soul--> Body & Soul (working)
Case 3: Body& Soul -->Body & Soul (working)
Case 4: Body&Soul&Mind -->Body & Soul & Mind (working)
Case 5: Body &Soul& Mind ---> Body & Soul & Mind (not working)
Case 6: Body& Soul &Mind ---> Body & Soul & Mind (not working)
def replaceEmployerNameContainingAmpersand(emplName: String): String
= {
val r = "(?<! )&(?! )".r.unanchored
val r2 = "&(?! )".r.unanchored
val r3 = "(?<! )&".r.unanchored
emplName match {
case r() => emplName.replaceAll("(?<! )&(?! )", " & ")
case r2() => emplName.replaceAll("&(?! )", "& ")
case r3() => emplName.replaceAll("(?<! )&", " &")
}
}
The goal is to fix Case 5 & 6: Body &Soul& Mind or Body& Soul &Mind --> Body & Soul & Mind
But it's not working because when case 2 or 3 occurs the case is exiting and not matching the second & symbol.
Can anyone help me on how to match case 5 and 6?

You may capture a single optional whitespace char on both ends of a & and check if they matched, and replace accordingly using replaceAllIn:
def replaceAllIn(target: CharSequence, replacer: (Match) => String): String
Replaces all matches using a replacer function.
See the Scala demo:
val s = "Body&Soul, Body &Soul, Body& Soul, Body&Soul&Mind, Body &Soul& Mind, Body& Soul &Mind"
val pattern = """(\s)?&(\s)?""".r
val res = pattern.replaceAllIn(s, m => (if (m.group(1) != null) m.group(1) else " ") + "&" + (if (m.group(2) != null) m.group(2) else " ") )
println(res)
// => Body & Soul, Body & Soul, Body & Soul, Body & Soul & Mind, Body & Soul & Mind, Body & Soul & Mind
The (\s)?&(\s)? pattern matches and captures into Group 1 a single whitespace char, then matches &, and then captures an optional whitespace in Group 2.
If Group 1 is not null, there is a whitespace, and we keep it, else, replace with a space. The same logic is used for the trailing space.

Related

Extract the repetitive parts of a String by Regex pattern matching in Scala

I have this code for extracting the repetitive : separated sections of a regex, which does not give me the right output.
val pattern = """([a-zA-Z]+)(:([a-zA-Z]+))*""".r
for (p <- pattern findAllIn "it:is:very:great just:because:it is") p match {
case pattern("it", pattern(is, pattern(very, great))) => println("it: "+ is + very+ great)
case pattern(it, _,rest) => println( it+" : "+ rest)
case pattern(it, is, very, great) => println(it +" : "+ is +" : "+ very +" : " + great)
case _ => println("match failure")
}
What am I doing wrong?
How can I write a case expression which allows me to extract each : separated part of the pattern regex?
What is the right syntax with which to solve this?
How can the match against unknown number of arguments to be extracted from a regex be done?
In this case print:
it : is : very : great
just : because : it
is
You can't use repeated capturing group like that, it only saves the last captured value as the current group value.
You can still get the matches you need with a \b[a-zA-Z]+(?::[a-zA-Z]+)*\b regex and then split each match with ::
val text = "it:is:very:great just:because:it is"
val regex = """\b[a-zA-Z]+(?::[a-zA-Z]+)*\b""".r
val results = regex.findAllIn(text).map(_ split ':').toList
results.foreach { x => println(x.mkString(", ")) }
// => it, is, very, great
// just, because, it
// is
See the Scala demo. Regex details:
\b - word boundary
[a-zA-Z]+ - one or more ASCII letters
(?::[a-zA-Z]+)* - zero or more repetitions of
: - a colon
[a-zA-Z]+ - one or more ASCII letters
\b - word boundary

Regex to replace word except in comments

How can I modify my regex so that it will ignore the comments in the pattern in a language that doesn't support lookbehind?
My regex pattern is:
\b{Word}\b(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)
\b{Word}\b : Whole word, {word} is replaced iteratively for the vocab list
(?=([^""\](\.|""([^""\]\.)[^""\]""))[^""]$) : Don't replace anything inside of quotes
My goal is to lint variables and words so that they always have the same case. However I do not want to lint any words in a comment. (The IDE sucks and there is no other option)
Comments in this language are prefixed by an apostrophe. Sample code follows
' This is a comment
This = "Is not" ' but this is
' This is a comment, what is it's value?
Object.value = 1234 ' Set value
value = 123
Basically I want the linter to take the above code and say for the word "value" update it to:
' This is a comment
This = "Is not" ' but this is
' This is a comment, what is it's value?
Object.Value = 1234 ' Set value
Value = 123
So that all code based "Value" are updated but not anything in double quotes or in a comment or part of another word such as valueadded wouldn't be touched.
I've tried several solutions but haven't been able to get it to work.
['.*] : Not preceeding an apostrophy
(?<!\s*') : BackSearch not with any spaces with apoostrophy
(?<!\s*') : Second example seemed incorrect but this won't work as the language doesn't support backsearches
Anybody have any ideas how I can alter my pattern so that I don't edit commented variables
VBA
Sub TestSO()
Dim Code As String
Dim Expected As String
Dim Actual As String
Dim Words As Variant
Code = "item = object.value ' Put item in value" & vbNewLine & _
"some.item <> some.otheritem" & vbNewLine & _
"' This is a comment, what is it's value?" & vbNewLine & _
"Object.value = 1234 ' Set value" & vbNewLine & _
"value = 123" & vbNewLine
Expected = "Item = object.Value ' Put item in value" & vbNewLine & _
"some.Item <> some.otheritem" & vbNewLine & _
"' This is a comment, what is it's value?" & vbNewLine & _
"Object.Value = 1234 ' Set value" & vbNewLine & _
"Value = 123" & vbNewLine
Words = Array("Item", "Value")
Actual = SOLint(Words, Code)
Debug.Print Actual = Expected
Debug.Print "CODE: " & vbNewLine & Code
Debug.Print "Actual: " & vbNewLine & Actual
Debug.Print "Expected: " & vbNewLine & Expected
End Sub
Public Function SOLint(ByVal Words As Variant, ByVal FileContents As String) As String
Const NotInQuotes As String = "(?=([^""\\]*(\\.|""([^""\\]*\\.)*[^""\\]*""))*[^""]*$)"
Dim RegExp As Object
Dim Regex As String
Dim Index As Variant
Set RegExp = CreateObject("VBScript.RegExp")
With RegExp
.Global = True
.IgnoreCase = True
End With
For Each Index In Words
Regex = "[('*)]\b" & Index & "\b" & NotInQuotes
RegExp.Pattern = Regex
FileContents = RegExp.Replace(FileContents, Index)
Next Index
SOLint = FileContents
End Function
As discussed in the comments above:
((?:\".*\")|(?:'.*))|\b(v)(alue)\b
3 Parts to this regex used with alternation.
A non-capturing group for text within double quotes, as we dont need that.
A non-capturing group for text starting with single quote
Finally the string "value" is split into two parts (v) and (value) because while replacing we can use \U($2) to convert v to V and rest as is so \E$3 where \U - converts to upper case and \E - turns off the case.
\b \b - word boundaries are used to avoid any stand-alone text which is not part of setting a value.
https://regex101.com/r/mD9JeR/8

In Scala how can I split a string on whitespaces accounting for an embedded quoted string?

I know Scala can split strings on regex's like this simple split on whitespace:
myString.split("\\s+").foreach(println)
What if I want to split on whitespace, accounting for the possibility that there may be a quoted string in the input (which I wish to be treated as 1 thing)?
"""This is a "very complex" test"""
In this example I want the resulting substrings to be:
This
is
a
very complex
test
While handling quoted expressions with split can be tricky, doing so with Regex matches is quite easy. We just need to match all non-whitespace character sequences with ([^\\s]+) and all quoted character sequences with \"(.*?)\" (toList added in order to avoid reiteration):
import scala.util.matching._
val text = """This is a "very complex" test"""
val regex = new Regex("\"(.*?)\"|([^\\s]+)")
val matches = regex.findAllMatchIn(text).toList
val words = matches.map { _.subgroups.flatMap(Option(_)).fold("")(_ ++ _) }
words.foreach(println)
/*
This
is
a
very complex
test
*/
Note that the solution also counts quote itself as a word boundary. If you want to inline quoted strings into surrounding expressions, you'll need to add [^\\s]* from both sides of the quoted case and adjust group boundaries correspondingly:
...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*\".*?\"[^\\s]*)|([^\\s]+)")
...
/*
This
is
a
["very complex"]
test
*/
You can also omit quote symbols when inlining a string by splitting a regex group:
...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*)\"(.*?)\"([^\\s]*)|([^\\s]+)")
...
/*
This
is
a
[very complex]
test
*/
In more complex scenarios, when you have to deal with CSV strings, you'd better use a CSV parser (e.g. scala-csv).
For a string like the one in question, when you do not have to deal with escaped quotation marks, nor with any "wild" quotes appearing in the middle of the fields, you may adapt a known Java solution (see Regex for splitting a string using space when not surrounded by single or double quotes):
val text = """This is a "very complex" test"""
val p = "\"([^\"]*)\"|[^\"\\s]+".r
val allMatches = p.findAllMatchIn(text).map(
m => if (m.group(1) != null) m.group(1) else m.group(0)
)
println(allMatches.mkString("\n"))
See the online Scala demo, output:
This
is
a
very complex
test
The regex is rather basic as it contains 2 alternatives, a single capturing group and a negated character class. Here are its details:
\"([^\"]*)\" - ", followed with 0+ chars other than " (captured into Group 1) and then a "
| - or
[^\"\\s]+ - 1+ chars other than " and whitespace.
You only grab .group(1) if Group 1 participated in the match, else, grab the whole match value (.group(0)).
This should work:
val xx = """This is a "very complex" test"""
var x = xx.split("\\s+")
for(i <-0 until x.length) {
if(x(i) contains "\"") {
x(i) = x(i) + " " + x(i + 1)
x(i + 1 ) = ""
}
}
val newX= x.filter(_ != "")
for(i<-newX) {
println(i.replace("\"",""))
}
Rather than using split, I used a recursive approach. Treat the input string as a List[Char], then step through, inspecting the head of the list to see if it is a quote or whitespace, and handle accordingly.
def fancySplit(s: String): List[String] = {
def recurse(s: List[Char]): List[String] = s match {
case Nil => Nil
case '"' :: tail =>
val (quoted, theRest) = tail.span(_ != '"')
quoted.mkString :: recurse(theRest drop 1)
case c :: tail if c.isWhitespace => recurse(tail)
case chars =>
val (word, theRest) = chars.span(c => !c.isWhitespace && c != '"')
word.mkString :: recurse(theRest)
}
recurse(s.toList)
}
If the list is empty, you've finished recursion
If the first character is a ", grab everything up to the next quote, and recurse with what's left (after throwing out that second quote).
If the first character is whitespace, throw it out and recurse from the next character
In any other case, grab everything up to the next split character, then recurse with what's left
Results:
scala> fancySplit("""This is a "very complex" test""") foreach println
This
is
a
very complex
test

Scala Regex Extractor with OR operator

I have this verbose code that does shortcircuit Regex extraction / matching in Scala. This attempts to match a string with the first Regex, if that doesn't match, it attempts to match the string with the second Regex.
val regex1 : scala.util.matching.Regex = "^a(b)".r
val regex2 : scala.util.matching.Regex = "^c(d)".r
val s = ?
val extractedGroup1 : Option[String] = s match { case regex1(v) => Some(v) case _ => None }
val extractedGroup2 : Option[String] = s match { case regex2(v) => Some(v) case _ => None}
val extractedValue = extractedGroup1.orElse(extractedGroup2)
Here are the results:
s == "ab" then extractedValue == "b"
s == "cd" then extractedValue == "c"
s == "gg" then extractedValue == None.
My question is how can we combine the two regex into a single regex with the regex or operator, and still use Scala extractors. I tried this, but it's always the None case.
val regex : scala.util.matching.Regex = "^a(b)$ | ^c(d)$".r
val extractedValue: s match { case regex(v) => Some(v) case _ => None }
Don't use quality of life spaces within the regex, although they feel very scala-esque, they might be taken literary and your program expects that there should be a whitespace after the endOfString or space before the startOfString, which is obviously never the case. Try ^(?:a(b)|c(d))$, which is the same thing you did without repeating ^ and $.
Your own ^a(b)$|^c(d)$ can work too (if you remove the whitespaces).
Also, do you really get c out of cd? Judging by your regex, you should be getting d, if we're talking about capture groups.
Also, note that you're extracting capture groups. If you combine the regexes, an extracted d will be $2, while b would be $1.

Scala Regular Expressions (string delimited by double quotes)

I am new to scala. I am trying to match a string delimited by double quotes, and I am a bit puzzled by the following behavior:
If I do the following:
val stringRegex = """"([^"]*)"(.*$)"""
val regex = stringRegex.r
val tidyTokens = Array[String]("1", "\"test\"", "'c'", "-23.3")
tidyTokens.foreach {
token => if (token.matches (stringRegex)) println (token + " matches!")
}
I get
"test" matches!
otherwise, if I do the following:
tidyTokens.foreach {
token => token match {
case regex(token) => println (token + " matches!")
case _ => println ("No match for token " + token)
}
}
I get
No match for token 1
No match for token "test"
No match for token 'c'
No match for token -23.3
Why doesn't "test" match in the second case?
Take your regular expression:
"([^"]*)"(.*$)
When compiled with .r, this string yields a regex object - which, if it matches it's input string, must yield 2 captured strings - one for the ([^"]*) and the other for the (.*$). Your code
case regex(token) => ...
Ought to reflect this, so maybe you want
case regex(token, otherStuff) => ...
Or just
case regex(token, _) => ...
Why? Because the case regex(matchedCaputures...) syntax works because regex is an
object with an unapplySeq method. case regex(token) => ... translates (roughly) to:
case List(token) => ...
Where List(token) is what regex.unapplySeq( inputString ) returns:
regex.unapplySeq("\"test\"") // Returns Some(List("test", ""))
Your regex does match the string "test" but in the case statement the regex extractor's unapplySeq method returns a list of 2 strings because that is what the regex says it captures. That's unfortunate, but the compiler can't help you here because regular expressions are compiled from strings at runtime.
One alternative would be to use a non-capturing group:
val stringRegex = """"([^"]*)"(?:.*$)"""
// ^^
Then your code would work, because regex will now be an extractor object whose
unapplySeq method returns only a single captured group:
tidyTokens foreach {
case regex(token) => println (token + " matches!")
case t => println ("No match for token " + t)
}
Have a look at the tutorial on Extractor Objects, for a better understanding on
how apply / unapply / unapplySeq works.