return first instance of unmatched regex scala - regex

Is there a way to return the first instance of an unmatched string between 2 strings with Scala's Regex library?
For example:
val a = "some text abc123 some more text"
val b = "some text xyz some more text"
a.firstUnmatched(b) = "abc123"

Regex is good for matching & replacing in strings based on patterns.
But to look for the differences between strings? Not exactly.
However, diff can be used to find differences.
object Main extends App {
val a = "some text abc123 some more text 321abc"
val b = "some text xyz some more text zyx"
val firstdiff = (a.split(" ") diff b.split(" "))(0)
println(firstdiff)
}
prints "abc123"
Is regex desired after all? Then realize that the splits could be replaced by regex matching.
The regex pattern in this example looks for words:
val reg = "\\w+".r
val firstdiff = (reg.findAllIn(a).toList diff reg.findAllIn(b).toList)(0)

Related

Scala regex get string before the first hyphen and the entire string

Given a string like abab/docId/example-doc1-2019-01-01, I want to use Regex to extract these values:
firstPart = example
fullString = example-doc1-2019-01-01
I have this:
import scala.util.matching.Regex
case class Read(theString: String) {
val stringFormat: Regex = """.*\/docId\/([A-Za-z0-9]+)-([A-Za-z0-9-]+)$""".r
val stringFormat(firstPart, fullString) = theString
}
But this separates it like this:
firstPart = example
fullString = doc1-2019-01-01
Is there a way to retain the fullString and do a regex on that to get the part before the first hyphen? I know I can do this using the String split method but is there a way do it using regex?
You may use
val stringFormat: Regex = ".*/docId/(([A-Za-z0-9])+-[A-Za-z0-9-]+)$".r
||_ Group 2 _| |
| |
|_________________ Group 1 __|
See the regex demo.
Note how capturing parentheses are re-arranged. Also, you need to swap the variables in the regex match call, see demo below (fullString should come before firstPart).
See Scala demo:
val theString = "abab/docId/example-doc1-2019-01-01"
val stringFormat = ".*/docId/(([A-Za-z0-9]+)-[A-Za-z0-9-]+)".r
val stringFormat(fullString, firstPart) = theString
println(s"firstPart: '$firstPart'\nfullString: '$fullString'")
Output:
firstPart: 'example'
fullString: 'example-doc1-2019-01-01'

Replacing the 1st regex-match group instead of the 0th

I was expecting this
val string = "hello , world"
val regex = Regex("""(\s+)[,]""")
println(string.replace(regex, ""))
to result in this:
hello, world
Instead, it prints this:
hello world
I see that the replace function cares about the whole match. Is there a way to replace only the 1st group instead of the 0th one?
Add the comma in the replacement:
val string = "hello , world"
val regex = Regex("""(\s+)[,]""")
println(string.replace(regex, ","))
Or, if kotlin supports lookahead:
val string = "hello , world"
val regex = Regex("""\s+(?=,)""")
println(string.replace(regex, ""))
You can retrieve the match range of the regular expression by using the groups property of MatchGroupCollection and then using the range as a parameter for String.removeRange method:
val string = "hello , world"
val regex = Regex("""(\s+)[,]""")
val result = string.removeRange(regex.find(string)!!.groups[1]!!.range)

Extracting inner group with Scala regex

My Scala app is being given a string that may or may not contain the token "flimFlam(*)" inside of it, where the asterisk represents any kind of text, chars, punctuation, etc. There will always only be 0 or 1 instances of "flimFlam(*)" in this string, never more.
I need to detect if the given input string contains a "flimFlam(*)" instance, and if it does, extract out whatever is inside the two parentheses. Hence, if my string contains "flimFlam(Joe)", then the result would be a string with a value of "Joe", etc.
My best attempt so far:
val inputStr : String = "blah blah flimFlam(Joe) blah blah"
// Regex must be case-sensitive for "flimFlam" (not "FLIMFLAM", "flimflam", etc.)
val flimFlamRegex = ".*flimFlam\\(.*?\\)".r
val insideTheParens = flimFlamRegex.findFirstIn(inputStr)
Can anyone spot where I'm going awry?
Use pattern matching and regex extractor
val regex = ".*flimFlam\\((.*)\\).*".r
inputStr match {
case regex(x) => println(x)
case _ => println("no match")
}
Scala REPL
scala> val inputStr : String = "blah blah flimFlam(Joe) blah blah"
inputStr: String = blah blah flimFlam(Joe) blah blah
scala> val regex = ".*flimFlam\\((.*)\\).*"
regex: String = .*flimFlam\((.*)\).*
scala> val regex = ".*flimFlam\\((.*)\\).*".r
regex: scala.util.matching.Regex = .*flimFlam\((.*)\).*
scala> inputStr match { case regex(x) => println(x); case _ => println("no match")}
Joe
You may use a capturing group around .*? and just use an unanchored regex within match block so that the pattern could stay short and "pretty" (no need for .* around the value you are looking for):
var str = "blah blah flimFlam(Joe) blah blah"
val pattern = """flimFlam\((.*?)\)""".r.unanchored
val res = str match {
case pattern(res) => println(res)
case _ => "No match"
}
See the online demo
Also, note that you do not need to double backslashes inside """-quoted string literals that helps avoid excessive backslashes.
And a hint: if the flimFlam is a whole word, add \b in front - """\bflimFlam\((.*?)\)""".

Selectively uppercasing a string

I have a string with some XML tags in it, like:
"hello <b>world</b> and <i>everyone</i>"
Is there a good Scala/functional way of uppercasing the words, but not the tags, so that it looks like:
"HELLO <b>WORLD<b> AND <i>EVERYONE</i>"
We can use dustmouse's regex to replace all the text in/outside XML tags with Regex.replaceAllIn. We can get the matched text with Regex.Match.matched which then can easily be uppercased using toUpperCase.
val xmlText = """(?<!<|<\/)\b\w+(?!>)""".r
val string = "hello <b>world</b> and <i>everyone</i>"
xmlText.replaceAllIn(string, _.matched.toUpperCase)
// String = HELLO <b>WORLD</b> AND <i>EVERYONE</i>
val string2 = "<h1>>hello</h1> <span>world</span> and <span><i>everyone</i>"
xmlText.replaceAllIn(string2, _.matched.toUpperCase)
// String = <h1>>HELLO</h1> <span>WORLD</span> AND <span><i>EVERYONE</i>
Using dustmouse's updated regex :
val xmlText = """(?:<[^<>]+>\s*)(\w+)""".r
val string3 = """<h1>>hello</h1> <span id="test">world</span>"""
xmlText.replaceAllIn(string3, m =>
m.group(0).dropRight(m.group(1).length) + m.group(1).toUpperCase)
// String = <h1>>hello</h1> <span id="test">WORLD</span>
Okay, how about this. It just prints the results, and takes into consideration some of the scenarios brought up by others. Not sure how to capitalize the output without mercilessly poaching from Peter's answer:
val string = "<h1 id=\"test\">hello</h1> <span>world</span> and <span><i>everyone</i></span>"
val pattern = """(?:<[^<>]+>\s*)(\w+)""".r
pattern.findAllIn(string).matchData foreach {
m => println(m.group(1))
}
The main thing here is that it is extracting the correct capture group.
Working example: http://ideone.com/2qlwoP
Also need to give credit to the answer here for getting capture groups in scala: Scala capture group using regex

Regular expression for removing white spaces but not those inside ""

I have the following input string:
key1 = "test string1" ; key2 = "test string 2"
I need to convert it to the following without tokenizing
key1="test string1";key2="test string 2"
You'd be far better off NOT using a regular expression.
What you should be doing is parsing the string. The problem you've described is a mini-language, since each point in that string has a state (eg "in a quoted string", "in the key part", "assignment").
For example, what happens when you decide you want to escape characters?
key1="this is a \"quoted\" string"
Move along the string character by character, maintaining and changing state as you go. Depending on the state, you can either emit or omit the character you've just read.
As a bonus, you'll get the ability to detect syntax errors.
Using ERE, i.e. extended regular expressions (which are more clear than basic RE in such cases), assuming no quote escaping and having global flag (to replace all occurrences) you can do it this way:
s/ *([^ "]*) *("[^"]*")?/\1\2/g
sed:
$ echo 'key1 = "test string1" ; key2 = "test string 2"' | sed -r 's/ *([^ "]*) *("[^"]*")/\1\2/g'
C# code:
using System.Text.RegularExpressions;
Regex regex = new Regex(" *([^ \"]*) *(\"[^\"]*\")?");
String input = "key1 = \"test string1\" ; key2 = \"test string 2\"";
String output = regex.Replace(input, "$1$2");
Console.WriteLine(output);
Output:
key1="test string1";key2="test string 2"
Escape-aware version
On second thought I've reached a conclusion that not showing escape-aware version of regexp may lead to incorrect findings, so here it is:
s/ *([^ "]*) *("([^\\"]|\\.)*")?/\1\2/g
which in C# looks like:
Regex regex = new Regex(" *([^ \"]*) *(\"(?:[^\\\\\"]|\\\\.)*\")?");
String output = regex.Replace(input, "$1$2");
Please do not go blind from those backslashes!
Example
Input: key1 = "test \\ " " string1" ; key2 = "test \" string 2"
Output: key1="test \\ "" string1";key2="test \" string 2"