Scala: how to replace strings using the original matched values - regex

Is there a way to replace some string in a text using the original matched values?
For instance, I would like to replace all the integers by decimals, as in the following example:
"hello 45 hello 4 bye" --> "hello 45.0 hello 4.0 bye"
I could match all the numbers with findAllIn and after replace them but I would like to know if there is a better solution.

Using RegularExpressions, you can use $1 to get the result of the first capturing group (in parenthesis):
val regex = "(\\d+)".r
val text = "hello 45 hello 4 bye"
val result = regex.replaceAllIn(text, "$1.0")
// result: String = hello 45.0 hello 4.0 bye

Use the overload of replaceAllIn that takes a replacer function:
http://www.scala-lang.org/api/current/index.html#scala.util.matching.Regex#replaceAllIn(target:CharSequence,replacer:scala.util.matching.Regex.Match=>String):String

Related

Scala regex on a whole column

I have the following pattern that I could parse using pandas in Python, but struggle with translating the code into Scala.
grade string_column
85 (str:ann smith,14)(str:frank chase,15)
86 (str:john foo,15)(str:al more,14)
In python I used:
df.set_index('grade')['string_column']\
.str.extractall(r'\((str:[^,]+),(\d+)\)')\
.droplevel(1)
with the output:
grade 0 1
85 str:ann smith 14
85 str:frank chase 15
86 str:john foo 15
86 str:al more 14
In Scala I tried to duplicate the approach, but it's failing:
import scala.util.matching.Regex
val pattern = new Regex("((str:[^,]+),(\d+)\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn(str)).mkString(","))
There are a few notes about the code:
There is an unmatched parenthesis for a group, but that one should be escaped
The backslashes should be double escaped
In the println you don't have to use all the parenthesis and the dot
findAllIn returns a MatchIterator, and looping those will expose a matched string. Joining those matched strings with a comma, will in this case give back the same string again.
For example
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,]+),(\\d+)\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn str mkString ",")
Output
(str:ann smith,14),(str:frank chase,15)
But if you want to print out the group 1 and group 2 values, you can use findAllMatchIn that returns a collection of Regex Matches:
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,]+),(\\d+)\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
pattern findAllMatchIn str foreach(m => {
println(m.group(1))
println(m.group(2))
}
)
Output
str:ann smith
14
str:frank chase
15
In Python, Series.str.extractall only returns captured substrings. In Scala, findAllIn returns the matched values if you do not query its matchData property that in its turn contains a subgroups property.
So, to get the captures only in Scala, you need to use
val pattern = """\((str:[^,()]+),(\d+)\)""".r
val str = "(str:ann smith,14)(str:frank chase,15)"
(pattern findAllIn str).matchData foreach {
m => println(m.subgroups.mkString(","))
}
Output:
str:ann smith,14
str:frank chase,15
See the Scala online demo.
Here, m.subgroups accesses all subgroups (captures) of each match (m).
Also, note you do not need to double backslashes in triple-quoted string literals. \((str:[^,()]+),(\d+)\) matches
\( - a ( char
(str:[^,()]+) - Group 1: str: and one or more chars other than ,, ( and )
, - a comma
(\d+) - Group 2: one or more digits
\) - a ) char.
If you just want to get all matches without captures, you can use
val pattern = """\((str:[^,]+),(\d+)\)""".r
println((pattern findAllIn str).matchData.mkString(","))
Output:
(str:ann smith,14),(str:frank chase,15)
See the online demo.

Replacing the 1st regex-match group instead of the 0th

I was expecting this
val string = "hello , world"
val regex = Regex("""(\s+)[,]""")
println(string.replace(regex, ""))
to result in this:
hello, world
Instead, it prints this:
hello world
I see that the replace function cares about the whole match. Is there a way to replace only the 1st group instead of the 0th one?
Add the comma in the replacement:
val string = "hello , world"
val regex = Regex("""(\s+)[,]""")
println(string.replace(regex, ","))
Or, if kotlin supports lookahead:
val string = "hello , world"
val regex = Regex("""\s+(?=,)""")
println(string.replace(regex, ""))
You can retrieve the match range of the regular expression by using the groups property of MatchGroupCollection and then using the range as a parameter for String.removeRange method:
val string = "hello , world"
val regex = Regex("""(\s+)[,]""")
val result = string.removeRange(regex.find(string)!!.groups[1]!!.range)

return first instance of unmatched regex scala

Is there a way to return the first instance of an unmatched string between 2 strings with Scala's Regex library?
For example:
val a = "some text abc123 some more text"
val b = "some text xyz some more text"
a.firstUnmatched(b) = "abc123"
Regex is good for matching & replacing in strings based on patterns.
But to look for the differences between strings? Not exactly.
However, diff can be used to find differences.
object Main extends App {
val a = "some text abc123 some more text 321abc"
val b = "some text xyz some more text zyx"
val firstdiff = (a.split(" ") diff b.split(" "))(0)
println(firstdiff)
}
prints "abc123"
Is regex desired after all? Then realize that the splits could be replaced by regex matching.
The regex pattern in this example looks for words:
val reg = "\\w+".r
val firstdiff = (reg.findAllIn(a).toList diff reg.findAllIn(b).toList)(0)

Selectively uppercasing a string

I have a string with some XML tags in it, like:
"hello <b>world</b> and <i>everyone</i>"
Is there a good Scala/functional way of uppercasing the words, but not the tags, so that it looks like:
"HELLO <b>WORLD<b> AND <i>EVERYONE</i>"
We can use dustmouse's regex to replace all the text in/outside XML tags with Regex.replaceAllIn. We can get the matched text with Regex.Match.matched which then can easily be uppercased using toUpperCase.
val xmlText = """(?<!<|<\/)\b\w+(?!>)""".r
val string = "hello <b>world</b> and <i>everyone</i>"
xmlText.replaceAllIn(string, _.matched.toUpperCase)
// String = HELLO <b>WORLD</b> AND <i>EVERYONE</i>
val string2 = "<h1>>hello</h1> <span>world</span> and <span><i>everyone</i>"
xmlText.replaceAllIn(string2, _.matched.toUpperCase)
// String = <h1>>HELLO</h1> <span>WORLD</span> AND <span><i>EVERYONE</i>
Using dustmouse's updated regex :
val xmlText = """(?:<[^<>]+>\s*)(\w+)""".r
val string3 = """<h1>>hello</h1> <span id="test">world</span>"""
xmlText.replaceAllIn(string3, m =>
m.group(0).dropRight(m.group(1).length) + m.group(1).toUpperCase)
// String = <h1>>hello</h1> <span id="test">WORLD</span>
Okay, how about this. It just prints the results, and takes into consideration some of the scenarios brought up by others. Not sure how to capitalize the output without mercilessly poaching from Peter's answer:
val string = "<h1 id=\"test\">hello</h1> <span>world</span> and <span><i>everyone</i></span>"
val pattern = """(?:<[^<>]+>\s*)(\w+)""".r
pattern.findAllIn(string).matchData foreach {
m => println(m.group(1))
}
The main thing here is that it is extracting the correct capture group.
Working example: http://ideone.com/2qlwoP
Also need to give credit to the answer here for getting capture groups in scala: Scala capture group using regex

How to create a regex that matches alpha chars up to, but not including, whitespace?

With this input:
"hello the3re world"
I'm trying to create a regex that will match words containing just alpha characters and not digits.
I'm using std::regex_search with flag std::regex_constants::match_continuous.
With this regex [[:alpha:]]+ the first call to regex_search will give me back "hello". If I then advance over "hello" and any white space and try again with "the3re world", I get back: "the".
But, what I really want at this point is a failure to match as the "3" shouldn't be valid in a word.
(Adding my comment as an answer) You should use word boundaries \b for this purpose..
\b[[:alpha:]]+\b //or "\\b[[:alpha:]]+\\b" as per the syntax..
You can use the following code:
string line1 = "hello the3re world";
string regexStr1 = "\\b[a-zA-Z]+\\b";
regex rg1(regexStr1);
smatch sm1;
while (regex_search(line1, sm1, rg1)) {
std::cout << sm1[0] << std::endl;
line1 = sm1.suffix().str();
}
Output: