scala regex pattern giving matcherror - regex

I have a string
val path = "/bigdatahdfs/datalake/raw/lum/creditriskreporting/iffcollateral/year=2017/month=05/approach=firb/basel=3/version_partition=8/vFirbtestCollateralBaselIIIData_201705_8_20170620.txt.gz"
the pattern
.*version_partition=(\d+)(.*)
is working as expected in regex101.com.
Requirement is to extract two strings. one is "8" (exactly after version_partition=)and another is "/vFirbtestCollateralBaselIIIData_201705_8_20170620.txt.gz"
In scala REPL the same pattern is giving scala.MatchError. I am new in using regular expressions. Not sure what I am doing wrong here. Please help me here.
scala code is
val P = """.*version_partition=(\d+)(.*)""".r
val P(ver,fileName) = path;
I have tried with /g and /m flag also. It didn't work.

Your code works : https://scalafiddle.io/sf/Xz1Y0Ze/0
You don't need /g and /m flag.
/g ==> Perform a global match (find all matches rather than stopping
after the first match)
/m ==> Perform multiline matching
code :
val path = "/bigdatahdfs/datalake/raw/lum/creditriskreporting/iffcollateral/year=2017/month=05/approach=firb/basel=3/version_partition=8/vFirbtestCollateralBaselIIIData_201705_8_20170620.txt.gz"
val P = """.*version_partition=(\d+)(.*)""".r
val P(ver,fileName) = path;
println(ver)
println(fileName)

Try it using a match like this:
val path = "/bigdatahdfs/datalake/raw/lum/creditriskreporting/iffcollateral/year=2017/month=05/approach=firb/basel=3/version_partition=8/vFirbtestCollateralBaselIIIData_201705_8_20170620.txt.gz"
val P = """.*version_partition=(\d+)(.*)""".r
path match {
case P(a,b) ⇒
println(a)
println(b)
}
Test

You accidentally added a white space at the end.
https://regex101.com/r/FLkZEu/2
The .* at the beginning of the regex is useless

Related

Matcher of Regex expression is false while the expression, pattern and string are all valid

I am using a regex regular expression like so:
#Test
fun timePatternFromInstantIsValid() {
val instantOfSometimeEarlier = Instant.now().minus(Duration.ofMinutes((1..3).random().toLong()))
val timeOfEvent = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss").withZone(ZoneId.of("UTC")).format(instantOfSometimeEarlier)
val regex = "(\\d{2}-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01]))T(?:(?:([01]?\\d|2[0-3]):)?([0-5]?\\d):)?([0-5]?\\d)"
val acceptedDatePattern: Pattern = Pattern.compile(regex)
val matcher: Matcher = microsoftAcceptedDatePattern.matcher(timeOfEvent)
val isMatchToAcceptedDatePattern: Boolean = matcher.matches()
print(isMatchToAcceptedDatePattern)
}
isMatchToAcceptedDatePattern for some reason is returning false which probably indicates something is wrong in my regex BUT, when checking it on multiple regex websites I'm getting a valid match. any ideas why?
try it yourself:
https://www.regextester.com/ or here:
https://regex101.com/
my regex - raw (as in the websites):
(\d{2}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]))T(?:(?:([01]?\d|2[0-3]):)?([0-5]?\d):)?([0-5]?\d)
pattern example returned like this (it gets returned without the " ' " near the "T"):
2021-04-01T11:12:51 (when I debug this is what I get)
date pattern:
yyyy-MM-ddTHH:mm:ss
could someone inlight me please?
You use matcher.matches() which is like pre- and appending ^ resp. $ to your regex. Such a regex won't work.
Instead you should either:
use matcher.find() which returns true if a match could be found.
prepend \d{2} to your regex and still use matcher.matches(): Demo

Scala regex can't match "\r\n" in a giving string which contains multiple "\r\n" [duplicate]

This question already has answers here:
Multiline regex capture in Scala
(2 answers)
Closed 5 years ago.
I want split abcd\r\nabc\r\nppp to (abcd\r\nabc, ppp) with regex "(.*)\r\n(.*)".r.
but the regex match fail as this:
object Regex extends App {
val r = "(.*)\r\n(.*)".r
val str = "abcd\r\nabc\r\nppp"
str match {
case r(a,b) =>
println((a,b))
case _ =>
println("fail - ")
}
}
console print fail -.
It works fine if use the Regex match abcd\r\nppp, code again:
object Regex extends App {
val r = "(.*)\r\n(.*)".r
val str = "abcd\r\nppp"
str match {
case r(a,b) =>
println((a,b))
case _ =>
println("fail - ")
}
}
Besides, I don't want replace \r\n to other characters.It's waste calculate resource, because the code is used to performance sensitive stage.
Thanks
Dot does not match \n by default (don't ask why - there is no reason, it just doesn't), so .* fails on the second \n.
You can change that by specifying a DOTALL flag to your regex. That's done by adding (?s) to the beginning of the pattern (don't ask how ?s came to stand for DOTALL ... there is a lot of mystery like this in regex world):
val r = "(?s)(.*)\r\n(.*)".r
val str = "abcd\r\nabc\r\nppp"
str match {
case r(a,b) => println(a -> b)
}
This prints
(abcd
abc,
ppp
)
If you want to split at the first \r\n rather than the last one add ? to the the first group:
val r = "(?s)(.*?)\r\n(.*)".r
This makes wildcard non-greedy, so that it'll match the shortest possible string, rather than the longest, which is the default.

Regular Expression Split

I have a string as mentioned below. I have been trying to split using regular expression and going through the forums, I found ([^|]+) which would match everything except (pipe) However I want to break this into two using regular expressions, but not been able to do this. So one expression would be (xyz) which would extract from GA till everything before the pipe character, the second would be (abc) which would extract anything after the first pipe.
GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298
The first is ^[^|]+ and the second is [^|]+$.
The idea is to use your negated character class with anchors. ^ will match the string start and $ will matchthe string end.
These two patterns have no lookarounds and will work with almost any regex flavor.
Guessing at popular languages. :-)
Python:
'GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298'.split('|')
JavaScript:
'GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298'.split('|')
PHP:
explode('|', 'GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298')
Go:
strings.Split("GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298", "|")
Ruby:
'GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298'.split('|')
EDIT
After clarification, I get what you're asking. Fiddling with regex101.com, I found that those two expressions should give you what you want:
^.*(?=\|) gets the first part, and
(?<=\|).* gets the second.
When you click on the link, you can see it in action.
PREVIOUS ANSWER
Many alternatives to regular expressions as #smarx's answer reveals.
But something along those lines should do it:
R
myString <- 'GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298'
part1 <- sub(pattern = "(.*)\\|(.*)", x = myString, replacement = "\\1")
part2 <- sub(pattern = "(.*)\\|(.*)", x = myString, replacement = "\\2")
R requires doubling all backslashes, some other languages don't.
Python
import re
myString = 'GA1.2.1127630839.1468526914|3847EFF358ABEC90-01A39B0290BAC298'
part1 = re.sub(pattern="(.*)\|(.*)", repl = "\\1", string = myString)
part1 = re.sub(pattern="(.*)\|(.*)", repl = "\\2", string = myString)

Regex is not matching in Scala

I want to split up a camelCase string with spaces.
"ClassicalMusicArtist" -> "Classical Music Artist"
I should be able to do this by replacing "/([a-z](?=[A-Z]))/g" with "$1 " (regex101).
But my regex is not getting any matches:
val regex = "/([a-z](?=[A-Z]))/g".r
val s = "ClassicalMusicArtist"
regex.replaceAllIn(s, "$1 ") // -> Returns "ClassicalMusicArtist"
regex.findFirstIn(s) // -> Returns None
What am I doing wrong? I used the regex in another language with success and can't figure out why I am not getting any matches.
Ok I figured it out.
In scala the regex has to be val regex = "([a-z](?=[A-Z]))".r without the leading / and the modifier.

Unable to identify comma "," with regex using scala 2.10.3

I am currently writing a function with uses the UNIX ls -m command to list a bunch of files, and then transform them into a list using a regex.
My function is as follows:
def genFileList(path : String = "~") : Iterator[String] = {
val fileSeparatorRegex: Regex = "(.*),".r
val fullCommand : String = s"ls -m $path"
val rawFileList: String = fullCommand.!!
val files: Iterator[String] = fileSeparatorRegex.findAllIn(rawFileList).matchData.map(_.group(1))
var debug : List[String] = files.toList
debug
files
}
For example: let's assume I have a folder called test with 3 files: test.txt test1.txt test2.txt. The resulting list is:
Very strange...
Lets change the function to:
def genFileList(path : String = "~") : Iterator[String] = {
val fileSeparatorRegex: Regex = "(.*)\\n".r \\ Changed to match newline
val fullCommand : String = s"ls -1 $path" \\ Changed to give file name separated via newline
val rawFileList: String = fullCommand.!!
val files: Iterator[String] = fileSeparatorRegex.findAllIn(rawFileList).matchData.map(_.group(1))
var debug : List[String] = files.toList
debug
files
}
Tadaaaa:
Can anybody help me make sense of the first case failing?
Why do the commas generated by ls -m not get matched?
(.*) is a greedy pattern, it tries to match as much as it can, including the commas
test1.txt, test2.txt, test3.txt
^------------------^^
all of this is |
matched by .* this is matched by ,
The last chunk is not matched, because it's not followed by a comma.
You can use non-greedy matching using .*?
Alternatively, you can to just do rawFileList.stripSuffix("\n").split(", ").toList
Also, "ls -m ~".!! doesn't work, splitting output on commas won't work if filenames contain commas, "s"ls -m $path".!! is asking for shell injection, and new File(path).list() is way better in all aspects.
I can see two problems with your initial approach. The first is that the * in your regex is greedy, which means it's sucking up as much as possible before reaching a comma, including other commas. If you change it to non-greedy by adding a ? (i.e. "(.*?),".r) it will only match up to the first comma.
The second problem is that there's no comma following the last file (naturally), so it won't be found by the regex. In your second approach you're getting all three files because there's a newline after each of them. If you want to stick with commas you'd be better off using split (e.g. rawFileList.split(",")).
You might also consider using the list or listFiles methods on java.io.File:
scala> val dir = new java.io.File(".")
f: java.io.File = .
scala> dir.list
res0: Array[String] = Array(test, test1.txt, test2.txt)