Multiline regex capture in Scala - regex

I'm trying to capture the content from a multiline regex. It doesn't match.
val text = """<p>line1
line2</p>"""
val regex = """(?m)<p>(.*?)</p>""".r
var result = regex.findFirstIn(text).getOrElse("")
Returns empty.
I put the m - flag for multiline but it doesn't seem to help in this case.
If I remove the line break the regex works.
I also found this but couldn't get it working.
How do I match the content between the <p> elements? I want everything between, also the line breaks.
Thanks in advance!

If you want to activate the dotall mode in scala, you must use (?s) instead of (?m)
(?s) means the dot can match newlines
(?m) means ^ and $ stand for begining and end of lines

In case it's not obvious at this point, "How do I match the content":
scala> val regex = """(?s)<p>(.*?)</p>""".r
scala> (regex findFirstMatchIn text).get group 1
res52: String =
line1
line2
More idiomatically,
scala> text match { case regex(content) => content }
res0: String =
line1
line2
scala> val embedded = s"stuff${text}morestuff"
embedded: String =
stuff<p>line1
line2</p>morestuff
scala> val regex = """(?s)<p>(.*?)</p>""".r.unanchored
regex: scala.util.matching.UnanchoredRegex = (?s)<p>(.*?)</p>
scala> embedded match { case regex(content) => content }
res1: String =
line1
line2

Related

scala regex pattern giving matcherror

I have a string
val path = "/bigdatahdfs/datalake/raw/lum/creditriskreporting/iffcollateral/year=2017/month=05/approach=firb/basel=3/version_partition=8/vFirbtestCollateralBaselIIIData_201705_8_20170620.txt.gz"
the pattern
.*version_partition=(\d+)(.*)
is working as expected in regex101.com.
Requirement is to extract two strings. one is "8" (exactly after version_partition=)and another is "/vFirbtestCollateralBaselIIIData_201705_8_20170620.txt.gz"
In scala REPL the same pattern is giving scala.MatchError. I am new in using regular expressions. Not sure what I am doing wrong here. Please help me here.
scala code is
val P = """.*version_partition=(\d+)(.*)""".r
val P(ver,fileName) = path;
I have tried with /g and /m flag also. It didn't work.
Your code works : https://scalafiddle.io/sf/Xz1Y0Ze/0
You don't need /g and /m flag.
/g ==> Perform a global match (find all matches rather than stopping
after the first match)
/m ==> Perform multiline matching
code :
val path = "/bigdatahdfs/datalake/raw/lum/creditriskreporting/iffcollateral/year=2017/month=05/approach=firb/basel=3/version_partition=8/vFirbtestCollateralBaselIIIData_201705_8_20170620.txt.gz"
val P = """.*version_partition=(\d+)(.*)""".r
val P(ver,fileName) = path;
println(ver)
println(fileName)
Try it using a match like this:
val path = "/bigdatahdfs/datalake/raw/lum/creditriskreporting/iffcollateral/year=2017/month=05/approach=firb/basel=3/version_partition=8/vFirbtestCollateralBaselIIIData_201705_8_20170620.txt.gz"
val P = """.*version_partition=(\d+)(.*)""".r
path match {
case P(a,b) ⇒
println(a)
println(b)
}
Test
You accidentally added a white space at the end.
https://regex101.com/r/FLkZEu/2
The .* at the beginning of the regex is useless

Scala regex can't match "\r\n" in a giving string which contains multiple "\r\n" [duplicate]

This question already has answers here:
Multiline regex capture in Scala
(2 answers)
Closed 5 years ago.
I want split abcd\r\nabc\r\nppp to (abcd\r\nabc, ppp) with regex "(.*)\r\n(.*)".r.
but the regex match fail as this:
object Regex extends App {
val r = "(.*)\r\n(.*)".r
val str = "abcd\r\nabc\r\nppp"
str match {
case r(a,b) =>
println((a,b))
case _ =>
println("fail - ")
}
}
console print fail -.
It works fine if use the Regex match abcd\r\nppp, code again:
object Regex extends App {
val r = "(.*)\r\n(.*)".r
val str = "abcd\r\nppp"
str match {
case r(a,b) =>
println((a,b))
case _ =>
println("fail - ")
}
}
Besides, I don't want replace \r\n to other characters.It's waste calculate resource, because the code is used to performance sensitive stage.
Thanks
Dot does not match \n by default (don't ask why - there is no reason, it just doesn't), so .* fails on the second \n.
You can change that by specifying a DOTALL flag to your regex. That's done by adding (?s) to the beginning of the pattern (don't ask how ?s came to stand for DOTALL ... there is a lot of mystery like this in regex world):
val r = "(?s)(.*)\r\n(.*)".r
val str = "abcd\r\nabc\r\nppp"
str match {
case r(a,b) => println(a -> b)
}
This prints
(abcd
abc,
ppp
)
If you want to split at the first \r\n rather than the last one add ? to the the first group:
val r = "(?s)(.*?)\r\n(.*)".r
This makes wildcard non-greedy, so that it'll match the shortest possible string, rather than the longest, which is the default.

Matching but not capture a string in Swift Regex

I'm trying to search for a single plain quote mark (') in given String to then replace it with a single curved quote mark (’). I had tested more patterns but every time the search captures also the adjacent text. For example in the string "I'm", along with the ' mark it gets also the "I" and the "m".
(?:\\S)'(?:\\S)
Is there a possibility for achieve this or in the Swift implementation of Regex there is not support for non-capturing groups?
EDIT:
Example
let startingString = "I'm"
let myPattern = "(?:\\S)(')(?:\\S)"
let mySubstitutionText = "’"
let result = (applyReg(startingString, pattern: myPattern, substitutionText: mySubstitutionText))
func applyReg(startingString: String, pattern: String, substitutionText: String) -> String {
var newStr = startingString
if let regex = try? NSRegularExpression(pattern: pattern, options: .CaseInsensitive) {
let regStr = regex.stringByReplacingMatchesInString(startingString, options: .WithoutAnchoringBounds, range: NSMakeRange(0, startingString.characters.count), withTemplate: startingString)
newStr = regStr
}
return newStr
}
Matching but not capture a string in Swift Regex
In regex, you can use lookarounds to achieve this behavior:
let myPattern = "(?<=\\S)'(?=\\S)"
See the regex demo
Lookarounds do not consume the text they match, they just return true or false so that the regex engine could decide what to do with the currently matched text. If the condition is met, the regex pattern is evaluated further, and if not, the match is failed.
However, using capturing seems quite valid here, do not discard that approach.
Put your quote in a capture group in itself
(?:\\S)(')(?:\\S)
For example, when matching against "I'm", this will capture ["I", "'", "m"]

Regex is not matching in Scala

I want to split up a camelCase string with spaces.
"ClassicalMusicArtist" -> "Classical Music Artist"
I should be able to do this by replacing "/([a-z](?=[A-Z]))/g" with "$1 " (regex101).
But my regex is not getting any matches:
val regex = "/([a-z](?=[A-Z]))/g".r
val s = "ClassicalMusicArtist"
regex.replaceAllIn(s, "$1 ") // -> Returns "ClassicalMusicArtist"
regex.findFirstIn(s) // -> Returns None
What am I doing wrong? I used the regex in another language with success and can't figure out why I am not getting any matches.
Ok I figured it out.
In scala the regex has to be val regex = "([a-z](?=[A-Z]))".r without the leading / and the modifier.

Scala regex multiline match with negative lookahead

I'm writing a DSL using Scala's parser combinators. I have recently changed my base class from StandardTokenParsers to JavaTokenParsers to take advantage of the regex features I think I need for one last piece of the puzzle. (see Parsing a delimited multiline string using scala StandardTokenParser)
What I am trying to do is to extract a block of text delimited by some characters ({{ and }} in this example). This block of text can span multiple lines. What I have so far is:
def docBlockRE = regex("""(?s)(?!}}).*""".r)
def docBlock: Parser[DocString] =
"{{" ~> docBlockRE <~ "}}" ^^ { case str => new DocString(str) }}
where DocString is a case class in my DSL. However, this doesn't work. It fails if I feed it the following:
{{
abc
}}
{{
abc
}}
I'm not sure why this fails. If I put a Deubg wrapper around have a debug wrapper around the parser (http://jim-mcbeath.blogspot.com/2011/07/debugging-scala-parser-combinators.html) I get the following:
docBlock.apply for token
at position 10.2 offset 165 returns [19.1] failure: `}}' expected but end of source found
If I try a single delimited block with multiple lines:
{{
abc
def
}}
then it also fails to parse with:
docBlock.apply for token
at position 10.2 offset 165 returns [16.1] failure: `}}' expected but end of source found
If I remove the DOTALL directive (?s) then I can parse multiple single-line blocks (which doesn't really help me much).
Is there any way of combining multi-line regex with negative lookahead?
One other issue I have with this approach is that, no matter what I do, the closing delimiter must be on a separate line from the text. Otherwise I get the same kind of error message I see above. It is almost like the negative lookahead isn't really working as I expect it to.
In context:
scala> val rr = """(?s).*?(?=}})""".r
rr: scala.util.matching.Regex = (?s).*?(?=}})
scala> object X extends JavaTokenParsers {val r: Parser[String] = rr; val b: Parser[String] = "{{" ~>r<~"}}" ^^ { case s => s } }
defined object X
scala> X.parseAll(X.b, """{{ abc
| def
| }}""")
res15: X.ParseResult[String] =
[3.3] parsed: abc
def
More to show difference in greed:
scala> val rr = """(?s)(.*?)(?=}})""".r.unanchored
rr: scala.util.matching.UnanchoredRegex = (?s)(.*?)(?=}})
scala> def f(s: String) = s match { case rr(x) => x case _ => "(none)" }
f: (s: String)String
scala> f("something }} }}")
res3: String = "something "
scala> val rr = """(?s)(.*)(?=}})""".r.unanchored
rr: scala.util.matching.UnanchoredRegex = (?s)(.*)(?=}})
scala> def f(s: String) = s match { case rr(x) => x case _ => "(none)" }
f: (s: String)String
scala> f("something }} }}")
res4: String = "something }} "
The lookahead just means "make sure this follows me, but don't consume it."
Negative lookahead just means make sure it doesn't follow me.
To match {{the entire bracket}}, use this regex:
(?s)\{\{.*?\}\}
See the matches in the demo.
To match {{inside the brackets}}, use this:
(?s)(?<=\{\{).*?(?=\}\})
See the matches in the demo.
Explanation
(?s) activates DOTALL mode, allowing the dot to match across lines
The star quantifier in .*? is made "lazy" by the ? so that the dot only matches as much as necessary. Without the ?, the .* will grab the longest match, first matching the whole string then backtracking only as far as needed to allow the next token to match.
(?<=\{\{) is a lookbehind that asserts that what precedes is {{
(?=\}\}) is a lookahead that asserts that what follows is }}
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind