Scala regex multiline match with negative lookahead - regex

I'm writing a DSL using Scala's parser combinators. I have recently changed my base class from StandardTokenParsers to JavaTokenParsers to take advantage of the regex features I think I need for one last piece of the puzzle. (see Parsing a delimited multiline string using scala StandardTokenParser)
What I am trying to do is to extract a block of text delimited by some characters ({{ and }} in this example). This block of text can span multiple lines. What I have so far is:
def docBlockRE = regex("""(?s)(?!}}).*""".r)
def docBlock: Parser[DocString] =
"{{" ~> docBlockRE <~ "}}" ^^ { case str => new DocString(str) }}
where DocString is a case class in my DSL. However, this doesn't work. It fails if I feed it the following:
{{
abc
}}
{{
abc
}}
I'm not sure why this fails. If I put a Deubg wrapper around have a debug wrapper around the parser (http://jim-mcbeath.blogspot.com/2011/07/debugging-scala-parser-combinators.html) I get the following:
docBlock.apply for token
at position 10.2 offset 165 returns [19.1] failure: `}}' expected but end of source found
If I try a single delimited block with multiple lines:
{{
abc
def
}}
then it also fails to parse with:
docBlock.apply for token
at position 10.2 offset 165 returns [16.1] failure: `}}' expected but end of source found
If I remove the DOTALL directive (?s) then I can parse multiple single-line blocks (which doesn't really help me much).
Is there any way of combining multi-line regex with negative lookahead?
One other issue I have with this approach is that, no matter what I do, the closing delimiter must be on a separate line from the text. Otherwise I get the same kind of error message I see above. It is almost like the negative lookahead isn't really working as I expect it to.

In context:
scala> val rr = """(?s).*?(?=}})""".r
rr: scala.util.matching.Regex = (?s).*?(?=}})
scala> object X extends JavaTokenParsers {val r: Parser[String] = rr; val b: Parser[String] = "{{" ~>r<~"}}" ^^ { case s => s } }
defined object X
scala> X.parseAll(X.b, """{{ abc
| def
| }}""")
res15: X.ParseResult[String] =
[3.3] parsed: abc
def
More to show difference in greed:
scala> val rr = """(?s)(.*?)(?=}})""".r.unanchored
rr: scala.util.matching.UnanchoredRegex = (?s)(.*?)(?=}})
scala> def f(s: String) = s match { case rr(x) => x case _ => "(none)" }
f: (s: String)String
scala> f("something }} }}")
res3: String = "something "
scala> val rr = """(?s)(.*)(?=}})""".r.unanchored
rr: scala.util.matching.UnanchoredRegex = (?s)(.*)(?=}})
scala> def f(s: String) = s match { case rr(x) => x case _ => "(none)" }
f: (s: String)String
scala> f("something }} }}")
res4: String = "something }} "
The lookahead just means "make sure this follows me, but don't consume it."
Negative lookahead just means make sure it doesn't follow me.

To match {{the entire bracket}}, use this regex:
(?s)\{\{.*?\}\}
See the matches in the demo.
To match {{inside the brackets}}, use this:
(?s)(?<=\{\{).*?(?=\}\})
See the matches in the demo.
Explanation
(?s) activates DOTALL mode, allowing the dot to match across lines
The star quantifier in .*? is made "lazy" by the ? so that the dot only matches as much as necessary. Without the ?, the .* will grab the longest match, first matching the whole string then backtracking only as far as needed to allow the next token to match.
(?<=\{\{) is a lookbehind that asserts that what precedes is {{
(?=\}\}) is a lookahead that asserts that what follows is }}
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind

Related

Scala Anchored Regex acts as unachored

So for some reason in Scala 2.11, my anchored regex patterns act as unanchored regex patterns.
scala> """something\.com""".r.anchored findFirstIn "app.something.com"
res66: Option[String] = Some(something.com)
scala> """^.something\.com$""".r.anchored findFirstIn "app.something.com"
res65: Option[String] = None
I thought the first expression would evaluate as None like the second (manually entered anchors) but it does not.
Any help would be appreciated.
The findFirstIn method un-anchors the regex automatically.
You can see that the example code also matches A only:
Example:
"""\w+""".r findFirstIn "A simple example." foreach println // prints "A"
BTW, once you create a regex like "pattern".r, it is anchored by default, but that only matters when you use the regex in a match block. Inside the FindAllIn or FindFirstIn, this type of anchoring is just ignored.
So, to make sure the regex matches the whole string, always add ^ and $ (or \A and \z) anchors if you are not sure where you are going to use the regexes.
I think, it is only supposed to work with match:
val reg = "o".r.anchored
"foo" match {
case reg() => "Yes!"
case _ => "No!"
}
... returns "No!".
This doesn't seem very useful, because just "o".r is anchored by default anyway. The only use of this I can imagine is if you made some unanchored (by accident? :)), and then want to undo it, or if you just want to match both cases, but s
eparately:
val reg = "o".r.unanchored
"foo" match {
case reg.anchored() => "Anchored!
case reg() => "Unanchored"
case _ => "I dunno"
}

Scala regex match failing with scala.MatchError for \w and \d (word or digit) match

I was trying some basic regex pattern matching. Although my syntax seems to be correct, it's failing when I use \w or \d for word and digit matching.
import scala.util.matching.Regex
object ex {
def main(args:Array[String]):Unit = {
val pattern = new Regex("(\\w)\\s(\\d)");
val pattern(words,num) = "asas1 11"
print(words+" "+num)
}
}
This is error I get:
Exception in thread "main" scala.MatchError: asas1 11 (of class java.lang.String)
at com.cccu.semantic.ex$.main(ex.scala:8)
at com.cccu.semantic.ex.main(ex.scala)
Note: I am using the Scala IDE build of the Eclipse SDK, Build ID 4.4.1 with Scala 2.11.8 on a Windows machine.
\w and \d will match single character, you need to add there + modifier. It is throwing an exception because it can't match input against your regular expression.
scala> val pattern = new Regex("(\\w+)\\s(\\d+)"); val pattern(words,num) = "asas1 11"
pattern: scala.util.matching.Regex = (\w+)\s(\d+)
words: String = asas1
num: String = 11

scala matching optional set of characters

I am using scala regex to extract a token from a URL
my url is http://www.google.com?x=10&id=x10_23&y=2
here I want to extract the value of x10 in front of id. note that _23 is optional and may or may not appear but if it appears it must be removed.
The regex which I have written is
val regex = "^.*id=(.*)(\\_\\d+)?.*$".r
x match {
case regex(id) => print(id)
case _ => print("none")
}
this should work because (\\_\\d+)? should make the _23 optional as a whole.
So I don't understand why it prints none.
Note that your pattern ^.*id=(.*)(\\_\\d+)?.*$ actually puts x10_23&y=2 into Group 1 because of the 1st greedy dot matching subpattern. Since (_\d+)? is optional, the first greedy subpattern does not have to yield any characters to that capture group.
You can use
val regex = "(?s).*[?&]id=([^\\W&]+?)(?:_\\d+)?(?:&.*)?".r
val x = "http://www.google.com?x=10&id=x10_23&y=2"
x match {
case regex(id) => print(id)
case _ => print("none")
}
See the IDEONE demo (regex demo)
Note that there is no need defining ^ and $ - that pattern is anchored in Scala by default. (?s) ensures we match the full input string even if it contains newline symbols.
Another idea instead of using a regular expression to extract tokens would be to use the built-in URI Java class with its getQuery() method. There you can split the query by = and then check if one of the pair starts with id= and extract the value.
For instance (just as an example):
val x = "http://www.google.com?x=10&id=x10_23&y=2"
val uri = new URI(x)
uri.getQuery.split('&').find(_.startsWith("id=")) match {
case Some(param) => println(param.split('=')(1).replace("_23", ""))
case None => println("None")
}
I find it simpler to maintain that the regular expression you have, but that's just my thought!

Multiline regex capture in Scala

I'm trying to capture the content from a multiline regex. It doesn't match.
val text = """<p>line1
line2</p>"""
val regex = """(?m)<p>(.*?)</p>""".r
var result = regex.findFirstIn(text).getOrElse("")
Returns empty.
I put the m - flag for multiline but it doesn't seem to help in this case.
If I remove the line break the regex works.
I also found this but couldn't get it working.
How do I match the content between the <p> elements? I want everything between, also the line breaks.
Thanks in advance!
If you want to activate the dotall mode in scala, you must use (?s) instead of (?m)
(?s) means the dot can match newlines
(?m) means ^ and $ stand for begining and end of lines
In case it's not obvious at this point, "How do I match the content":
scala> val regex = """(?s)<p>(.*?)</p>""".r
scala> (regex findFirstMatchIn text).get group 1
res52: String =
line1
line2
More idiomatically,
scala> text match { case regex(content) => content }
res0: String =
line1
line2
scala> val embedded = s"stuff${text}morestuff"
embedded: String =
stuff<p>line1
line2</p>morestuff
scala> val regex = """(?s)<p>(.*?)</p>""".r.unanchored
regex: scala.util.matching.UnanchoredRegex = (?s)<p>(.*?)</p>
scala> embedded match { case regex(content) => content }
res1: String =
line1
line2

Scala Regex Multiple Block Capturing

I'm trying to capture parts of a multi-lined string with a regex in Scala.
The input is of the form:
val input = """some text
|begin {
| content to extract
| content to extract
|}
|some text
|begin {
| other content to extract
|}
|some text""".stripMargin
I've tried several possibilities that should get me the text out of the begin { } blocks. One of them:
val Block = """(?s).*begin \{(.*)\}""".r
input match {
case Block(content) => println(content)
case _ => println("NO MATCH")
}
I get a NO MATCH. If I drop the \} the regex looks like (?s).*begin \{(.*) and it matches the last block including the unwanted } and "some text". I checked my regex at rubular.com as with /.*begin \{(.*)\}/m and it matches at least one block. I thought when my Scala regex would match the same I could start using findAllIn to match all blocks. What am I doing wrong?
I had a look at Scala Regex enable Multiline option but I could not manage to capture all the occurrences of the text blocks in, for example, a Seq[String].
Any help is appreciated.
As Alex has said, when using pattern matching to extract fields from regular expressions, the pattern acts as if it was bounded (ie, using ^ and $). The usual way to avoid this problem is to use findAllIn first. This way:
val input = """some text
|begin {
| content to extract
| content to extract
|}
|some text
|begin {
| other content to extract
|}
|some text""".stripMargin
val Block = """(?s)begin \{(.*)\}""".r
Block findAllIn input foreach (_ match {
case Block(content) => println(content)
case _ => println("NO MATCH")
})
Otherwise, you can use .* at the beginning and end to get around that restriction:
val Block = """(?s).*begin \{(.*)\}.*""".r
input match {
case Block(content) => println(content)
case _ => println("NO MATCH")
}
By the way, you probably want a non-eager matcher:
val Block = """(?s)begin \{(.*?)\}""".r
Block findAllIn input foreach (_ match {
case Block(content) => println(content)
case _ => println("NO MATCH")
})
When doing a match, I believe there is a full match implicity required. Your match is equivalent to:
val Block = """^(?s).*begin \{(.*)\}$""".r
It works if you add .* to the end:
val Block = """(?s).*begin \{(.*)\}.*""".r
I haven't been able to find any documentation on this, but I have encountered this same issue.
As a complement to the other answers, I wanted to point out the existence of kantan.regex, which lets you write the following:
import kantan.regex.ops._
// The type parameter is the type as which to decode results,
// the value parameters are the regular expression to apply and the group to
// extract data from.
input.evalRegex[String]("""(?s)begin \{(.*?)\}""", 1).toList
This yields:
List(Success(
content to extract
content to extract
), Success(
other content to extract
))