Scala Regex Multiple Block Capturing - regex

I'm trying to capture parts of a multi-lined string with a regex in Scala.
The input is of the form:
val input = """some text
|begin {
| content to extract
| content to extract
|}
|some text
|begin {
| other content to extract
|}
|some text""".stripMargin
I've tried several possibilities that should get me the text out of the begin { } blocks. One of them:
val Block = """(?s).*begin \{(.*)\}""".r
input match {
case Block(content) => println(content)
case _ => println("NO MATCH")
}
I get a NO MATCH. If I drop the \} the regex looks like (?s).*begin \{(.*) and it matches the last block including the unwanted } and "some text". I checked my regex at rubular.com as with /.*begin \{(.*)\}/m and it matches at least one block. I thought when my Scala regex would match the same I could start using findAllIn to match all blocks. What am I doing wrong?
I had a look at Scala Regex enable Multiline option but I could not manage to capture all the occurrences of the text blocks in, for example, a Seq[String].
Any help is appreciated.

As Alex has said, when using pattern matching to extract fields from regular expressions, the pattern acts as if it was bounded (ie, using ^ and $). The usual way to avoid this problem is to use findAllIn first. This way:
val input = """some text
|begin {
| content to extract
| content to extract
|}
|some text
|begin {
| other content to extract
|}
|some text""".stripMargin
val Block = """(?s)begin \{(.*)\}""".r
Block findAllIn input foreach (_ match {
case Block(content) => println(content)
case _ => println("NO MATCH")
})
Otherwise, you can use .* at the beginning and end to get around that restriction:
val Block = """(?s).*begin \{(.*)\}.*""".r
input match {
case Block(content) => println(content)
case _ => println("NO MATCH")
}
By the way, you probably want a non-eager matcher:
val Block = """(?s)begin \{(.*?)\}""".r
Block findAllIn input foreach (_ match {
case Block(content) => println(content)
case _ => println("NO MATCH")
})

When doing a match, I believe there is a full match implicity required. Your match is equivalent to:
val Block = """^(?s).*begin \{(.*)\}$""".r
It works if you add .* to the end:
val Block = """(?s).*begin \{(.*)\}.*""".r
I haven't been able to find any documentation on this, but I have encountered this same issue.

As a complement to the other answers, I wanted to point out the existence of kantan.regex, which lets you write the following:
import kantan.regex.ops._
// The type parameter is the type as which to decode results,
// the value parameters are the regular expression to apply and the group to
// extract data from.
input.evalRegex[String]("""(?s)begin \{(.*?)\}""", 1).toList
This yields:
List(Success(
content to extract
content to extract
), Success(
other content to extract
))

Related

Scala pattern match regex

In my scala program, I want to use a pattern match to test whether there is a valid .csv file in the input path.
path ="\DAP\TestData\test01.csv"
val regex=""".csv$""".r.unanchored
I tried to use the previous regex to match the string, it worked, but when it went to match pattern, it cannot work.
path ="\DAP\TestData\test01.csv"
val regex="""\.csv$""".r.unanchored
path match {
case regex(type) =>println(s"$type matched")
case _ =>println("something else happeded")
}
I need to successfully print information like ".csv matched".
Could anyone help me with this issue? I m really confused by this issue.
Thanks
It's not clear which part of the path you want to capture and report. But in any case you'll probably want a capture group in the regex pattern.
val path = raw"\DAP\TestData\test01.csv"
val re = """(.*\.csv)$""".r.unanchored
path match {
case re(typ) => println(s"$typ matched") //"\DAP\TestData\test01.csv matched"
case _ => println("something else happened")
}
You can also use the capture group to capture any one of many different target patterns.
val re = ".*\\.((?i:json|xml|csv))$".r
raw"\root\test31.XML" match {
case re(ext) => println(s"$ext matched") //"XML matched"
case _ => println("something else happeded")
}
You can try it like this:
val regex = """(\.csv)$""".r.unanchored
path match {
case regex(fileType) => println(s"$fileType matched")
case _ => println("something else happeded")
}

scala regex : extract file extension

val file_name="D:/folder1/folder2/filename.ext" //filename
val reg_ex = """(.*?).(\\\\w*$)""".r //regex pattern
file_name match {
case reg_ex(one , two) =>s"$two is extension"
case _ => println(" file_reg_ex none")
}
I want to extract file extension i.e."ext" from the above using scala regex , using match & case.
I am using above regex and it is going into not match case.
Any pointers to regex tutorials will be helpful.
A few minor adjustments.
val reg_ex = """.*\.(\w+)""".r
file_name match {
case reg_ex(ext) =>s"$ext is extension"
case _ => println("file_reg_ex none"); ""
}
Only one capture group needed. Ignore everything before the final dot, \. (escaped so it's a dot and not an "any char") and capture the rest.
The default, case _, should do more than println. It should return the same type as the match.

Scala Anchored Regex acts as unachored

So for some reason in Scala 2.11, my anchored regex patterns act as unanchored regex patterns.
scala> """something\.com""".r.anchored findFirstIn "app.something.com"
res66: Option[String] = Some(something.com)
scala> """^.something\.com$""".r.anchored findFirstIn "app.something.com"
res65: Option[String] = None
I thought the first expression would evaluate as None like the second (manually entered anchors) but it does not.
Any help would be appreciated.
The findFirstIn method un-anchors the regex automatically.
You can see that the example code also matches A only:
Example:
"""\w+""".r findFirstIn "A simple example." foreach println // prints "A"
BTW, once you create a regex like "pattern".r, it is anchored by default, but that only matters when you use the regex in a match block. Inside the FindAllIn or FindFirstIn, this type of anchoring is just ignored.
So, to make sure the regex matches the whole string, always add ^ and $ (or \A and \z) anchors if you are not sure where you are going to use the regexes.
I think, it is only supposed to work with match:
val reg = "o".r.anchored
"foo" match {
case reg() => "Yes!"
case _ => "No!"
}
... returns "No!".
This doesn't seem very useful, because just "o".r is anchored by default anyway. The only use of this I can imagine is if you made some unanchored (by accident? :)), and then want to undo it, or if you just want to match both cases, but s
eparately:
val reg = "o".r.unanchored
"foo" match {
case reg.anchored() => "Anchored!
case reg() => "Unanchored"
case _ => "I dunno"
}

scala matching optional set of characters

I am using scala regex to extract a token from a URL
my url is http://www.google.com?x=10&id=x10_23&y=2
here I want to extract the value of x10 in front of id. note that _23 is optional and may or may not appear but if it appears it must be removed.
The regex which I have written is
val regex = "^.*id=(.*)(\\_\\d+)?.*$".r
x match {
case regex(id) => print(id)
case _ => print("none")
}
this should work because (\\_\\d+)? should make the _23 optional as a whole.
So I don't understand why it prints none.
Note that your pattern ^.*id=(.*)(\\_\\d+)?.*$ actually puts x10_23&y=2 into Group 1 because of the 1st greedy dot matching subpattern. Since (_\d+)? is optional, the first greedy subpattern does not have to yield any characters to that capture group.
You can use
val regex = "(?s).*[?&]id=([^\\W&]+?)(?:_\\d+)?(?:&.*)?".r
val x = "http://www.google.com?x=10&id=x10_23&y=2"
x match {
case regex(id) => print(id)
case _ => print("none")
}
See the IDEONE demo (regex demo)
Note that there is no need defining ^ and $ - that pattern is anchored in Scala by default. (?s) ensures we match the full input string even if it contains newline symbols.
Another idea instead of using a regular expression to extract tokens would be to use the built-in URI Java class with its getQuery() method. There you can split the query by = and then check if one of the pair starts with id= and extract the value.
For instance (just as an example):
val x = "http://www.google.com?x=10&id=x10_23&y=2"
val uri = new URI(x)
uri.getQuery.split('&').find(_.startsWith("id=")) match {
case Some(param) => println(param.split('=')(1).replace("_23", ""))
case None => println("None")
}
I find it simpler to maintain that the regular expression you have, but that's just my thought!

Scala regex multiline match with negative lookahead

I'm writing a DSL using Scala's parser combinators. I have recently changed my base class from StandardTokenParsers to JavaTokenParsers to take advantage of the regex features I think I need for one last piece of the puzzle. (see Parsing a delimited multiline string using scala StandardTokenParser)
What I am trying to do is to extract a block of text delimited by some characters ({{ and }} in this example). This block of text can span multiple lines. What I have so far is:
def docBlockRE = regex("""(?s)(?!}}).*""".r)
def docBlock: Parser[DocString] =
"{{" ~> docBlockRE <~ "}}" ^^ { case str => new DocString(str) }}
where DocString is a case class in my DSL. However, this doesn't work. It fails if I feed it the following:
{{
abc
}}
{{
abc
}}
I'm not sure why this fails. If I put a Deubg wrapper around have a debug wrapper around the parser (http://jim-mcbeath.blogspot.com/2011/07/debugging-scala-parser-combinators.html) I get the following:
docBlock.apply for token
at position 10.2 offset 165 returns [19.1] failure: `}}' expected but end of source found
If I try a single delimited block with multiple lines:
{{
abc
def
}}
then it also fails to parse with:
docBlock.apply for token
at position 10.2 offset 165 returns [16.1] failure: `}}' expected but end of source found
If I remove the DOTALL directive (?s) then I can parse multiple single-line blocks (which doesn't really help me much).
Is there any way of combining multi-line regex with negative lookahead?
One other issue I have with this approach is that, no matter what I do, the closing delimiter must be on a separate line from the text. Otherwise I get the same kind of error message I see above. It is almost like the negative lookahead isn't really working as I expect it to.
In context:
scala> val rr = """(?s).*?(?=}})""".r
rr: scala.util.matching.Regex = (?s).*?(?=}})
scala> object X extends JavaTokenParsers {val r: Parser[String] = rr; val b: Parser[String] = "{{" ~>r<~"}}" ^^ { case s => s } }
defined object X
scala> X.parseAll(X.b, """{{ abc
| def
| }}""")
res15: X.ParseResult[String] =
[3.3] parsed: abc
def
More to show difference in greed:
scala> val rr = """(?s)(.*?)(?=}})""".r.unanchored
rr: scala.util.matching.UnanchoredRegex = (?s)(.*?)(?=}})
scala> def f(s: String) = s match { case rr(x) => x case _ => "(none)" }
f: (s: String)String
scala> f("something }} }}")
res3: String = "something "
scala> val rr = """(?s)(.*)(?=}})""".r.unanchored
rr: scala.util.matching.UnanchoredRegex = (?s)(.*)(?=}})
scala> def f(s: String) = s match { case rr(x) => x case _ => "(none)" }
f: (s: String)String
scala> f("something }} }}")
res4: String = "something }} "
The lookahead just means "make sure this follows me, but don't consume it."
Negative lookahead just means make sure it doesn't follow me.
To match {{the entire bracket}}, use this regex:
(?s)\{\{.*?\}\}
See the matches in the demo.
To match {{inside the brackets}}, use this:
(?s)(?<=\{\{).*?(?=\}\})
See the matches in the demo.
Explanation
(?s) activates DOTALL mode, allowing the dot to match across lines
The star quantifier in .*? is made "lazy" by the ? so that the dot only matches as much as necessary. Without the ?, the .* will grab the longest match, first matching the whole string then backtracking only as far as needed to allow the next token to match.
(?<=\{\{) is a lookbehind that asserts that what precedes is {{
(?=\}\}) is a lookahead that asserts that what follows is }}
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind