non-greedy matching in Scala RegexParsers - regex

Suppose I'm writing a rudimentary SQL parser in Scala. I have the following:
class Arith extends RegexParsers {
def selectstatement: Parser[Any] = selectclause ~ fromclause
def selectclause: Parser[Any] = "(?i)SELECT".r ~ tokens
def fromclause: Parser[Any] = "(?i)FROM".r ~ tokens
def tokens: Parser[Any] = rep(token) //how to make this non-greedy?
def token: Parser[Any] = "(\\s*)\\w+(\\s*)".r
}
When trying to match selectstatement against SELECT foo FROM bar, how do I prevent the selectclause from gobbling up the entire phrase due to the rep(token) in ~ tokens?
In other words, how do I specify non-greedy matching in Scala?
To clarify, I'm fully aware that I can use standard non-greedy syntax (*?) or (+?) within the String pattern itself, but I wondered if there's a way to specify it at a higher level inside def tokens. For example, if I had defined token like this:
def token: Parser[Any] = stringliteral | numericliteral | columnname
Then how can I specify non-greedy matching for the rep(token) inside def tokens?

Not easily, because a successful match is not retried. Consider, for example:
object X extends RegexParsers {
def p = ("a" | "aa" | "aaa" | "aaaa") ~ "ab"
}
scala> X.parseAll(X.p, "aaaab")
res1: X.ParseResult[X.~[String,String]] =
[1.2] failure: `ab' expected but `a' found
aaaab
^
The first match was successful, in parser inside parenthesis, so it proceeded to the next one. That one failed, so p failed. If p was part of alternative matches, the alternative would be tried, so the trick is to produce something that can handle that sort of thing.
Let's say we have this:
def nonGreedy[T](rep: => Parser[T], terminal: => Parser[T]) = Parser { in =>
def recurse(in: Input, elems: List[T]): ParseResult[List[T] ~ T] =
terminal(in) match {
case Success(x, rest) => Success(new ~(elems.reverse, x), rest)
case _ =>
rep(in) match {
case Success(x, rest) => recurse(rest, x :: elems)
case ns: NoSuccess => ns
}
}
recurse(in, Nil)
}
You can then use it like this:
def p = nonGreedy("a", "ab")
By the way,I always found that looking at how other things are defined is helpful in trying to come up with stuff like nonGreedy above. In particular, look at how rep1 is defined, and how it was changed to avoid re-evaluating its repetition parameter -- the same thing would probably be useful on nonGreedy.
Here's a full solution, with a little change to avoid consuming the "terminal".
trait NonGreedy extends Parsers {
def nonGreedy[T, U](rep: => Parser[T], terminal: => Parser[U]) = Parser { in =>
def recurse(in: Input, elems: List[T]): ParseResult[List[T]] =
terminal(in) match {
case _: Success[_] => Success(elems.reverse, in)
case _ =>
rep(in) match {
case Success(x, rest) => recurse(rest, x :: elems)
case ns: NoSuccess => ns
}
}
recurse(in, Nil)
}
}
class Arith extends RegexParsers with NonGreedy {
// Just to avoid recompiling the pattern each time
val select: Parser[String] = "(?i)SELECT".r
val from: Parser[String] = "(?i)FROM".r
val token: Parser[String] = "(\\s*)\\w+(\\s*)".r
val eof: Parser[String] = """\z""".r
def selectstatement: Parser[Any] = selectclause(from) ~ fromclause(eof)
def selectclause(terminal: Parser[Any]): Parser[Any] =
select ~ tokens(terminal)
def fromclause(terminal: Parser[Any]): Parser[Any] =
from ~ tokens(terminal)
def tokens(terminal: Parser[Any]): Parser[Any] =
nonGreedy(token, terminal)
}

Related

Scala Regex Parser throws weird error

I have a simple RegexParser that matches {key}={value} repeating for several times:
object CommandOptionsParser extends RegexParsers {
private val key: Parser[String] = "[^= ]+".r
private val value: Parser[String] = "[^ ]*".r
val pair: Parser[Option[(String, Option[String])]] =
(key ~ ("=".r ~> value).?).? ^^ {
case None => None
case Some(k ~ v) => Some(k.trim -> v.map(_.trim))
}
val pairs: Parser[Map[String, Option[String]]] = phrase(repsep(pair, whiteSpace)) ^^ {
case v =>
Map(v.flatten: _*)
}
def apply(input: String): Map[String, Option[String]] = parseAll(pairs, input) match {
case Success(plan, _) => plan
case x => sys.error(x.toString)
}
}
However the matching of value seems to fail on more than 1 capturing groups (despite that the regex doesn't limit it). when I try to match against "token=abc again=abc", I have the following error:
[1.11] failure: string matching regex `\z' expected but `a' found
token=abc again=abc'
^
Why RegexParser has such strange behaviour?
The fix for your unexpected behavior is quite easy, just change the value of skipWhitespace:
object CommandOptionsParser extends RegexParsers {
override val skipWhitespace = false
From description of RegexParsers:
The parsing methods call the method skipWhitespace (defaults to
true) and, if true, skip any whitespace before each parser is
called.
So, what happened, your first pair was matched, then whiteSpace was skipped and then, as repsep couldn't find another whitespace separator, it just assumed that parsing is over, hence that "\z" expected.
Also, I can't help but note that the whole Parser approach for such simple task seems overcomplicated, simple regexps would suffice.
UPD: Also your parsers can be a bit simpler:
val pair: Parser[Option[(String, Option[String])]] =
(key ~ ("=" ~> value).?).? ^^ (_.map {case (k ~ v) => k.trim -> v.map(_.trim)})
val pairs: Parser[Map[String, Option[String]]] = phrase(repsep(pair, whiteSpace)) ^^
{ l => Map(l.flatten: _*)}

Scala CSV parser with comments

First of all: credits. This code is based on the solution from here: Use Scala parser combinator to parse CSV files
The CSV files I want to parse can have comments, lines starting with #. And to avoid confusion: The CSV files are tabulator-separated. There are more constraints which would make the parser a lot easier, but since I am completly new to Scala I thought it would be best to stay as close to the (working) original as possible.
The problem I have is that I get a type mismatch. Obviously the regex for a comment does not yield a list. I was hoping that Scala would interpret a comment as a 1-element-list, but this is not the case.
So how would I need to modify my code that I can handle this comment lines? And closly related: Is there an elegant way to query the parser result so I can write in myfunc something like
if (isComment(a)) continue
So here is the actual code:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.util.parsing.combinator._
object MyParser extends RegexParsers {
override val skipWhitespace = false // meaningful spaces in CSV
def COMMA = ","
def TAB = "\t"
def DQUOTE = "\""
def HASHTAG = "#"
def DQUOTE2 = "\"\"" ^^ { case _ => "\"" } // combine 2 dquotes into 1
def CRLF = "\r\n" | "\n"
def TXT = "[^\",\r\n]".r
def SPACES = "[ ]+".r
def file: Parser[List[List[String]]] = repsep((comment|record), CRLF) <~ (CRLF?)
def comment: Parser[List[String]] = HASHTAG<~TXT
def record: Parser[List[String]] = "[^#]".r<~repsep(field, TAB)
def field: Parser[String] = escaped|nonescaped
def escaped: Parser[String] = {
((SPACES?)~>DQUOTE~>((TXT|COMMA|CRLF|DQUOTE2)*)<~DQUOTE<~(SPACES?)) ^^ {
case ls => ls.mkString("")
}
}
def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }
def applyParser(s: String) = parseAll(file, s) match {
case Success(res, _) => res
case e => throw new Exception(e.toString)
}
def myfunc( a: (String, String)) = {
val parserResult = applyParser(a._2)
println("APPLY PARSER FOR " + a._1)
for( a <- parserResult ){
a.foreach { println }
}
}
def main(args: Array[String]) {
val filesPath = "/home/user/test/*.txt"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.wholeTextFiles(filesPath).cache()
logData.foreach( x => myfunc(x))
}
}
Since the parser for comment and the parser for record are "or-ed" together they must be of the same type.
You need to make the following changes:
def comment: Parser[List[String]] = HASHTAG<~TXT ^^^ {List()}
By using ^^^ we are converting the result of the parser (which is the result returned by HASHTAG parser) to an empty List.
Also change:
def record: Parser[List[String]] = repsep(field, TAB)
Note that because comment and record parser are or-ed and because comment comes first, if the row begins with a "#" it will be parsed by the comment parser.
Edit:
In order to keep the comments text as an output of the parser (say if you want to print them later), and because you are using | you can do the following:
Define the following classes:
trait Line
case class Comment(text: String) extends Line
case class Record(elements: List[String]) extends Line
Now define comment, record & file parsers as follows:
val comment: Parser[Comment] = "#" ~> TXT ^^ Comment
val record :Parser[Line]= repsep(field, TAB) ^^ Record
val file: Parser[List[Line]] = repsep(comment | record, CRLF) <~ (CRLF?)
Now you can define the printing function myFunc:
def myfunc( a: (String, String)) = {
parseAll(file, a._2).map { lines =>
lines.foreach{
case Comment(t) => println(s"This is a comment: $t")
case Record(elems) => println(s"This is a record: ${elems.mkString(",")}")
}
}
}

How to create a parser from Regex in Scala to parse a path?

I am writing a parser in which I am trying to parse a path and do arithmetic calculations. since I cannot use RegexParsers with StandardTokenParsers I am trying to make my own. So I am using the following code for that which I picked a part of it from another discussion:
lexical.delimiters ++= List("+","-","*","/", "^","(",")",",")
import lexical.StringLit
def regexStringLit(r: Regex): Parser[String] =
acceptMatch( "string literal matching regex " + r,{ case StringLit(s) if r.unapplySeq(s).isDefined => s })
def pathIdent: Parser[String] =regexStringLit ("/hdfs://([\\d.]+):(\\d+)/([\\w/]+/(\\w+\\.w+))".r)
def value :Parser[Expr] = numericLit ^^ { s => Number(s) }
def variable:Parser[Expr] = pathIdent ^^ { s => Variable(s) }
def parens:Parser[Expr] = "(" ~> expr <~ ")"
def argument:Parser[Expr] = expr <~ (","?)
def func:Parser[Expr] = ( pathIdent ~ "(" ~ (argument+) ~ ")" ^^ { case f ~ _ ~ e ~ _ => Function(f, e) })
//some other code
def parse(s:String) = {
val tokens = new lexical.Scanner(s)
phrase(expr)(tokens)
}
Then I use args(0) to send my input to the program which is :
"/hdfs://111.33.55.2:8888/folder1/p.a3d+1"
and this is the error I get :
[1.1] failure: string literal matching regex /hdfs://([\d\.]+):(\d+)/([\w/]+/(\w+\.\w+)) expected
/hdfs://111.33.55.2:8888/folder1/p.a3d
^
I tried simple path and also I commented the rest of the code and just left the path part there but it seems like the regexStringLit is not working for me. I think I am wrong in syntax part. I don't know!
There are a couple of mistakes in you regex:
/hdfs://([\d.]+):(\d+)/([\w/]+/(\w+\.w+))
1) There are unnecessary parenthesis (or your forgot a +) - this is not a real mistake but makes it harder to read your regex and fix bugs.
/hdfs://[\d.]+:\d+/[\w/]+/\w+\.w+
2) The last w+ is not escaped:
/hdfs://[\d.]+:\d+/[\w/]+/\w+\.\w+
3) You only allow . but not + for the last part:
/hdfs://[\d.]+:\d+/[\w/]+/\w+([.+]\w+)+
The above expression matches your test case, however, I do suspect, you actually want this expression:
/hdfs://\d+(\.\d+){3}:\d+(/(\w+([-+.*/]\w+)*))+
I solved it writing a trait and using JavaTokenParsers rather than StandardToken Parser.
trait pathIdentifier extends RegexParsers{
def pathIdent: Parser[String] ={
"""hdfs://([\d\.]+):(\d+)/([\w/]+/(\w+\.\w+))""".r
}
}
#Tilo Thanks for your help your solution is working as well but changing extended class to JavaTokenParser helped to solve the problem.

Using parser combinators to collate lines of text

I'm trying to parse a text file using parser combinators. I want to capture the index and text in a class called Example. Here's a test showing the form on an input file:
object Test extends ParsComb with App {
val input = """
0)
blah1
blah2
blah3
1)
blah4
blah5
END
"""
println(parseAll(examples, input))
}
And here's my attempt that doesn't work:
import scala.util.parsing.combinator.RegexParsers
case class Example(index: Int, text: String)
class ParsComb extends RegexParsers {
def examples: Parser[List[Example]] = rep(divider~example) ^^
{_ map {case d ~ e => Example(d,e)}}
def divider: Parser[Int] = "[0-9]+".r <~ ")" ^^ (_.toInt)
def example: Parser[String] = ".*".r <~ (divider | "END")
}
It fails with:
[4.1] failure: `END' expected but `b' found
blah2
^
I'm just starting out with these so I don't have much clue what I'm doing. I think the problem could be with the ".*".r regex not doing multi-line. How can I change this so that it parses correctly?
What does the error message mean?
According to your grammar definition, ".*".r <~ (divider | "END"), you told to the parser that, an example should followed either by a divider or a END. After parsing blah1, the parser tried to find divider and failed, then tried END, failed again, there're no other options available, so the END here was the last alternative of the production value, so from the parser's perspective, it expected END, but it soon found, the next input was blah2 from the 4th line.
How to fix it?
Try to be close to your implementation, the grammar in your case should be:
examples ::= {divider example}
divider ::= Integer")"
example ::= {literal ["END"]}
and I think parsing "example" into List[String] makes more sense, anyway, it's up to you.
The problem is your example parser, it should be a repeatable literal.
So ,
class ParsComb extends RegexParsers {
def examples: Parser[List[Example]] = rep(divider ~ example) ^^ { _ map { case d ~ e => Example(d, e) } }
def divider: Parser[Int] = "[0-9]+".r <~ ")" ^^ (_.toInt)
def example: Parser[List[String]] = rep("[\\w]*(?=[\\r\\n])".r <~ opt("END"))
}
the regex (?=[\\r\\n]) means it's a positive lookahead and would match characters that followed by \r or \n.
the parse result is:
[10.1] parsed: List(Example(0,List(blah1, blah2, blah3)),
Example(1,List(blah4, blah5)))
If you want to parse it into a String(instead of List[String]), just add a transform function for example: ^^ {_ mkString "\n"}
Your parser can't process newline character, your example parser eliminates next divider and your example regex matches divider and "END" string.
Try this:
object ParsComb extends RegexParsers {
def examples: Parser[List[Example]] = rep(divider~example) <~ """END\n?""".r ^^ {_ map {case d ~ e => Example(d,e)}}
def divider: Parser[Int] = "[0-9]+".r <~ ")\n" ^^ (_.toInt)
def example: Parser[String] = rep(str) ^^ {_.mkString}
def str: Parser[String] = """.*\n""".r ^? { case s if simpleLine(s) => s}
val div = """[0-9]+\)\n""".r
def simpleLine(s: String) = s match {
case div() => false
case "END\n" => false
case _ => true
}
def apply(s: String) = parseAll(examples, s)
}
Result:
scala> ParsComb(input)
res3: ParsComb.ParseResult[List[Example]] =
[10.1] parsed: List(Example(0,blah1
blah2
blah3
), Example(1,blah4
blah5
))
I think the problem could be with the ".*".r regex not doing
multi-line.
Exactly. Use the dotall modifier (strangely called "s"):
def example: Parser[String] = "(?s).*".r <~ (divider | "END")

How to pattern match using regular expression in Scala?

I would like to be able to find a match between the first letter of a word, and one of the letters in a group such as "ABC". In pseudocode, this might look something like:
case Process(word) =>
word.firstLetter match {
case([a-c][A-C]) =>
case _ =>
}
}
But how do I grab the first letter in Scala instead of Java? How do I express the regular expression properly? Is it possible to do this within a case class?
You can do this because regular expressions define extractors but you need to define the regex pattern first. I don't have access to a Scala REPL to test this but something like this should work.
val Pattern = "([a-cA-C])".r
word.firstLetter match {
case Pattern(c) => c bound to capture group here
case _ =>
}
Since version 2.10, one can use Scala's string interpolation feature:
implicit class RegexOps(sc: StringContext) {
def r = new util.matching.Regex(sc.parts.mkString, sc.parts.tail.map(_ => "x"): _*)
}
scala> "123" match { case r"\d+" => true case _ => false }
res34: Boolean = true
Even better one can bind regular expression groups:
scala> "123" match { case r"(\d+)$d" => d.toInt case _ => 0 }
res36: Int = 123
scala> "10+15" match { case r"(\d\d)${first}\+(\d\d)${second}" => first.toInt+second.toInt case _ => 0 }
res38: Int = 25
It is also possible to set more detailed binding mechanisms:
scala> object Doubler { def unapply(s: String) = Some(s.toInt*2) }
defined module Doubler
scala> "10" match { case r"(\d\d)${Doubler(d)}" => d case _ => 0 }
res40: Int = 20
scala> object isPositive { def unapply(s: String) = s.toInt >= 0 }
defined module isPositive
scala> "10" match { case r"(\d\d)${d # isPositive()}" => d.toInt case _ => 0 }
res56: Int = 10
An impressive example on what's possible with Dynamic is shown in the blog post Introduction to Type Dynamic:
object T {
class RegexpExtractor(params: List[String]) {
def unapplySeq(str: String) =
params.headOption flatMap (_.r unapplySeq str)
}
class StartsWithExtractor(params: List[String]) {
def unapply(str: String) =
params.headOption filter (str startsWith _) map (_ => str)
}
class MapExtractor(keys: List[String]) {
def unapplySeq[T](map: Map[String, T]) =
Some(keys.map(map get _))
}
import scala.language.dynamics
class ExtractorParams(params: List[String]) extends Dynamic {
val Map = new MapExtractor(params)
val StartsWith = new StartsWithExtractor(params)
val Regexp = new RegexpExtractor(params)
def selectDynamic(name: String) =
new ExtractorParams(params :+ name)
}
object p extends ExtractorParams(Nil)
Map("firstName" -> "John", "lastName" -> "Doe") match {
case p.firstName.lastName.Map(
Some(p.Jo.StartsWith(fn)),
Some(p.`.*(\\w)$`.Regexp(lastChar))) =>
println(s"Match! $fn ...$lastChar")
case _ => println("nope")
}
}
As delnan pointed out, the match keyword in Scala has nothing to do with regexes. To find out whether a string matches a regex, you can use the String.matches method. To find out whether a string starts with an a, b or c in lower or upper case, the regex would look like this:
word.matches("[a-cA-C].*")
You can read this regex as "one of the characters a, b, c, A, B or C followed by anything" (. means "any character" and * means "zero or more times", so ".*" is any string).
To expand a little on Andrew's answer: The fact that regular expressions define extractors can be used to decompose the substrings matched by the regex very nicely using Scala's pattern matching, e.g.:
val Process = """([a-cA-C])([^\s]+)""".r // define first, rest is non-space
for (p <- Process findAllIn "aha bah Cah dah") p match {
case Process("b", _) => println("first: 'a', some rest")
case Process(_, rest) => println("some first, rest: " + rest)
// etc.
}
String.matches is the way to do pattern matching in the regex sense.
But as a handy aside, word.firstLetter in real Scala code looks like:
word(0)
Scala treats Strings as a sequence of Char's, so if for some reason you wanted to explicitly get the first character of the String and match it, you could use something like this:
"Cat"(0).toString.matches("[a-cA-C]")
res10: Boolean = true
I'm not proposing this as the general way to do regex pattern matching, but it's in line with your proposed approach to first find the first character of a String and then match it against a regex.
EDIT:
To be clear, the way I would do this is, as others have said:
"Cat".matches("^[a-cA-C].*")
res14: Boolean = true
Just wanted to show an example as close as possible to your initial pseudocode. Cheers!
First we should know that regular expression can separately be used. Here is an example:
import scala.util.matching.Regex
val pattern = "Scala".r // <=> val pattern = new Regex("Scala")
val str = "Scala is very cool"
val result = pattern findFirstIn str
result match {
case Some(v) => println(v)
case _ =>
} // output: Scala
Second we should notice that combining regular expression with pattern matching would be very powerful. Here is a simple example.
val date = """(\d\d\d\d)-(\d\d)-(\d\d)""".r
"2014-11-20" match {
case date(year, month, day) => "hello"
} // output: hello
In fact, regular expression itself is already very powerful; the only thing we need to do is to make it more powerful by Scala. Here are more examples in Scala Document: http://www.scala-lang.org/files/archive/api/current/index.html#scala.util.matching.Regex
Note that the approach from #AndrewMyers's answer matches the entire string to the regular expression, with the effect of anchoring the regular expression at both ends of the string using ^ and $. Example:
scala> val MY_RE = "(foo|bar).*".r
MY_RE: scala.util.matching.Regex = (foo|bar).*
scala> val result = "foo123" match { case MY_RE(m) => m; case _ => "No match" }
result: String = foo
scala> val result = "baz123" match { case MY_RE(m) => m; case _ => "No match" }
result: String = No match
scala> val result = "abcfoo123" match { case MY_RE(m) => m; case _ => "No match" }
result: String = No match
And with no .* at the end:
scala> val MY_RE2 = "(foo|bar)".r
MY_RE2: scala.util.matching.Regex = (foo|bar)
scala> val result = "foo123" match { case MY_RE2(m) => m; case _ => "No match" }
result: String = No match