I have a simple RegexParser that matches {key}={value} repeating for several times:
object CommandOptionsParser extends RegexParsers {
private val key: Parser[String] = "[^= ]+".r
private val value: Parser[String] = "[^ ]*".r
val pair: Parser[Option[(String, Option[String])]] =
(key ~ ("=".r ~> value).?).? ^^ {
case None => None
case Some(k ~ v) => Some(k.trim -> v.map(_.trim))
}
val pairs: Parser[Map[String, Option[String]]] = phrase(repsep(pair, whiteSpace)) ^^ {
case v =>
Map(v.flatten: _*)
}
def apply(input: String): Map[String, Option[String]] = parseAll(pairs, input) match {
case Success(plan, _) => plan
case x => sys.error(x.toString)
}
}
However the matching of value seems to fail on more than 1 capturing groups (despite that the regex doesn't limit it). when I try to match against "token=abc again=abc", I have the following error:
[1.11] failure: string matching regex `\z' expected but `a' found
token=abc again=abc'
^
Why RegexParser has such strange behaviour?
The fix for your unexpected behavior is quite easy, just change the value of skipWhitespace:
object CommandOptionsParser extends RegexParsers {
override val skipWhitespace = false
From description of RegexParsers:
The parsing methods call the method skipWhitespace (defaults to
true) and, if true, skip any whitespace before each parser is
called.
So, what happened, your first pair was matched, then whiteSpace was skipped and then, as repsep couldn't find another whitespace separator, it just assumed that parsing is over, hence that "\z" expected.
Also, I can't help but note that the whole Parser approach for such simple task seems overcomplicated, simple regexps would suffice.
UPD: Also your parsers can be a bit simpler:
val pair: Parser[Option[(String, Option[String])]] =
(key ~ ("=" ~> value).?).? ^^ (_.map {case (k ~ v) => k.trim -> v.map(_.trim)})
val pairs: Parser[Map[String, Option[String]]] = phrase(repsep(pair, whiteSpace)) ^^
{ l => Map(l.flatten: _*)}
Related
I am starting to learn Scala and want to use regular expressions to match a character from a string so I can populate a mutable map of characters and their value (String values, numbers etc) and then print the result.
I have looked at several answers on SO and gone over the Scala Docs but can't seem to get this right. I have a short Lexer class that currently looks like this:
class Lexer {
private val tokens: mutable.Map[String, Any] = collection.mutable.Map()
private def checkCharacter(char: Character): Unit = {
val Operator = "[-+*/^%=()]".r
val Digit = "[\\d]".r
val Other = "[^\\d][^-+*/^%=()]".r
char.toString match {
case Operator(c) => tokens(c) = "Operator"
case Digit(c) => tokens(c) = Integer.parseInt(c)
case Other(c) => tokens(c) = "Other" // Temp value, write function for this
}
}
def lex(input: String): Unit = {
val inputArray = input.toArray
for (s <- inputArray)
checkCharacter(s)
for((key, value) <- tokens)
println(key + ": " + value)
}
}
I'm pretty confused by the sort of strange method syntax, Operator(c), that I have seen being used to handle the value to match and am also unsure if this is the correct way to use regex in Scala. I think what I want this code to do is clear, I'd really appreciate some help understanding this. If more info is needed I will supply what I can
This official doc has lot's of examples: https://www.scala-lang.org/api/2.12.1/scala/util/matching/Regex.html. What might be confusing is the type of the regular expression and its use in pattern matching...
You can construct a regex from any string by using .r:
scala> val regex = "(something)".r
regex: scala.util.matching.Regex = (something)
Your regex becomes an object that has a few useful methods to be able to find matching groups like findAllIn.
In Scala it's idiomatic to use pattern matching for safe extraction of values, thus Regex class also has unapplySeq method to support pattern matching. This makes it an extractor object. You can use it directly (not common):
scala> regex.unapplySeq("something")
res1: Option[List[String]] = Some(List(something))
or you can let Scala compiler call it for you when you do pattern matching:
scala> "something" match {
| case regex(x) => x
| case _ => ???
| }
res2: String = something
You might ask why exactly this return type on unapply/unapplySeq. The doc explains it very well:
The return type of an unapply should be chosen as follows:
If it is just a test, return a Boolean. For instance case even().
If it returns a single sub-value of type T, return an Option[T].
If you want to return several sub-values T1,...,Tn, group them in an optional tuple Option[(T1,...,Tn)].
Sometimes, the number of values to extract isn’t fixed and we would
like to return an arbitrary number of values, depending on the input.
For this use case, you can define extractors with an unapplySeq method
which returns an Option[Seq[T]]. Common examples of these patterns
include deconstructing a List using case List(x, y, z) => and
decomposing a String using a regular expression Regex, such as case
r(name, remainingFields # _*) =>
In short your regex might match one or more groups, thus you need to return a list/seq. It has to be wrapped in an Option to comply with extractor contract.
The way you are using regex is correct, I would just map your function over the input array to avoid creating mutable maps. Perhaps something like this:
class Lexer {
private def getCharacterType(char: Character): Any = {
val Operator = "([-+*/^%=()])".r
val Digit = "([\\d])".r
//val Other = "[^\\d][^-+*/^%=()]".r
char.toString match {
case Operator(c) => "Operator"
case Digit(c) => Integer.parseInt(c)
case _ => "Other" // Temp value, write function for this
}
}
def lex(input: String): Unit = {
val inputArray = input.toArray
val tokens = inputArray.map(x => x -> getCharacterType(x))
for((key, value) <- tokens)
println(key + ": " + value)
}
}
scala> val l = new Lexer()
l: Lexer = Lexer#60f662bd
scala> l.lex("a-1")
a: Other
-: Operator
1: 1
Suppose I'm writing a rudimentary SQL parser in Scala. I have the following:
class Arith extends RegexParsers {
def selectstatement: Parser[Any] = selectclause ~ fromclause
def selectclause: Parser[Any] = "(?i)SELECT".r ~ tokens
def fromclause: Parser[Any] = "(?i)FROM".r ~ tokens
def tokens: Parser[Any] = rep(token) //how to make this non-greedy?
def token: Parser[Any] = "(\\s*)\\w+(\\s*)".r
}
When trying to match selectstatement against SELECT foo FROM bar, how do I prevent the selectclause from gobbling up the entire phrase due to the rep(token) in ~ tokens?
In other words, how do I specify non-greedy matching in Scala?
To clarify, I'm fully aware that I can use standard non-greedy syntax (*?) or (+?) within the String pattern itself, but I wondered if there's a way to specify it at a higher level inside def tokens. For example, if I had defined token like this:
def token: Parser[Any] = stringliteral | numericliteral | columnname
Then how can I specify non-greedy matching for the rep(token) inside def tokens?
Not easily, because a successful match is not retried. Consider, for example:
object X extends RegexParsers {
def p = ("a" | "aa" | "aaa" | "aaaa") ~ "ab"
}
scala> X.parseAll(X.p, "aaaab")
res1: X.ParseResult[X.~[String,String]] =
[1.2] failure: `ab' expected but `a' found
aaaab
^
The first match was successful, in parser inside parenthesis, so it proceeded to the next one. That one failed, so p failed. If p was part of alternative matches, the alternative would be tried, so the trick is to produce something that can handle that sort of thing.
Let's say we have this:
def nonGreedy[T](rep: => Parser[T], terminal: => Parser[T]) = Parser { in =>
def recurse(in: Input, elems: List[T]): ParseResult[List[T] ~ T] =
terminal(in) match {
case Success(x, rest) => Success(new ~(elems.reverse, x), rest)
case _ =>
rep(in) match {
case Success(x, rest) => recurse(rest, x :: elems)
case ns: NoSuccess => ns
}
}
recurse(in, Nil)
}
You can then use it like this:
def p = nonGreedy("a", "ab")
By the way,I always found that looking at how other things are defined is helpful in trying to come up with stuff like nonGreedy above. In particular, look at how rep1 is defined, and how it was changed to avoid re-evaluating its repetition parameter -- the same thing would probably be useful on nonGreedy.
Here's a full solution, with a little change to avoid consuming the "terminal".
trait NonGreedy extends Parsers {
def nonGreedy[T, U](rep: => Parser[T], terminal: => Parser[U]) = Parser { in =>
def recurse(in: Input, elems: List[T]): ParseResult[List[T]] =
terminal(in) match {
case _: Success[_] => Success(elems.reverse, in)
case _ =>
rep(in) match {
case Success(x, rest) => recurse(rest, x :: elems)
case ns: NoSuccess => ns
}
}
recurse(in, Nil)
}
}
class Arith extends RegexParsers with NonGreedy {
// Just to avoid recompiling the pattern each time
val select: Parser[String] = "(?i)SELECT".r
val from: Parser[String] = "(?i)FROM".r
val token: Parser[String] = "(\\s*)\\w+(\\s*)".r
val eof: Parser[String] = """\z""".r
def selectstatement: Parser[Any] = selectclause(from) ~ fromclause(eof)
def selectclause(terminal: Parser[Any]): Parser[Any] =
select ~ tokens(terminal)
def fromclause(terminal: Parser[Any]): Parser[Any] =
from ~ tokens(terminal)
def tokens(terminal: Parser[Any]): Parser[Any] =
nonGreedy(token, terminal)
}
As I frequently observe and how I often implement a name attribute, is to simply model it as String.
What now, if the name has to follow a certain syntax, i.e. format? In Java I probably would define a constructor with a check on its arguments, something like:
public Name(str: String) {
if (str == null) throw new IllegalArgumentException("Str must not be null.");
if (!str.matches("name format expressed as regex")) throw new IllegalArgumentException("Str must match 'regex' but was " + str);
this.str = str;
}
In Scala I came up with the following solution:
import StdDef.Str
import StdDef.Bol
import StdDef.?
import scala.util.parsing.combinator.RegexParsers
final case class Name private (pfx: ?[Str] = None, sfx: Str) {
override def toString = pfx.mkString + sfx
}
object Name extends RegexParsers {
implicit def apply(str: Str): Name = parseAll(syntax, str) match {
case Success(res, _) => Name(res._1, res._2)
case rej: NoSuccess => error(rej.toString)
}
lazy val syntax = (prefix ?) ~! suffix
lazy val prefix = (("x" | "X") ~! hyph) ^^ { case a ~ b => a + b }
lazy val suffix = alpha ~! (alpha | digit | hyph *) ^^ { case a ~ b => a + b.mkString }
lazy val alpha: Parser[Str] = """\p{Alpha}""".r
lazy val digit: Parser[Str] = """\p{Digit}""".r
lazy val hyph: Parser[Str] = "-"
override lazy val skipWhitespace = false
}
My intents here are:
Compose a Name from its natural representation, i.e. a String value
Check whether its natural representation forms a valid Name at construction time.
Disallow any other construction than through the factory method apply:(str:Str)Str.
Make the construction from its natural representation implicit, e.g. val a: Name = "ISBN 978-0-9815316-4-9".
Decompose a Name into its parts according to its syntactical elements.
Have errors being thrown with messages, such as:
===
--
^
[1.3] error: string matching regex `\p{Alpha}' expected but end of source found
I would like to know what solutions you come up with.
After giving the topic some more thoughts, I am currently taking the following approach.
Token.scala:
abstract class Token {
val value: Str
}
object Token {
def apply[A <: Token](ctor: Str => A, syntax: Regex) = (value: Str) => value match {
case syntax() => ctor(value)
case _ => error("Value must match '" + syntax + "' but was '" + value + "'.")
}
}
Tokens.scala:
final case class Group private (val value: Str) extends Token
final case class Name private (val value: Str) extends Token
trait Tokens {
import foo.{ bar => outer }
val Group = Token(outer.Group, """(?i)[a-z0-9-]++""".r)
val Name = Token(outer.Name, """(?i)(?:x-)?+[a-z0-9-]++""".r)
}
Given that you'd be comfortable using a regex in Java, it seems like overkill to then try and solve the same problem with a parser in Scala.
Stick with what you know here, but add a Scala twist to clean up the solution a bit. Regexes in Scala also define extractors, allowing them to be used in a pattern match:
//triple-quote to make escaping easier, the .r makes it a regex
//Note how the value breaks normal naming conventions and starts in uppercase
//This is to avoid backticks when pattern matching
val TestRegex = """xxyyzz""".r
class Name(str: String) {
str match {
case Null => throw new IllegalArgumentException("Str must not be null")
case TestRegex => //do nothing
case _ => throw new IllegalArgumentException(
"Str must match 'regex' but was " + str)
}
}
disclaimer: I didn't actually test this code, it may contain typos
Is it possible to match regular expression pattern which is returned from a function? Can I do something like this?
def pattern(prefix: String) = (prefix + "_(\\w+)").r
val x = something match {
case pattern("a")(key) => "AAAA" + key
case pattern("b")(key) => "BBBB" + key
}
I cannot compile the above code. The following console snapshot shows an error I get. What am I doing wrong?
scala> def pattern(prefix: String) = (prefix + "_(\\w+)").r
pattern: (prefix: String)scala.util.matching.Regex
scala> def f(s:String) = s match {
| case pattern("a")(x) => s+x+"AAAAA"
<console>:2: error: '=>' expected but '(' found.
case pattern("a")(x) => s+x+"AAAAA"
^
This syntax is not supported by scala, you have to declare the extractor before you use it. See my earlier question on this topic.
I would like to be able to find a match between the first letter of a word, and one of the letters in a group such as "ABC". In pseudocode, this might look something like:
case Process(word) =>
word.firstLetter match {
case([a-c][A-C]) =>
case _ =>
}
}
But how do I grab the first letter in Scala instead of Java? How do I express the regular expression properly? Is it possible to do this within a case class?
You can do this because regular expressions define extractors but you need to define the regex pattern first. I don't have access to a Scala REPL to test this but something like this should work.
val Pattern = "([a-cA-C])".r
word.firstLetter match {
case Pattern(c) => c bound to capture group here
case _ =>
}
Since version 2.10, one can use Scala's string interpolation feature:
implicit class RegexOps(sc: StringContext) {
def r = new util.matching.Regex(sc.parts.mkString, sc.parts.tail.map(_ => "x"): _*)
}
scala> "123" match { case r"\d+" => true case _ => false }
res34: Boolean = true
Even better one can bind regular expression groups:
scala> "123" match { case r"(\d+)$d" => d.toInt case _ => 0 }
res36: Int = 123
scala> "10+15" match { case r"(\d\d)${first}\+(\d\d)${second}" => first.toInt+second.toInt case _ => 0 }
res38: Int = 25
It is also possible to set more detailed binding mechanisms:
scala> object Doubler { def unapply(s: String) = Some(s.toInt*2) }
defined module Doubler
scala> "10" match { case r"(\d\d)${Doubler(d)}" => d case _ => 0 }
res40: Int = 20
scala> object isPositive { def unapply(s: String) = s.toInt >= 0 }
defined module isPositive
scala> "10" match { case r"(\d\d)${d # isPositive()}" => d.toInt case _ => 0 }
res56: Int = 10
An impressive example on what's possible with Dynamic is shown in the blog post Introduction to Type Dynamic:
object T {
class RegexpExtractor(params: List[String]) {
def unapplySeq(str: String) =
params.headOption flatMap (_.r unapplySeq str)
}
class StartsWithExtractor(params: List[String]) {
def unapply(str: String) =
params.headOption filter (str startsWith _) map (_ => str)
}
class MapExtractor(keys: List[String]) {
def unapplySeq[T](map: Map[String, T]) =
Some(keys.map(map get _))
}
import scala.language.dynamics
class ExtractorParams(params: List[String]) extends Dynamic {
val Map = new MapExtractor(params)
val StartsWith = new StartsWithExtractor(params)
val Regexp = new RegexpExtractor(params)
def selectDynamic(name: String) =
new ExtractorParams(params :+ name)
}
object p extends ExtractorParams(Nil)
Map("firstName" -> "John", "lastName" -> "Doe") match {
case p.firstName.lastName.Map(
Some(p.Jo.StartsWith(fn)),
Some(p.`.*(\\w)$`.Regexp(lastChar))) =>
println(s"Match! $fn ...$lastChar")
case _ => println("nope")
}
}
As delnan pointed out, the match keyword in Scala has nothing to do with regexes. To find out whether a string matches a regex, you can use the String.matches method. To find out whether a string starts with an a, b or c in lower or upper case, the regex would look like this:
word.matches("[a-cA-C].*")
You can read this regex as "one of the characters a, b, c, A, B or C followed by anything" (. means "any character" and * means "zero or more times", so ".*" is any string).
To expand a little on Andrew's answer: The fact that regular expressions define extractors can be used to decompose the substrings matched by the regex very nicely using Scala's pattern matching, e.g.:
val Process = """([a-cA-C])([^\s]+)""".r // define first, rest is non-space
for (p <- Process findAllIn "aha bah Cah dah") p match {
case Process("b", _) => println("first: 'a', some rest")
case Process(_, rest) => println("some first, rest: " + rest)
// etc.
}
String.matches is the way to do pattern matching in the regex sense.
But as a handy aside, word.firstLetter in real Scala code looks like:
word(0)
Scala treats Strings as a sequence of Char's, so if for some reason you wanted to explicitly get the first character of the String and match it, you could use something like this:
"Cat"(0).toString.matches("[a-cA-C]")
res10: Boolean = true
I'm not proposing this as the general way to do regex pattern matching, but it's in line with your proposed approach to first find the first character of a String and then match it against a regex.
EDIT:
To be clear, the way I would do this is, as others have said:
"Cat".matches("^[a-cA-C].*")
res14: Boolean = true
Just wanted to show an example as close as possible to your initial pseudocode. Cheers!
First we should know that regular expression can separately be used. Here is an example:
import scala.util.matching.Regex
val pattern = "Scala".r // <=> val pattern = new Regex("Scala")
val str = "Scala is very cool"
val result = pattern findFirstIn str
result match {
case Some(v) => println(v)
case _ =>
} // output: Scala
Second we should notice that combining regular expression with pattern matching would be very powerful. Here is a simple example.
val date = """(\d\d\d\d)-(\d\d)-(\d\d)""".r
"2014-11-20" match {
case date(year, month, day) => "hello"
} // output: hello
In fact, regular expression itself is already very powerful; the only thing we need to do is to make it more powerful by Scala. Here are more examples in Scala Document: http://www.scala-lang.org/files/archive/api/current/index.html#scala.util.matching.Regex
Note that the approach from #AndrewMyers's answer matches the entire string to the regular expression, with the effect of anchoring the regular expression at both ends of the string using ^ and $. Example:
scala> val MY_RE = "(foo|bar).*".r
MY_RE: scala.util.matching.Regex = (foo|bar).*
scala> val result = "foo123" match { case MY_RE(m) => m; case _ => "No match" }
result: String = foo
scala> val result = "baz123" match { case MY_RE(m) => m; case _ => "No match" }
result: String = No match
scala> val result = "abcfoo123" match { case MY_RE(m) => m; case _ => "No match" }
result: String = No match
And with no .* at the end:
scala> val MY_RE2 = "(foo|bar)".r
MY_RE2: scala.util.matching.Regex = (foo|bar)
scala> val result = "foo123" match { case MY_RE2(m) => m; case _ => "No match" }
result: String = No match