Scala CSV parser with comments

Scala CSV parser with comments - regex

First of all: credits. This code is based on the solution from here: Use Scala parser combinator to parse CSV files
The CSV files I want to parse can have comments, lines starting with #. And to avoid confusion: The CSV files are tabulator-separated. There are more constraints which would make the parser a lot easier, but since I am completly new to Scala I thought it would be best to stay as close to the (working) original as possible.
The problem I have is that I get a type mismatch. Obviously the regex for a comment does not yield a list. I was hoping that Scala would interpret a comment as a 1-element-list, but this is not the case.
So how would I need to modify my code that I can handle this comment lines? And closly related: Is there an elegant way to query the parser result so I can write in myfunc something like
if (isComment(a)) continue
So here is the actual code:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.util.parsing.combinator._
object MyParser extends RegexParsers {
override val skipWhitespace = false // meaningful spaces in CSV
def COMMA = ","
def TAB = "\t"
def DQUOTE = "\""
def HASHTAG = "#"
def DQUOTE2 = "\"\"" ^^ { case _ => "\"" } // combine 2 dquotes into 1
def CRLF = "\r\n" | "\n"
def TXT = "[^\",\r\n]".r
def SPACES = "[ ]+".r
def file: Parser[List[List[String]]] = repsep((comment|record), CRLF) <~ (CRLF?)
def comment: Parser[List[String]] = HASHTAG<~TXT
def record: Parser[List[String]] = "[^#]".r<~repsep(field, TAB)
def field: Parser[String] = escaped|nonescaped
def escaped: Parser[String] = {
((SPACES?)~>DQUOTE~>((TXT|COMMA|CRLF|DQUOTE2)*)<~DQUOTE<~(SPACES?)) ^^ {
case ls => ls.mkString("")
}
}
def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }
def applyParser(s: String) = parseAll(file, s) match {
case Success(res, _) => res
case e => throw new Exception(e.toString)
}
def myfunc( a: (String, String)) = {
val parserResult = applyParser(a._2)
println("APPLY PARSER FOR " + a._1)
for( a <- parserResult ){
a.foreach { println }
}
}
def main(args: Array[String]) {
val filesPath = "/home/user/test/*.txt"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.wholeTextFiles(filesPath).cache()
logData.foreach( x => myfunc(x))
}
}

Since the parser for comment and the parser for record are "or-ed" together they must be of the same type.
You need to make the following changes:
def comment: Parser[List[String]] = HASHTAG<~TXT ^^^ {List()}
By using ^^^ we are converting the result of the parser (which is the result returned by HASHTAG parser) to an empty List.
Also change:
def record: Parser[List[String]] = repsep(field, TAB)
Note that because comment and record parser are or-ed and because comment comes first, if the row begins with a "#" it will be parsed by the comment parser.
Edit:
In order to keep the comments text as an output of the parser (say if you want to print them later), and because you are using | you can do the following:
Define the following classes:
trait Line
case class Comment(text: String) extends Line
case class Record(elements: List[String]) extends Line
Now define comment, record & file parsers as follows:
val comment: Parser[Comment] = "#" ~> TXT ^^ Comment
val record :Parser[Line]= repsep(field, TAB) ^^ Record
val file: Parser[List[Line]] = repsep(comment | record, CRLF) <~ (CRLF?)
Now you can define the printing function myFunc:
def myfunc( a: (String, String)) = {
parseAll(file, a._2).map { lines =>
lines.foreach{
case Comment(t) => println(s"This is a comment: $t")
case Record(elems) => println(s"This is a record: ${elems.mkString(",")}")
}
}
}

Related

scala repsep with new non empty line separator

I've got 3 lines for parsing
John Smith
Nick Jackson
Liza Ashwood
i want to use repsep scala function with new line separator like this:
def parseLines: Parser[Map[String, String]] = repsep(parse, lineSeparator)
where lineSeparator is just "\n"
but sometimes there can be an empty lines without any names, just empty lines. in this case, i want to ignore this function
i was trying to use opt(repsep) which seems logical but this doesn't work
what can i do here? thx

By empty lines, I'm assuming you can have data like this:
John Smith
<-- empty line
Nick Jackson
Liza Ashwood
In this case, it means that there's two times the lineSeparator. So you can use the rep1 combinator to tell that there should be at least 1 but there can be more line separators between each line of data:
def parseLines: Parser[Map[String, String]] = repsep(line, rep1("\n")) ^^ (ListMap() ++_ )
For example:
object MyParser extends RegexParsers {
override def skipWhitespace = false
def parse(input: java.io.Reader) = parse(parseLines, input) match {
case Success(res, _) => res
case e => throw new RuntimeException(e.toString)
}
def parseLines: Parser[Map[String, String]] = repsep(line, rep1("\n")) ^^ (ListMap() ++_ )
def oneOrMoreletters = "[a-zA-Z]+".r
def line:Parser[(String, String)] = (oneOrMoreletters <~ " ") ~ oneOrMoreletters ^^ (t => (t._1, t._2))
}

Scala Regex Pattern Matching

I need to use a regex to match a pattern in Scala and I currently have a Regex that is
InputPattern: scala.util.matching.Regex = put (.*) in (.*)
When I run the follwing this happens:
scala> val InputPattern(verb, item, prep, obj) = "put a in b";
scala.MatchError: put a in b (of class java.lang.String)
... 33 elided
I would like it to end up with verb("put"), item("a"), prep("in"), and obj("b") for input "put a in b" and also verb("put"), item(""), prep("in"), and obj("") for input "put in".
Thanks

This works for your special cases :
scala> val InputPattern = "(put) (.*?) ?(in) ?(.*?)".r
InputPattern: scala.util.matching.Regex = (put) (.*) ?(in) ?(.*)
scala> val InputPattern(verb, item, prep, obj) = "put a in b"
verb: String = put
item: String = a
prep: String = in
obj: String = b
scala> val InputPattern(verb, item, prep, obj) = "put in"
verb: String = put
item: String = ""
prep: String = in
obj: String = ""
put and in here are also captured in groups to participate in pattern matching. I also used lazy regexps (.*?) to capture as less as possible, you may replace it with (\S*). ? gives you optional space to match
"put in" (with one space between put and in and no space at the end).
But be aware of this:
scala> val InputPattern(verb, item, prep, obj) = "put ainb"
verb: String = put
item: String = a
prep: String = in
obj: String = b
scala> val InputPattern(verb, item, prep, obj) = "put aininb"
verb: String = put
item: String = a
prep: String = in
obj: String = inb
scala> val InputPattern(verb, item, prep, obj) = "put ain"
verb: String = put
item: String = a
prep: String = in
obj: String = ""
If you have simple command interpreter it may be even good, otherwise you should match your special cases separately.
To process a simple (not natural) language, you may also consider StandardTokenParsers, as they are context-free (Chomsky type 2):
import scala.util.parsing.combinator.syntactical._
val p = new StandardTokenParsers {
lexical.reserved ++= List("put", "in")
def p = "put" ~ opt(ident) ~ "in" ~ opt(ident)
}
scala> p.p(new p.lexical.Scanner("put a in b"))
warning: there was one feature warning; re-run with -feature for details
res13 = [1.11] parsed: (((put~Some(a))~in)~Some(b))
scala> p.p(new p.lexical.Scanner("put in"))
warning: there was one feature warning; re-run with -feature for details
res14 = [1.7] parsed: (((put~None)~in)~None)

You can write one regexp for all cases, but I'm not sure it would be readable and maintainable. I prefer simple approach:
val pattern1 = "(put) (.*) (in) (.*)".r
val pattern2 = "(put) (in)".r
def parse(text: String) = text match {
case pattern1(verb, item, prep, obj) => (verb, item, prep, obj);
case pattern2(verb, prep) => (verb, "", prep, "")
}
scala> parse("put a in b")
res6: (String, String, String, String) = (put,a,in,b)
scala> parse("put in")
res7: (String, String, String, String) = (put,"",in,"")
And one extra notion: I hope you know what you are doing! RegEx is a Chomsky Type 3 grammar and natural language is much more complex. If you need natural language parser, you can use already available solution such as Stanford NLP parser.

Using parser combinators to collate lines of text

I'm trying to parse a text file using parser combinators. I want to capture the index and text in a class called Example. Here's a test showing the form on an input file:
object Test extends ParsComb with App {
val input = """
0)
blah1
blah2
blah3
1)
blah4
blah5
END
"""
println(parseAll(examples, input))
}
And here's my attempt that doesn't work:
import scala.util.parsing.combinator.RegexParsers
case class Example(index: Int, text: String)
class ParsComb extends RegexParsers {
def examples: Parser[List[Example]] = rep(divider~example) ^^
{_ map {case d ~ e => Example(d,e)}}
def divider: Parser[Int] = "[0-9]+".r <~ ")" ^^ (_.toInt)
def example: Parser[String] = ".*".r <~ (divider | "END")
}
It fails with:
[4.1] failure: `END' expected but `b' found
blah2
^
I'm just starting out with these so I don't have much clue what I'm doing. I think the problem could be with the ".*".r regex not doing multi-line. How can I change this so that it parses correctly?

What does the error message mean?
According to your grammar definition, ".*".r <~ (divider | "END"), you told to the parser that, an example should followed either by a divider or a END. After parsing blah1, the parser tried to find divider and failed, then tried END, failed again, there're no other options available, so the END here was the last alternative of the production value, so from the parser's perspective, it expected END, but it soon found, the next input was blah2 from the 4th line.
How to fix it?
Try to be close to your implementation, the grammar in your case should be:
examples ::= {divider example}
divider ::= Integer")"
example ::= {literal ["END"]}
and I think parsing "example" into List[String] makes more sense, anyway, it's up to you.
The problem is your example parser, it should be a repeatable literal.
So ,
class ParsComb extends RegexParsers {
def examples: Parser[List[Example]] = rep(divider ~ example) ^^ { _ map { case d ~ e => Example(d, e) } }
def divider: Parser[Int] = "[0-9]+".r <~ ")" ^^ (_.toInt)
def example: Parser[List[String]] = rep("[\\w]*(?=[\\r\\n])".r <~ opt("END"))
}
the regex (?=[\\r\\n]) means it's a positive lookahead and would match characters that followed by \r or \n.
the parse result is:
[10.1] parsed: List(Example(0,List(blah1, blah2, blah3)),
Example(1,List(blah4, blah5)))
If you want to parse it into a String(instead of List[String]), just add a transform function for example: ^^ {_ mkString "\n"}

Your parser can't process newline character, your example parser eliminates next divider and your example regex matches divider and "END" string.
Try this:
object ParsComb extends RegexParsers {
def examples: Parser[List[Example]] = rep(divider~example) <~ """END\n?""".r ^^ {_ map {case d ~ e => Example(d,e)}}
def divider: Parser[Int] = "[0-9]+".r <~ ")\n" ^^ (_.toInt)
def example: Parser[String] = rep(str) ^^ {_.mkString}
def str: Parser[String] = """.*\n""".r ^? { case s if simpleLine(s) => s}
val div = """[0-9]+\)\n""".r
def simpleLine(s: String) = s match {
case div() => false
case "END\n" => false
case _ => true
}
def apply(s: String) = parseAll(examples, s)
}
Result:
scala> ParsComb(input)
res3: ParsComb.ParseResult[List[Example]] =
[10.1] parsed: List(Example(0,blah1
blah2
blah3
), Example(1,blah4
blah5
))

I think the problem could be with the ".*".r regex not doing
multi-line.
Exactly. Use the dotall modifier (strangely called "s"):
def example: Parser[String] = "(?s).*".r <~ (divider | "END")

non-greedy matching in Scala RegexParsers

Suppose I'm writing a rudimentary SQL parser in Scala. I have the following:
class Arith extends RegexParsers {
def selectstatement: Parser[Any] = selectclause ~ fromclause
def selectclause: Parser[Any] = "(?i)SELECT".r ~ tokens
def fromclause: Parser[Any] = "(?i)FROM".r ~ tokens
def tokens: Parser[Any] = rep(token) //how to make this non-greedy?
def token: Parser[Any] = "(\\s*)\\w+(\\s*)".r
}
When trying to match selectstatement against SELECT foo FROM bar, how do I prevent the selectclause from gobbling up the entire phrase due to the rep(token) in ~ tokens?
In other words, how do I specify non-greedy matching in Scala?
To clarify, I'm fully aware that I can use standard non-greedy syntax (*?) or (+?) within the String pattern itself, but I wondered if there's a way to specify it at a higher level inside def tokens. For example, if I had defined token like this:
def token: Parser[Any] = stringliteral | numericliteral | columnname
Then how can I specify non-greedy matching for the rep(token) inside def tokens?

Not easily, because a successful match is not retried. Consider, for example:
object X extends RegexParsers {
def p = ("a" | "aa" | "aaa" | "aaaa") ~ "ab"
}
scala> X.parseAll(X.p, "aaaab")
res1: X.ParseResult[X.~[String,String]] =
[1.2] failure: `ab' expected but `a' found
aaaab
^
The first match was successful, in parser inside parenthesis, so it proceeded to the next one. That one failed, so p failed. If p was part of alternative matches, the alternative would be tried, so the trick is to produce something that can handle that sort of thing.
Let's say we have this:
def nonGreedy[T](rep: => Parser[T], terminal: => Parser[T]) = Parser { in =>
def recurse(in: Input, elems: List[T]): ParseResult[List[T] ~ T] =
terminal(in) match {
case Success(x, rest) => Success(new ~(elems.reverse, x), rest)
case _ =>
rep(in) match {
case Success(x, rest) => recurse(rest, x :: elems)
case ns: NoSuccess => ns
}
}
recurse(in, Nil)
}
You can then use it like this:
def p = nonGreedy("a", "ab")
By the way,I always found that looking at how other things are defined is helpful in trying to come up with stuff like nonGreedy above. In particular, look at how rep1 is defined, and how it was changed to avoid re-evaluating its repetition parameter -- the same thing would probably be useful on nonGreedy.
Here's a full solution, with a little change to avoid consuming the "terminal".
trait NonGreedy extends Parsers {
def nonGreedy[T, U](rep: => Parser[T], terminal: => Parser[U]) = Parser { in =>
def recurse(in: Input, elems: List[T]): ParseResult[List[T]] =
terminal(in) match {
case _: Success[_] => Success(elems.reverse, in)
case _ =>
rep(in) match {
case Success(x, rest) => recurse(rest, x :: elems)
case ns: NoSuccess => ns
}
}
recurse(in, Nil)
}
}
class Arith extends RegexParsers with NonGreedy {
// Just to avoid recompiling the pattern each time
val select: Parser[String] = "(?i)SELECT".r
val from: Parser[String] = "(?i)FROM".r
val token: Parser[String] = "(\\s*)\\w+(\\s*)".r
val eof: Parser[String] = """\z""".r
def selectstatement: Parser[Any] = selectclause(from) ~ fromclause(eof)
def selectclause(terminal: Parser[Any]): Parser[Any] =
select ~ tokens(terminal)
def fromclause(terminal: Parser[Any]): Parser[Any] =
from ~ tokens(terminal)
def tokens(terminal: Parser[Any]): Parser[Any] =
nonGreedy(token, terminal)
}

scala dsl parser: rep, opt and regexps

Learning how to use scala DSL:s and quite a few examples work nice.
However I get stuck on a very simple thing:
I'm parsing a language, which has '--' as comment until end of line.
A single line works fine using:
def comment: Parser[Comment] = """--.*$""".r ^^ { case c => Comment(c) }
But when connecting multiple lines I get an error.
I've tried several varaints, but the following feel simple:
def commentblock: Parser[List[Comment]] = opt(rep(comment)) ^^ {
case Some(x) => { x }
case None => { List() }
}
When running a test with two consecutive commentlines I get an error.
Testcase:
--Test Comment
--Test Line 2
Error:
java.lang.AssertionError: Parse error: [1.1] failure: string matching regex `--.*$' expected but `-' found
Any ideas on how I should fix this?
Complete code below:
import scala.util.parsing.combinator._
abstract class A
case class Comment(comment:String) extends A
object TstParser extends JavaTokenParsers {
override def skipWhitespace = true;
def comment: Parser[Comment] = """--.*$""".r ^^ { case c => Comment(c) }
def commentblock: Parser[List[Comment]] = opt(rep(comment)) ^^ {
case Some(x) => { x }
case None => { List() }
}
def parse(text : String) = {
parseAll(commentblock, text)
}
}
class TestParser {
import org.junit._, Assert._
#Test def testComment() = {
val y = Asn1Parser.parseAll(Asn1Parser.comment, "--Test Comment")
assertTrue("Parse error: " + y, y.successful)
val y2 = Asn1Parser.parseAll(Asn1Parser.commentblock,
"""--Test Comment
--Test Line 2
""")
assertTrue("Parse error: " + y2, y2.successful)
}
}

Not familiar with Scala, but in Java, the regex --.*$ matches:
-- two hyphens;
.* followed by zero or more characters other than line breaks;
$ followed by the end of the input (not necessarily the end of the line!).
So you could try:
def comment: Parser[Comment] = """--.*""".r ^^ { case c => Comment(c) }
or even:
def comment: Parser[Comment] = """--[^\r\n]*""".r ^^ { case c => Comment(c) }
Note that in both cases, the line break is left in place and not "consumed" by your comment "rule".

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Scala CSV parser with comments - regex

Related

scala repsep with new non empty line separator

Scala Regex Pattern Matching

Using parser combinators to collate lines of text

non-greedy matching in Scala RegexParsers

scala dsl parser: rep, opt and regexps

Categories

Resources