Does order really matter with <|> here in this parser? - regex

The Monday Morning Haskell post Parsing Part 2: Applicative Parsing says this about alternation with regex-applicative:
Note that order matters! If we put the integer parser first, we’ll be in trouble! If we encounter a decimal, the integer parser will greedily succeed and parse everything before the decimal point. We'll either lose all the information after the decimal, or worse, have a parse failure.
Referring to this function from their Git repository:
numberParser :: RE Char Value
numberParser = (ValueNumber . read) <$>
(negativeParser <|> decimalParser <|> integerParser)
where
integerParser = some (psym isNumber)
decimalParser = combineDecimal <$> many (psym isNumber) <*> sym '.' <*> some (psym isNumber)
negativeParser = (:) <$> sym '-' <*> (decimalParser <|> integerParser)
combineDecimal :: String -> Char -> String -> String
combineDecimal base point decimal = base ++ (point : decimal)
However, I can't figure out why that would be so. When I change decimalParser <|> integerParser to integerParser <|> decimalParser, it still seems like it always parses the right thing (in particular, I did that and ran stack test, and their tests all still passed). The decimal parser must have a decimal point, and the integer parser can't have one, so it will stop parsing there, resulting in the decimal point making the next piece of the parse fail, backtracking us back to the decimal parser. It seems like the only case this wouldn't occur in would be if the next part of the overall parser after this one could accept the decimal point (making it an ambiguous grammar), but you still wouldn't "lose all the information after the decimal, or worse, have a parse failure". Is my reasoning correct and this a mistake in that article, or is there a case I'm not seeing where one of their outcomes could happen?

There is a difference, and it matters, but part of the reason is that the rest of the parser is quite fragile.
When I change decimalParser <|> integerParser to integerParser <|> decimalParser, it still seems like it always parses the right thing (in particular, I did that and ran stack test, and their tests all still passed).
The tests pass because the tests don't cover this part of the parser (the closest ones only exercise stringParser).
Here's a test that currently passes, but wouldn't if you swapped those parsers (stick it in test/Spec.hs and add it to the do block under main):
badex :: Spec
badex = describe "Bad example" $ do
it "Should fail" $
shouldMatch
exampleLineParser
"| 3.4 |\n"
[ ValueNumber 3.4 ]
If you swap the parsers, you get as a result ValueNumber 3.0: the integerParser (which is now first) succeeds parsing 3, but then the rest of the input gets discarded.
To give more context, we have to see where numberParser is used:
numberParser is one of the alternatives of valueParser...
which is used in exampleLineParser, where valueParser is followed by readThroughBar (and I mean the relevant piece of code is literally valueParser <* readThroughBar);
readThroughBar discards all characters until the next vertical bar (using many (psym (\c -> c /= '|' && c /= '\n'))).
So if valueParser succeeds parsing just 3, then the subsequent readThroughBar will happily consume and discard the rest .4 |.
The explanation from the blogpost you quote is only partially correct:
Note that order matters! If we put the integer parser first, we’ll be in trouble! If we encounter a decimal, the integer parser will greedily succeed and parse everything before the decimal point. We'll either lose all the information after the decimal, or worse, have a parse failure.
(emphasis mine) You will only lose information if your parser actively discards it, which readThroughBar does here.
As you already suggested, the backtracking behavior of RE means that the noncommutativity of <|> really only matters for correctness with ambiguous syntaxes (it might still have an effect on performance in general), which would not be a problem here if readThroughBar were less lenient, e.g., by consuming only whitespace before |.
I think that shows that using psym with (/=) is at least a code smell, if not a clear antipattern. By only looking for the delimiter without restricting the characters in the middle, it makes it hard to catch mistakes where the preceding parser does not consume as much input as it should. A better alternative is to ensure that the consumed characters may contain no meaningful information, for example, requiring them to all be whitespace.

The issue comes with parsing something like:
12.34
If you try the integer parser first then it will find the 12, conclude that it has found an integer, and then try to parse the .34 as the next token.
[...] the decimal point making the next piece of the parse fail, backtracking us back to the decimal parser.
Yes, as long as your parser is doing backtracking. For performance reasons most production parsers (e.g. megaparsec) don't do that unless they are specifically told to (usually using the try combinator). The RE parser used in the blog post may well be an exception, but I can't find its interpreter to check.

Related

Haskell check if the regular expression r made up of the single symbol alphabet Σ = {a} defines language L(r) = a*

I have got to write an algorithm programatically using haskell. The program takes a regular expression r made up of the unary alphabet Σ = {a} and check if the regular expression r defines the language L(r) = a^* (Kleene star). I am looking for any kind of tip. I know that I can translate any regular expression to the corresponding NFA then to the DFA and at the very end minimize DFA then compare, but is there any other way to achieve my goal? I am asking because it is clearly said that this is the unary alphabet, so I suppose that I have to use this information somehow to make this exercise much easier.
This is how my regular expression data type looks like
data Reg = Epsilon | -- epsilon regex
Literal Char | -- a
Or Reg Reg | -- (a|a)
Then Reg Reg | -- (aa)
Star Reg -- (a)*
deriving Eq
Yes, there is another way. Every DFA for regular languages on the single-letter alphabet is a "lollipop"1: an initial string of nodes that each point to each other (some of which are marked as final and some not) followed by a loop of nodes (again, some of which are marked as final and some not). So instead of doing a full compilation pass, you can go directly to a DFA, where you simply store two [Bool] saying which nodes in the lead-in and in the loop are marked final (or perhaps two [Integer] giving the indices and two Integer giving the lengths may be easier, depending on your implementation plans). You don't need to ensure the compiled version is minimal; it's easy enough to check that all the Bools are True. The base cases for Epsilon and Literal are pretty straightforward, and with a bit of work and thought you should be able to work out how to implement the combining functions for "or", "then", and "star" (hint: think about gcd's and stuff).
1 You should try to prove this before you begin implementing, so you can be sure you believe me.
Edit 1: Hm, while on my afternoon walk today, I realized the idea I had in mind for "then" (and therefore "star") doesn't work. I'm not giving up on this idea (and deleting this answer) yet, but those operations may be trickier than I gave them credit for at first. This approach definitely isn't for the faint of heart!
Edit 2: Okay, I believe now that I have access to pencil and paper I've worked out how to do concatenation and iteration. Iteration is actually easier than concatenation. I'll give a hint for each -- though I have no idea whether the hint is a good one or not!
Suppose your two lollipops have a length m lead-in and a length n loop for the first one, and m'/n' for the second one. Then:
For iteration of the first lollipop, there's a fairly mechanical/simple way to produce a lollipop with a 2*m + 2*n-long lead-in and n-long loop.
For concatenation, you can produce a lollipop with m + n + m' + lcm(n, n')-long lead-in and n-long loop (yes, that short!).

a fully backtracking star operator in parsec

I am trying to build a real, fully backtracking + combinator on parsec.
That is, one that receives a parser, and tries to find one or more instances of the given combinator.
That would mean that parse_one_or_more foolish_a would be able to match nine chars a in a row, for example. (see code below for context)
As far as I understand it, the reason why it does not currently do so is that, after foolish_a finds a match (the first 2 as) the many1 (try p1) never gives up on that match.
Is this possible in parsec? Pretty sure it would be very slow (this simple example is already exponential!) but I wonder if it can be done. It is for a programming challenge that runs without time limit -- I would not want to use it in the wild
import Text.Parsec
import Text.Parsec.String (Parser)
parse_one_or_more :: Parser String -> Parser String
parse_one_or_more p1 = (many1 (try p1)) >> eof >> return "bababa"
foolish_a = parse_one_or_more (try (string "aa") <|> string "aaa")
good_a = parse_one_or_more (string "aaa")
-- |
-- >>> parse foolish_a "unused triplet" "aaaaaaaaa"
-- Left...
-- ...
-- >>> parse good_a "unused" "aaaaaaaaa"
-- Right...
You are correct - Parsec-like libraries can't do this in a way that works for any input. Parsec's implementation of (<|>) is left-biased and commits to the left parser if it matches, regardless of anything that may happen later in the grammar. When the two arguments of (<|>) overlap, such as in (try (string "aa") <|> string "aaa"), there is no way to cause parsec to backtrack into there and try the right side match if the left side succeeded.
If you want to do this, you will need a different library, one that doesn't have a (<|>) operator that's left-biased and commits.
Yes, since Parsec produces a recursive-descent parser, you would rather want to make an unambiguous guess first to minimize the need for backtracking. So if your first guess is "aa" and that happens to overlap with a later guess "aaa", backtracking is necessary. Sometimes a grammar is LL(k) for some k > 1 and you want to use backtracking out of pure necessity.
The only time I use try is when I know that the backtracking is quite limited (k is low). For example, I might have an operator ? that overlaps with another operator ?//; I want to parse ? first because of precedence rules, but I want the parser to fail in case it's followed by // so that it can eventually reach the correct parse. Here k = 2, so the impact is quite low, but also I don't need an operator here that lets me backtrack arbitrarily.
If you want a parser combinator library that lets you fully backtrack all the time, this may come at a severe cost to performance. You could look into Text.ParserCombinators.ReadP's +++ symmetric choice operator that picks both. This is an example of what Carl suggested, a <|> that is not left-biased and commits.

Is that because Clojure is limited by JVM so this code can't evaluate?

code include something like '(1+2) in Clojure will cause a java.lang.RuntimeException, which leaves a error message "Unmatched delimiter: )".
But in any other lisp dialect I've ever used like Emacs Lisp or Racket, '(1+2) will just return a list, which should act like this because with the special form quote, anything in the list should not be evaluate.
So I just wonder is that because of the limitation of JVM so these codes can't act like how they act in other dialects? Or is it a bug of Clojure? Or maybe there is something different between the definition of quote in Clojure and other lisp dialects?
These are artifacts of the way tokenizers are set in different languages. In Clojure, if a token starts with a digit, it is consumed until the next reader macro character (that includes parentheses among other things,) whitespace or end of file (whitespace includes comma.) And what's consumed must be a valid number, which includes integer, float and rational. So when you feed '(1+2) to the reader, it consumes 1+2 as one token, which then fails to match against integer, float or rational number patterns. After that, the reader tries to recover, which resets its state. In this state, a ) is unmatched.
Try to enter '(1 + 2) instead (mind the spaces around +,) you will see exactly what you expect.

Faithfully handle white-spacing in a pretty-printer

I am writing a front-end for a language (by ocamllex and ocamlyacc).
So the frond-end can build a Abstract Syntax Tree (AST) from a program. Then we often write a pretty printer, which takes an AST and print a program. If later we just want to compile or analyse the AST, most of the time, we don't need the printed program to be exactly the same as the original program, in terms of white-spacing. However, this time, I want to write a pretty printer that prints exactly the same program as the original one, in terms of white-spacing.
Therefore, my question is what are best practices to handle white-spacing while trying not to modify too much the types of AST. I really don't want to add a number (of white-spaces) to each type in the AST.
For example, this is how I currently deal with (ie, skip) white-spacing in lexer.mll:
rule token = parse
...
| [' ' '\t'] { token lexbuf } (* skip blanks *)
| eof { EOF }
Does anyone know how to change this as well as other parts of the front-end to correctly taking white-spacing into account for a later printing?
It's quite common to keep source-file location information for each token. This information allows for more accurate errors, for example.
The most general way to do this is to keep the beginning and ending line number and column position for each token, which is a total of four numbers. If it were easy to compute the end position of a token from its value and the start position, that could be reduced to two numbers, but at the price of extra code complexity.
Bison has some features which simplify the bookkeeping work of remembering location objects; it's possible that ocamlyacc includes similar features, but I didn't see anything in the documentation. In any case, it is straight-forward to maintain a location object associated with each input token.
With that information, it is easy to recreate the whitespace between two adjacent tokens, as long as what separated the tokens was whitespace. Comments are another issue.
It's a judgement call whether or not that is simpler than just attaching preceding whitespace (and even comments) to each token as it is lexed.
You can have match statements that print different number of spaces depending on the token you're dealing with. I would usually print 1 space before if the token is an: id,num,define statement, assign(=)
If the token is an arithmetic expression I would print one space before and one space after it.
if you are dealing with an if or while statement I would indent the body by four spaces.
I think the best bet would be to write a pretty_print function such as:
let rec pretty_print pos ast =
match ast with
|Some_token -> String.make pos ' '; (* adds 'pos' number of spaces; pos will start off as zero. *)
print_string "Some_token";
|Other_token...
In sum I would handle white spaces by matching each token individually in a recursive function, and printing out the appropriate number of spaces.

Is it possible to efficiently look ahead more than one Char in Attoparsec?

I'm trying to augment Haskell's Attoparsec parser library with a function
takeRegex :: Regex -> Parser ByteString
using one of the regexp implementations.
(Motivation: Good regex libraries can provide performance that is linear to the length of the input, while attoparsec needs to backtrack. A portion of my input is particularly amenable to parsing using regexps, and even the backtracking Text.Regex.PCRE library gets me 4x speedup over attoparsec code for that piece.)
Attoparsec used to have a getInput :: Parser ByteString function to get (witout consuming) remaining input; that would probably have been quite perfect for my purposes, as my input is non-incremental, strict and reasonably small ­– I run the parser for a line of a log file at a time. With it, it seems I could have done something like
takeRegex re = do
input <- getInput
m <- matchM re input
take $ length m
Unfortunately recent versions of attoparsec lack this function. Is there some way to achieve the same? Why has the function been removed?
Now there is the takeByteString :: Parser ByteString function, which takes and consumes the rest of the input. If there was a function to attempt a parse and get the result without actually consuming anything, this could be used in conjunction with that, but I cannot seem to find (or figure out how to implement) such a function either.
Is there some way to achieve this with the current version of attoparsec?
There are a few solutions to this, but none are great....
Method 1- Fast to implement, but not so fast to run
Well, (according to http://hackage.haskell.org/package/attoparsec-0.10.1.1/docs/Data-Attoparsec-ByteString.html), attoparsec always backtracks on failure, so you can always do something like this-
parseLine1 = do
line <- takeTill (== '\n')
char '\n'
case <some sort of test on line, ie- a regex> of
Just -> return <some sort of data type>
Nothing -> fail "Parse Error"
then later many of these chained together will work as expected
parseLine = parseLine1 <|> parseLine2
The problem with this solution is, as you can see, you are still doing a bunch of backtracking, which can really slow things down.
Method 2- The traditional method
The usual way to handle this type of thing is to rewrite the grammar, or in the case of a parser combinator, move stuff around, to make the full algorithm need only one character of lookahead. This can almost always be done in practice, although it sometimes makes the logic much harder to follow....
For example, suppose you have a grammar production rule like this-
pet = "dog" | "dolphin"
This would need two characters of lookahead before either path could be resolved. Instead you can left factor the whole thing like this
pet => "ca" halfpet
halfpet => "g" | "lphin"
No parallel processing is needed, but the grammar is much uglier. (Although I wrote this as a production rule, there is a one to one mapping to a similar parser combinator).
Method 3- The correct way, but involved to write.
The true way that you want to do this is to directly compile a regex to a parser combinator.... Once you compile any regular language, the resulting algorithm always only need one character of lookahead, so the resulting attoparsec code should be pretty simple (like the routine in method 1 for a single character read), but the work will be in compiling the regex.
Compiling a regex is covered in many textbooks, so I won't go into detail here, but it basically amounts to replacing all the ambiguous paths in the regex state machine with new states. Or to put it differently, it automatically "left factors" all the cases that would need backtracking.
(I wrote a library that automatically "left factors" many cases in context free grammars, turning almost any context free grammar into linear parser once, but I haven't yet made it available.... some day, when I have cleaned it up I will).