a fully backtracking star operator in parsec

a fully backtracking star operator in parsec - regex

I am trying to build a real, fully backtracking + combinator on parsec.
That is, one that receives a parser, and tries to find one or more instances of the given combinator.
That would mean that parse_one_or_more foolish_a would be able to match nine chars a in a row, for example. (see code below for context)
As far as I understand it, the reason why it does not currently do so is that, after foolish_a finds a match (the first 2 as) the many1 (try p1) never gives up on that match.
Is this possible in parsec? Pretty sure it would be very slow (this simple example is already exponential!) but I wonder if it can be done. It is for a programming challenge that runs without time limit -- I would not want to use it in the wild
import Text.Parsec
import Text.Parsec.String (Parser)
parse_one_or_more :: Parser String -> Parser String
parse_one_or_more p1 = (many1 (try p1)) >> eof >> return "bababa"
foolish_a = parse_one_or_more (try (string "aa") <|> string "aaa")
good_a = parse_one_or_more (string "aaa")
-- |
-- >>> parse foolish_a "unused triplet" "aaaaaaaaa"
-- Left...
-- ...
-- >>> parse good_a "unused" "aaaaaaaaa"
-- Right...

You are correct - Parsec-like libraries can't do this in a way that works for any input. Parsec's implementation of (<|>) is left-biased and commits to the left parser if it matches, regardless of anything that may happen later in the grammar. When the two arguments of (<|>) overlap, such as in (try (string "aa") <|> string "aaa"), there is no way to cause parsec to backtrack into there and try the right side match if the left side succeeded.
If you want to do this, you will need a different library, one that doesn't have a (<|>) operator that's left-biased and commits.

Yes, since Parsec produces a recursive-descent parser, you would rather want to make an unambiguous guess first to minimize the need for backtracking. So if your first guess is "aa" and that happens to overlap with a later guess "aaa", backtracking is necessary. Sometimes a grammar is LL(k) for some k > 1 and you want to use backtracking out of pure necessity.
The only time I use try is when I know that the backtracking is quite limited (k is low). For example, I might have an operator ? that overlaps with another operator ?//; I want to parse ? first because of precedence rules, but I want the parser to fail in case it's followed by // so that it can eventually reach the correct parse. Here k = 2, so the impact is quite low, but also I don't need an operator here that lets me backtrack arbitrarily.
If you want a parser combinator library that lets you fully backtrack all the time, this may come at a severe cost to performance. You could look into Text.ParserCombinators.ReadP's +++ symmetric choice operator that picks both. This is an example of what Carl suggested, a <|> that is not left-biased and commits.

Related

Haskell - Why is Alternative implemented for List

I have read some of this post Meaning of Alternative (it's long)
What lead me to that post was learning about Alternative in general. The post gives a good answer to why it is implemented the way it is for List.
My question is:
Why is Alternative implemented for List at all?
Is there perhaps an algorithm that uses Alternative and a List might be passed to it so define it to hold generality?
I thought because Alternative by default defines some and many, that may be part of it but What are some and many useful for contains the comment:
To clarify, the definitions of some and many for the most basic types such as [] and Maybe just loop. So although the definition of some and many for them is valid, it has no meaning.
In the "What are some and many useful for" link above, Will gives an answer to the OP that may contain the answer to my question, but at this point in my Haskelling, the forest is a bit thick to see the trees.
Thanks

There's something of a convention in the Haskell library ecology that if a thing can be an instance of a class, then it should be an instance of the class. I suspect the honest answer to "why is [] an Alternative?" is "because it can be".
...okay, but why does that convention exist? The short answer there is that instances are sort of the one part of Haskell that succumbs only to whole-program analysis. They are global, and if there are two parts of the program that both try to make a particular class/type pairing, that conflict prevents the program from working right. To deal with that, there's a rule of thumb that any instance you write should live in the same module either as the class it's associated with or as the type it's associated with.
Since instances are expected to live in specific modules, it's polite to define those instances whenever you can -- since it's not really reasonable for another library to try to fix up the fact that you haven't provided the instance.

Alternative is useful when viewing [] as the nondeterminism-monad. In that case, <|> represents a choice between two programs and empty represents "no valid choice". This is the same interpretation as for e.g. parsers.
some and many does indeed not make sense for lists, since they try iterating through all possible lists of elements from the given options greedily, starting from the infinite list of just the first option. The list monad isn't lazy enough to do even that, since it might always need to abort if it was given an empty list. There is however one case when both terminates: When given an empty list.
Prelude Control.Applicative> many []
[[]]
Prelude Control.Applicative> some []
[]
If some and many were defined as lazy (in the regex sense), meaning they prefer short lists, you would get out results, but not very useful, since it starts by generating all the infinite number of lists with just the first option:
Prelude Control.Applicative> some' v = liftA2 (:) v (many' v); many' v = pure [] <|> some' v
Prelude Control.Applicative> take 100 . show $ (some' [1,2])
"[[1],[1,1],[1,1,1],[1,1,1,1],[1,1,1,1,1],[1,1,1,1,1,1],[1,1,1,1,1,1,1],[1,1,1,1,1,1,1,1],[1,1,1,1,1,"
Edit: I believe the some and many functions corresponds to a star-semiring while <|> and empty corresponds to plus and zero in a semiring. So mathematically (I think), it would make sense to split those operations out into a separate typeclass, but it would also be kind of silly, since they can be implemented in terms of the other operators in Alternative.

Consider a function like this:
fallback :: Alternative f => a -> (a -> f b) -> (a -> f e) -> f (Either e b)
fallback x f g = (Right <$> f x) <|> (Left <$> g x)
Not spectacularly meaningful, but you can imagine it being used in, say, a parser: try one thing, falling back to another if that doesn't work.
Does this function have a meaning when f ~ []? Sure, why not. If you think of a list's "effects" as being a search through some space, this function seems to represent some kind of biased choice, where you prefer the first option to the second, and while you're willing to try either, you also tag which way you went.
Could a function like this be part of some algorithm which is polymorphic in the Alternative it computes in? Again I don't see why not. It doesn't seem unreasonable for [] to have an Alternative instance, since there is an implementation that satisfies the Alternative laws.
As to the answer linked to by Will Ness that you pointed out: it covers that some and many don't "just loop" for lists. They loop for non-empty lists. For empty lists, they immediately return a value. How useful is this? Probably not very, I must admit. But that functionality comes along with (<|>) and empty, which can be useful.

How do I check if a string is a valid prefix of a regex?

So I'm using Rust to build a parser for my programming language and I need it to well ... parse. I'm implementing a Stream struct that is responsible for some basic operations on Vec<char> that is possesses. One of those functions is as follows:
fn consume_while_regex_matches(&self, regex: Regex, mut acc: String) -> (Self, String)
It's a recursive function that takes a regex and an accumulator which it's supposed to be filling up until the next char in the stream makes it no longer match the Regex. I'd use this function to match simple tokens like integers, strings, etc.
I tried implementing it in the following kind of way (this is not the exact version, but it explains the issue):
acc.push(self.peek())
if !regex.is_match() {
acc.pop();
return self.clone(), acc;
}
return self.consume_while_regex_matches(regex, acc);
The problem with this code is that it is testing whether acc matches the whole regex. Imagine if we want to consume a string -42 with a regex like ^-[0-9]+$. The algorithm would read the very first char -, the match would fail and the accumulator is going to be empty.
Is there a way to check that a string (e.g. acc) is a prefix of a potential regex match?
Like - is not a match on its own, but -42 is a match and - is a valid prefix.
And it'd be great if it's like a library way and it doesn't require me to produce my own regex engine.
Update: I'm not using the described function for parsing. I use it for lexing. I am aware that regex is not enough to parse a complex language.
What I'm asking is whether I can use some regex lib to match tokens gradually as opposed to having to provide the whole string to match against. I'm looking for a way to check whether the underlying DFA is or isn't in the error state by the end of marching a string without having to write my own regex parser and DFA implementation. If this was possible, I'd pass the - to the integer regex, check that after marching it didn't end up in the ERROR state, and if so, it's a valid prefix.

Taking the question at face value. The crate regex-automata (maintained by one of the regex authors) provides some access to the lower level details of building and parsing regexes. In particular, you can access and drive the internal DFA (deterministic finite automata) yourself.
use regex_automata::{Regex, DFA}; // 0.1.9
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
enum PotentialMatch {
Match,
CouldMatch,
DoesntMatch,
}
fn potentially_matches(pattern: &Regex, input: &str) -> PotentialMatch {
let input = input.as_bytes();
let dfa = pattern.forward();
let mut state = dfa.start_state();
for byte in input {
state = dfa.next_state(state, *byte);
if dfa.is_dead_state(state) {
return PotentialMatch::DoesntMatch;
}
}
if dfa.is_match_state(state) {
PotentialMatch::Match
}
else {
PotentialMatch::CouldMatch
}
}
fn main() {
let pattern = Regex::new("-[0-9]+").unwrap();
assert_eq!(potentially_matches(&pattern, ""), PotentialMatch::CouldMatch);
assert_eq!(potentially_matches(&pattern, "-"), PotentialMatch::CouldMatch);
assert_eq!(potentially_matches(&pattern, "-1"), PotentialMatch::Match);
assert_eq!(potentially_matches(&pattern, "-12"), PotentialMatch::Match);
assert_eq!(potentially_matches(&pattern, "-12a"), PotentialMatch::DoesntMatch);
}
You could probably integrate this state tracking into your implementation to be more performant over calling potentially_matches() over and over.

Regular expressions are basically a succinct format for state machines that work on inputs of characters/strings, whether finite-state machines or pushdown automata, depending on the complexity of the language you need to accept/reject.
Your approach is to build the string until it matches against a complex regex, but regexs are essentially often recompiled into state machines for efficiency. The way these state machines work is to break down the complex regex into individual states, and then, against these states, parse and transition between states token-by-token (or character-by-character). A final regex may have many different states. Here's a (somewhat improper) FSM example for [+-][1-9][0-9]* with s2 as the accept state:
Is there a way to check that a string (e.g. acc) is a prefix of a potential regex match? Like - is not a match on its own, but -42 is a match and - is a valid prefix. And it'd be great if it's like a library way and it doesn't require me to produce my own regex engine.
You could use pure regex for simpler parsing problems, but if you're building a programming language, you're eventually going to be looking for a parsing library to handle parsing to arbitrary nested depth (which requires a system stronger than regexes). In Rust, one of the most popular crates for this is nom. Nom is a parsing combinator-style library, which means it relies on putting smaller parsers together in different ways with "operators" called "combinators." Regular expressions are often a weaker form of parser than these due to the limited number and limited properties of their operators/combinators.
I will also note that the full job of validating a proper program written in a programming language won't be satisfied by either, which is why we bother going beyond the lexing/parsing process. All traditional and proper programming languages are somewhere between context-sensitive grammars and Turing-complete languages, which means parsing alone won't satisfy your needs because there is additional context and semantic information that needs to be checked by type checking, reference checking, etc.

Haskell check if the regular expression r made up of the single symbol alphabet Σ = {a} defines language L(r) = a*

I have got to write an algorithm programatically using haskell. The program takes a regular expression r made up of the unary alphabet Σ = {a} and check if the regular expression r defines the language L(r) = a^* (Kleene star). I am looking for any kind of tip. I know that I can translate any regular expression to the corresponding NFA then to the DFA and at the very end minimize DFA then compare, but is there any other way to achieve my goal? I am asking because it is clearly said that this is the unary alphabet, so I suppose that I have to use this information somehow to make this exercise much easier.
This is how my regular expression data type looks like
data Reg = Epsilon | -- epsilon regex
Literal Char | -- a
Or Reg Reg | -- (a|a)
Then Reg Reg | -- (aa)
Star Reg -- (a)*
deriving Eq

Yes, there is another way. Every DFA for regular languages on the single-letter alphabet is a "lollipop"1: an initial string of nodes that each point to each other (some of which are marked as final and some not) followed by a loop of nodes (again, some of which are marked as final and some not). So instead of doing a full compilation pass, you can go directly to a DFA, where you simply store two [Bool] saying which nodes in the lead-in and in the loop are marked final (or perhaps two [Integer] giving the indices and two Integer giving the lengths may be easier, depending on your implementation plans). You don't need to ensure the compiled version is minimal; it's easy enough to check that all the Bools are True. The base cases for Epsilon and Literal are pretty straightforward, and with a bit of work and thought you should be able to work out how to implement the combining functions for "or", "then", and "star" (hint: think about gcd's and stuff).
1 You should try to prove this before you begin implementing, so you can be sure you believe me.
Edit 1: Hm, while on my afternoon walk today, I realized the idea I had in mind for "then" (and therefore "star") doesn't work. I'm not giving up on this idea (and deleting this answer) yet, but those operations may be trickier than I gave them credit for at first. This approach definitely isn't for the faint of heart!
Edit 2: Okay, I believe now that I have access to pencil and paper I've worked out how to do concatenation and iteration. Iteration is actually easier than concatenation. I'll give a hint for each -- though I have no idea whether the hint is a good one or not!
Suppose your two lollipops have a length m lead-in and a length n loop for the first one, and m'/n' for the second one. Then:
For iteration of the first lollipop, there's a fairly mechanical/simple way to produce a lollipop with a 2*m + 2*n-long lead-in and n-long loop.
For concatenation, you can produce a lollipop with m + n + m' + lcm(n, n')-long lead-in and n-long loop (yes, that short!).

Does order really matter with <|> here in this parser?

The Monday Morning Haskell post Parsing Part 2: Applicative Parsing says this about alternation with regex-applicative:
Note that order matters! If we put the integer parser first, we’ll be in trouble! If we encounter a decimal, the integer parser will greedily succeed and parse everything before the decimal point. We'll either lose all the information after the decimal, or worse, have a parse failure.
Referring to this function from their Git repository:
numberParser :: RE Char Value
numberParser = (ValueNumber . read) <$>
(negativeParser <|> decimalParser <|> integerParser)
where
integerParser = some (psym isNumber)
decimalParser = combineDecimal <$> many (psym isNumber) <*> sym '.' <*> some (psym isNumber)
negativeParser = (:) <$> sym '-' <*> (decimalParser <|> integerParser)
combineDecimal :: String -> Char -> String -> String
combineDecimal base point decimal = base ++ (point : decimal)
However, I can't figure out why that would be so. When I change decimalParser <|> integerParser to integerParser <|> decimalParser, it still seems like it always parses the right thing (in particular, I did that and ran stack test, and their tests all still passed). The decimal parser must have a decimal point, and the integer parser can't have one, so it will stop parsing there, resulting in the decimal point making the next piece of the parse fail, backtracking us back to the decimal parser. It seems like the only case this wouldn't occur in would be if the next part of the overall parser after this one could accept the decimal point (making it an ambiguous grammar), but you still wouldn't "lose all the information after the decimal, or worse, have a parse failure". Is my reasoning correct and this a mistake in that article, or is there a case I'm not seeing where one of their outcomes could happen?

There is a difference, and it matters, but part of the reason is that the rest of the parser is quite fragile.
When I change decimalParser <|> integerParser to integerParser <|> decimalParser, it still seems like it always parses the right thing (in particular, I did that and ran stack test, and their tests all still passed).
The tests pass because the tests don't cover this part of the parser (the closest ones only exercise stringParser).
Here's a test that currently passes, but wouldn't if you swapped those parsers (stick it in test/Spec.hs and add it to the do block under main):
badex :: Spec
badex = describe "Bad example" $ do
it "Should fail" $
shouldMatch
exampleLineParser
"| 3.4 |\n"
[ ValueNumber 3.4 ]
If you swap the parsers, you get as a result ValueNumber 3.0: the integerParser (which is now first) succeeds parsing 3, but then the rest of the input gets discarded.
To give more context, we have to see where numberParser is used:
numberParser is one of the alternatives of valueParser...
which is used in exampleLineParser, where valueParser is followed by readThroughBar (and I mean the relevant piece of code is literally valueParser <* readThroughBar);
readThroughBar discards all characters until the next vertical bar (using many (psym (\c -> c /= '|' && c /= '\n'))).
So if valueParser succeeds parsing just 3, then the subsequent readThroughBar will happily consume and discard the rest .4 |.
The explanation from the blogpost you quote is only partially correct:
Note that order matters! If we put the integer parser first, we’ll be in trouble! If we encounter a decimal, the integer parser will greedily succeed and parse everything before the decimal point. We'll either lose all the information after the decimal, or worse, have a parse failure.
(emphasis mine) You will only lose information if your parser actively discards it, which readThroughBar does here.
As you already suggested, the backtracking behavior of RE means that the noncommutativity of <|> really only matters for correctness with ambiguous syntaxes (it might still have an effect on performance in general), which would not be a problem here if readThroughBar were less lenient, e.g., by consuming only whitespace before |.
I think that shows that using psym with (/=) is at least a code smell, if not a clear antipattern. By only looking for the delimiter without restricting the characters in the middle, it makes it hard to catch mistakes where the preceding parser does not consume as much input as it should. A better alternative is to ensure that the consumed characters may contain no meaningful information, for example, requiring them to all be whitespace.

The issue comes with parsing something like:
12.34
If you try the integer parser first then it will find the 12, conclude that it has found an integer, and then try to parse the .34 as the next token.
[...] the decimal point making the next piece of the parse fail, backtracking us back to the decimal parser.
Yes, as long as your parser is doing backtracking. For performance reasons most production parsers (e.g. megaparsec) don't do that unless they are specifically told to (usually using the try combinator). The RE parser used in the blog post may well be an exception, but I can't find its interpreter to check.

Is it possible to efficiently look ahead more than one Char in Attoparsec?

I'm trying to augment Haskell's Attoparsec parser library with a function
takeRegex :: Regex -> Parser ByteString
using one of the regexp implementations.
(Motivation: Good regex libraries can provide performance that is linear to the length of the input, while attoparsec needs to backtrack. A portion of my input is particularly amenable to parsing using regexps, and even the backtracking Text.Regex.PCRE library gets me 4x speedup over attoparsec code for that piece.)
Attoparsec used to have a getInput :: Parser ByteString function to get (witout consuming) remaining input; that would probably have been quite perfect for my purposes, as my input is non-incremental, strict and reasonably small – I run the parser for a line of a log file at a time. With it, it seems I could have done something like
takeRegex re = do
input <- getInput
m <- matchM re input
take $ length m
Unfortunately recent versions of attoparsec lack this function. Is there some way to achieve the same? Why has the function been removed?
Now there is the takeByteString :: Parser ByteString function, which takes and consumes the rest of the input. If there was a function to attempt a parse and get the result without actually consuming anything, this could be used in conjunction with that, but I cannot seem to find (or figure out how to implement) such a function either.
Is there some way to achieve this with the current version of attoparsec?

There are a few solutions to this, but none are great....
Method 1- Fast to implement, but not so fast to run
Well, (according to http://hackage.haskell.org/package/attoparsec-0.10.1.1/docs/Data-Attoparsec-ByteString.html), attoparsec always backtracks on failure, so you can always do something like this-
parseLine1 = do
line <- takeTill (== '\n')
char '\n'
case <some sort of test on line, ie- a regex> of
Just -> return <some sort of data type>
Nothing -> fail "Parse Error"
then later many of these chained together will work as expected
parseLine = parseLine1 <|> parseLine2
The problem with this solution is, as you can see, you are still doing a bunch of backtracking, which can really slow things down.
Method 2- The traditional method
The usual way to handle this type of thing is to rewrite the grammar, or in the case of a parser combinator, move stuff around, to make the full algorithm need only one character of lookahead. This can almost always be done in practice, although it sometimes makes the logic much harder to follow....
For example, suppose you have a grammar production rule like this-
pet = "dog" | "dolphin"
This would need two characters of lookahead before either path could be resolved. Instead you can left factor the whole thing like this
pet => "ca" halfpet
halfpet => "g" | "lphin"
No parallel processing is needed, but the grammar is much uglier. (Although I wrote this as a production rule, there is a one to one mapping to a similar parser combinator).
Method 3- The correct way, but involved to write.
The true way that you want to do this is to directly compile a regex to a parser combinator.... Once you compile any regular language, the resulting algorithm always only need one character of lookahead, so the resulting attoparsec code should be pretty simple (like the routine in method 1 for a single character read), but the work will be in compiling the regex.
Compiling a regex is covered in many textbooks, so I won't go into detail here, but it basically amounts to replacing all the ambiguous paths in the regex state machine with new states. Or to put it differently, it automatically "left factors" all the cases that would need backtracking.
(I wrote a library that automatically "left factors" many cases in context free grammars, turning almost any context free grammar into linear parser once, but I haven't yet made it available.... some day, when I have cleaned it up I will).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js