Parser for recursive expressions hangs in ghci - regex

I am trying to make a parser for the following recursive datatype:
data Expr = Val Int
| Var Char
| App Op Expr Expr
deriving Show
data Op = Add | Sub | Mul | Div
deriving Show
It should, for example, parse "(1 + (a / -2))" as App Add (Val 1) (App Div (Var 'a') (Val (-2))). I've managed to write parsers for the Val and Var constructors as well as for Op's constructors like so:
import Text.Regex.Applicative
import Data.Char
rNonnegativeIntegral :: (Read a, Integral a) => RE Char a
rNonnegativeIntegral = read <$> some (psym isDigit)
rNegativeIntegral :: (Read a, Integral a) => RE Char a
rNegativeIntegral = negate <$> (sym '-' *> rNonnegativeIntegral)
rIntegral :: (Read a, Integral a) => RE Char a
rIntegral = rNonnegativeIntegral <|> rNegativeIntegral
rVal :: RE Char Expr
rVal = Val <$> rIntegral
rVar :: RE Char Expr
rVar = Var <$> psym isAlpha
rOp = aux <$> (foldr1 (<|>) $ map sym "+-*/")
where
aux '+' = Add
aux '-' = Sub
aux '*' = Mul
aux '/' = Div
When this is loaded into ghci it can produce the following output:
ghci> findLongestPrefix rVal "-271"
Just (Val (-271), "")
ghci> findLongestPrefix rVar "a"
Just (Var 'a', "")
ghci> findLongestPrefix rOp "-"
Just (Sub, "")
The trouble comes when I introduce this recursive definition for the App constructor:
whiteSpace :: RE Char String
whiteSpace = many $ psym isSpace
strictWhiteSpace :: RE Char String
strictWhiteSpace = some $ psym isSpace
rApp :: RE Char Expr
-- flip App :: Expr -> Op -> Expr
-- strictWhiteSpace after rOp to avoid conflict with rNegativeInteger
rApp = flip App <$> (sym '(' *> whiteSpace *> rExpr)
<*> (whiteSpace *> rOp <* strictWhiteSpace)
<*> (rExpr <* whiteSpace <* sym ')')
rExpr :: RE Char Expr
rExpr = rVal <|> rVar <|> rApp
This loads into ghci just fine, and all previous constructors still work. But findLongestPrefix rApp "(1 + a)" and many similar expressions cause ghci to hang and produce no output.
Through experimentation I've found that the issue happens in general when rExpr is passed in as the first argument to <*. For example, findLongestPrefix (rExpr <* whiteSpace) "a)" also causes ghci to hang.
Also, when the definition for rExpr is replaced by
rExpr = rVal <|> rVar
all of these hanging issues go away. Simple expressions like "(1 + a)" are able to be parsed, but support for recursive expressions is not available.
How can I implement a recursive parser here without hanging issues?

The language of expressions that you describe isn't regular. So you'll have to use a different library.
Luckily, essentially the same parser structure should work fine with most other parser combinator libraries. It should be as simple as substituting your new library's name for a few basic parsers in place of their regex-applicative analogs.

Related

Haskell Text Parser Combinators to parse a Range Greedily like Regex range notation

In regex you can acquire a range of a parse by doing something like \d{1,5}, which parses a digit 1 to 5 times greedily. Or you do \d{1,5}? to make it lazy.
How would you do this in Haskell's Text.ParserCombinators.ReadP?
My attempt gave this:
rangeParse :: Read a => ReadP a -> [Int] -> ReadP [a]
rangeParse parse ranges = foldr1 (<++) $ fmap (\n -> count n $ parse) ranges
Which if you do it like rangeParse (satisfy isDigit) ([5,4..1]) will perform a greedy parse of digits 1 to 5 times. While if you swap the number sequent to [1..5], you get a lazy parse.
Is there a better or more idiomatic way to do this with parser combinators?
update: the below is wrong - for example
rangeGreedy 2 4 a <* string "aab", the equivalent of regexp a{2,4}aab, doesn't match. The questioner's solution gets this right. I won't delete the answer just yet in case it keeps someone else from making the same mistake.
=========
This isn't a complete answer, just a possible way to write the greedy
version. I haven't found a nice way to do the lazy version.
Define a left-biased version of option that returns Maybes:
greedyOption :: ReadP a -> ReadP (Maybe a)
greedyOption p = (Just <$> p) <++ pure Nothing
Then we can do up to n of something with a replicateM of them:
upToGreedy :: Int -> ReadP a -> ReadP [a]
upToGreedy n p = catMaybes <$> replicateM n (greedyOption p)
To allow a minimum count, do the mandatory part separately and append
it:
rangeGreedy :: Int -> Int -> ReadP a -> ReadP [a]
rangeGreedy lo hi p = (++) <$> count lo p <*> upToGreedy (hi - lo) p
The rest of my test code in case it's useful for anyone:
module Main where
import Control.Monad (replicateM)
import Data.Maybe (catMaybes)
import Text.ParserCombinators.ReadP
main :: IO ()
main = mapM_ go ["aaaaa", "aaaab", "aaabb", "aabbb", "abbbb", "bbbbb"]
where
go = print . map fst . readP_to_S test
test :: ReadP [String]
test = ((++) <$> rangeGreedy 2 4 a <*> many aOrB) <* eof
where
a = char 'a' *> pure "ay"
aOrB = (char 'a' +++ char 'b') *> pure "ayorbee"

Haskell split string on last occurence

Is there any way I can split String in Haskell on the last occurrence of given character into 2 lists?
For example I want to split list "a b c d e" on space into ("a b c d", "e").
Thank you for answers.
I'm not sure why the solutions suggested are so complicated. Only one two traversals are needed:
splitLast :: Eq a => a -> [a] -> Either [a] ([a],[a])
splitLast c' = foldr go (Left [])
where
go c (Right (f,b)) = Right (c:f,b)
go c (Left s) | c' == c = Right ([],s)
| otherwise = Left (c:s)
Note this is total and clearly signifies its failure. When a split is not possible (because the character specified wasn't in the string) it returns a Left with the original list. Otherwise, it returns a Right with the two components.
ghci> splitLast ' ' "hello beautiful world"
Right ("hello beautiful","world")
ghci> splitLast ' ' "nospaceshere!"
Left "nospaceshere!"
Its not beautiful, but it works:
import Data.List
f :: Char -> String -> (String, String)
f char str = let n = findIndex (==char) (reverse str) in
case n of
Nothing -> (str, [])
Just n -> splitAt (length str - n -1) str
I mean f 'e' "a b c d e" = ("a b c d ", "e"), but I myself wouldn't crop that trailing space.
I would go with more pattern matching.
import Data.List
splitLast = contract . words
where contract [] = ("", "")
contract [x] = (x, "")
contract [x,y] = (x, y)
contract (x:y:rest) = contract $ intercalate " " [x,y] : rest
For long lists, we just join the first two strings with a space and try the shorter list again. Once the length is reduced to 2, we just return the pair of strings.
(x, "") seemed like a reasonable choice for strings with no whitespace, but I suppose you could return ("", x) instead.
It's not clear that ("", "") is the best choice for empty strings, but it seems like a reasonable alternative to raising an error or changing the return type to something like Maybe (String, String).
I can propose the following solution:
splitLast list elem = (reverse $ snd reversedSplit, reverse $ fst reversedSplit)
where
reversedSplit = span (/= elem) $ reverse list
probably not the fastest one (two needless reverses) but I like it's simplicity.
If you insist on removing the space we're splitting on, you can go for:
import qualified Data.List as List
splitLast list elem = splitAt (last $ List.elemIndices elem list) list
however, this version assumes that there will be at least one element matching the pattern. If you don't like this assumption, the code gets slightly longer (but no double-reversals here):
import qualified Data.List as List
splitLast list elem = splitAt index list where
index = if null indices then 0 else last indices
indices = List.elemIndices elem list
Of course, choice of splitting at the beginning is arbitrary and probably splitting at the end would be more intuitive for you - then you can simply replace 0 with length list
My idea is to split at every occurrence and then separate the initial parts from the last part.
Pointed:
import Control.Arrow -- (&&&)
import Data.List -- intercalate
import Data.List.Split -- splitOn
breakOnLast :: Eq a => a -> [a] -> ([a], [a])
breakOnLast x = (intercalate x . init &&& last) . splitOn x
Point-free:
liftA2 (.) ((&&& last) . (. init) . intercalate) splitOn
(.) <$> ((&&&) <$> ((.) <$> pure init <*> intercalate) <*> pure last) <*> splitOn

Is there a way to code a chainl function in idris lightyear library?

I'm trying to formalise a regular expression based string search tool in Idris
(current status here). But I'm fighting with the problem of parsing regular expressions. I've tried to build a small parsing library but gave up on this in favor to use Lightyear, a parsing combinator library for Idris.
Since I'm used to Haskell, I've tried to use a similar strategy that I would do using Parsec. My main problem is how to handle left recursion on Lightyear parsers? I've tried several encodings but pretty much all parsers end up looping and causing segmentation faults in generated code.
I don't know Lightyear, but I had some success porting Parsec to Idris:
module Parser
data Parser : Type -> Type where
P : (String -> List (a, String)) -> Parser a
unP : Parser a -> String -> List (a, String)
unP (P f) = f
total stripPrefix : (Eq a) => List a -> List a -> Maybe (List a)
stripPrefix [] ys = Just ys
stripPrefix (x::xs) (y::ys) = if (x == y) then stripPrefix xs ys else Nothing
stripPrefix _ _ = Nothing
total token : String -> Parser ()
token tk = P $ \s => case stripPrefix (unpack tk) (unpack s) of
Just s' => [((), pack s')]
Nothing => []
total skip : Parser ()
skip = P $ \s => case unpack s of
[] => []
(_::s') => [((), pack s')]
instance Functor Parser where
map f p = P $ \s => map (\(x, s') => (f x, s')) (unP p s)
instance Applicative Parser where
pure x = P $ \s => [(x, s)]
(P pf) <*> (P px) = P $ \s => concat (map (\(f, s') => map (\(x, s'') => (f x, s'')) (px s')) (pf s))
instance Alternative Parser where
empty = P $ \s => []
(P p1) <|> (P p2) = P $ \s => case p1 s of
[] => p2 s
results => results
instance Monad Parser where
px >>= f = P $ \s => concat (map (\(x, s') => unP (f x) s') (unP px s))
total runParser : Parser a -> String -> Maybe a
runParser (P p) s = case p s of
[(x, "")] => Just x
_ => Nothing
This allows a straight copy-paste implementation of chainl:
chainl1 : Parser a -> Parser (a -> a -> a) -> Parser a
chainl1 p op = p >>= rest
where
rest x = do { f <- op; y <- p; rest $ f x y } <|> return x
chainl : Parser a -> Parser (a -> a -> a) -> a -> Parser a
chainl p op x = chainl1 p op <|> return x
We can then take a straight transliteration of the expression parser from the chainl docs (I'm too lazy to implement a proper integer parser so we'll just use unary):
parens : Parser a -> Parser a
parens p = token "(" *> p <* token ")"
symbol : String -> Parser ()
symbol = token
integer : Parser Nat
integer = P $ \s => case unpack s of
('Z'::s') => [(Z, pack s')]
('S'::s') => map (\(n, s'') => (S n, s'')) $ unP integer (pack s')
_ => []
mutual
expr : Parser Nat
expr = term `chainl1` addop
term : Parser Nat
term = factor `chainl1` mulop
factor : Parser Nat
factor = parens expr <|> integer
mulop : Parser (Nat -> Nat -> Nat)
mulop = (symbol "*" *> pure (*)) <|>
(symbol "/" *> pure div)
addop : Parser (Nat -> Nat -> Nat)
addop = (symbol "+" *> pure (+)) <|>
(symbol "-" *> pure (-))
Now, if you try this:
main : IO ()
main = do
s <- getLine
printLn $ runParser expr s
then it will have the same divergant behaviour that you've observed. However, we can make two small changes:
Introduce a lazy alternative combinator:
orElse : Parser a -> Lazy (Parser a) -> Parser a
orElse p1 p2 = P $ \s => case unP p1 s of
[] => unP p2 s
results => results
Make sure the recursive part of factor, i.e. the parens expr part, is in this lazy position, by flipping the two alternatives:
factor = integer `orElse` parens expr
This then works as expected:
13:06:07 [cactus#galaxy brainfuck]$ idris Expr.idr -o Expr
13:06:27 [cactus#galaxy brainfuck]$ echo "SZ+(SSZ*SSSZ)" | ./Expr
Just 7
The chainl and chainl1 combinators can be used with the Lightyear package. However, they are provided by default. I've added the combinators to my own modules where I've needed them:
chainl1 : Parser a -> Parser (a -> a -> a) -> Parser a
chainl1 p op = p >>= rest
where rest a1 = (do f <- op
a2 <- p
rest (f a1 a2)) <|> pure a1
chainl : Parser a -> Parser (a -> a -> a) -> a -> Parser a
chainl p op a = (p `chainl1` op) <|> pure a
Seems to work fine. Hope that helps.

Haskell getArgs changing data type

I am trying to build a main function for a Haskell module which would take a regular expression from the user and use this in the SimplifyRegExp function but this wants the input in type RegExp:
data RegExp sy = Empty
| Epsilon
| Literal sy
| Or (RegExp sy) (RegExp sy)
| Then (RegExp sy) (RegExp sy)
| Star (RegExp sy)
deriving (Read, Eq)
How would I be able to turn a string to type RegExp?
If I load the program onto GHCi then I can call the method straight like the following:
*Language.HaLex.RegExp> simplifyRegExp(Star (Star a))
'a'*
But I would like to do it so I can pass the program just one argument in command prompt and it would print the result something like the following (which ofcourse doesn't work):
main = do
n <- getArgs $ head
print (simplifyRegExp(n))
You can define a Read instance for your type and use that
data RegEx sy = ...
deriving Read
And then use readMay
import Text.Read
...
main = do
regexp <- (readMay . head) `fmap` getArgs
case regexp of
Just r -> ...
Nothing -> putStrLn "Parse error!"
But this is a little brittle in two ways. First is that read is a partial function! If the regexp is ill formed your program will blow up. Second, using your default read instance forces your internal representation of regexs onto your users! You'd be better off doing some actually parsing if this is a serious project.
Luckily, Haskell has some really awesome parsing libraries. Some of the most famous include parsec and attoparsec.
An example of a parsec parser might be
import Text.Parsec
import Text.Parsec.String
import Control.Applicative
parseStar :: Parsec (RegExp Char)
parseStar = Star <$> (parseRe <* char '*')
parseLiteral :: Parsec (RegExp Char)
parseLiteral = Literal <$> noneOf "*()"
parseOr :: Parsec (RegExp Char)
parseOr = Or <$> parseRe <*> (char '|' *> parseRe)
parseThen :: Parsec (RegExp Char)
parseThen = Then <$> parseRe <*> parseRe
....

Explanation of OCaml code: explode a string, split a list

I am absolute OCaml beginner and have an assignment about more code. I have got the following code, but I don't know how it works. If someone can help me out, I appreciate it.
# let explode str = (*defines function that explodes argument str witch is type
string into list of chars*)
let rec exp = function (*defines recursive function exp*)
| a, b when a < 0 -> b (*this part i dont know.is this pattern
matching ?is it function with arguments a and b
and they go into expression? when is a guard and
then we have if a is smaller than 0 then b *)
(*if a is not smaller than 0 then this function ? *)
| a, b -> exp (a-1, str.[a]::b) (*this i dont know, a and b are arguments
that go into recursive function in the way
that a is decreesed by one and b goes into
string a?? *)
in
exp ((String.length str)-1, []);; (*defined function exp on string lenght of
str decresed by one (why?) [ ]these
brackets mean or tell some kind of type ? *)
# let split lst ch =
let rec split = function (* defines recursive fun split *)
| [], ch, cacc', aacc' -> cacc'::aacc'(* if empty ...this is about what i got
so far :) *)
| c::lst, ch, cacc', aacc' when c = ch -> split (lst, ch, [], cacc'::aacc')
| c::lst, ch, cacc', aacc' -> split (lst, ch, c::cacc', aacc')
in
split (lst, ch, [], []);;
val split : 'a list -> 'a -> 'a list list = <fun>
This code is ugly. Whoever has been giving that to you is making you a disservice. If a student of mine wrote that, I would ask them to rewrite them without using when conditionals, because they tend to be confusing, encourage to write pattern-matching-heavy code at places where they are not warranted.
As a rule of the thumb, beginners should never use when. A simple if..then..else test provides an increase in readability.
Here are equivalent versions of those two functions, rewritten for readability:
let explode str =
let rec exp a b =
if a < 0 then b
else exp (a - 1) (str.[a] :: b)
in
exp (String.length str - 1) []
let split input delim_char =
let rec split input curr_word past_words =
match input with
| [] -> curr_word :: past_words
| c :: rest ->
if c = delim_char
then split rest [] (curr_word :: past_words)
else split rest (c :: curr_word) past_words
in
split input [] []
My advice to understand them is to run them yourself, on a given example, on paper. Just write down the function call (eg. explode "foo" and split 'b' ['a';'b';'c';'d']), expand the definition, evaluate the code to get another expression, etc., until you get to the result. Here is an example:
explode "fo"
=>
exp (String.length "fo" - 1) []
=>
exp 1 []
=>
if 1 < 0 then [] else exp 0 ("fo".[1] :: [])
=>
exp 0 ("fo".[1] :: [])
=>
exp 0 ('o' :: [])
=>
exp 0 ['o']
=>
if 0 < 0 then ['o'] else exp (-1) ("fo".[0] :: ['o'])
=>
exp (-1) ("fo".[0] :: ['o'])
=>
exp (-1) ('f' :: ['o'])
=>
exp (-1) ['f'; 'o']
=>
if -1 < 0 then ['f'; 'o'] else exp (-2) ("fo".[-1] :: ['o'])
=>
['f'; 'o']
Take the care to do that, for each function, and any function you will have problem understanding. On a small example. That's the best way to get a global view of what's going on.
(Later when you grow more used to recursion, you'll find out that you don't actually need to do that, you can reason inductively on the function: make an assumption on what they do, and assuming that recursive calls actually do that, check that it indeed does it. In more advanced cases, trying to hold all the execution in one's head is just too hard, and this induction technique works better, but it is more high-level and requires more practices. First begin by simply running the code.)
If you're using the Core library you can just use
String.to_list "BKMGTPEZY"
Which will return a list of chars if you want strings just map it:
String.to_list "BKMGTPEZY" |> List.map ~f:Char.to_string
Outputs:
- : bytes list = ["B"; "K"; "M"; "G"; "T"; "P"; "E"; "Z"; "Y"]
As a function
let explode s = String.to_list s |> List.map ~f:Char.to_string
You can also implement in this way.
let rec strexp s =
if length(s)==0 then
[]
else
(strexp (sub s 0 (length(s)-1)))#(s.[length(s)-1]::[])
;;