Haskell regular expression simplifier guards - regex

I am trying to further develop some Haskell code that was developed for simplifying regular expressions and I've run into a small problem. When I run the following command:
*Language.HaLex.RegExp> simplifyRegExp(Star (Or a Epsilon))
I get the output 'a'* and if I replace the a's with b's I get 'b'* like I should. The problem arises when I use Then. The following command:
*Language.HaLex.RegExp> simplifyRegExp(Then (Star a) (Star a))
works fine and produces 'a'* as expected but replacing a's with b's produced the following output:
*Language.HaLex.RegExp> simplifyRegExp(Then (Star b) (Star b))
'b'*'b'*
Although it is supposed to produce just 'b'*. Now if I change the variable name to b in the the line
simplifyRegExp (Then x y) | x' == Star a && y' == x' = y'
it works fine for b but not for any other letter. So my question is why it works fine in the Star part but not in the Then part?
I've added some of the important parts of the code below but feel free to ask for more if its not enough.
data RegExp sy = Empty -- ^ Empty Language
| Epsilon -- ^ Empty String
| Literal sy -- ^ Literals
| Or (RegExp sy) (RegExp sy) -- ^ Disjuncion
| Then (RegExp sy) (RegExp sy) -- ^ Sequence
| Star (RegExp sy) -- ^ Repetition, possibly zero time
deriving (Read, Eq)
a = Literal 'a'
b = Literal 'b'
c = Literal 'c'
simplifyRegExp Empty = Empty
simplifyRegExp Epsilon = Epsilon
simplifyRegExp (Literal x) = Literal x
simplifyRegExp (Star x) = case x' of
Or a Epsilon -> Star (simplifyRegExp a)
where x' = simplifyRegExp x
simplifyRegExp (Then x y) | x' == Star a && y' == x' = y'
where x' = simplifyRegExp x
y' = simplifyRegExp y

You're having variable scoping issues.
In the case pattern match Or a Epsilon the a is a fresh variable bound locally to the right side of that case rule (i.e. Star (simplifyRegExp a)). Later, in the equation for simplifyRegExp (Then x y) you refer to Star a where the a is not locally bound but instead refers to the top-level definition
a = Literal 'a'
That's almost certainly not the behavior you're intending as the simplification of a regular expression goes over its structure and ignores the actual choice of literals.

Related

What is "let rec (-) x y = y - x in 1 - 2 - 3" doing?

I saw let (-) x y = y - x in 1 - 2 - 3 and let rec (-) x y = y - x in 1 - 2 - 3 these two examples in a book about ocaml. When I saw the former one, I thought I understood the trick behind until I saw the latter function. It seems that the latter function is having some stack overflow problem, but why is it the case? How does ocaml evaluate these two expressions separately?
It might be helpful to rename the let bound functions.
let (-) x y = y - x
in 1 - 2 - 3
(* The above expression is equivalent to the following. *)
let f x y = y - x
in f (f 1 2) 3
(* Which reduces to the following. *)
let f x y = y - x
in 3 - (2 - 1)
Note that the function we defined, let (-) x y, is different from the function which we use in the definition, y - x. This is because let without rec doesn't bind the function name within the definition. Hence, the (-) operator in the definition is the native minus operator. Therefore, the result is 2.
Now, consider the second example.
let rec (-) x y = y - x
in 1 - 2 - 3
(* The above expression is equivalent to the following. *)
let rec f x y = f y x
in f (f 1 2) 3
Now, what does f 1 2 reduce to? It reduces to f 2 1, which reduces to f 1 2, which reduces to f 2 1, and so on ad infinitum. Now, in a language without tail call optimization this would result in a stack overflow error. However, in OCaml it should just run forever without returning.
The let expression has syntax is let <name> = <expr1> in <expr2> and it defines <name> to be bound to <expr1> in <expr2>. The <name> itself is not visible in the scope of <expr1>, in other words it is not recursive by default. And that feature could be (and often used) to give the same names the new meaning, e.g., this is the canonical OCaml code,
let example () =
let name = "Alice" in
let name = "Professor " ^ name in
print_endline name
The same technique approach is used in the let (-) x y = y - x in 1 - 2 - 3, where we redefine the (-) in terms of the original (-) operator which is still seen untouched in the scope of the y - x expression.
However, when we add the rec keyword to the let definition then the name is immediately visible in the scope of the currently defined expression, e.g.,
let rec (-) x y = y - x
(* ^ | *)
(* | | *)
(* +----------+ *)
here - in x - y refers to the currently defined function, so we have a recursive definition that says that x minus y is y minus x - a bogus definition.

Ocaml- syntax error during a pattern-matching

Here is my code :
type mass = Inf | P of int
let som = fun
|Inf _ | _ Inf -> Inf
| (P a) (P b) -> P (a+b)
I get the following error :
line 5, characters 0-1:
Error: Syntax error
I don't understand at all how I can get a syntax error here. I tried to replace the fun by : match a b with yet I still get the same syntax.
I also tried to put some : ";" yet it still doesn't work.
These patterns:
Inf _
_ Inf
don't make sense in OCaml. Both of them consist of one pattern followed directly by another. (The Inf pattern matches the Inf constuctor, and _ is a wild-card that matches anything.)
But there is no pattern in OCaml that consists of one pattern followed by another.
The same is true of this pattern:
(P a) (P b)
If these patterns did have a meaning, they would seem to match function applications. But a pattern can't pull apart a function application, it can only pull apart data constructors (lists, tuples, etc.).
What is an example OCaml value that you would expect this pattern to match?
Update
You seem to be saying that the value P 2, P 3 should match this second pattern. The value P 2, P 3 in OCaml is a tuple. It will match this pattern:
(P a), (P b)
Note that the comma is required. The comma is the constructor that creates a tuple.
Update 2
Well, the other mistake is that the fun keyword allows only a single pattern. For multiple patterns you need to use the function keyword. Here is a correct version of your function (assuming that you want it to handle pairs of values of type mass).
type mass = Inf | P of int
let som = function
|Inf, _ | _, Inf -> Inf
| (P a), (P b) -> P (a+b)
Update 3
It's more idiomatic in OCaml to have curried functions. It strikes me that this could be the reason you wanted to have adjacent patterns. To get a curried version of som you need to use an explicit match. Neither fun nor function is quite flexible enough.
It would look like this:
let som x y =
match x, y with
| Inf, _ | _, Inf -> Inf
| P a, P b -> P (a + b)

Simplifying regex in Haskell with trees

I have this data structure for regular expressions (RE), and so far I do not have any functions modifying REs:
data Regex a = Letter a | Emptyword | Concat (Regex a) (Regex a) | Emptyset | Or (Regex a) (Regex a) | Star (Regex a)
deriving (Show, Eq)
I would like to implement a simplification algorithm for my REs. For this I thought I should first represent the RE as tree, update the tree according to some equivalences and then convert it back to a RE. My reasoning was that with trees I would have functions to find, extract and attach subtrees, update values etc.
However, I have difficulties finding a tree module giving these functionalities and being simple enough for a beginner to learn.
I found this avl-tree package however, it seems very large.
I'd like to have alternative suggestions to my approach with trees and suggestions on easy tree modules supporting mentioned functions.
Note that I'm a beginner in Haskell and I do not understand monads yet and that I'm not interested in an implementation to simplify REs.
Edit 1: We know that the following two REs are equivalent, where L b stands for Letter b and C for Concat:
Or Or
/ \ / \
L b C = L b L a
/ \
L a Emptyword
So given the left RE I'd like to replace the subtree with its root labeled by C with a node labeled by L a. As was pointed out my data structure is a tree structure. However, currently I do not have functions to, e.g. replace a subtree with a node, or find a subtree of a structure that I can replace.
As noted in the comments, you already have a tree. You can simplify right away:
simplify :: Regex a -> Regex a
simplify (Star Emptyset) = Emptyword
simplify (Star (Star x)) = Star (simplify x)
simplify (Concat x Emptyword) = simplify x
simplify (Concat Emptyword y) = simplify y
simplify (Or x y) | x == y = x
-- or rather simplify (Or x y) | simplify x == simplify y = simplify x
-- more sophisticated rules here
-- ...
-- otherwise just push down
simplify (Or x y) = simplify (Or (simplify x) (simplify y)
-- ...
simplify x#(Letter _) = x
This is just superficial, e.g. the first rule should be simplify (Star x) | simplify x == Emptyset = emptyword.
AVL Trees
AVL trees are for balance, not really applicable here. The only place where balance make sense is for the associative operations
Or (x (Or y z) == Or (Or x y) y
I suggest to use lists for those operations
data Regex' a = Letter' a | Concat' [Regex a] | Or [Regex a] | Star (Regex a)
deriving (Show, Eq)
(No Emptyword' because it is Concat' []; same with Emptyset' and Or.)
Converting between Regex and Regex' is the usual exercise for the reader.
General Hardness
Note that Regex equivalence is not easy:
(a|b)* = (a*b)*a*
Optimizing Or "(a|b)*" "(a*b)*a*" is hard...

finding the first occurence in a list

I want to find the first occurence of a digit in a list :
let pos_list = function (list , x) ->
let rec pos = function
|([] , x , i) -> i
|([y] , x , i) -> if y == x then i
|(s::t , x , i) -> if s == x then i else pos(t , x , i + 1) in pos(list , x , 0) ;;
but the compiler complain that the expression is a "uint" type , and was used instead with a "int" type .
Remove the second case from the pattern matching. This case is already matched by the last one with s = y, t = []. So the function can be simplified to
let pos_list (list, x) =
let rec pos = function
| ([], x, i) -> i
| (s::t, x, i) -> if s == x then i else pos(t, x, i + 1) in pos(list, x, 0) ;;
Why are you using == (physical equality) instead = which is structural equality? I realize that this might not make a difference if you only have integers but it might yield unexpected behaviour down the road.
See the Comparisons section in the doc: http://caml.inria.fr/pub/docs/manual-ocaml/libref/Pervasives.html
Pavel Zaichenkov's answer is of course the best one, but you might be interested in knowing the exact cause of the error. Namely, when you have
if y == x then i
without a corresponding else expression, the whole expression is treated as
if y == x then i else ()
where () is the only value of the type unit (and not uint), which is the type of expressions that are evaluated for their side effects only. Since both branches of the if must have the same type, i is deemed to have type unit also. Then, when type-checking the third branch of the pattern-matching, you try to add i and 1, which means that i should have type int, hence the type error.

Telling if regular expression contains a single invariable segment

I write a search tool which is optimized to first look for fixed phrases of characters in sentences - consider it a simple "does the sentence contain a particular sequence of characters". The result will be a set of found sentences which can be searched further in a second stage.
For this second stage I like to apply regex search for convenience. But I need to pre-select the items first at the first stage, and I cannot simply get all sentences - the API I need to use for the first stage requires me to search for a phrase of at least one matching char. So there's no way around this.
Now, the user will only enter one regex, and my software needs to first determine if it can perform the first stage search on this. If the user enters something ambiguous, I will then tell the user to change his regex.
I need the algorithm that determines all substrings I can use for the first stage search.
Here are some examples of expected results:
a.b – Yes (searches for "a" or "b" first)
a|b – No (there'd be two distinct first level searches necessary)
[ab] – No (same problem: Not a clear target for the first stage search)
[ab]c – Yes (searches for "c" first)
These are simple examples. But since regex can get quite complicated I wonder if I can construct a regex or other test that will tell me if I have a usable outcome.
I could also live with limiting the regex syntax to the more common cases if that makes the test simpler, e.g. no recursion or whatever could help.
Here is an example, using Haskell. The algorithm should be easily transferred to another language.
Just some boilerplate imports;
import Data.Maybe
import Data.List
import Data.Function
Here is a datatype to represent a Regex. You'll have to parse it yourself or use a library:
data Regex
= Concat Regex Regex -- e.g. /ab/
| Alt Regex Regex -- e.g. /a|b/
| Single Char -- e.g. /a/
| Star Regex -- e.g. /a*/
| CharClass [(Char,Char)] -- a list of ranges. for non-range (e.g. [a]) just use the same char twice
Here is the algorithm:
regexMustMatch (Single x) = [Just x] -- has to match the character
regexMustMatch (Alt _ _) = [Nothing] -- doesn't need to match one thing (you could actually check for equality here, so something like /a|a/ would work)
regexMustMatch (Star _) = [Nothing] -- doesn't need to match one thing
regexMustMatch (CharClass ((a,b):[])) | a == b = [Just a] -- char class must match if it only has one character
regexMustMatch (CharClass _) = [Nothing] -- otherwise doesn't need to match one thing
regexMustMatch (Concat x y) = (regexMustMatch x) ++ (regexMustMatch y) -- must match both parts in sequence
Some methods to make the results usable:
selectAll = map (concatMap (return . fromJust)) .
filter (isJust . head) .
groupBy ((==) `on` isJust)
selectLongest x = case selectAll x of
[] -> ""
xs -> maximumBy (compare `on` length) xs
And some examples:
main = do
-- your tests
-- /ab/
print . selectAll . regexMustMatch $ (Single 'a' `Concat` Single 'b')
-- /a|b/
print . selectAll . regexMustMatch $ (Single 'a' `Alt` Single 'b')
-- /[ab]/
print . selectAll . regexMustMatch $ (CharClass [('a','a'),('b','b')])
-- /[ab]c/
print . selectAll . regexMustMatch $ ((Single 'a' `Alt` Single 'b') `Concat` Single 'c')
-- a few more
-- /[a]/
print . selectAll . regexMustMatch $ (CharClass [('a','a')])
-- /ab*c/
print . selectAll . regexMustMatch $ (Single 'a' `Concat` Star (Single 'b') `Concat` Single 'c')
-- /s(ab*)(cd)/ - these aren't capturing parens, just grouping to test associativity
print . selectAll . regexMustMatch $ (Single 's' `Concat` (Single 'a' `Concat` Star (Single 'b')) `Concat` (Single 'c' `Concat` Single 'd'))
Output:
["ab"] -- /ab/
[] -- /a|b/
[] -- /[ab]/
["c"] -- /[ab]c/
["a"] -- /[a]/
["a","c"] -- /ab*c/
["sa","cd"] -- /s(ab*)(cd)/
The main area where this could be improved is in the algorithm for alternation.
If we have the regex /a*bc*|d*be*/ then b needs to be matched, but this won't pick that up.
Edit: here's an improved algorithm for alternation:
regexMustMatch (Alt x y)
| x' == y' = x'
| otherwise = start ++ [Nothing] ++ common ++ [Nothing] ++ end
where
x' = regexMustMatch x
y' = regexMustMatch y
start = map fst $ takeWhile (uncurry (==)) (zip x' y')
end = map fst $ reverse $ takeWhile (uncurry (==)) (zip (reverse (drop (length start) x')) (reverse (drop (length start) y')))
dropEnds = drop (length start) . reverse . drop (length end) . reverse
common = intercalate [Nothing] $ map (map Just) (selectAll (dropEnds x') `intersect` selectAll (dropEnds y'))
Some more tests with the improved alternation:
/a*bc*|d*be*/ == b
/s(abc*|abe*)/ == sab
/s(a*bc*|d*be*)/ == s, b
/sa*b|b*/ == s
/(abc*|abe*)s/ == ab, s
/(a*bc*|d*be*)s/ == b, s
/(a*b|b*)s/ == s
/s(ab|b)e/ == s, be
/s(ba|b)e/ == sb, e
/s(b|b)e/ == sbe
/s(ac*b|ac*b)e/ == sa, be