Need help to solve the function haskell regex manipulation function [closed] - regex

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
The definition
firsts :: RE sym -> [sym]
firsts = undefined
The RE data
data RE sym -- sym is type of alphabet symbols
= RSym sym -- match single symbol
| REps -- match empty string
| RZero -- match nothing
| RStar (RE sym) -- choice
| RPlus (RE sym) -- concatenation
| RAlt (RE sym) (RE sym) -- 0+ repetition
| RSeq (RE sym) (RE sym) -- 1+ repetition
deriving (Show)
The Alphabet used in regex
data Alphabet = A | B | C deriving (Show, Eq)
firsts re returns a list containing every symbol that occurs first in some string in the language for re.
For example, if re represents "A(C|B)|BC", then the strings in its language are AB, AC, and BC. In this case, firsts re might return [A,B].
Note that the type signature does not include Eq sym or Ord sym. This means that your code will be unable to sort or remove duplicates from the list of symbols it returns.
The requirements your code must satisfy are:
the list returned must be finite (even if the language is infinite!)
every symbol in the list must be the first symbol in some string in the language
for every string in the language, its first symbol must occur in the list
Individual symbols may occur in any order, and may be duplicated any finite number of
times.

The idea is to analyze the regular expression, not produce all possible strings for that regular expression. For example the RSym sym clearly has sym as first (and only) character whereas REps has no start characters.
It thus means that you should define a function that aims to find the initial characters. You thus implement such function like:
firsts :: RE sym -> [sym]
firsts (RSym sym) = [sym]
firsts REps = []
firsts RZero = …
firsts (RStar sub) = …
firsts (RPlus sub) = …
firsts (RAlt sub1 sub2) = …
firsts (RSeq sub1 sub2) = …
where sub and sub1 and sub2 are sub-regexes. You will thus for some of these regular expressions have to make recursive calls to find out the first characters of the subregex(es).
For (RSeq sub1 sub2) you will need to make a helper function matchEmpty :: RE sym -> Bool that checks if the regular expression matches with the empty string. If that is the case then the first characters of sub2 can be the first characters of the regex whereas if sub1 does not match with the empty string, then that is impossible.

Related

Simplify parsed regex

I have to simplify custom regex expressions parsed to a certain data type. With "simplify" I mean the following (emphasis mine):
Given the rules:
lowercase letters match themselves, eg.:
a matches a and nothing else
parens enclosing only letters match their full sequence, eg.:
(abc) matches abc and nothing else
square brackets enclosing only letters match every letters inside, eg.:
[abc] matches a and b and c and nothing else
The following are all valid:
(a[bc]) matches ab and ac and nothing else
[a(bc)] matches a and bc and nothing else
(a(bc)) is the same as (abc) and matches abc and nothing else
[a[bc]] is the same as [abc] and matches a and b and c and nothing else
Regexes can be simplified. For example [a[[bb]b[[b]]](c)(d)] is
really just the same as [abcd] which matches a, b, c and d.
I have implemented a simple parser combinator in Haskell using attoparsec and the following destination data type:
data Regex
= Symbol Char
| Concat [Regex] -- ()
| Union [Regex] -- []
deriving (Eq)
However, I'm really struggling with the simplification part. I try to reduce the Concats and Unions by a combination of unwrapping them, nubbing and concatMapping to no avail. I think that the data type I have defined might not be the best fit but I have run out of ideas (late at night here). Could you help me look to the right direction? Thanks!
simplify :: Regex -> Regex
simplify (Symbol s) = Symbol s
simplify (Concat [Symbol c]) = Symbol c
simplify (Concat rs) = Concat $ simplify <$> rs
simplify (Union [Symbol c]) = Symbol c
simplify (Union rs) = Union $ nub $ simplify <$> rs
You are missing a couple simple improvements, for starters. simplify (Concat [x]) = x and likewise for Union: there's no need for the wrapped regex to be specifically a symbol.
Then you need to start looking at Concats containing other Concats, and likewise for Union. Sure, you start by simplifying the elements of the wrapped list, but before jamming the result back into a wrapper, you lift up any elements using the same wrapper. Something like:
simplify (Concat xs) =
case concatMap liftConcats (map simplify xs) of
[x] -> x
xs -> Concat xs
where liftConcats :: Regex -> [Regex]
liftConcats r = _exerciseForTheReader
Then you can do something similar for Union, with a nub thrown in as well.

ocaml Str.full_split does not returns the original string instead of the expected substring

I am trying to write a program that will read diff files and return the filenames, just the filenames. So I wrote the following code
open Printf
open Str
let syname: string = "diff --git a/drivers/usc/filex.c b/drivers/usc/filex"
let fileb =
let pat_filename = Str.regexp "a\/(.+)b" in
let s = Str.full_split pat_filename syname in
s
let print_split_res (elem: Str.split_result) =
match elem with
| Text t -> print_string t
| Delim d -> print_string d
let rec print_list (l: Str.split_result list) =
match l with
| [] -> ()
| hd :: tl -> print_split_res hd ; print_string "\n" ; print_list tl
;;
() = print_list fileb
upon running this I get the original sting diff --git a/drivers/usc/filex.c b/drivers/usc/filex back as the output.
Whereas if I use the same regex pattern with the python standard library I get the desired result
import re
p=re.compile('a\/(.+)b')
p.findall("diff --git a/drivers/usc/filex.c b/drivers/usc/filex")
Output: ['drivers/usc/filex.c ']
What am I doing wrong?
Not to be snide, but the way to understand OCaml regular expressions is to read the documentation, not compare to things in another language :-) Sadly, there is no real standard for regular expressions across languages.
The main problem appears to be that parentheses in OCaml regular expressions match themselves. To get grouping behavior they need to be escaped with '\\'. In other words, your pattern is looking for actual parentheses in the filename. Your code works for me if you change your regular expression to this:
Str.regexp "a/\\(.+\\)b"
Note that the backslashes must themselves be escaped so that Str.regexp sees them.
You also have the problem that your pattern doesn't match the slash after b. So the resulting text will start with a slash.
As a side comment, I also removed the backslash before /, which is technically not allowed in an OCaml string.

(Ocaml) Using 'match' to extract list of chars from a list of chars

I have just started to learn ocaml and I find it difficult to extract small list of chars from a bigger list of chars.
lets say I have:
let list_of_chars = ['#' ; 'a' ; 'b' ; 'c'; ... ; '!' ; '3' ; '4' ; '5' ];;
I have the following knowledge - I know that in the
list above I have '#' followed by a '!' in some location further in the list .
I want to extract the lists ['a' ;'b' ;'c' ; ...] and ['3' ; '4' ; '5'] and do something with them,
so I do the following thing:
let variable = match list_of_chars with
| '#'::l1#['!']#l2 -> (*[code to do something with l1 and l2]*)
| _ -> raise Exception ;;
This code doesn't work for me, it's throwing errors. Is there a simple way of doing this?
(specifically for using match)
As another answer points out, you can’t use pattern matching for this because pattern matching only lets you use constructors and # is not a constructor.
Here is how you might solve your problem
let split ~equal ~on list =
let rec go acc = function
| [] -> None
| x::xs -> if equal x on then Some (rev acc, xs) else go (x::acc) xs
in
go [] list
let variable = match list_of_chars with
| '#'::rest ->
match split rest ~on:'!' ~equal:(Char.equal) with
| None -> raise Exception
| Some (left,right) ->
... (* your code here *)
I’m now going to hypothesise that you are trying to do some kind of parsing or lexing. I recommend that you do not do it with a list of chars. Indeed I think there is almost never a reason to have a list of chars in ocaml: a string is better for a string (a chat list has an overhead of 23x in memory usage) and while one might use chars as a kind of mnemonic enum in C, ocaml has actual enums (aka variant types or sum types) so those should usually be used instead. I guess you might end up with a chat list if you are doing something with a trie.
If you are interested in parsing or lexing, you may want to look into:
Ocamllex and ocamlyacc
Sedlex
Angstrom or another parser generator like it
One of the regular expression libraries (eg Re, Re2, Pcre (note Re and Re2 are mostly unrelated)
Using strings and functions like lsplit2
# is an operator, not a valid pattern. Patterns need to be static and can't match a varying number of elements in the middle of a list. But since you know the position of ! it doesn't need to be dynamic. You can accomplish it just using :::
let variable = match list_of_chars with
| '#'::a::b::c::'!'::l2 -> let l1 = [a;b;c] in ...
| _ -> raise Exception ;;

The '#' sign in Haskell [duplicate]

This question already has answers here:
What does the "#" symbol mean in reference to lists in Haskell?
(4 answers)
Closed 9 years ago.
I am beginner in Haskell. I was doing simple excersice in Haskell which is to write compress function, since my code of this function was pretty long and not really what i wanted to do i checked the solution, and i found this one:
compress (x:ys#(y:_))
| x == y = compress ys
| otherwise = x : compress ys
compress ys = ys
The problem for me is the '#' which i don't really know what is doing, is there anyone out there willing to explain me how this works?
# is used to bind a name to the value of the whole pattern match. Think of it like this
foo fullList#(x:xs) = ...
Is like saying
foo (x:xs) = ...
where fullList = x:xs
or, if you like
foo fullList = case fullList of
(x:xs) -> ...
So in your case
ys is equal to the tail of the original list, and the head of ys is y.
It's worth reading a good haskell tutorial to pick up some of this syntax.
# is used to pattern match a value while still keeping a reference to the whole value. An example is
data Blah = Blah Int Int
f :: Blah -> String
f val#(Blah x y) = -- some expression
f (Blah 1 2)
In the last call, val would be Blah 1 2, x would be 1 and y would be 2.
I recommend you read the relevant section of Learn you a Haskell for a Great Good!
From the link:
There's also a thing called as patterns. Those are a handy way of
breaking something up according to a pattern and binding it to names
whilst still keeping a reference to the whole thing. You do that by
putting a name and an # in front of a pattern. For instance, the
pattern xs#(x:y:ys). This pattern will match exactly the same thing as
x:y:ys but you can easily get the whole list via xs instead of
repeating yourself by typing out x:y:ys in the function body again.

haskell regex substitution

Despite the ridiculously large number of regex matching engines for Haskell, the only one I can find that will substitute is Text.Regex, which, while decent, is missing a few thing I like from pcre. Are there any pcre-based packages which will do substitution, or am I stuck with this?
I don't think "just roll your own" is a reasonable answer to people trying to get actual work done, in an area where every other modern language has a trivial way to do this. Including Scheme. So here's some actual resources; my code is from a project where I was trying to replace "qql foo bar baz qq" with text based on calling a function on the stuff inside the qq "brackets", because reasons.
Best option: pcre-heavy:
let newBody = gsub [re|\s(qq[a-z]+)\s(.*?)\sqq\s|] (unWikiReplacer2 titles) body in do
[snip]
unWikiReplacer2 :: [String] -> String -> [String] -> String
unWikiReplacer2 titles match subList = case length subList > 0 of
True -> " --" ++ subList!!1 ++ "-- "
False -> match
Note that pcre-heavy directly supports function-based replacement, with any
string type. So nice.
Another option: pcre-light with a small function that works but isn't exactly
performant:
let newBody = replaceAllPCRE "\\s(qq[a-z]+)\\s(.*?)\\sqq\\s" (unWikiReplacer titles) body in do
[snip]
unWikiReplacer :: [String] -> (PCRE.MatchResult String) -> String
unWikiReplacer titles mr = case length subList > 0 of
True -> " --" ++ subList!!1 ++ "-- "
False -> PCRE.mrMatch mr
where
subList = PCRE.mrSubList mr
-- A very simple, very dumb "replace all instances of this regex
-- with the results of this function" function. Relies on the
-- MatchResult return type.
--
-- https://github.com/erantapaa/haskell-regexp-examples/blob/master/RegexExamples.hs
-- was very helpful to me in constructing this
--
-- I also used
-- https://github.com/jaspervdj/hakyll/blob/ea7d97498275a23fbda06e168904ee261f29594e/src/Hakyll/Core/Util/String.hs
replaceAllPCRE :: String -- ^ Pattern
-> ((PCRE.MatchResult String) -> String) -- ^ Replacement (called on capture)
-> String -- ^ Source string
-> String -- ^ Result
replaceAllPCRE pattern f source =
if (source PCRE.=~ pattern) == True then
replaceAllPCRE pattern f newStr
else
source
where
mr = (source PCRE.=~ pattern)
newStr = (PCRE.mrBefore mr) ++ (f mr) ++ (PCRE.mrAfter mr)
Someone else's fix: http://0xfe.blogspot.com/2010/09/regex-substitution-in-haskell.html
Another one, this time embedded in a major library: https://github.com/jaspervdj/hakyll/blob/master/src/Hakyll/Core/Util/String.hs
Another package for this purpose: https://hackage.haskell.org/package/pcre-utils
Update 2020
I totally agree with #rlpowell that
I don't think "just roll your own" is a reasonable answer to people trying to get actual work done, in an area where every other modern language has a trivial way to do this.
At the time of this writing, there is also Regex.Applicative.replace for regex substitution, though it's not Perl-compatible.
For pattern-matching and substitution with parsers instead of regex, there is Replace.Megaparsec.streamEdit
The regular expression API in regex-base is generic to the container of characters to match. Doing some kind of splicing generically to implements substitution would be very hard to make efficient. I did not want to provide a crappy generic routine.
Writing a small function to do the substitution exactly how you want is just a better idea, and it can be written to match your container.