Simplify parsed regex - regex

I have to simplify custom regex expressions parsed to a certain data type. With "simplify" I mean the following (emphasis mine):
Given the rules:
lowercase letters match themselves, eg.:
a matches a and nothing else
parens enclosing only letters match their full sequence, eg.:
(abc) matches abc and nothing else
square brackets enclosing only letters match every letters inside, eg.:
[abc] matches a and b and c and nothing else
The following are all valid:
(a[bc]) matches ab and ac and nothing else
[a(bc)] matches a and bc and nothing else
(a(bc)) is the same as (abc) and matches abc and nothing else
[a[bc]] is the same as [abc] and matches a and b and c and nothing else
Regexes can be simplified. For example [a[[bb]b[[b]]](c)(d)] is
really just the same as [abcd] which matches a, b, c and d.
I have implemented a simple parser combinator in Haskell using attoparsec and the following destination data type:
data Regex
= Symbol Char
| Concat [Regex] -- ()
| Union [Regex] -- []
deriving (Eq)
However, I'm really struggling with the simplification part. I try to reduce the Concats and Unions by a combination of unwrapping them, nubbing and concatMapping to no avail. I think that the data type I have defined might not be the best fit but I have run out of ideas (late at night here). Could you help me look to the right direction? Thanks!
simplify :: Regex -> Regex
simplify (Symbol s) = Symbol s
simplify (Concat [Symbol c]) = Symbol c
simplify (Concat rs) = Concat $ simplify <$> rs
simplify (Union [Symbol c]) = Symbol c
simplify (Union rs) = Union $ nub $ simplify <$> rs

You are missing a couple simple improvements, for starters. simplify (Concat [x]) = x and likewise for Union: there's no need for the wrapped regex to be specifically a symbol.
Then you need to start looking at Concats containing other Concats, and likewise for Union. Sure, you start by simplifying the elements of the wrapped list, but before jamming the result back into a wrapper, you lift up any elements using the same wrapper. Something like:
simplify (Concat xs) =
case concatMap liftConcats (map simplify xs) of
[x] -> x
xs -> Concat xs
where liftConcats :: Regex -> [Regex]
liftConcats r = _exerciseForTheReader
Then you can do something similar for Union, with a nub thrown in as well.

Related

Need help to solve the function haskell regex manipulation function [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
The definition
firsts :: RE sym -> [sym]
firsts = undefined
The RE data
data RE sym -- sym is type of alphabet symbols
= RSym sym -- match single symbol
| REps -- match empty string
| RZero -- match nothing
| RStar (RE sym) -- choice
| RPlus (RE sym) -- concatenation
| RAlt (RE sym) (RE sym) -- 0+ repetition
| RSeq (RE sym) (RE sym) -- 1+ repetition
deriving (Show)
The Alphabet used in regex
data Alphabet = A | B | C deriving (Show, Eq)
firsts re returns a list containing every symbol that occurs first in some string in the language for re.
For example, if re represents "A(C|B)|BC", then the strings in its language are AB, AC, and BC. In this case, firsts re might return [A,B].
Note that the type signature does not include Eq sym or Ord sym. This means that your code will be unable to sort or remove duplicates from the list of symbols it returns.
The requirements your code must satisfy are:
the list returned must be finite (even if the language is infinite!)
every symbol in the list must be the first symbol in some string in the language
for every string in the language, its first symbol must occur in the list
Individual symbols may occur in any order, and may be duplicated any finite number of
times.
The idea is to analyze the regular expression, not produce all possible strings for that regular expression. For example the RSym sym clearly has sym as first (and only) character whereas REps has no start characters.
It thus means that you should define a function that aims to find the initial characters. You thus implement such function like:
firsts :: RE sym -> [sym]
firsts (RSym sym) = [sym]
firsts REps = []
firsts RZero = …
firsts (RStar sub) = …
firsts (RPlus sub) = …
firsts (RAlt sub1 sub2) = …
firsts (RSeq sub1 sub2) = …
where sub and sub1 and sub2 are sub-regexes. You will thus for some of these regular expressions have to make recursive calls to find out the first characters of the subregex(es).
For (RSeq sub1 sub2) you will need to make a helper function matchEmpty :: RE sym -> Bool that checks if the regular expression matches with the empty string. If that is the case then the first characters of sub2 can be the first characters of the regex whereas if sub1 does not match with the empty string, then that is impossible.

(Ocaml) Using 'match' to extract list of chars from a list of chars

I have just started to learn ocaml and I find it difficult to extract small list of chars from a bigger list of chars.
lets say I have:
let list_of_chars = ['#' ; 'a' ; 'b' ; 'c'; ... ; '!' ; '3' ; '4' ; '5' ];;
I have the following knowledge - I know that in the
list above I have '#' followed by a '!' in some location further in the list .
I want to extract the lists ['a' ;'b' ;'c' ; ...] and ['3' ; '4' ; '5'] and do something with them,
so I do the following thing:
let variable = match list_of_chars with
| '#'::l1#['!']#l2 -> (*[code to do something with l1 and l2]*)
| _ -> raise Exception ;;
This code doesn't work for me, it's throwing errors. Is there a simple way of doing this?
(specifically for using match)
As another answer points out, you can’t use pattern matching for this because pattern matching only lets you use constructors and # is not a constructor.
Here is how you might solve your problem
let split ~equal ~on list =
let rec go acc = function
| [] -> None
| x::xs -> if equal x on then Some (rev acc, xs) else go (x::acc) xs
in
go [] list
let variable = match list_of_chars with
| '#'::rest ->
match split rest ~on:'!' ~equal:(Char.equal) with
| None -> raise Exception
| Some (left,right) ->
... (* your code here *)
I’m now going to hypothesise that you are trying to do some kind of parsing or lexing. I recommend that you do not do it with a list of chars. Indeed I think there is almost never a reason to have a list of chars in ocaml: a string is better for a string (a chat list has an overhead of 23x in memory usage) and while one might use chars as a kind of mnemonic enum in C, ocaml has actual enums (aka variant types or sum types) so those should usually be used instead. I guess you might end up with a chat list if you are doing something with a trie.
If you are interested in parsing or lexing, you may want to look into:
Ocamllex and ocamlyacc
Sedlex
Angstrom or another parser generator like it
One of the regular expression libraries (eg Re, Re2, Pcre (note Re and Re2 are mostly unrelated)
Using strings and functions like lsplit2
# is an operator, not a valid pattern. Patterns need to be static and can't match a varying number of elements in the middle of a list. But since you know the position of ! it doesn't need to be dynamic. You can accomplish it just using :::
let variable = match list_of_chars with
| '#'::a::b::c::'!'::l2 -> let l1 = [a;b;c] in ...
| _ -> raise Exception ;;

Removing commas from inside quoted parts of a string in Elm 0.16

Right now I am trying to remove any commas that are contained within quotation marks and replace them with spaces in this string:
(,(,data,"quoted,data",123,4.5,),(,data,(,!##,(,4.5,),"(,more","data,)",),),)
I am currently using this function that uses Javascript style regex:
removeNeedlessCommmas sExpression =
sExpression
|> (\_ -> replaceSpacesWithCommas sExpression)
|> Regex.replace Regex.All (Regex.regex ",") (\_ -> ",(?!(?:[^"]*"[^"]*")*[^"]*$)g")
This regex is displayed as working correctly in sites such as regex101.com.
However, I have tried many ways of escaping the regex so that it works in Elm 0.16, but the rest of my code in my file is always still highlighted like the rest of the file is enclosed in a string. This is the error that I am getting with my current code:
(line 1, column 64): unexpected "_" expecting space, "&" or escape code
39│ printToBrowser "((data \"quoted data\" 123 4.5) (data (!##(4.5) \"(more\" \"data)\")))"
Maybe <http://elm-lang.org/docs/syntax> can help you figure it out.
I will post the main function that the error is referring to so that it makes more sense:
main : Html.Html
main =
printToBrowser "((data \"quoted data\" 123 4.5) (data (!## (4.5) \"(more\" \"data)\")))"
Any assistance would be greatly appreciated. Thanks in advance.
I think you need 3 things:
Add a closing ) to the last anonymous function in removeNeedlessCommmas (this could have just been a copy-paste error)
Escape all the inner " in your regex like so: ",(?!(?:[^\"]*\"[^\"]*\")*[^\"]*$)g"
Use the regex for matching, and replace with a space like so: Regex.replace Regex.All (Regex.regex ",(?!(?:[^\"]*\"[^\"]*\")*[^\"]*$)g") (\_ -> " ")
If you'd consider a cowardly workaround alternative to a death-defying super-regex, I can offer this:
removeNeedlessCommas sExpr =
replace All (regex "\"[^\"]*?\"")
(\{match} -> String.map (\c -> if c == ',' then ' ' else c) match)
sExpr
It lets regex find the quoted strings but does the comma substitution to those strings in a separate step. If preferred, that could be done by regex as well.
Here's my test harness, which ran fine in http://elm-lang.org/try :
import Html exposing (..)
import Regex exposing (..)
import String
str = """(,(,data,"quoted,data",123,4.5,),(,data,(,!##,(,4.5,),"(,more","data,)",),),)"""
main = div []
[ (text str)
, br [] []
, (text (removeNeedlessCommas str))]
Output:
(,(,data,"quoted,data",123,4.5,),(,data,(,!##,(,4.5,),"(,more","data,)",),),)
(,(,data,"quoted data",123,4.5,),(,data,(,!##,(,4.5,),"( more","data )",),),)
Just for good measure, here's an algorithmic solution that does completely without regex:
removeNeedlessCommas str =
reverse
<| snd
<| foldl (\c (inQ, acc) ->
case c of
'"' -> (not inQ, cons c acc)
',' -> (inQ, cons (if inQ then ' ' else c) acc)
_ -> (inQ, cons c acc))
(False, "")
str

Haskell - Capitalize all letters in a list [String] with toUpper

I have a list [String] the task ist to remove those elements in the list, which have "q" or "p" and then capitalize all letters in the list with toUpper.
What I tried yet is as follow:
delAndUpper :: [String] -> [String]
delAndUpper myList = filter (\x -> not('p' `elem` x || 'q' `elem` x)) myList
It removes the unwanted elements from the list properly, however I can't apply toUpper on this list since the type of toUpper is Char.
I tried it with map and it does not work.
delAndUpper myList = map toUpper (filter (\x -> not('p' `elem` x || 'q' `elem` x)) myList)
I know, that toUpper in this line of code gets a list as value and therefore it can't work, but know how to go a level down into the list and the apply map toUpper.
Could you please help me.
Thanks in advance!
Greetings
Mapping one level deeper
You need to use map (map toUpper).
This is because you have [String] instead of String.
toUpper :: Char -> Char
map toUpper :: [Char] -> [Char]
i.e.
map toUpper :: String -> String
map (map toUpper) :: [String] -> [String]
map toUpper capitalises a String, by making each letter uppercase, so map (map toUpper) capitalises each String in a list of Strings.
Your function becomes
delAndUpper myList = map (map toUpper) (filter (\x -> not('p' `elem` x || 'q' `elem` x)) myList)
dave4420 made a good suggestion that (map.map) toUpper is a neat way of writing map (map toUpper) that helps you think two list levels in quite simply and naturally - have a look at his answer too.
Can we un-hardwire the p and q?
You asked if there was a shorter way to write the condition, and didn't like hard coding the `q` and `p`. I agree those multiple `elem` bits aren't pretty. Let's pass in the list of disallowed letters and tidy up a bit:
delAndUpper omit strings = map (map toUpper) (filter ok strings) where
ok xs = not (any (`elem` omit) xs)
Here (`elem` omit) checks a character if it's in the list of ones that would cause us to omit the word, so (any (`elem` omit) xs) checks if any of the characters of xs are forbidden. Of course if none are forbidden, it's ok.
Your original delAndUpper would now be delAndUpper "pq", or if you also want to disallow capital P and Q, delAndUpper "pqPQ". (More on this later.)
Can we make it more concise?
Let's see if we can't write ok a little shorter. My first thought was to use pointfree on it (see my answer to another question for details of how to get it running in ghci), but it seemed to hang, so using some standard transformation tricks, we can compose not with a function that takes two arguments before giving us a Bool by doing (not.).f instead of not.f as we would with a function which just gave us a Bool after the first input. any is taking (`elem` omit) as its first argument. This gives us
ok xs = ((not.).any) (`elem` omit) xs
from which we can remove the trailing xs:
ok = ((not.).any) (`elem` omit)
and inline:
delAndUpper omit strings = map (map toUpper) (filter (((not.).any) (`elem` omit)) strings)
I'm not keen on the trailing strings either:
delAndUpper omit = map (map toUpper).filter (((not.).any) (`elem` omit))
(We could get rid of the omit argument as well and go completely point free, but that would go quite a bit too far down the hard-to-read road for my taste.)
Whither Q?
> delAndUpper "pq" $ words "The Queen has probably never had a parking ticket."
["THE","QUEEN","HAS","NEVER","HAD","A","TICKET."]
Is this the required behaviour? It seems strange to carefully exclude the lowercase variants and then make everything uppercase. We could do it the other way round:
upperAndDel omit = filter (((not.).any) (`elem` omit)).map (map toUpper)
giving
> upperAndDel "PQ" $ words "The Queen has probably never had a parking ticket."
["THE","HAS","NEVER","HAD","A","TICKET."]
I know, that toUpper in this line of code gets a list as value and therefore it can't work, but know how to go a level down into the list and the apply map toUpper.
Use (map . map) instead of map.
n.b. (map . map) toUpper is the same as map (map toUpper) as suggested by the other answers. I mention it because, personally, I find it clearer: it looks more like it is going down two levels to apply toUpper. (You may not be familiar with the function composition operator (.), look it up or ask about it if you need to.)
Other functions with a similar type ((a -> b) -> something a -> something b), such as fmap, Data.Map.map, first and second, can be combined with each other and with map in a similar way.
Perhaps you don't find (map . map) toUpper clearer than map (map toUpper); fair enough.
You're almost there, you just need a second map.
map (map toUpper) (filter (\x -> not('p' `elem` x || 'q' `elem` x)) myList)
This is because String is completely synonymous with [Char] in vanilla haskell. Since the type of map is
(a->b) -> [a] -> b
and we have
toUpper :: Char -> Char
String :: [Char]
We'll get back another String, except capitalized.
By the way, that ugly-ish filter can be replaced made prettier with by making it use arrows :) (Think of these like more structured functions)
map (map toUpper) . filter $ elem 'p' &&& elem 'q' >>> arr (not . uncurry (||))
Gratuitousness? Maybe, but kinda cool.

F# Mapping Regular Expression Matches with Active Patterns

I found this useful article on using Active Patterns with Regular Expressions:
http://www.markhneedham.com/blog/2009/05/10/f-regular-expressionsactive-patterns/
The original code snippet used in the article was this:
open System.Text.RegularExpressions
let (|Match|_|) pattern input =
let m = Regex.Match(input, pattern) in
if m.Success then Some (List.tl [ for g in m.Groups -> g.Value ]) else None
let ContainsUrl value =
match value with
| Match "(http:\/\/\S+)" result -> Some(result.Head)
| _ -> None
Which would let you know if at least one url was found and what that url was (if I understood the snippet correctly)
Then in the comment section Joel suggested this modification:
Alternative, since a given group may
or may not be a successful match:
List.tail [ for g in m.Groups -> if g.Success then Some g.Value else None ]
Or maybe you give labels to your
groups and you want to access them by
name:
(re.GetGroupNames()
|> Seq.map (fun n -> (n, m.Groups.[n]))
|> Seq.filter (fun (n, g) -> g.Success)
|> Seq.map (fun (n, g) -> (n, g.Value))
|> Map.ofSeq)
After trying to combine all of this I came up with the following code:
let testString = "http://www.bob.com http://www.b.com http://www.bob.com http://www.bill.com"
let (|Match|_|) pattern input =
let re = new Regex(pattern)
let m = re.Match(input) in
if m.Success then Some ((re.GetGroupNames()
|> Seq.map (fun n -> (n, m.Groups.[n]))
|> Seq.filter (fun (n, g) -> g.Success)
|> Seq.map (fun (n, g) -> (n, g.Value))
|> Map.ofSeq)) else None
let GroupMatches stringToSearch =
match stringToSearch with
| Match "(http:\/\/\S+)" result -> printfn "%A" result
| _ -> ()
GroupMatches testString;;
When I run my code in an interactive session this is what is output:
map [("0", "http://www.bob.com"); ("1", "http://www.bob.com")]
The result I am trying to achieve would look something like this:
map [("http://www.bob.com", 2); ("http://www.b.com", 1); ("http://www.bill.com", 1);]
Basically a mapping of each unique match found followed by the count of the number of times that specific matching string was found in the text.
If you think I'm going down the wrong path here please feel free to suggest a completely different approach. I'm somewhat new to both Active Patterns and Regular Expressions so I have no idea where to even begin in trying to fix this.
I also came up with this which is basically what I would do in C# translated to F#.
let testString = "http://www.bob.com http://www.b.com http://www.bob.com http://www.bill.com"
let matches =
let matchDictionary = new Dictionary<string,int>()
for mtch in (Regex.Matches(testString, "(http:\/\/\S+)")) do
for m in mtch.Captures do
if(matchDictionary.ContainsKey(m.Value)) then
matchDictionary.Item(m.Value) <- matchDictionary.Item(m.Value) + 1
else
matchDictionary.Add(m.Value, 1)
matchDictionary
Which returns this when run:
val matches : Dictionary = dict [("http://www.bob.com", 2); ("http://www.b.com", 1); ("http://www.bill.com", 1)]
This is basically the result I am looking for, but I'm trying to learn the functional way to do this, and I think that should include active patterns. Feel free to try to "functionalize" this if it makes more sense than my first attempt.
Thanks in advance,
Bob
Interesting stuff, I think everything you are exploring here is valid. (Partial) active patterns for regular expression matching work very well indeed. Especially when you have a string which you want to match against multiple alternative cases. The only thing I'd suggest with the more complex regex active patterns is that you give them more descriptive names, possibly building up a collection of different regex active patterns with differing purposes.
As for your C# to F# example, you can have functional solution just fine without active patterns, e.g.
let testString = "http://www.bob.com http://www.b.com http://www.bob.com http://www.bill.com"
let matches input =
Regex.Matches(input, "(http:\/\/\S+)")
|> Seq.cast<Match>
|> Seq.groupBy (fun m -> m.Value)
|> Seq.map (fun (value, groups) -> value, (groups |> Seq.length))
//FSI output:
> matches testString;;
val it : seq<string * int> =
seq
[("http://www.bob.com", 2); ("http://www.b.com", 1);
("http://www.bill.com", 1)]
Update
The reason why this particular example works fine without active patterns is because 1) you are only testing one pattern, 2) you are dynamically processing the matches.
For a real world example of active patterns, let's consider a case where 1) we are testing multiple regexes, 2) we are testing for one regex match with multiple groups. For these scenarios, I use the following two active patterns, which are a bit more general than the first Match active pattern you showed (I do not discard first group in the match, and I return a list of the Group objects, not just their values -- one uses the compiled regex option for static regex patterns, one uses the interpreted regex option for dynamic regex patterns). Because the .NET regex API is so feature filled, what you return from your active pattern is really up to what you find useful. But returning a list of something is good, because then you can pattern match on that list.
let (|InterpretedMatch|_|) pattern input =
if input = null then None
else
let m = Regex.Match(input, pattern)
if m.Success then Some [for x in m.Groups -> x]
else None
///Match the pattern using a cached compiled Regex
let (|CompiledMatch|_|) pattern input =
if input = null then None
else
let m = Regex.Match(input, pattern, RegexOptions.Compiled)
if m.Success then Some [for x in m.Groups -> x]
else None
Notice also how these active patterns consider null a non-match, instead of throwing an exception.
OK, so let's say we want to parse names. We have the following requirements:
Must have first and last name
May have middle name
First, optional middle, and last name are separated by a single blank space in that order
Each part of the name may consist of any combination of at least one or more letters or numbers
Input may be malformed
First we'll define the following record:
type Name = {First:string; Middle:option<string>; Last:string}
Then we can use our regex active pattern quite effectively in a function for parsing a name:
let parseName name =
match name with
| CompiledMatch #"^(\w+) (\w+) (\w+)$" [_; first; middle; last] ->
Some({First=first.Value; Middle=Some(middle.Value); Last=last.Value})
| CompiledMatch #"^(\w+) (\w+)$" [_; first; last] ->
Some({First=first.Value; Middle=None; Last=last.Value})
| _ ->
None
Notice one of the key advantages we gain here, which is the case with pattern matching in general, is that we are able to simultaneously test that an input matches the regex pattern, and decompose the returned list of groups if it does.