Haskell, regex, TDFA: match (and remove) quoted substrings - regex

There is a regular expression matching quoted substrings: "/\"(?:[^\"\\]|\\.)*\"/" (originally /"(?:[^"\\]|\\.)*"/, see Here). Tested on regex101, it works.
With TDFA, it's syntax:
*** Exception: Explict error in module Text.Regex.TDFA.String : Text.Regex.TDFA.String died:
parseRegex for Text.Regex.TDFA.String failed:"/"(?:[^"\]|\.)*"/" (line 1, column 4):
unexpected "?"
expecting empty () or anchor ^ or $ or an atom
Is there a way co correct it?
Test string: Is big "problem", no?
Expected result: "problem"
UPD:
This is full context:
removeQuotedSubstrings :: String -> [String]
removeQuotedSubstrings str =
let quoteds = concat (str =~ ("/\"(?:[^\"\\]|\\.)*\"/" :: String) :: [[String]])
in quoteds

No improvement, just an acceptable solution, albeit lacking in elegance:
import qualified Data.Text as T
import Text.Regex.TDFA
-- | Removes all double quoted substrings, if any, from a string.
--
-- Examples:
--
-- >>> removeQuotedSubstrings "alfa"
-- "alfa"
-- >>> removeQuotedSubstrings "ngoro\"dup\"lai \"ming\""
-- "ngoro lai "
removeQuotedSubstrings :: String -> String
removeQuotedSubstrings str =
let quoteds = filter (('"' ==) . head)
$ concat (str =~ ("\"(\\.|[^\"\\])*\"" :: String) :: [[String]])
in T.unpack $ foldr (\quoted acc -> T.replace (T.pack quoted) " " acc)
(T.pack str) quoteds
Yes, the final purpose has always been to remove the quoted substrings.

Related

Error while passing strings in functions in Haskell

So I tried the following code in haskell where I try to detect if the user has entered a "no" or "No" in the string. Also I tried replacing [[Char]] with Strings but it gives compilation errors.
wantGifts :: [[Char]] -> [[Char]]
wantGifts st = [if (x == "No" || x== "no") then "No gifts given" else "But why" | x <- st, x == head st]
The above code compiles but when I pass a string to it, it returns an error message:
*Main> wantGifts "no I dont"
<interactive>:8:11:
Couldn't match type ‘Char’ with ‘[Char]’
Expected type: [[Char]]
Actual type: [Char]
In the first argument of ‘wantGifts’, namely ‘"no I dont"’
In the expression: wantGifts "no I dont"
In an equation for ‘it’: it = wantGifts "no I dont"
Look closely at the type of wantGifts, it requires a List of List of Chars. But "no I dont" is of type String which is just [Char]. With your current construction, you have to use:
wantGifts ["no I dont"]
There are several ways to improve this, best is to use Text.
import Data.Text (Text)
import qualified Data.Text as T
wantGifts :: Text -> Text
wantGifts txt = if (T.isInfixOf "no" . T.toLower) txt then "No gifts given" else "But why"
You have defined wantGifts as taking a list of strings. [[Char]] is equivalent to [String]. In the REPL, you are passing it a single string.
If you instead did this, it would compile:
wantGifts ["no I dont"]
However, I have a hunch this isn't what you want.
If you were trying to detect whether the word "no" was anywhere in the string, you could use the words function:
containsNo :: String -> Bool
containsNo = any (\w -> w == "no" || w == "No") . words

OCaml regexp "any" matching, where "]" is one of the characters

I'd like to match a string containing any of the characters "a" through "z", or "[" or "]", but nothing else. The regexp should match
"b"
"]abc["
"ab[c"
but not these
"2"
"(abc)"
I tried this:
let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-z[\]]*$") s 0;;
content_check "]abc[";;
and got warned that the "escape" before the "]" was illegal, although I'm pretty certain that the equivalent in, say, sed or awk would work fine.
Anyhow, I tried un-escaping the cracket, but
let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-z[]]*$") s 0;;
doesn't work at all, since it should match any of a-z or "[", then the first "]" closes the "any" selection, after which there must be any number of "]"s. So it should match
[abc]]]]
but not
]]]abc[
In practice, that's not what happens at all; I get the following:
# let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-zA-Z[]]*$") s 0;;
content_check "]abc[";;
content_check "[abc]]]";;
content_check "]abc[";;
val content_check : string -> bool = <fun>
# - : bool = false
# - : bool = false
# - : bool = false
Can anyone explain/suggest an alternative?
#Tim Pietzker's suggestion sounded really good, but appears not to work:
# #load "str.cma" ;;
let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-z[\\]]*$") s 0;;
content_check "]abc[";;
# val content_check : string -> bool = <fun>
# - : bool = false
#
nor does it work when I double-escape the "[" in the pattern, just in case. :(
Indeed, here's a MWE:
#load "str.cma" ;;
let content_check(s:string):bool =
Str.string_match (Str.regexp "[\\]]") s 0;;
content_check "]";; (* should be true *)
This is not going to really answer your question, but it will solve your problem. With the re library:
let re_set = Re.(rep (* "rep" is the star *) ## alt [
rg 'a' 'z' ; (* the range from a to z *)
set "[]" ; (* the set composed of [ and ] *)
])
(* version that matches the whole text *)
let re = Re.(compile ##
seq [ start ; re_set ; stop ])
let content_check s =
Printf.printf "%s : %b\n" s (Re.execp re s)
let () =
List.iter content_check [
"]abc[" ;
"[abc]]]" ;
"]abc[" ;
"]abc[" ;
"abc##"
]
As you noticed, str from the stdlib is akward, to put it midly. re is a very good alternative, and it comes with various regexp syntax and combinators (which I tend to use, because I think it's easier to use than regexp syntax).
I'm an idiot. (But perhaps the designers of Str weren't so clever either.)
From the "Str" documentation: "To include a ] character in a set, make it the first character in the set."
With this, it's not so clear how to search for "anything except a ]", since you'd have to place the "^" in front of it. Sigh.
:(

How to merge Regex?

Background
Let say I have several Regex here.
import Text.Regex
openTag = mkRegex "<([A-Z][A-Z0-9]*)\\b[^>]*>"
closeTag = mkRegex "</\\1>"
any = mkRegex "(.*?)"
Problem
openTag ++ any ++ closeTag <-- Just for illustration purpose
How can I merge them? To be specific, a Regex -> Regex -> Regex function. Alternatively, convert a Regex back to String would be good.
openTag ++ "hello" ++ closeTag <-- Just for illustration purpose
Thus, I can create my own Regex -> String -> Regex function ultimately.
Workaround
Manipulate the string literals.
import Text.Regex
openTag = "<([A-Z][A-Z0-9]*)\\b[^>]*>"
closeTag = "</\\1>"
any = "(.*?)"
tagWithAny = mkRegex $ openTag ++ any ++ closeTag
tagWith :: String -> Regex
tagWith s = mkRegex $ openTag ++ s ++ closeTag
Regex type in the Text.Regex is essentially a C pointer:
data Regex = Regex (ForeignPtr CRegex) CompOption ExecOption
AFAIK there is no way to recover the string representation of the posix regex, after it has been compiled. regcomp 3 man page.
If you’d like to operate on regular expression algebraically, wrap then in your own type to postpone the compiling or use for example regex-applicative.

syntax error in ocaml because of String.concat

Lets say I have a list of type integer [blah;blah;blah;...] and i don't know the size of the lis and I want to pattern match and not print the first element of the list. Is there any way to do this without using a if else case or having a syntax error?
because all i'm trying to do is parse a file tha looks like a/path/to/blah/blah/../file.c
and only print the path/to/blah/blah
for example, can it be done like this?
let out x = Printf.printf " %s \n" x
let _ = try
while true do
let line = input_line stdin in
...
let rec f (xpath: string list) : ( string list ) =
begin match Str.split (Str.regexp "/") xpath with
| _::rest -> out (String.concat "/" _::xpath);
| _ -> ()
end
but if i do this i have a syntax error at the line of String.concat!!
String.concat "/" _::xpath doesn't mean anything because _ is pattern but not a value. _ can be used in the left part of a pattern matching but not in the right part.
What you want to do is String.concat "/" rest.
Even if _::xpath were correct, String.concat "/" _::xpath would be interpreted as (String.concat "/" _)::xpath whereas you want it to be interpreted as String.concat "/" (_::xpath).

Find all capturing groups of a regular expression

I am looking for a Haskell function that returns the capturing groups of all matches of a given regex.
I have been looking at Text.Regex, but couldn't find anything there.
Now I am using this workaround which seems to work:
import Text.Regex
findNext :: String -> Maybe (String, String, String, [String] ) -> [ [String] ]
findNext pattern Nothing = []
findNext pattern (Just (_, _, rest, matches) ) =
case matches of
[] -> (findNext pattern res)
_ -> [matches] ++ (findNext pattern res)
where res = matchRegexAll (mkRegex pattern) rest
findAll :: String -> String -> [ [String] ]
findAll pattern str = findNext pattern (Just ("", "", str, [] ) )
Result:
findAll "x(.)x(.)" "aaaxAxaaaxBxaaaxCx"
[["A","a"],["B","a"]]
Question:
Did I miss something in Text.Regex?
Is there a Haskell regex library that implements a findAll function?
You can use the =~ operator from Text.Regex.Posix:
Prelude> :mod + Text.Regex.Posix
Prelude Text.Regex.Posix> "aaaxAxaaaxBxaaaxCx" =~ "x(.)x(.)" :: [[String]]
[["xAxa","A","a"],["xBxa","B","a"]]
Note the explicit [[String]] type. Try replacing it with Bool, Int, String and see what happens. All types that you can use in this context are listed here. Also see this tutorial.