Find all capturing groups of a regular expression - regex

I am looking for a Haskell function that returns the capturing groups of all matches of a given regex.
I have been looking at Text.Regex, but couldn't find anything there.
Now I am using this workaround which seems to work:
import Text.Regex
findNext :: String -> Maybe (String, String, String, [String] ) -> [ [String] ]
findNext pattern Nothing = []
findNext pattern (Just (_, _, rest, matches) ) =
case matches of
[] -> (findNext pattern res)
_ -> [matches] ++ (findNext pattern res)
where res = matchRegexAll (mkRegex pattern) rest
findAll :: String -> String -> [ [String] ]
findAll pattern str = findNext pattern (Just ("", "", str, [] ) )
Result:
findAll "x(.)x(.)" "aaaxAxaaaxBxaaaxCx"
[["A","a"],["B","a"]]
Question:
Did I miss something in Text.Regex?
Is there a Haskell regex library that implements a findAll function?

You can use the =~ operator from Text.Regex.Posix:
Prelude> :mod + Text.Regex.Posix
Prelude Text.Regex.Posix> "aaaxAxaaaxBxaaaxCx" =~ "x(.)x(.)" :: [[String]]
[["xAxa","A","a"],["xBxa","B","a"]]
Note the explicit [[String]] type. Try replacing it with Bool, Int, String and see what happens. All types that you can use in this context are listed here. Also see this tutorial.

Related

Haskell, regex, TDFA: match (and remove) quoted substrings

There is a regular expression matching quoted substrings: "/\"(?:[^\"\\]|\\.)*\"/" (originally /"(?:[^"\\]|\\.)*"/, see Here). Tested on regex101, it works.
With TDFA, it's syntax:
*** Exception: Explict error in module Text.Regex.TDFA.String : Text.Regex.TDFA.String died:
parseRegex for Text.Regex.TDFA.String failed:"/"(?:[^"\]|\.)*"/" (line 1, column 4):
unexpected "?"
expecting empty () or anchor ^ or $ or an atom
Is there a way co correct it?
Test string: Is big "problem", no?
Expected result: "problem"
UPD:
This is full context:
removeQuotedSubstrings :: String -> [String]
removeQuotedSubstrings str =
let quoteds = concat (str =~ ("/\"(?:[^\"\\]|\\.)*\"/" :: String) :: [[String]])
in quoteds
No improvement, just an acceptable solution, albeit lacking in elegance:
import qualified Data.Text as T
import Text.Regex.TDFA
-- | Removes all double quoted substrings, if any, from a string.
--
-- Examples:
--
-- >>> removeQuotedSubstrings "alfa"
-- "alfa"
-- >>> removeQuotedSubstrings "ngoro\"dup\"lai \"ming\""
-- "ngoro lai "
removeQuotedSubstrings :: String -> String
removeQuotedSubstrings str =
let quoteds = filter (('"' ==) . head)
$ concat (str =~ ("\"(\\.|[^\"\\])*\"" :: String) :: [[String]])
in T.unpack $ foldr (\quoted acc -> T.replace (T.pack quoted) " " acc)
(T.pack str) quoteds
Yes, the final purpose has always been to remove the quoted substrings.

Distinguish empty regexp matches from no matches in Haskell

I'm trying to use regex-pcre but regex-base contains too many overloads for RegexContext so I don't know which one should I use for the task at hand.
I want to match a string against (foo)-(bar)|(quux)-(quux)(q*u*u*x*) regular expression the following way:
myMatch :: String -> Maybe (String, String, Maybe String)
Sample output:
myMatch "dfjdjk" should be Nothing as there is no match
myMatch "foo-bar" should be Just ("foo", "bar", Nothing) as there's no third capture group in the first alternative
myMatch "quux-quuxqu" should be Just ("quux", "quux", Just "qu")
myMatch "quux-quux" should be Just ("quux", "quux", Just "") as the third capture group is present but empty
It's not an assignment, I'm just baffled with how https://github.com/erantapaa/haskell-regexp-examples/blob/master/RegexExamples.hs don't contain code paths for situations where there are no matches or no capture groups
A way of achieving it is using getAllTextSubmatches:
import Text.Regex.PCRE
myMatch :: String -> Maybe (String, String, Maybe String)
myMatch str = case getAllTextSubmatches $ str =~ "(foo)-(bar)|(quux)-(quux)(q*u*u*x*)" :: [String] of
[] -> Nothing
[_, g1, g2, "", "", ""] -> Just (g1, g2, Nothing)
[_, "", "", g3, g4, g5] -> Just (g3, g4, Just g5)
When getAllTextSubmatches has [String] as return type, it returns an empty list if there is no match, or a list with all capturing groups (where index 0 is the whole match) of the first match.
Alternatively, if a matched group may be empty and you cannot pattern match on the empty string, you can use [(String, (MatchOffset, MatchLength))] as return type of getAllTextSubmatches and pattern match MatchOffset with -1 to identify unmatched groups:
myMatch :: String -> Maybe (String, String, Maybe String)
myMatch str = case getAllTextSubmatches $ str =~ "(foo)-(bar)|(quux)-(quux)(q*u*u*x*)" :: [(String, (MatchOffset, MatchLength))] of
[] -> Nothing
[_, (g1, _), (g2, _), (_, (-1, _)), (_, (-1, _)), (_, (-1, _))] -> Just (g1, g2, Nothing)
[_, (_, (-1, _)), (_, (-1, _)), (g3, _), (g4, _), (g5, _)] -> Just (g3, g4, Just g5)
Now, if that looks too verbose:
{-# LANGUAGE PatternSynonyms #-}
pattern NoMatch = ("", (-1, 0))
myMatch :: String -> Maybe (String, String, Maybe String)
myMatch str = case getAllTextSubmatches $ str =~ "(foo)-(bar)|(quux)-(quux)(q*u*u*x*)" :: [(String, (MatchOffset, MatchLength))] of
[] -> Nothing
[_, (g1, _), (g2, _), NoMatch, NoMatch, NoMatch] -> Just (g1, g2, Nothing)
[_, NoMatch, NoMatch, (g3, _), (g4, _), (g5, _)] -> Just (g3, g4, Just g5)
To distinguish when there is no match, use =~~ so that it will place the result in a Maybe monad. It will use fail to return Nothing if there are no matches.
myMatch :: String -> Maybe (String, String, Maybe String)
myMatch str = do
let regex = "(foo)-(bar)|(quux)-(quux)(q*u*u*x*)"
groups <- getAllTextSubmatches <$> str =~~ regex :: Maybe [String]
case groups of
[_, g1, g2, "", "", ""] -> Just (g1, g2, Nothing)
[_, "", "", g3, g4, g5] -> Just (g3, g4, Just g5)
Use regex-applicative
myMatch = match re
re = foobar <|> quuces where
foobar = (,,) <$> "foo" <* "-" <*> "bar" <*> pure Nothing
quuces = (,,)
<$> "quux" <* "-"
<*> "quux"
<*> (fmap (Just . mconcat) . sequenceA)
[many $ sym 'q', many $ sym 'u', many $ sym 'u', many $ sym 'x']
or, with ApplicativeDo,
re = foobar <|> quuces where
foobar = do
foo <- "foo"
_ <- "-"
bar <- "bar"
pure (foo, bar, Nothing)
quuces = do
quux1 <- "quux"
_ <- "-"
quux2 <- "quux"
quux3 <- fmap snd . withMatched $
traverse (many . sym) ("quux" :: [Char])
-- [many $ sym 'q', many $ sym 'u', many $ sym 'u', many $ sym 'x']
pure (quux1, quux2, Just quux3)

How to merge Regex?

Background
Let say I have several Regex here.
import Text.Regex
openTag = mkRegex "<([A-Z][A-Z0-9]*)\\b[^>]*>"
closeTag = mkRegex "</\\1>"
any = mkRegex "(.*?)"
Problem
openTag ++ any ++ closeTag <-- Just for illustration purpose
How can I merge them? To be specific, a Regex -> Regex -> Regex function. Alternatively, convert a Regex back to String would be good.
openTag ++ "hello" ++ closeTag <-- Just for illustration purpose
Thus, I can create my own Regex -> String -> Regex function ultimately.
Workaround
Manipulate the string literals.
import Text.Regex
openTag = "<([A-Z][A-Z0-9]*)\\b[^>]*>"
closeTag = "</\\1>"
any = "(.*?)"
tagWithAny = mkRegex $ openTag ++ any ++ closeTag
tagWith :: String -> Regex
tagWith s = mkRegex $ openTag ++ s ++ closeTag
Regex type in the Text.Regex is essentially a C pointer:
data Regex = Regex (ForeignPtr CRegex) CompOption ExecOption
AFAIK there is no way to recover the string representation of the posix regex, after it has been compiled. regcomp 3 man page.
If you’d like to operate on regular expression algebraically, wrap then in your own type to postpone the compiling or use for example regex-applicative.

syntax error in ocaml because of String.concat

Lets say I have a list of type integer [blah;blah;blah;...] and i don't know the size of the lis and I want to pattern match and not print the first element of the list. Is there any way to do this without using a if else case or having a syntax error?
because all i'm trying to do is parse a file tha looks like a/path/to/blah/blah/../file.c
and only print the path/to/blah/blah
for example, can it be done like this?
let out x = Printf.printf " %s \n" x
let _ = try
while true do
let line = input_line stdin in
...
let rec f (xpath: string list) : ( string list ) =
begin match Str.split (Str.regexp "/") xpath with
| _::rest -> out (String.concat "/" _::xpath);
| _ -> ()
end
but if i do this i have a syntax error at the line of String.concat!!
String.concat "/" _::xpath doesn't mean anything because _ is pattern but not a value. _ can be used in the left part of a pattern matching but not in the right part.
What you want to do is String.concat "/" rest.
Even if _::xpath were correct, String.concat "/" _::xpath would be interpreted as (String.concat "/" _)::xpath whereas you want it to be interpreted as String.concat "/" (_::xpath).

String regex matching in Erlang

How would I do regex matching in Erlang?
All I know is this:
f("AAPL" ++ Inputstring) -> true.
The lines that I need to match
"AAPL,07-May-2010 15:58,21.34,21.36,21.34,21.35,525064\n"
In Perl regex: ^AAPL,* (or something similar)
In Erlang?
Use the re module, e.g.:
...
String = "AAPL,07-May-2010 15:58,21.34,21.36,21.34,21.35,525064\n",
RegExp = "^AAPL,*",
case re:run(String, RegExp) of
{match, Captured} -> ... ;
nomatch -> ...
end,
...