How to merge Regex? - regex

Background
Let say I have several Regex here.
import Text.Regex
openTag = mkRegex "<([A-Z][A-Z0-9]*)\\b[^>]*>"
closeTag = mkRegex "</\\1>"
any = mkRegex "(.*?)"
Problem
openTag ++ any ++ closeTag <-- Just for illustration purpose
How can I merge them? To be specific, a Regex -> Regex -> Regex function. Alternatively, convert a Regex back to String would be good.
openTag ++ "hello" ++ closeTag <-- Just for illustration purpose
Thus, I can create my own Regex -> String -> Regex function ultimately.
Workaround
Manipulate the string literals.
import Text.Regex
openTag = "<([A-Z][A-Z0-9]*)\\b[^>]*>"
closeTag = "</\\1>"
any = "(.*?)"
tagWithAny = mkRegex $ openTag ++ any ++ closeTag
tagWith :: String -> Regex
tagWith s = mkRegex $ openTag ++ s ++ closeTag

Regex type in the Text.Regex is essentially a C pointer:
data Regex = Regex (ForeignPtr CRegex) CompOption ExecOption
AFAIK there is no way to recover the string representation of the posix regex, after it has been compiled. regcomp 3 man page.
If you’d like to operate on regular expression algebraically, wrap then in your own type to postpone the compiling or use for example regex-applicative.

Related

Haskell, regex, TDFA: match (and remove) quoted substrings

There is a regular expression matching quoted substrings: "/\"(?:[^\"\\]|\\.)*\"/" (originally /"(?:[^"\\]|\\.)*"/, see Here). Tested on regex101, it works.
With TDFA, it's syntax:
*** Exception: Explict error in module Text.Regex.TDFA.String : Text.Regex.TDFA.String died:
parseRegex for Text.Regex.TDFA.String failed:"/"(?:[^"\]|\.)*"/" (line 1, column 4):
unexpected "?"
expecting empty () or anchor ^ or $ or an atom
Is there a way co correct it?
Test string: Is big "problem", no?
Expected result: "problem"
UPD:
This is full context:
removeQuotedSubstrings :: String -> [String]
removeQuotedSubstrings str =
let quoteds = concat (str =~ ("/\"(?:[^\"\\]|\\.)*\"/" :: String) :: [[String]])
in quoteds
No improvement, just an acceptable solution, albeit lacking in elegance:
import qualified Data.Text as T
import Text.Regex.TDFA
-- | Removes all double quoted substrings, if any, from a string.
--
-- Examples:
--
-- >>> removeQuotedSubstrings "alfa"
-- "alfa"
-- >>> removeQuotedSubstrings "ngoro\"dup\"lai \"ming\""
-- "ngoro lai "
removeQuotedSubstrings :: String -> String
removeQuotedSubstrings str =
let quoteds = filter (('"' ==) . head)
$ concat (str =~ ("\"(\\.|[^\"\\])*\"" :: String) :: [[String]])
in T.unpack $ foldr (\quoted acc -> T.replace (T.pack quoted) " " acc)
(T.pack str) quoteds
Yes, the final purpose has always been to remove the quoted substrings.

Finding permutations using regular expressions

I need to create a regular expression (for program in haskell) that will catch the strings containing "X" and ".", assuming that there are 4 "X" and only one ".". It cannot catch any string with other X-to-dot relations.
I have thought about something like
[X\.]{5}
But it catches also "XXXXX" or ".....", so it isn't what I need.
That's called permutation parsing, and while "pure" regular expressions can't parse permutations it's possible if your regex engine supports lookahead. (See this answer for an example.)
However I find the regex in the linked answer difficult to understand. It's cleaner in my opinion to use a library designed for permutation parsing, such as megaparsec.
You use the Text.Megaparsec.Perm module by building a PermParser in a quasi-Applicative style using the <||> operator, then converting it into a regular MonadParsec action using makePermParser.
So here's a parser which recognises any combination of four Xs and one .:
import Control.Applicative
import Data.Ord
import Data.List
import Text.Megaparsec
import Text.Megaparsec.Perm
fourXoneDot :: Parsec Dec String String
fourXoneDot = makePermParser $ mkFive <$$> x <||> x <||> x <||> x <||> dot
where mkFive a b c d e = [a, b, c, d, e]
x = char 'X'
dot = char '.'
I'm applying the mkFive function, which just stuffs its arguments into a five-element list, to four instances of the x parser and one dot, combined with <||>.
ghci> parse fourXoneDot "" "XXXX."
Right "XXXX."
ghci> parse fourXoneDot "" "XX.XX"
Right "XXXX."
ghci> parse fourXoneDot "" "XX.X"
Left {- ... -}
This parser always returns "XXXX." because that's the order I combined the parsers in: I'm mapping mkFive over the five parsers and it doesn't reorder its arguments. If you want the permutation parser to return its input string exactly, the trick is to track the current position within the component parsers, and then sort the output.
fourXoneDotSorted :: Parsec Dec String String
fourXoneDotSorted = makePermParser $ mkFive <$$> x <||> x <||> x <||> x <||> dot
where mkFive a b c d e = map snd $ sortBy (comparing fst) [a, b, c, d, e]
x = withPos (char 'X')
dot = withPos (char '.')
withPos = liftA2 (,) getPosition
ghci> parse fourXoneDotSorted "" "XX.XX"
Right "XX.XX"
As the megaparsec docs note, the implementation of the Text.Megaparsec.Perm module is based on Parsing Permutation Phrases; the idea is described in detail in the paper and the accompanying slides.
The other answers look quite complicated to me, given that there are only five strings in this language. Here's a perfectly fine and very readable regex for this:
\.XXXX|X\.XXX|XX\.XX|XXX\.X|XXXX\.
Are you attached to regex, or did you just end up at regex because this was a question you didn't want to try answering with applicative parsers?
Here's the simplest possible attoparsec implementation I can think of:
parseDotXs :: Parser ()
parseDotXs = do
dotXs <- count 5 (satisfy (inClass ".X"))
let (dots,xS) = span (=='.') . sort $ dotXs
if (length dots == 1) && (length xS == 4) then do
return ()
else do
fail "Mismatch between dots and Xs"
You may need to adjust slightly depending on your input type.
There are tons of fancy ways to do stuff in applicative parsing land, but there is no rule saying you can't just do things the rock-stupid simple way.
Try the following regex :
(?<=^| )(?=[^. ]*\.)(?=(?:[^X ]*X){4}).{5}(?=$| )
Demo here
If you have one word per string, you can simplify the regex by this one :
^(?=[^. \n]*\.)(?=(?:[^X \n]*X){4}).{5}$
Demo here

matching exact string in Ocaml using regex

How to find a exact match using regular expression in Ocaml? For example, I have a code like this:
let contains s1 s2 =
let re = Str.regexp_string s2
in
try ignore (Str.search_forward re s1 0); true
with Not_found -> false
where s2 is "_X_1" and s1 feeds strings like "A_1_X_1", "A_1_X_2", ....and so on to the function 'contains'. The aim is to find the exact match when s1 is "A_1_X_1". But the current code finds match even when s1 is "A_1_X_10", "A_1_X_11", "A_1_X_100" etc.
I tried with "[_x_1]", "[_X_1]$" as s2 instead of "_X_1" but does not seem to work. Can somebody suggest what can be wrong?
You can use the $ metacharacter to match the end of the line (which, assuming the string doens't contain multiple lines, is the end of the string). But you can't put that through Str.regexp_string; that just escapes the metacharacters. You should first quote the actual substring part, and then append the $, and then make a regexp from that:
let endswith s1 s2 =
let re = Str.regexp (Str.quote s2 ^ "$")
in
try ignore (Str.search_forward re s1 0); true
with Not_found -> false
Str.match_end is what you need:
let ends_with patt str =
let open Str in
let re = regexp_string patt in
try
let len = String.length str in
ignore (search_backward re str len);
match_end () == len
with Not_found -> false
With this definition, the function works as you require:
# ends_with "_X_1" "A_1_X_10";;
- : bool = false
# ends_with "_X_1" "A_1_X_1";;
- : bool = true
# ends_with "_X_1" "_X_1";;
- : bool = true
# ends_with "_X_1" "";;
- : bool = false
A regex will match anywhere in the input, so the behaviour you see is normal.
You need to anchor your regex: ^_X_1$.
Also, [_x_1] will not help: [...] is a character class, here you ask the regex engine to match a character which is x, 1 or _.

Find all capturing groups of a regular expression

I am looking for a Haskell function that returns the capturing groups of all matches of a given regex.
I have been looking at Text.Regex, but couldn't find anything there.
Now I am using this workaround which seems to work:
import Text.Regex
findNext :: String -> Maybe (String, String, String, [String] ) -> [ [String] ]
findNext pattern Nothing = []
findNext pattern (Just (_, _, rest, matches) ) =
case matches of
[] -> (findNext pattern res)
_ -> [matches] ++ (findNext pattern res)
where res = matchRegexAll (mkRegex pattern) rest
findAll :: String -> String -> [ [String] ]
findAll pattern str = findNext pattern (Just ("", "", str, [] ) )
Result:
findAll "x(.)x(.)" "aaaxAxaaaxBxaaaxCx"
[["A","a"],["B","a"]]
Question:
Did I miss something in Text.Regex?
Is there a Haskell regex library that implements a findAll function?
You can use the =~ operator from Text.Regex.Posix:
Prelude> :mod + Text.Regex.Posix
Prelude Text.Regex.Posix> "aaaxAxaaaxBxaaaxCx" =~ "x(.)x(.)" :: [[String]]
[["xAxa","A","a"],["xBxa","B","a"]]
Note the explicit [[String]] type. Try replacing it with Bool, Int, String and see what happens. All types that you can use in this context are listed here. Also see this tutorial.

String regex matching in Erlang

How would I do regex matching in Erlang?
All I know is this:
f("AAPL" ++ Inputstring) -> true.
The lines that I need to match
"AAPL,07-May-2010 15:58,21.34,21.36,21.34,21.35,525064\n"
In Perl regex: ^AAPL,* (or something similar)
In Erlang?
Use the re module, e.g.:
...
String = "AAPL,07-May-2010 15:58,21.34,21.36,21.34,21.35,525064\n",
RegExp = "^AAPL,*",
case re:run(String, RegExp) of
{match, Captured} -> ... ;
nomatch -> ...
end,
...