Distinguish empty regexp matches from no matches in Haskell - regex

I'm trying to use regex-pcre but regex-base contains too many overloads for RegexContext so I don't know which one should I use for the task at hand.
I want to match a string against (foo)-(bar)|(quux)-(quux)(q*u*u*x*) regular expression the following way:
myMatch :: String -> Maybe (String, String, Maybe String)
Sample output:
myMatch "dfjdjk" should be Nothing as there is no match
myMatch "foo-bar" should be Just ("foo", "bar", Nothing) as there's no third capture group in the first alternative
myMatch "quux-quuxqu" should be Just ("quux", "quux", Just "qu")
myMatch "quux-quux" should be Just ("quux", "quux", Just "") as the third capture group is present but empty
It's not an assignment, I'm just baffled with how https://github.com/erantapaa/haskell-regexp-examples/blob/master/RegexExamples.hs don't contain code paths for situations where there are no matches or no capture groups

A way of achieving it is using getAllTextSubmatches:
import Text.Regex.PCRE
myMatch :: String -> Maybe (String, String, Maybe String)
myMatch str = case getAllTextSubmatches $ str =~ "(foo)-(bar)|(quux)-(quux)(q*u*u*x*)" :: [String] of
[] -> Nothing
[_, g1, g2, "", "", ""] -> Just (g1, g2, Nothing)
[_, "", "", g3, g4, g5] -> Just (g3, g4, Just g5)
When getAllTextSubmatches has [String] as return type, it returns an empty list if there is no match, or a list with all capturing groups (where index 0 is the whole match) of the first match.
Alternatively, if a matched group may be empty and you cannot pattern match on the empty string, you can use [(String, (MatchOffset, MatchLength))] as return type of getAllTextSubmatches and pattern match MatchOffset with -1 to identify unmatched groups:
myMatch :: String -> Maybe (String, String, Maybe String)
myMatch str = case getAllTextSubmatches $ str =~ "(foo)-(bar)|(quux)-(quux)(q*u*u*x*)" :: [(String, (MatchOffset, MatchLength))] of
[] -> Nothing
[_, (g1, _), (g2, _), (_, (-1, _)), (_, (-1, _)), (_, (-1, _))] -> Just (g1, g2, Nothing)
[_, (_, (-1, _)), (_, (-1, _)), (g3, _), (g4, _), (g5, _)] -> Just (g3, g4, Just g5)
Now, if that looks too verbose:
{-# LANGUAGE PatternSynonyms #-}
pattern NoMatch = ("", (-1, 0))
myMatch :: String -> Maybe (String, String, Maybe String)
myMatch str = case getAllTextSubmatches $ str =~ "(foo)-(bar)|(quux)-(quux)(q*u*u*x*)" :: [(String, (MatchOffset, MatchLength))] of
[] -> Nothing
[_, (g1, _), (g2, _), NoMatch, NoMatch, NoMatch] -> Just (g1, g2, Nothing)
[_, NoMatch, NoMatch, (g3, _), (g4, _), (g5, _)] -> Just (g3, g4, Just g5)

To distinguish when there is no match, use =~~ so that it will place the result in a Maybe monad. It will use fail to return Nothing if there are no matches.
myMatch :: String -> Maybe (String, String, Maybe String)
myMatch str = do
let regex = "(foo)-(bar)|(quux)-(quux)(q*u*u*x*)"
groups <- getAllTextSubmatches <$> str =~~ regex :: Maybe [String]
case groups of
[_, g1, g2, "", "", ""] -> Just (g1, g2, Nothing)
[_, "", "", g3, g4, g5] -> Just (g3, g4, Just g5)

Use regex-applicative
myMatch = match re
re = foobar <|> quuces where
foobar = (,,) <$> "foo" <* "-" <*> "bar" <*> pure Nothing
quuces = (,,)
<$> "quux" <* "-"
<*> "quux"
<*> (fmap (Just . mconcat) . sequenceA)
[many $ sym 'q', many $ sym 'u', many $ sym 'u', many $ sym 'x']
or, with ApplicativeDo,
re = foobar <|> quuces where
foobar = do
foo <- "foo"
_ <- "-"
bar <- "bar"
pure (foo, bar, Nothing)
quuces = do
quux1 <- "quux"
_ <- "-"
quux2 <- "quux"
quux3 <- fmap snd . withMatched $
traverse (many . sym) ("quux" :: [Char])
-- [many $ sym 'q', many $ sym 'u', many $ sym 'u', many $ sym 'x']
pure (quux1, quux2, Just quux3)

Related

Haskell, regex, TDFA: match (and remove) quoted substrings

There is a regular expression matching quoted substrings: "/\"(?:[^\"\\]|\\.)*\"/" (originally /"(?:[^"\\]|\\.)*"/, see Here). Tested on regex101, it works.
With TDFA, it's syntax:
*** Exception: Explict error in module Text.Regex.TDFA.String : Text.Regex.TDFA.String died:
parseRegex for Text.Regex.TDFA.String failed:"/"(?:[^"\]|\.)*"/" (line 1, column 4):
unexpected "?"
expecting empty () or anchor ^ or $ or an atom
Is there a way co correct it?
Test string: Is big "problem", no?
Expected result: "problem"
UPD:
This is full context:
removeQuotedSubstrings :: String -> [String]
removeQuotedSubstrings str =
let quoteds = concat (str =~ ("/\"(?:[^\"\\]|\\.)*\"/" :: String) :: [[String]])
in quoteds
No improvement, just an acceptable solution, albeit lacking in elegance:
import qualified Data.Text as T
import Text.Regex.TDFA
-- | Removes all double quoted substrings, if any, from a string.
--
-- Examples:
--
-- >>> removeQuotedSubstrings "alfa"
-- "alfa"
-- >>> removeQuotedSubstrings "ngoro\"dup\"lai \"ming\""
-- "ngoro lai "
removeQuotedSubstrings :: String -> String
removeQuotedSubstrings str =
let quoteds = filter (('"' ==) . head)
$ concat (str =~ ("\"(\\.|[^\"\\])*\"" :: String) :: [[String]])
in T.unpack $ foldr (\quoted acc -> T.replace (T.pack quoted) " " acc)
(T.pack str) quoteds
Yes, the final purpose has always been to remove the quoted substrings.

Scala: string pattern matching and splitting

I am new to Scala and want to create a function to split Hello123 or Hello 123 into two strings as follows:
val string1 = 123
val string2 = Hello
What is the best way to do it, I have attempted to use regex matching \\d and \\D but I am not sure how to write the function fully.
Regards
You may replace with 0+ whitespaces (\s*+) that are preceded with letters and followed with digits:
var str = "Hello123"
val res = str.split("(?<=[a-zA-Z])\\s*+(?=\\d)")
println(res.deep.mkString(", ")) // => Hello, 123
See the online Scala demo
Pattern details:
(?<=[a-zA-Z]) - a positive lookbehind that only checks (but does not consume the matched text) if there is an ASCII letter before the current position in the string
\\s*+ - matches (consumes) zero or more spaces possessively, i.e.
(?=\\d) - this check is performed only once after the whitespaces - if any - were matched, and it requires a digit to appear right after the current position in the string.
Based on the given string I assume you have to match a string and a number with any number of spaces in between
here is the regex for that
([a-zA-Z]+)\\s*(\\d+)
Now create a regex object using .r
"([a-zA-Z]+)\\s*(\\d+)".r
Scala REPL
scala> val regex = "([a-zA-Z]+)\\s*(\\d+)".r
scala> val regex(a, b) = "hello 123"
a: String = "hello"
b: String = "123"
scala> val regex(a, b) = "hello123"
a: String = "hello"
b: String = "123"
Function to handle pattern matching safely
pattern match with extractors
str match {
case regex(a, b) => Some(a -> b.toInt)
case _ => None
}
Here is the function which does Regex with Pattern matching
def matchStr(str: String): Option[(String, Int)] = {
val regex = "([a-zA-Z]+)\\s*(\\d+)".r
str match {
case regex(a, b) => Some(a -> b.toInt)
case _ => None
}
}
Scala REPL
scala> def matchStr(str: String): Option[(String, Int)] = {
val regex = "([a-zA-Z]+)\\s*(\\d+)".r
str match {
case regex(a, b) => Some(a -> b.toInt)
case _ => None
}
}
defined function matchStr
scala> matchStr("Hello123")
res41: Option[(String, Int)] = Some(("Hello", 123))
scala> matchStr("Hello 123")
res42: Option[(String, Int)] = Some(("Hello", 123))

OCaml regexp "any" matching, where "]" is one of the characters

I'd like to match a string containing any of the characters "a" through "z", or "[" or "]", but nothing else. The regexp should match
"b"
"]abc["
"ab[c"
but not these
"2"
"(abc)"
I tried this:
let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-z[\]]*$") s 0;;
content_check "]abc[";;
and got warned that the "escape" before the "]" was illegal, although I'm pretty certain that the equivalent in, say, sed or awk would work fine.
Anyhow, I tried un-escaping the cracket, but
let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-z[]]*$") s 0;;
doesn't work at all, since it should match any of a-z or "[", then the first "]" closes the "any" selection, after which there must be any number of "]"s. So it should match
[abc]]]]
but not
]]]abc[
In practice, that's not what happens at all; I get the following:
# let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-zA-Z[]]*$") s 0;;
content_check "]abc[";;
content_check "[abc]]]";;
content_check "]abc[";;
val content_check : string -> bool = <fun>
# - : bool = false
# - : bool = false
# - : bool = false
Can anyone explain/suggest an alternative?
#Tim Pietzker's suggestion sounded really good, but appears not to work:
# #load "str.cma" ;;
let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-z[\\]]*$") s 0;;
content_check "]abc[";;
# val content_check : string -> bool = <fun>
# - : bool = false
#
nor does it work when I double-escape the "[" in the pattern, just in case. :(
Indeed, here's a MWE:
#load "str.cma" ;;
let content_check(s:string):bool =
Str.string_match (Str.regexp "[\\]]") s 0;;
content_check "]";; (* should be true *)
This is not going to really answer your question, but it will solve your problem. With the re library:
let re_set = Re.(rep (* "rep" is the star *) ## alt [
rg 'a' 'z' ; (* the range from a to z *)
set "[]" ; (* the set composed of [ and ] *)
])
(* version that matches the whole text *)
let re = Re.(compile ##
seq [ start ; re_set ; stop ])
let content_check s =
Printf.printf "%s : %b\n" s (Re.execp re s)
let () =
List.iter content_check [
"]abc[" ;
"[abc]]]" ;
"]abc[" ;
"]abc[" ;
"abc##"
]
As you noticed, str from the stdlib is akward, to put it midly. re is a very good alternative, and it comes with various regexp syntax and combinators (which I tend to use, because I think it's easier to use than regexp syntax).
I'm an idiot. (But perhaps the designers of Str weren't so clever either.)
From the "Str" documentation: "To include a ] character in a set, make it the first character in the set."
With this, it's not so clear how to search for "anything except a ]", since you'd have to place the "^" in front of it. Sigh.
:(

Replace a few substrings in Scala

Suppose I need to replace few patterns in a string :
val rules = Map("abc" -> "123", "d" -> "4", "efg" -> "5"} // string to string
def replace(input:String, rules: Map[String, String] = {...}
replace("xyz", rules) // returns xyz
replace("abc123", rules) // returns 123123
replace("dddxyzefg", rules) // returns 444xyz5
How would you implement replace in Scala ? How would you generalize the solution for rules : Map[Regex, String] ?
It's probably easier just to go straight to the general case:
val replacements = Map("abc".r -> "123", "d".r -> "4", "efg".r -> "5")
val original = "I know my abc's AND my d's AND my efg's!"
val replaced = replacements.foldLeft(original) { (s, r) => r._1.replaceAllIn(s, r._2) }
replaced: String = I know my 123's AND my 4's AND my 5's!

Find all capturing groups of a regular expression

I am looking for a Haskell function that returns the capturing groups of all matches of a given regex.
I have been looking at Text.Regex, but couldn't find anything there.
Now I am using this workaround which seems to work:
import Text.Regex
findNext :: String -> Maybe (String, String, String, [String] ) -> [ [String] ]
findNext pattern Nothing = []
findNext pattern (Just (_, _, rest, matches) ) =
case matches of
[] -> (findNext pattern res)
_ -> [matches] ++ (findNext pattern res)
where res = matchRegexAll (mkRegex pattern) rest
findAll :: String -> String -> [ [String] ]
findAll pattern str = findNext pattern (Just ("", "", str, [] ) )
Result:
findAll "x(.)x(.)" "aaaxAxaaaxBxaaaxCx"
[["A","a"],["B","a"]]
Question:
Did I miss something in Text.Regex?
Is there a Haskell regex library that implements a findAll function?
You can use the =~ operator from Text.Regex.Posix:
Prelude> :mod + Text.Regex.Posix
Prelude Text.Regex.Posix> "aaaxAxaaaxBxaaaxCx" =~ "x(.)x(.)" :: [[String]]
[["xAxa","A","a"],["xBxa","B","a"]]
Note the explicit [[String]] type. Try replacing it with Bool, Int, String and see what happens. All types that you can use in this context are listed here. Also see this tutorial.