Can't capture a group in a string - regex

I want to capture a group in as string:
import Text.Regex.Posix
"somestring; somestring2=\"(.*?)\"" =~ "somestring; somestring2=\"my_super_string123\"" :: String
It returns an empty string "", as opposed to my_super_string123 which I expect. I've tried ::[String] and ::[[String]] and, obviously, they were empty. Your suggestions?

The problem is that you have your string and your pattern swapped. You also will want to have the return type be [[String]]:
> "somestring; somestring2=\"my_super_string123\"" =~ "somestring; somestring2=\"(.*)\"" :: [[String]]
[["somestring; somestring2=\"my_super_string123\"", "my_super_string123"]]
Note that I had to remove the ? from the .*? part of the pattern. This is because POSIX doesn't support the lazy quantifier *?. You'll have to select both of the POSIX flavors from the drop downs to see, but it says both do not support the lazy quantifiers. It's also recommended to use negation instead of laziness for regex since it improves performance over having to backtrack. To do this, you'd have to change your pattern to
"somestring; somestring2=\"([^\"]*)\""
To clarify, here's the output from my GHCi:
> "s1; s2=\"my_super_string123\"" =~ "s1; s2=\"([^\"]*)\"" :: [[String]]
[["s1; s2=\"my_super_string123\"","my_super_string123"]]
it :: [[String]]
> "s1; s2=\"my_super_string123\"" =~ "s1; s2=\"([^\"]*)\"" :: String
"s1; s2=\"my_super_string123\""
it :: String
1As you can see, with the return type as String, it returns whatever text matches the entire pattern, not just the capturing groups. Use [[String]] when you want to get the contents of the individual capturing groups.
I edited the contents of the string so that it would fit without having to scroll horizontally, just for illustrative purposes.

Related

Replace regex match by arbitrary function of match itself

I'm trying to write a function of type Text -> (Text -> Text) -> Text that replaces occurrences of a regular expression in a piece of text by something else that is a function of what the regular expression has matched. There is subRegex from Text.Regex but this only allows replacing a match with some fixed replacement string whereas I would like the replacement to an an arbitrary function of the match. Is there a package that already implements something like that?
You can use matchRegexAll
matchRegexAll
:: Regex -- ^ The regular expression
-> String -- ^ The string to match against
-> Maybe ( String, String, String, [String] )
-- ^ Returns: 'Nothing' if the match failed, or:
--
-- > Just ( everything before match,
-- > portion matched,
-- > everything after the match,
-- > subexpression matches )
For example:
subFirst :: Regex -> String -> (String -> String) -> String
subFirst rx input f = case matchRegexAll rx input of
Nothing -> input
Just (pre, match, post, _) -> pre <> f match <> post
If you want to do this for all matches rather than just the first, you can call this function recursively on the remainder post (left as an exercise).
For a different approach, it looks like the text-regex-replace replace package might be of use to you. It works directly on Text rather than String, and it appears to have the capability of arbitrary replacement functions (however the usage seems a bit obtuse).
If you’re willing to write your pattern matching function as a parser instead of a regular expression, then the function Replace.Megaparsec.streamEdit with the match combinator has the signature you’re looking for.
Here’s a usage example in the README

Multiple regex matches not a [String] in Haskell

I'm looking at the RWH tutorial here, which suggests, but has an error on, the usage of [String] to return multiple results. As you can see:
"I'd like to group by word breaks" =~ "\\S+" :: Bool
"I'd like to group by word breaks" =~ "\\S+" :: String
"I'd like to group by word breaks" =~ "\\S+" :: [[String]]
produce
True
"I'd"
[["I'd"],["like"],["to"],["group"],["by"],["word"],["breaks"]]
respectively.
But the recommended [String] does not, and instead has an error:
"I'd like to group by word breaks" =~ "\\S+" :: [String]
<interactive>:1:1: error:
• No instance for (RegexContext Regex String [String]) arising from a use of ‘=~’
• In the expression: "I'd like to group by word breaks" =~ "\\S+" :: [String]
In an equation for ‘it’: it = "I'd like to group by word breaks" =~ "\\S+" :: [String]
How can I ask for the missing [String]-like type suggestion that would provide what I'm looking for, namely:
["I'd","like","to","group","by","word","breaks"]
without having to post-process? Also, it's interesting that this seems natural in the context of the other successful type conversions, and it even worked this way at one point when the book was written, and no longer does. What is the explanation for the change?
Looks like from the comments on that, the recommendations are either:
Prelude Text.Regex.PCRE> getAllTextMatches ("I'd like to group by word breaks" =~ "\\S+") :: [String]
["I'd","like","to","group","by","word","breaks"]
or
Prelude Text.Regex.PCRE> concat $ "I'd like to group by word breaks" =~ "\\S+" :: [String]
["I'd","like","to","group","by","word","breaks"]
Neither of these are as clean as how this used to work.

extract string between two substrings in Haskell

I wanted to adapt the python regex (PCRE) technique in this SO question Find string between two substrings to Haskell so that I can do the same in Haskell.
But I can't figure out how to make it work in GHC (8.2.1). I've installed cabal install regex-pcre, and came up with the following test code after some search:
import Text.Regex.PCRE
s = "+++asdf=5;iwantthis123jasd---"
result = (s ++ s) =~ "asdf=5;(.*)123jasd" :: [[String]]
I was hoping to get the first and last instance of the middle string
iwantthis
But I can't get the result right:
[["asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd","iwantthis123jasd---+++asdf=5;iwantthis"]]
I haven't used regex or pcre in Haskell before.
Can someone help with the right usage (to extract the first and last occurrence) ?
Also, I don't quite understand the ::[[String]] usage here. What does it do and why is it necessary?
I searched the documentation but found no mention of the usage with type conversion to :: [[String]].
The result you obtain is the following:
Prelude Text.Regex.PCRE> (s ++ s) =~ "asdf=5;(.*)123jasd" :: [[String]]
[["asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd","iwantthis123jasd---+++asdf=5;iwantthis"]]
This is correct, the first element is the implicit capture group 0 (the entire regex), and the second element is that of capture group 1 (the one that matches (.*). Since it matches like:
+++asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd---
So it still matches between the asdf=5; and 123jasd part.
This is due to the fact that the Kleene start * matches greedy: it aims to capture as much as possible. You can use (.*?) however to use a non-greedy quantifier:
Prelude Text.Regex.PCRE> (s ++ s) =~ "asdf=5;(.*?)123jasd" :: [[String]]
[["asdf=5;iwantthis123jasd","iwantthis"],["asdf=5;iwantthis123jasd","iwantthis"]]
And now we obtain two matches. Each match has "iwantthis" as capture group 1.
You can use map (head . tail) or map (!!1) on it to obtain a list of captures of the (.*?) part:
Prelude Text.Regex.PCRE> map (!!1) ((s ++ s) =~ "asdf=5;(.*?)123jasd" :: [[String]])
["iwantthis","iwantthis"]

Haskell and Regex with Intersections

I am using regex with Haskell along with Text.Regex.PCRE and in my case I have:
Prelude Text.Regex.PCRE> getAllTextMatches ("32UMU1078" =~ "(\\d{1,2})([C-X&&[^IO]])([A-Z&&[^IO]])([A-Z&&[^IO]])(\\d{2,10})" :: AllTextMatches [] String)
[]
I am expecting some values returned but list is empty. However this returns what is expected:
Prelude Text.Regex.PCRE> getAllTextMatches ("32UMU1078" =~ "(\\d{1,2})([C-X])([A-Z])([A-Z])(\\d{2,10})" :: AllTextMatches [] String)
["32UMU1078"]
So if I remove the intersections like &&[^IO] there are no problems.
As I just discovered PCRE doesn't support intersections. Any alternative library with Haskell that support it?
PCRE does not support character class intersection/subtraction.
However, you may work around it with negative lookaheads and other methods.
Here, replace "(\\d{1,2})([C-X&&[^IO]])([A-Z&&[^IO]])([A-Z&&[^IO]])(\\d{2,10})" with
"(\\d{1,2})((?![IO])[C-X])((?![IO])[A-Z])((?![IO])[A-Z])(\\d{2,10})"
^^^^^^^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^^^
That is, replace the subtractions with lookaheads, [C-X&&[^IO]] -> (?![IO])[C-X].
Another way, that is more verbose, is to spell out the character classes:
"(\\d{1,2})([C-HJ-NP-X])([A-HJ-NP-Z])([A-HJ-NP-Z])(\\d{2,10})"
So, [C-X] that does not match I and O must be written as [C-HJ-NP-X].

Grouping in haskell regular expressions

How can I extract a string using regular expressions in Haskell?
let x = "xyz abc" =~ "(\\w+) \\w+" :: String
That doesn't event get a match
let x = "xyz abc" =~ "(.*) .*" :: String
That does but x ends up as "xyz abc" how do I extract only the first regex group so that x is "xyz"?
I wrote/maintain such packages as regex-base, regex-pcre, and regex-tdfa.
In regex-base the Text.Regex.Base.Context module documents the large number of instances of RegexContext that =~ uses. These are implemented on top of RegexLike which provides the underlying way to call matchText and matchAllText.
The [[String]] that KennyTM mentions is another instance of RegexContext, and may or may not be one that works best for you. A comprehensive instance is
RegexContext a b (AllTextMatches (Array Int) (MatchText b))
type MatchText source = Array Int (source, (MatchOffset, MatchLength))
which can be used to get a MatchText for everything:
let x :: Array Int (MatchText String)
x = getAllTextMatches $ "xyz abc" =~ "(\\w+) \\w+"
At which point x is an Array Int of matches of an Array Int of group-matches.
Note that "\w" is Perl syntax so you need regex-pcre to access it. If you want Unix/Posix extended regular expressions you should use regex-tdfa which is cross-platform and avoid using regex-posix that hits each platform's bugs in implementing the regex.h library.
Note that Perl vs Posix is not just a matter of syntax like "\w". They use very different algorithms and often return different results. Also, the time and space complexity are very different. For matching against a string of length 'n' Perl style (regex-pcre) can be O(exp(n)) in time while Posix style using regex-posix is always O(n) in time.
Cast the result as [[String]]. Then you'll get a list of matches, each being the list of matched text and the captured subgroups.
Prelude Text.Regex.PCRE> "xyz abc more text" =~ "(\\w+) \\w+" :: [[String]]
[["xyz abc","xyz"],["more text","more"]]