Grouping in haskell regular expressions - regex

How can I extract a string using regular expressions in Haskell?
let x = "xyz abc" =~ "(\\w+) \\w+" :: String
That doesn't event get a match
let x = "xyz abc" =~ "(.*) .*" :: String
That does but x ends up as "xyz abc" how do I extract only the first regex group so that x is "xyz"?

I wrote/maintain such packages as regex-base, regex-pcre, and regex-tdfa.
In regex-base the Text.Regex.Base.Context module documents the large number of instances of RegexContext that =~ uses. These are implemented on top of RegexLike which provides the underlying way to call matchText and matchAllText.
The [[String]] that KennyTM mentions is another instance of RegexContext, and may or may not be one that works best for you. A comprehensive instance is
RegexContext a b (AllTextMatches (Array Int) (MatchText b))
type MatchText source = Array Int (source, (MatchOffset, MatchLength))
which can be used to get a MatchText for everything:
let x :: Array Int (MatchText String)
x = getAllTextMatches $ "xyz abc" =~ "(\\w+) \\w+"
At which point x is an Array Int of matches of an Array Int of group-matches.
Note that "\w" is Perl syntax so you need regex-pcre to access it. If you want Unix/Posix extended regular expressions you should use regex-tdfa which is cross-platform and avoid using regex-posix that hits each platform's bugs in implementing the regex.h library.
Note that Perl vs Posix is not just a matter of syntax like "\w". They use very different algorithms and often return different results. Also, the time and space complexity are very different. For matching against a string of length 'n' Perl style (regex-pcre) can be O(exp(n)) in time while Posix style using regex-posix is always O(n) in time.

Cast the result as [[String]]. Then you'll get a list of matches, each being the list of matched text and the captured subgroups.
Prelude Text.Regex.PCRE> "xyz abc more text" =~ "(\\w+) \\w+" :: [[String]]
[["xyz abc","xyz"],["more text","more"]]

Related

extract string between two substrings in Haskell

I wanted to adapt the python regex (PCRE) technique in this SO question Find string between two substrings to Haskell so that I can do the same in Haskell.
But I can't figure out how to make it work in GHC (8.2.1). I've installed cabal install regex-pcre, and came up with the following test code after some search:
import Text.Regex.PCRE
s = "+++asdf=5;iwantthis123jasd---"
result = (s ++ s) =~ "asdf=5;(.*)123jasd" :: [[String]]
I was hoping to get the first and last instance of the middle string
iwantthis
But I can't get the result right:
[["asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd","iwantthis123jasd---+++asdf=5;iwantthis"]]
I haven't used regex or pcre in Haskell before.
Can someone help with the right usage (to extract the first and last occurrence) ?
Also, I don't quite understand the ::[[String]] usage here. What does it do and why is it necessary?
I searched the documentation but found no mention of the usage with type conversion to :: [[String]].
The result you obtain is the following:
Prelude Text.Regex.PCRE> (s ++ s) =~ "asdf=5;(.*)123jasd" :: [[String]]
[["asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd","iwantthis123jasd---+++asdf=5;iwantthis"]]
This is correct, the first element is the implicit capture group 0 (the entire regex), and the second element is that of capture group 1 (the one that matches (.*). Since it matches like:
+++asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd---
So it still matches between the asdf=5; and 123jasd part.
This is due to the fact that the Kleene start * matches greedy: it aims to capture as much as possible. You can use (.*?) however to use a non-greedy quantifier:
Prelude Text.Regex.PCRE> (s ++ s) =~ "asdf=5;(.*?)123jasd" :: [[String]]
[["asdf=5;iwantthis123jasd","iwantthis"],["asdf=5;iwantthis123jasd","iwantthis"]]
And now we obtain two matches. Each match has "iwantthis" as capture group 1.
You can use map (head . tail) or map (!!1) on it to obtain a list of captures of the (.*?) part:
Prelude Text.Regex.PCRE> map (!!1) ((s ++ s) =~ "asdf=5;(.*?)123jasd" :: [[String]])
["iwantthis","iwantthis"]

Haskell and Regex with Intersections

I am using regex with Haskell along with Text.Regex.PCRE and in my case I have:
Prelude Text.Regex.PCRE> getAllTextMatches ("32UMU1078" =~ "(\\d{1,2})([C-X&&[^IO]])([A-Z&&[^IO]])([A-Z&&[^IO]])(\\d{2,10})" :: AllTextMatches [] String)
[]
I am expecting some values returned but list is empty. However this returns what is expected:
Prelude Text.Regex.PCRE> getAllTextMatches ("32UMU1078" =~ "(\\d{1,2})([C-X])([A-Z])([A-Z])(\\d{2,10})" :: AllTextMatches [] String)
["32UMU1078"]
So if I remove the intersections like &&[^IO] there are no problems.
As I just discovered PCRE doesn't support intersections. Any alternative library with Haskell that support it?
PCRE does not support character class intersection/subtraction.
However, you may work around it with negative lookaheads and other methods.
Here, replace "(\\d{1,2})([C-X&&[^IO]])([A-Z&&[^IO]])([A-Z&&[^IO]])(\\d{2,10})" with
"(\\d{1,2})((?![IO])[C-X])((?![IO])[A-Z])((?![IO])[A-Z])(\\d{2,10})"
^^^^^^^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^^^
That is, replace the subtractions with lookaheads, [C-X&&[^IO]] -> (?![IO])[C-X].
Another way, that is more verbose, is to spell out the character classes:
"(\\d{1,2})([C-HJ-NP-X])([A-HJ-NP-Z])([A-HJ-NP-Z])(\\d{2,10})"
So, [C-X] that does not match I and O must be written as [C-HJ-NP-X].

Can't capture a group in a string

I want to capture a group in as string:
import Text.Regex.Posix
"somestring; somestring2=\"(.*?)\"" =~ "somestring; somestring2=\"my_super_string123\"" :: String
It returns an empty string "", as opposed to my_super_string123 which I expect. I've tried ::[String] and ::[[String]] and, obviously, they were empty. Your suggestions?
The problem is that you have your string and your pattern swapped. You also will want to have the return type be [[String]]:
> "somestring; somestring2=\"my_super_string123\"" =~ "somestring; somestring2=\"(.*)\"" :: [[String]]
[["somestring; somestring2=\"my_super_string123\"", "my_super_string123"]]
Note that I had to remove the ? from the .*? part of the pattern. This is because POSIX doesn't support the lazy quantifier *?. You'll have to select both of the POSIX flavors from the drop downs to see, but it says both do not support the lazy quantifiers. It's also recommended to use negation instead of laziness for regex since it improves performance over having to backtrack. To do this, you'd have to change your pattern to
"somestring; somestring2=\"([^\"]*)\""
To clarify, here's the output from my GHCi:
> "s1; s2=\"my_super_string123\"" =~ "s1; s2=\"([^\"]*)\"" :: [[String]]
[["s1; s2=\"my_super_string123\"","my_super_string123"]]
it :: [[String]]
> "s1; s2=\"my_super_string123\"" =~ "s1; s2=\"([^\"]*)\"" :: String
"s1; s2=\"my_super_string123\""
it :: String
1As you can see, with the return type as String, it returns whatever text matches the entire pattern, not just the capturing groups. Use [[String]] when you want to get the contents of the individual capturing groups.
I edited the contents of the string so that it would fit without having to scroll horizontally, just for illustrative purposes.

Is there an R function to escape a string for regex characters

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.
Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?
For example (made up function):
x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
I've written an R version of Perl's quotemeta function:
library(stringr)
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.
Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:
This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):
Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.
which reinforces my point that this solution is only guaranteed for PCRE.
Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
My previous answer:
I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.
re.escape <- function(strings){
vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)",
"\\{", "\\}", "\\^", "\\$","\\*",
"\\+", "\\?", "\\.", "\\|")
replace.vals <- paste0("\\\\", vals)
for(i in seq_along(vals)){
strings <- gsub(vals[i], replace.vals[i], strings)
}
strings
}
Some output
> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"
An easier way than #ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.
Use the rex package
These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:
library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")
But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:
x = "foo[bar]"
y = rex(start, x, end)
Now y is ^foo\[bar\]$ and will only match the exact string contained in x.
According to ?regex:
The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).
Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:
> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"
Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".

perl regular expression take out text enclosed in parentheses

how do I use Perl to get rid of text within parentheses? For example:
$str = "This is a (extra stuff) string."
to
$str = "This is a string."
I am current using this but it's not working:
$str =~ s/( ( [^)]+ ) )//;
Thanks!
You need to escape the parentheses, like:
s/\([^)]*\)//g
Update by popular demand:
To remove the space you can simply remove spaces before the parenthesis. This will work in most cases:
s/\s*\([^)]*\)//g
To handle nested parenthesis you can use a recursive pattern, like so:
s/\s*\((?:[^()]+|(?R))*\)//g
You can read about (?R) and the like in perlre.
The last expression will work for string like aaa (foo(b,a,2*(3+4)) b) (c (c) c) ddd (x)., giving aaa ddd..
The ( are special and must be escaped
s/\([^)]+\)//g
None of the solutions so far do that the OP asked.
The expression $str =~ s/\([^)]*\)//g;
Converts "This is a (extra stuff) string" to "This is a string", leaving two spaces between the "a" and "string".
Converts "This is a (doubly (nested)) string" to "This is a ) string".
Converts "This is a (doubly (no, (triply!) nested) expression) string" to "This is a nested) expression) string".
Similar problems exist with $str =~ s/[ ]?\(.*?\)[ ]?//g; And why use those square brackets? Aren't regular expressions hairy enough without unneeded stuff?
We're going to need something a bit hairier to so we can eat multiply-nested parenthetical remarks and properly deal with keeping spacing where needed but discarding it otherwise. This does the trick:
1 while $str =~ s/(\w?)(\s*)\([^()]*\)(\s*)(\w?)
/($1&&$4)?($1.($2?$2:$3).$4):($1?$1:$4)/ex;
Edit
Test results:
'This string is OK as is.' -> 'This string is OK as is.'
'This is a (extra stuff) string.' -> 'This is a string.'
'(Preliminary remark) string' -> 'string'
'String (with end remark)' -> 'String'
'A string (remark before punctuation)!' -> 'A string!'
'A (doubly (nested)) string' -> 'A string'
'A (doubly (no, (triply!) nested)) string' -> 'A string'
Edit2
The exg qualification results in incorrect handling of "This (delete) (delete) is a string". All that is needed is ex.
This line should do what you need:
$str =~ s/[ ]?\(.*?\)[ ]?//g;
Do note that it won't work with nested brackets (like (this)), since the regex would have to be a lot more complicated for that type of functionality.
I do converting special characters to hex for easy use in my regex's
/\x28([^\x29]+)\x29/
Hmm I had expected the "greedy" principle to apply, eating all the way to the close parenthesis even when nested. Perhaps a little brute force, using index and rindex functions, would be better.
But I still wonder, why doesn't
$str =~ s/[ ]?\(.*?\)[ ]?//g;
slurp it all the way to the last ')'?
A split version. I kind of like split for this, because it is non-invasive, preserving the original format, and also, regexes tend to become... complicated. Though you need regex to trim it, of course.
You'd still need to work out the spacing. It is not a simple thing to predict whether extra space will appear in the front or end, and removing all double spaces will not preserve original format. This solution removes a single space in front of opening parens, and nothing else. Works in most cases, assuming the input has correct punctuation to begin with.
use warnings;
use strict;
while (<DATA>) {
my #parts = split /\(/;
print de_paren(#parts);
}
sub de_paren {
my $return = shift;
my #parts = #_;
while (my $word = shift #parts) {
next unless $word =~ /\)/;
$word =~ s/^.*?\)// while ($word =~ /\)/);
$return =~ s/ $//;
$return .= $word;
}
return $return;
}
__DATA__
A (doubly (no, (triply!) nested)) string
This is a (extra stuff) string.
(Preliminary remark) string
String (with end remark) String (with end remark)
A string (remark before punctuation)!
A (doubly (nested)) string
Output is:
A string
This is a string.
string
String String
A string!
A string ->