R Regex: Parenthesis Not Acting as Metacharacter - regex

I am trying to split a string by the group "%in%" and the character "#". All documentation and everything I can find says that parenthesis are metacharacters used for grouping in R regex. So the code
> strsplit('example%in%aa(bbb)aa#cdef', '[(%in%)#]', perl=TRUE)
SHOULD give me
[[1]]
[1] "example" "aa(bbb)aa" "cdef"
That is, it should leave the parentheses in "aa(bbb)aa" alone, because the parentheses in the matching expression are not escaped. But instead it ACTUALLY gives me
[[1]]
[1] "example" "" "" "" "aa" "bbb" "aa" "cdef"
as if the parentheses were not metacharacters! What is up with this and how can I fix it? Thanks!
This is true with and without the argument perl=TRUE in strsplit.

Not sure what documentation you're reading, but the Extended Regular Expressions section in ?regex says:
Most metacharacters lose their special meaning inside a character class. ...
(Only '^ - \ ]' are special inside character classes.)
You don't need to create a character class. Just use "or" | (you likely don't need to group "%in%" either, but it shouldn't hurt anything):
> strsplit('example%in%aa(bbb)aa#cdef', '(%in%)|#', perl=TRUE)
[[1]]
[1] "example" "aa(bbb)aa" "cdef"

No need to use [ or ( here , just this :
strsplit('example%in%aa(bbb)aa#cdef', '%in%|#')
[[1]]
[1] "example" "aa(bbb)aa" "cdef"

Inside character class [], most of the characters lose their special meaning, including ().
You might want this regex instead:
'%in%|#'

Related

R regex remove unicode apostrophe

Lets say I have the following string in R:
text <- "[Peanut M&M\u0092s]"
I've been trying to use regex to erase the apostrophe by searching for and deleting \u0092:
replaced <- gsub("\\\\u0092", "", text )
However, the above doesnt seem to work and results in the same line as the original. What is the correct way to do this removal?
Furthermore, if I wanted to remove the opening and closing [], is it more efficient to do it all in one go or on separate lines?
You can use a [^[:ascii:]] construct with a Perl-like regex to remove the non-ASCII codes from your input, and you can add an alternative [][] to also match square brackets:
text <- "[Peanut M&M\u0092s]"
replaced <- gsub("[][]|[^[:ascii:]]", "", text, perl=T)
replaced
## => [1] "Peanut M&Ms"
See IDEONE demo
If you only plan to remove the \0092 symbol, you do not need a Perl like regex:
replaced <- gsub("[][\u0092]", "", text)
See another demo
Note that [...] is a character class that matches 1 symbol, here, either a ] or [, or \u0092. If you place ] at the beginning of the character class, it does not need escaping. [ does not need escaping inside a character class (in R regex and in some other flavors, too).

How do I replace brackets using regular expressions in R?

I'm sure this is a really easy question. I'm quite familiar with RegEx in R in the meantime, but I just can't get my head around this one.
Suppose, we have this string:
a <- c("a b . ) ] \"")
Now, all I want to do is to delete the quotes, the dot, the closing paranthesis and the closing brackets.
So, I want: "a b".
I tried:
gsub("[.\\)\"\\]]", "", a)
It doesn't work. It returns: "a b . ) ]" So nothing gets removed.
As soon as I exclude the \\] from the search pattern, it works...
gsub("[.\\)\"]", "", a)
But, of course, it doesn't remove the closing brackets!
What have I done wrong?!?
Thanks for your help!
a <- c('a b . ) ] "');
gsub('\\s*[].)"]\\s*','',a);
## [1] "a b"
When you want to include the close bracket character in a bracket expression you should always include it first within the brackets; that causes it to be taken as a character within the bracket expression, rather than as the closing delimiter of the bracket expression.
Building on #akruns comment
library(stringr)
str_trim(gsub('[.]|[[:punct:]]', '\\1', a))
replace the period in the first set of brackets with whichever punctuations you want to keep.
You may try this.
> gsub("\\b\\W\\b(*SKIP)(*F)|\\W", "", a, perl=T)
[1] "a b"
> gsub("\\b(\\W)\\b|\\W", "\\1", a, perl=T)
[1] "a b"

Unable to replace string with back reference using gsub in R

I am trying to replace some text in a character vector using regex in R where, if there is a set of letters inside a bracket, the bracket content is to erplace the whole thing. So, given the input:
tst <- c("85", "86 (TBA)", "87 (LAST)")
my desired output would be equivalent to c("85", "TBA", "LAST")
I tried gsub("\\(([[:alpha:]])\\)", "\\1", tst) but it didn't replace anything. What do I need to correct in my regular expression here?
I think you want
gsub(".*\\(([[:alpha:]]+)\\)", "\\1", tst)
# [1] "85" "TBA" "LAST"
Your first expression was trying to match exactly one alpha character rather than one-or-more. I also added the ".*" to capture the beginning part of the string so it gets replaced as well, otherwise, it would be left untouched.
gsub("(?=.*\\([^)]*\\)).*\\(([^)]*)\\)", "\\1", tst, perl=TRUE)
## [1] "85" "TBA" "LAST"
You can try this.See demo.Replace by \1.
https://regex101.com/r/sH8aR8/38
The following would work. Note that white-spaces within the brackets may be problematic
A<-sapply(strsplit(tst," "),tail,1)
B<-gsub("\\(|\\)", "", A)
I like the purely regex answers better. I'm showing a solution using the qdapRegex package that I maintain as the result is pretty speedy and easy to remember and generalize. It pulls out the strings that are in parenthesis and then replaces any NA (no bracket) with the original value. Note that the result is a list and you'd need to use unlist to match your desired output.
library(qdpRegex)
m <- rm_round(tst, extract=TRUE)
m[is.na(m)] <- tst[is.na(m)]
## [[1]]
## [1] "85"
##
## [[2]]
## [1] "TBA"
##
## [[3]]
## [1] "LAST"

Extract substrings starting with specific character until next space

I want to extract the tags (twitter handles) from tweets.
tweet <- "#me bla bla bla bla #2_him some text #me_"
The following only extracts part of some substrings due to the punctuation in some tags
regmatches(tweet, gregexpr("#[[:alnum:]]*", tweet))[[1]]
[1] "#me" "#2" "#me"
I don't know what regular expression would return the entire string (#tag).
Thanks!
If you want to match all non-spaces, just use the corresponding regular expression
regmatches(tweet, gregexpr("#[^ ]*", tweet))[[1]]
# [1] "#me" "#2_him" "#me_"
You can use the following. \S will match any non-white space character. As well, you want to use the + quantifier instead of * otherwise you will end up matching the # character by itself if one did exist in the string.
> regmatches(tweet, gregexpr("#\\S+", tweet))[[1]]
# [1] "#me" "#2_him" "#me_"
Instead of [[:alnum:]]* use \w* because _ does not comes under alphanumeric character list(ie, [[:alnum:]] matches alphanumeric[A-Za-z0-9] characters. ) but it comes under word character ([A-Za-z0-9_]) list.
> regmatches(tweet, gregexpr("#\\w*", tweet))[[1]]
[1] "#me" "#2_him" "#me_"
The qdapRegex package has a function specifically designed for this task rm_tag:
library(qdapRegex)
rm_tag(tweet, extract=TRUE)
## [[1]]
## [1] "#me" "#2_him" "#me_"

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)