Extract info inside parenthesis in R - regex

I have some rows, some have parenthesis and some don't. Like ABC(DEF) and ABC. I want to extract info from parenthesis:
ABC(DEF) -> DEF
ABC -> NA
I wrote
gsub(".*\\((.*)\\).*", "\\1",X).
It works good for ABC(DEF), but output "ABC" when there is not parenthesis.

If you do not want to get ABC when using sub with your regex, you need to add an alternative that would match all the non-empty string and remove it.
X <- c("ABC(DEF)", "ABC")
sub(".*(?:\\((.*)\\)).*|.*", "\\1",X)
^^^
See the IDEONE demo.
Note you do not have to use gsub, you only need one replacement to be performed, so a sub will do.
Also, a stringr str_match would also be handy for this task:
str_match(X, "\\((.*)\\)")
or
str_match(X, "\\(([^()]*)\\)")

Using string_extract() will work.
library(stringr)
df$`new column` <- str_extract(df$`existing column`, "(?<=\\().+?(?=\\))")
This creates a new column of any text inside parentheses of an existing column. If there is no parentheses in the column, it will fill in NA.
The inspiration for my answer comes from this answer on the original question about this topic

Related

",(?!.*\\))" returning "Invalid Regex" error in R

I've got a string that I'm working with and I'm trying to select only the commas that are outside of the parentheses so that I can split the string based on that. Here's the string I'm working with:
"LIVINGSTON (Townships of Brighton, Deerfield, Genoa, Hartland,, Oceola & Tyrone), MACOMB, MONROE, OAKLAND, SANILAC, ST. CLAIR, AND WAYNE COUNTIES"
I'm trying to use the regex mentioned in the question title and it's telling me that it's not valid. Presumably this is because the closing parenthesis that is supposed to be escaped is being recognized by R as the parenthesis closing the match group and so the second parenthesis is throwing everything off. I'm just curious about how to work around this. Here is the syntax I'm using:
counties <- "LIVINGSTON (Townships of Brighton, Deerfield, Genoa, Hartland,, Oceola & Tyrone), MACOMB, MONROE, OAKLAND, SANILAC, ST. CLAIR, AND WAYNE COUNTIES"
tmp <- strsplit(counties, ',(?!.*\\))')
I can obviously just do the inverse of what I'm doing now and instead of splitting the text on the commas outside of the parentheses, simply replace the commas inside of the parentheses and then split the string on commas, but I'd like to know why this isn't working.
I believe the reason your regex isn't working is because it's very Perl-ish, which requires the perl=T flag. I think it is also slightly malformed in that you should check for opening and closing parentheses to be complete... I think this is a general solution matching not just your specific case:
counties <- "LIVINGSTON (Townships of Brighton, Deerfield, Genoa, Hartland,, Oceola & Tyrone), MACOMB, MONROE, OAKLAND, SANILAC, ST. CLAIR, AND WAYNE COUNTIES"
tmp <- strsplit(counties, ",(?![^(]*\\))", perl=T)
Because you have an unbalanced ),
https://regex101.com/r/jE0lI9/1
should be:
counties <- "LIVINGSTON (Townships of Brighton, Deerfield, Genoa, Hartland,, Oceola & Tyrone), MACOMB, MONROE, OAKLAND, SANILAC, ST. CLAIR, AND WAYNE COUNTIES"
tmp <- substr(counties, ',(?!.*\\)')
If i have understood the question correctly, try this:
strsplit(gsub("\\(.*\\)", "", counties), ",")[[1]]

R: how to convert part of a string to variable name and return its value in the same string?

Suppose I have a string marco <- 'polo'. Is there any way I can embed marco in the middle of another string, e.g. x <- 'John plays water marco.' and have x return 'John plays water polo.'?
EDIT
The solution David kindly offered does work for the hypothetical problem I posted above, but what I was trying to get to was this:
data <- c('kek','koki','ukak','ikka')
V <- c('a|e|i|o|u')
Rather than deleting all vowels, which the solution can manage (gsub(V,'',data)), how do I specify, say, all vowels between two k's? Obviously gsub('kVk','',data) doesn't work. Any help would be greatly appreciated.
If you want all vowels between two "k" letters removed, I propose the following:
V <- '[aeiou]'
data <- c('kek', 'koki', 'ukak', 'ikka', 'keeuiokaeioukaeiousk')
gsub(paste0('(?:\\G(?!^)|[^k]*k(?=[^k]+k))\\K', V), '', data, perl=T)
# [1] "kk" "kki" "ukk" "ikka" "kkksk"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. \K resets the starting point of the reported match and any previously consumed characters are no longer included which is similar to a lookbehind.
Regular Expression Explanation
Or, to use the example as given:
V <- 'a|e|i|o|u' ## or equivalently '[aeiou]'
dd <- c('kek','koki','ukak','ikka','kaaaak')
gsub(paste0("k(",V,")+k"),"kk",dd)
## [1] "kk" "kki" "ukk" "ikka" "kk"
I guessed that you might (?) want to delete multiple vowels between ks; I added a + to the regular expression to do this.

Replacing the first vowel-consonent occurence with consonent-vowel using sub in R

I know that it should be something like this but definitely I am missing something in the syntax:
yy=sub(r'\b[aeiou][^aeiou]*',r'\b[^aeiou][aeiou]*',"abmmmm")
I expect to have "bammmm" as output
Error: unexpected string constant in "yy=sub(r'\b[aeiou][^aeiou]*'"
I am not sure how is the exact syntax.
Please run your code in RStudio or any R compiler. I am new to regex and you giving me Python code wouldn't help me to understand the situation. Thanks!
This is what you want
yy=sub("\\b([aeiou])([^aeiuos])","\\2\\1","abmm")
I'll explain how it works:
If you ask me to substitute any vowel-consonent with any consonent-vowel? It doesn't make much sense. Should I change ab to ba, ce, or da? It can be any one of them. You never specified any relationship between the vowel in vowel-consonent and the vowel in consonent-vowel. Therefore, it doesn't make sense to put a regular expression in the 2nd argument. As a result, you are not allowed to.
If you want to achieve what you asked for. You can add brackets to the regular expression in the 1st argument. The first ( marks group 1, second ( marks group 2, etc. (note, group 0 is the whole matched string.) You can use \1, \2, ... in the second argument to put the matched group there.
As an alternative to using a regular expression for this, there's a nice string reversal function in example(strsplit)
> strReverse <- function(x)
sapply(lapply(strsplit(x, NULL), rev), paste, collapse="")
> dd <- "abmmmm"
> paste(strReverse(substr(dd, 1, 2)), substr(dd, 3, nchar(dd)), sep = "")
[1] "bammmm"

r gsub and regex, obating y*_x* from y*_x*_xxxx.csv

General situation: I am currently trying to name dataframes inside a list in accordance to the csv files they have been retrieved from, I found that using gsub and regex is the way to go. Unfortunately, I can’t produce exactly what I need, just sort of.
I would be very grateful for some hints from someone more experienced, maybe there is a reasonable R regex cheat cheet ?
File are named r2_m1_enzyme.csv, the script should use the first 4 characters to name the corresponding dataframe r2_m1, and so on…
# generates a list of dataframes, to mimic a lapply(f,read.csv) output:
data <- list(data.frame(c(1,2)),data.frame(c(1,2)),data.frame(c(1,2)),data.frame(c(1,2)))
# this mimics file names obtained by list.files() function
f <-c("r1_m1_enzyme.csv","r2_m1_enzyme.csv","r1_m2_enzyme.csv","r2_m2_enzyme.csv")
# this should name the data frames according to the csv file they have been derived from
names(data) <- gsub("r*_m*_.*","\\1", f)
but it doesnt work as expected... they are named r2_m1_enzyme.csv instead of the desired r2_m1, although .* should stop it?
If I do:
names(data) <- gsub("r*_.*","\\1", f)
I do get r1, r2, r3 ... but I am missing my second index.
The question: So my questions is, what regex expression would allow me to obtain strings “r1_m1”, “r2_m1”, “r1_m2”, ... from strings that are are named r*_m*_xyz.csv
Search history: R regex use * for only one character, Gsub regex replacement, R ussing parts of filename to name dataframe, R regex cheat sheet,...
If your names are always five characters long you could use substr:
substr(f, 1, 5)
If you want to use gsub you have to group your expression (via ( and )) because \\1 refers to the first group and insert its content, e.g.:
gsub("^(r[0-9]+_m[0-9]+).*", "\\1", f)

R! remove element from list which start from specific letters

I create a list of files:
folder_GLDAS=dir(foldery[numeryfolderow],pattern="_OBC.asc",recursive=F,full.names=T)
Unfortunately there is one additional object which i would like to remove (file name begin with "NOWY" - NOWYevirainf_OBC.asc).
How can I find index of this element on list to remove it by typing:
folder_GLDAS<=folder_GLDAS[-to_remove] ??
Filter by using a regular expression.
folder_GLDAS <- folder_GLDAS[!grepl("^NOWY", folder_GLDAS)]
(You can also swap grepl for str_detect in stringr.)
Assuming that your list is one-dimensional, something like this should work:
*folder_GLDAS<-*folder_GLDAS[substr(*folder_GLDAS,1,4)!='NOWY']
You can actually make a (rather complex) PERL regex pattern that matches all names that end in "_OBC.asc" but DO NOT start with "NOWY": "^(?!NOWY).*_OBC\\.asc$"
Unfortunately the PERL syntax is not recognized by dir. But you could do it with grep like this:
folder_GLDAS <- dir(foldery[numeryfolderow],recursive=F,full.names=T)
folder_GLDAS <- grep(folder_GLDAS, pattern="^(?!NOWY).*_OBC\\.asc$", perl=T, value=T)
Also note that the "." in "_OBC.asc" needs to be escaped - otherwise you'll match for example "_OBCXasc" as well).