Unable to replace string with back reference using gsub in R - regex

I am trying to replace some text in a character vector using regex in R where, if there is a set of letters inside a bracket, the bracket content is to erplace the whole thing. So, given the input:
tst <- c("85", "86 (TBA)", "87 (LAST)")
my desired output would be equivalent to c("85", "TBA", "LAST")
I tried gsub("\\(([[:alpha:]])\\)", "\\1", tst) but it didn't replace anything. What do I need to correct in my regular expression here?

I think you want
gsub(".*\\(([[:alpha:]]+)\\)", "\\1", tst)
# [1] "85" "TBA" "LAST"
Your first expression was trying to match exactly one alpha character rather than one-or-more. I also added the ".*" to capture the beginning part of the string so it gets replaced as well, otherwise, it would be left untouched.

gsub("(?=.*\\([^)]*\\)).*\\(([^)]*)\\)", "\\1", tst, perl=TRUE)
## [1] "85" "TBA" "LAST"
You can try this.See demo.Replace by \1.
https://regex101.com/r/sH8aR8/38

The following would work. Note that white-spaces within the brackets may be problematic
A<-sapply(strsplit(tst," "),tail,1)
B<-gsub("\\(|\\)", "", A)

I like the purely regex answers better. I'm showing a solution using the qdapRegex package that I maintain as the result is pretty speedy and easy to remember and generalize. It pulls out the strings that are in parenthesis and then replaces any NA (no bracket) with the original value. Note that the result is a list and you'd need to use unlist to match your desired output.
library(qdpRegex)
m <- rm_round(tst, extract=TRUE)
m[is.na(m)] <- tst[is.na(m)]
## [[1]]
## [1] "85"
##
## [[2]]
## [1] "TBA"
##
## [[3]]
## [1] "LAST"

Related

stringr package str_extract() with inversion of the regex

I have a string like the following:
14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.0
The following regex extracts the last part that ends in a dot and a digit. I want to extract everything but that part and can't seem to find a way to invert the regex (using ^) is not helping:
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.0'
> str_extract(s, '(\\.[0-9]{1})$')
[1] ".0"
I instead want the output to be:
[1] 14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27
To clarify further, I want it to return the string as is, if it does not end in a dot and one single digit.
Following example:
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27'
> str_extract(s, someRegex)
[1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.1'
> str_extract(s, someRegex)
[1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.4'
> str_extract(s, someRegex)
[1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"
Try this regex:
^.*(?=\.\d+$)|^.*
Regex live here.
One option would be substituting for the last bit,
sub("\\.\\d$", '', s)
str_extract(s, ([\w ]+(?:\.|\-)){7})
Then you can access the returned string to its lenght-1, and it will give you the required output!
PS: You may have to use escape characters.
You could use stringr::str_remove() for example:
s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.0'
stringr::str_remove(s, '(\\.[0-9]{1})$')
#> [1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"

Splitting strings with unescaped separator in R

I have to read a file with R, where a variable number of columns is separated by the | character. However, if it is preceded by a \ it should not be considered a separator.
I first thought something like strsplit(x, "[^\\][|]") would work, but the problem here is that the character before each pipe is "consumed":
> strsplit("word1|word2|word3\\|aha!|word4", "[^\\][|]")
[[1]]
[1] "word" "word" "word3\\|aha" "word4"
Can anyone suggest a way to do this? Ideally it should be vectorized since the files in question are very large.
I believe this works; using Anirudh's downvoted answer (not sure why the downvote, it doesn't work but the regex was correct)
strsplit(x, "(?<!\\\\)[|]", perl=TRUE)
## > strsplit(x, "(?<!\\\\)[|]", perl=TRUE)
## [[1]]
## [1] "word1" "word2" "word3\\|aha!" "word4"
You need to use zero width assertion(lookbehind)
(?<!\\\\)[|]

Grab from beginning to first occurrence of character with gsub

I have the following regex that I'd like to grab everything from the beginning of the sentence until the first ##. I could use strsplit as I demonstrate to do this task but am preferring a gsub solution. If gusub is not the correct tool (I think it is though) I'd prefer a base solution because I want to learn the base regex tools.
x <- "gfd gdr tsvfvetrv erv tevgergre ## vev fe ## vgrrgf"
strsplit(x, "##")[[c(1, 1)]] #works
gsub("(.*)(##.*)", "\\1", x) #I want to work
Just add one character, putting a ? after the first quantifier to make it "non-greedy":
gsub("(.*?)(##.*)", "\\1", x)
# [1] "gfd gdr tsvfvetrv erv tevgergre "
Here's the relevant documentation, from ?regex
By default repetition is greedy, so the maximal possible number of
repeats is used. This can be changed to 'minimal' by appending
'?' to the quantifier.
I'd say:
sub("##.*", "", x)
Removes everything including and after the first occurance of ##.
In this case, I'd say to the inverse, i.e. replace everything following # with an empty string:
gsub("#.*$", "", x)
[1] "gfd gdr tsvfvetrv erv tevgergre "
But you can also use the non-greedy modifier ? to make your regex work in the way you suggested:
gsub("(.*?)#.*$", "\\1", x)
[1] "gfd gdr tsvfvetrv erv tevgergre "
Here's another approach that uses more string tools instead of a more complicated regular expression. It first finds the location of the first ## and then extracts the substring up to that point:
library(stringr)
x <- "gfd gdr tsvfvetrv erv tevgergre ## vev fe ## vgrrgf"
loc <- str_locate(x, "##")
str_sub(x, 1, loc[, "start"] - 1)
Generally, I think this sort of step-by-step approach is more maintainable than complex regular expressions.
Try this as your regex
^[^#]+
starts at the beginning of the string and matches anything not a # up to the first #
There are several simpler answers already here, but since you indicated in your question that you'd like to learn about regex support in base R, here's another way, using positive lookahead assertion (?=#) and non-greedy option (?U).
regmatches(x, regexpr('(?U)^.+(?=#)', x, perl=TRUE))
[1] "gfd gdr tsvfvetrv erv tevgergre "

Regex matches processing in R

I would like to extract the 2 matching groups using R.
Right now I've got this, but is not working well:
Code:
str = '123abc'
vector <- gregexpr('(?<first>\\d+)(?<second>\\w+)', str, perl=TRUE)
regmatches(str, vector)
Result:
[[1]]
[1] "123abc"
I want the result to be something like this:
[1] "123"
[2] "abc"
I'm not sure if you have a specific reason for using regmatches, unless you are e.g. importing the expressions in that format. If well-defined groups are common to all your entries, you can match them in this way:
x <- "123abc"
sub("([[:digit:]]+)[[:alpha:]]+","\\1",x)
sub("[[:digit:]]+([[:alpha:]]+)","\\1",x)
Result
[1] "123"
[1] "abc"
I.e., match the entire structure of the string, then replace it with the part you want to retain by enclosing it in round brackets and referring to it with a backreference ("\\1").
I've renamed your string s to avoid clobbering str. Here is one approach:
library(stringr)
s <- '123abc'
reg <- '([[:digit:]]+)([[:alpha:]]+)'
complete <- unlist(str_extract_all(s, reg))
partials <- unlist(str_match_all(s, reg))
partials <- partials[!(partials %in% complete)]
partials
[1] "123" "abc"
Depending on how well structured your inputs are, you may want to use strsplit to split the string.
Documentation here.
Try this:
> library(gsubfn)
> strapplyc("123abc", '(\\d+)(\\w+)')[[1]]
[1] "123" "abc"

R- regexp question

I need to re-shape my data frame using regexp and, in particular, this kind of line
X21_GS04.A.mzdata
must became:
GS04.A
I tryed
pluto <- sub('^X[0-90_]+','', my.data.frame$File.Name, perl=TRUE)
and it works; than I tryed
pluto <- sub('.mzdata$','', my.data.frame$File.Name, perl=TRUE)
and it works too.
The problem is that I have no idea how to combine the two code in one, I tryed a script such this
pluto <- sub('^X[0-90_]+ | .mzdata$','', my.data.frame$File.Name, perl=TRUE)
but nothing appens.
Can someone say to me where I wrong??
Best
Riccardo
The regular expression you’re after is this:
^X\d+_(.*)\.mzdata$
This will match your whole expression and capture the part that you want to retain in a group. You can now replace this by \1 (a reference to the capture group).
In R, this would be:
result <- sub('^X\\d+_(.*)\\.mzdata$', '\\1', my.data.frame$File.Name, perl=TRUE)
Remove space in your regex. Also escape . char: \., i.e.:
^X[0-9]+_|\.mzdata$