Extract the last word between | | - regex

I have the following dataset
> head(names$SAMPLE_ID)
[1] "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
[2] "Bacteria|Firmicutes|Bacilli|Bacillales|Bacillaceae|Bacillus|"
[3] "Bacteria|Proteobacteria|Gammaproteobacteria|Pasteurellales|Pasteurellaceae|Haemophilus|"
[4] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
[5] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
[6] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
I want to extract the last word between || as a new variable i.e.
Acinetobacter
Bacillus
Haemophilus
I have tried using
library(stringr)
names$sample2 <- str_match(names$SAMPLE_ID, "|.*?|")

We can use
library(stringi)
stri_extract_last_regex(v1, '\\w+')
#[1] "Acinetobacter"
data
v1 <- "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"

Using just base R:
myvar <- gsub("^..*\\|(\\w+)\\|$", "\\1", names$SAMPLE_ID)

^.*\\|\\K.*?(?=\\|)
Use \K to remove rest from the final matche.See demo.Also use perl=T
https://regex101.com/r/fM9lY3/45
x <- c("Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|",
"Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|" )
unlist(regmatches(x, gregexpr('^.*\\|\\K.*?(?=\\|)', x, perl = TRUE)))
# [1] "Streptococcus" "Streptococcus"

The ending is all you need [^|]+(?=\|$)
Per #RichardScriven :
Which in R would be regmatches(x, regexpr("[^|]+(?=\\|$)", x, perl = TRUE)

You can use package "stringr" as well in this case. Here is the code:
v<- "Bacteria|
Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
v1<- str_replace_all(v, "\\|", " ")
word(v1,-2)
Here I used v as the string. The basic theory is to replace all the | with spaces, and then get the last word in the string by using function word().

Related

How to split a string before the delimiter?

I have a character string like the below.
a <- "T,2016,07,T,2016,07,22,T,2016,07"
I would like to split it to get this,
b <- c("T,2016,07", "T,2016,07", "T,2016,07")
Could you tell me the way? Many thanks.
Or use regular expression to split:
strsplit(a, ",(?=T)", perl = T)
# [[1]]
# [1] "T,2016,07" "T,2016,07,22" "T,2016,07"
You can do
x <- gsub("T", "%T", a)
y <- unlist(strsplit(x, "%"))[-1]
a <- "T,2016,07,T,2016,07,22,T,2016,07"
paste0("T", Filter(nzchar, strsplit(a, ",?T")[[1]]))
# [1] "T,2016,07" "T,2016,07,22" "T,2016,07"

String split with conditions in R

I have this mystring with the delimiter _. The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at ".Recal" and get the result as shown below.
mystring<-c("MODY_60.2.ReCal.sort.bam","MODY_116.21_C4U.ReCal.sort.bam","MODY_116.3_C2RX-1-10.ReCal.sort.bam","MODY_116.4.ReCal.sort.bam")
result
"MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
You can do this using gsubfn
library(gsubfn)
f <- function(x,y,z) if (z=="_") y else strsplit(x, ".ReCal", fixed=T)[[1]][[1]]
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
This allows for cases when you have more than two "_", and you want to split on the second one, for example,
mystring<-c("MODY_60.2.ReCal.sort.bam",
"MODY_116.21_C4U.ReCal.sort.bam",
"MODY_116.3_C2RX-1-10.ReCal.sort.bam",
"MODY_116.4.ReCal.sort.bam",
"MODY_116.4_asdfsadf_1212_asfsdf",
"MODY_116.5.ReCal_asdfsadf_1212_asfsdf", # split by second "_", leaving ".ReCal"
"MODY")
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
# [5] "MODY_116.4" "MODY_116.5.ReCal" "MODY"
In the function, f, x is the original string, y and z are the next matches. So, if z is not a "_", then it proceeds with the splitting by the alternative string.
With the stringr package:
str_extract(mystring, '.*?_.*?(?=_)|^.*?_.*(?=\\.ReCal)')
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
It also works with more than two delimiters.
Perl/PCRE has the branch reset feature that lets you reuse a group number when you have capturing groups in different alternatives, and is considered as one capturing group.
IMO, this feature is elegant when you want to supply different alternatives.
x <- c('MODY_60.2.ReCal.sort.bam', 'MODY_116.21_C4U.ReCal.sort.bam',
'MODY_116.3_C2RX-1-10.ReCal.sort.bam', 'MODY_116.4.ReCal.sort.bam',
'MODY_116.4_asdfsadf_1212_asfsdf', 'MODY_116.5.ReCal_asdfsadf_1212_asfsdf', 'MODY')
sub('^(?|([^_]*_[^_]*)_.*|(.*)\\.ReCal.*)$', '\\1', x, perl=T)
# [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
# [5] "MODY_116.4" "MODY_116.5.ReCal" "MODY"
gsub('^(.*\\.\\d+).*','\\1',mystring)
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
^([^_\\n]*_[^_\\n]*)(?:_.*|\\.ReCal[^_]*)$
You can simply do using gsub without using any complex regex.Just replace by \\1.See demo.
https://regex101.com/r/wL4aB6/1
A little longer, but needs less regular expression knowledge:
library(stringr)
indx <- str_locate_all(mystring, "_")
for (i in seq_along(indx)) {
if (nrow(indx[[i]]) == 1) {
mystring[i] <- strsplit(mystring[i], ".ReCal")[[1]][1]
} else {
mystring[i] <- substr(mystring[i], start = 1, stop = indx[[i]][2] - 1)
}
}
gregexpr can search for a pattern in strings and give the location.
First, we use gregexpr to find the location of all _ in each element of mystring. Then, we loop through that output and extract the index of second _ within each element of mystring. If there is no second _, it'll return an NA (check inds in the example below).
After that, we can either extract the relevant part using substr based on the extracted index or, if there is NA, we can split the string at .ReCal and keep only the first part.
inds = sapply(gregexpr("_", mystring, fixed = TRUE), function(x) x[2])
ifelse(!is.na(inds),
substr(mystring, 1, inds - 1),
sapply(strsplit(mystring, ".ReCal"), '[', 1))
#[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"

R regular expression: isolate a string between quotes

I have a string myFunction(arg1=\"hop\",arg2=TRUE). I want to isolate what is in between quotes (\"hop\" in this example)
I have tried so far with no success:
gsub(pattern="(myFunction)(\\({1}))(.*)(\\\"{1}.*\\\"{1})(.*)(\\){1})",replacement="//4",x="myFunction(arg1=\"hop\",arg2=TRUE)")
Any help by a regex guru would be welcome!
Try
sub('[^\"]+\"([^\"]+).*', '\\1', x)
#[1] "hop"
Or
sub('[^\"]+(\"[^\"]+.).*', '\\1', x)
#[1] "\"hop\""
The \" is not needed as " would work too
sub('[^"]*("[^"]*.).*', '\\1', x)
#[1] "\"hop\""
If there are multiple matches, as #AvinashRaj mentioned in his post, sub may not be that useful. An option using stringi would be
library(stringi)
stri_extract_all_regex(x1, '"[^"]*"')[[1]]
#[1] "\"hop\"" "\"hop2\""
data
x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
x1 <- "myFunction(arg1=\"hop\",arg2=TRUE arg3=\"hop2\", arg4=TRUE)"
You could use regmatches function also. Sub or gsub only works for a particular input , for general case you must do grabing instead of removing.
> x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
> regmatches(x, gregexpr('"[^"]*"', x))[[1]]
[1] "\"hop\""
To get only the text inside quotes then pass the result of above function to a gsub function which helps to remove the quotes.
> x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
> gsub('"', '', regmatches(x, gregexpr('"([^"]*)"', x))[[1]])
[1] "hop"
> x <- "myFunction(arg1=\"hop\",arg2=\"TRUE\")"
> gsub('"', '', regmatches(x, gregexpr('"([^"]*)"', x))[[1]])
[1] "hop" "TRUE"
You can try:
str='myFunction(arg1=\"hop\",arg2=TRUE)'
gsub('.*(\\".*\\").*','\\1',str)
#[1] "\"hop\""
x <- "myFunction(arg1=\"hop\",arg2=TRUE)"
unlist(strsplit(x,'"'))[2]
# [1] "hop"

Extract string between parenthesis in R

I have to extract values between a very peculiar feature in R. For eg.
a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}
{3:{112:123123214321}}{4:20:asdasd3214213}"
This is my example string and I wish to extract text between {[0-9]: and } such that my output for the above string looks like
## output should be
"0987617820" "q312132498s7yd09f8sydf987s6df8797yds9f87098", "{112:123123214321}" "20:asdasd3214213"
This is a horrible hack and probably breaks on your real data. Ideally you could just use a parser but if you're stuck with regex... well... it's not pretty
a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}
{3:{112:123123214321}}{4:20:asdasd3214213}"
# split based on }{ allowing for newlines and spaces
out <- strsplit(a, "\\}[[:space:]]*\\{")
# Make a single vector
out <- unlist(out)
# Have an excess open bracket in first
out[1] <- substring(out[1], 2)
# Have an excess closing bracket in last
n <- length(out)
out[length(out)] <- substring(out[n], 1, nchar(out[n])-1)
# Remove the number colon at the beginning of the string
answer <- gsub("^[0-9]*\\:", "", out)
which gives
> answer
[1] "0987617820"
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098"
[3] "{112:123123214321}"
[4] "20:asdasd3214213"
You could wrap something like that in a function if you need to do this for multiple strings.
Using PERL. This way is a bit more robust.
a = "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}{3:{112:123123214321}}{4:20:asdasd3214213}"
foohacky = function(str){
#remove opening bracket
pt1 = gsub('\\{+[0-9]:', '##',str)
#remove a closing bracket that is preceded by any alphanumeric character
pt2 = gsub('([0-9a-zA-Z])(\\})', '\\1',pt1, perl=TRUE)
#split up and hack together the result
pt3 = strsplit(pt2, "##")[[1]][-1]
pt3
}
For example
> foohacky(a)
[1] "0987617820"
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098"
[3] "{112:123123214321}"
[4] "20:asdasd3214213"
It also works with nesting
> a = "{1:0987617820}{{3:{112:123123214321}}{4:{20:asdasd3214213}}"
> foohacky(a)
[1] "0987617820" "{112:123123214321}" "{20:asdasd3214213}"
Here's a more general way, which returns any pattern between {[0-9]: and } allowing for a single nest of {} inside the match.
regPattern <- gregexpr("(?<=\\{[0-9]\\:)(\\{.*\\}|.*?)(?=\\})", a, perl=TRUE)
a_parse <- regmatches(a, regPattern)
a <- unlist(a_parse)

dynamic regex in R

The below code works so long as before and after strings have no characters that are special to a regex:
before <- 'Name of your Manager (note "self" if you are the Manager)' #parentheses cause problem in regex
after <- 'CURRENT FOCUS'
pattern <- paste0(c('(?<=', before, ').*?(?=', after, ')'), collapse='')
ex <- regmatches(x, gregexpr(pattern, x, perl=TRUE))
Does R have a function to escape strings to be used in regexes?
In Perl, there is http://perldoc.perl.org/functions/quotemeta.html for doing exactly that. If the doc is correct when it says
Returns the value of EXPR with all the ASCII non-"word" characters backslashed. (That is, all ASCII characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash in the returned string, regardless of any locale settings.)
then you can achieve the same by doing:
quotemeta <- function(x) gsub("([^A-Za-z_0-9])", "\\\\\\1", x)
And your pattern should be:
pattern <- paste0(c('(?<=', quotemeta(before), ').*?(?=', quotemeta(after), ')'),
collapse='')
Quick sanity check:
a <- "he'l(lo)"
grepl(a, a)
# [1] FALSE
grepl(quotemeta(a), a)
# [1] TRUE
Use \Q...\E to surround the verbatim subpatterns:
# test data
before <- "A."
after <- ".Z"
x <- c("A.xyz.Z", "ABxyzYZ")
pattern <- sprintf('(?<=\\Q%s\\E).*?(?=\\Q%s\\E)', before, after)
which gives:
> gregexpr(pattern, x, perl = TRUE) > 0
[1] TRUE FALSE
dnagirl, such a function exists and is glob2rx
a <- "he'l(lo)"
tt <- glob2rx(a)
# [1] "^he'l\\(lo)$"
before <- 'Name of your Manager (note "self" if you are the Manager)'
tt <- glob2rx(before)
# [1] "^Name of your Manager \\(note \"self\" if you are the Manager)$"
You can just remove the "^" and "$" from the strings by doing:
substr(tt, 2, nchar(tt)-1)
# [1] "he'l\\(lo)"