String split with conditions in R - regex

I have this mystring with the delimiter _. The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at ".Recal" and get the result as shown below.
mystring<-c("MODY_60.2.ReCal.sort.bam","MODY_116.21_C4U.ReCal.sort.bam","MODY_116.3_C2RX-1-10.ReCal.sort.bam","MODY_116.4.ReCal.sort.bam")
result
"MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"

You can do this using gsubfn
library(gsubfn)
f <- function(x,y,z) if (z=="_") y else strsplit(x, ".ReCal", fixed=T)[[1]][[1]]
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
This allows for cases when you have more than two "_", and you want to split on the second one, for example,
mystring<-c("MODY_60.2.ReCal.sort.bam",
"MODY_116.21_C4U.ReCal.sort.bam",
"MODY_116.3_C2RX-1-10.ReCal.sort.bam",
"MODY_116.4.ReCal.sort.bam",
"MODY_116.4_asdfsadf_1212_asfsdf",
"MODY_116.5.ReCal_asdfsadf_1212_asfsdf", # split by second "_", leaving ".ReCal"
"MODY")
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
# [5] "MODY_116.4" "MODY_116.5.ReCal" "MODY"
In the function, f, x is the original string, y and z are the next matches. So, if z is not a "_", then it proceeds with the splitting by the alternative string.

With the stringr package:
str_extract(mystring, '.*?_.*?(?=_)|^.*?_.*(?=\\.ReCal)')
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
It also works with more than two delimiters.

Perl/PCRE has the branch reset feature that lets you reuse a group number when you have capturing groups in different alternatives, and is considered as one capturing group.
IMO, this feature is elegant when you want to supply different alternatives.
x <- c('MODY_60.2.ReCal.sort.bam', 'MODY_116.21_C4U.ReCal.sort.bam',
'MODY_116.3_C2RX-1-10.ReCal.sort.bam', 'MODY_116.4.ReCal.sort.bam',
'MODY_116.4_asdfsadf_1212_asfsdf', 'MODY_116.5.ReCal_asdfsadf_1212_asfsdf', 'MODY')
sub('^(?|([^_]*_[^_]*)_.*|(.*)\\.ReCal.*)$', '\\1', x, perl=T)
# [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
# [5] "MODY_116.4" "MODY_116.5.ReCal" "MODY"

gsub('^(.*\\.\\d+).*','\\1',mystring)
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"

^([^_\\n]*_[^_\\n]*)(?:_.*|\\.ReCal[^_]*)$
You can simply do using gsub without using any complex regex.Just replace by \\1.See demo.
https://regex101.com/r/wL4aB6/1

A little longer, but needs less regular expression knowledge:
library(stringr)
indx <- str_locate_all(mystring, "_")
for (i in seq_along(indx)) {
if (nrow(indx[[i]]) == 1) {
mystring[i] <- strsplit(mystring[i], ".ReCal")[[1]][1]
} else {
mystring[i] <- substr(mystring[i], start = 1, stop = indx[[i]][2] - 1)
}
}

gregexpr can search for a pattern in strings and give the location.
First, we use gregexpr to find the location of all _ in each element of mystring. Then, we loop through that output and extract the index of second _ within each element of mystring. If there is no second _, it'll return an NA (check inds in the example below).
After that, we can either extract the relevant part using substr based on the extracted index or, if there is NA, we can split the string at .ReCal and keep only the first part.
inds = sapply(gregexpr("_", mystring, fixed = TRUE), function(x) x[2])
ifelse(!is.na(inds),
substr(mystring, 1, inds - 1),
sapply(strsplit(mystring, ".ReCal"), '[', 1))
#[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"

Related

Regexp in R to match everything in between first and last occurene of some specified character

I'd like to match everything between the first and last underscore. I use R.
What I have until now is this:
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')
sub('[^_]*_(.*)_[^_]*', x = p.subject, replacement = '\\1', perl = T)
Where 'bla' is any character except an underscore...
The result I'd like would be something like this:
c(NA, NA, bla, bla_bla)
I can't figure it out! Why does the first pattern match? It shouldn't because the pattern must have 2 underscores! Or do I have to use some kind of lookahead expression?
Your help is very welcome!
You can use gsub:
vec <- gsub("(^[^_]+)_?|_?([^_]+$)", "", p.subject)
vec <- ifelse(nchar(vec) == 0 , NA, vec)
vec
[1] NA NA "bla" "bla_bla"
Data:
dput(p.subject)
c("bla_bla", "bla", "bla_bla_bla", "bla_bla_bla_bla")
Here is another option using str_extract. We use regex lookarounds to extract the pattern between the first and the last occurrence of a specified character i.e. _.
library(stringr)
str_extract(p.subject, "(?<=[^_]{1,30}_).*(?=_[^_]+)")
#[1] NA NA "bla" "bla_bla"
NOTE: We didn't use any ifelse.
data
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')

Extract the last word between | |

I have the following dataset
> head(names$SAMPLE_ID)
[1] "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
[2] "Bacteria|Firmicutes|Bacilli|Bacillales|Bacillaceae|Bacillus|"
[3] "Bacteria|Proteobacteria|Gammaproteobacteria|Pasteurellales|Pasteurellaceae|Haemophilus|"
[4] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
[5] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
[6] "Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|"
I want to extract the last word between || as a new variable i.e.
Acinetobacter
Bacillus
Haemophilus
I have tried using
library(stringr)
names$sample2 <- str_match(names$SAMPLE_ID, "|.*?|")
We can use
library(stringi)
stri_extract_last_regex(v1, '\\w+')
#[1] "Acinetobacter"
data
v1 <- "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
Using just base R:
myvar <- gsub("^..*\\|(\\w+)\\|$", "\\1", names$SAMPLE_ID)
^.*\\|\\K.*?(?=\\|)
Use \K to remove rest from the final matche.See demo.Also use perl=T
https://regex101.com/r/fM9lY3/45
x <- c("Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|",
"Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|" )
unlist(regmatches(x, gregexpr('^.*\\|\\K.*?(?=\\|)', x, perl = TRUE)))
# [1] "Streptococcus" "Streptococcus"
The ending is all you need [^|]+(?=\|$)
Per #RichardScriven :
Which in R would be regmatches(x, regexpr("[^|]+(?=\\|$)", x, perl = TRUE)
You can use package "stringr" as well in this case. Here is the code:
v<- "Bacteria|
Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Acinetobacter|"
v1<- str_replace_all(v, "\\|", " ")
word(v1,-2)
Here I used v as the string. The basic theory is to replace all the | with spaces, and then get the last word in the string by using function word().

String split in R skipping first delimiter if multiple delimiters are present

I have "elephant_giraffe_lion" and "monkey_tiger" strings.
The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at that delimiter. So the results I want to get in this example are "elephant_giraffe" and "monkey".
mystring<-c("elephant_giraffe_lion", "monkey_tiger")
result
"elephant_giraffe" "monkey"
You can anchor your split to the end of the string using $,
unlist(strsplit(mystring, "_[a-z]+$"))
# [1] "elephant_giraffe" "monkey"
Edit
The above only matches the last "_", not accounting for cases where there are more than two "_". For the more general case, you could try
mystring<-c("elephant_giraffe_lion", "monkey_tiger", "dogs", "foo_bar_baz_bap")
tmp <- gsub("([^_]+_[^_]+).*", "\\1", mystring)
tmp[tmp==mystring] <- sapply(strsplit(tmp[tmp==mystring], "_"), `[[`, 1)
tmp
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
You could also use gsubfn, to process the match with a function
library(gsubfn)
f <- function(x,y) if (y==x) strsplit(y, "_")[[1]][[1]] else y
gsubfn("([^_]+_[^_]+).*", f, mystring, backref=1)
# [1] "elephant_giraffe" "monkey" "dogs" "foo_bar"
As I posted an answer on your other related question, a base R solution:
x <- c('elephant_giraffe_lion', 'monkey_tiger', 'foo_bar_baz_bap')
sub('^(?|([^_]*_[^_]*)_.*|([^_]*)_[^_]*)$', '\\1', x, perl=TRUE)
# [1] "elephant_giraffe" "monkey" "foo_bar"

Extract string between parenthesis in R

I have to extract values between a very peculiar feature in R. For eg.
a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}
{3:{112:123123214321}}{4:20:asdasd3214213}"
This is my example string and I wish to extract text between {[0-9]: and } such that my output for the above string looks like
## output should be
"0987617820" "q312132498s7yd09f8sydf987s6df8797yds9f87098", "{112:123123214321}" "20:asdasd3214213"
This is a horrible hack and probably breaks on your real data. Ideally you could just use a parser but if you're stuck with regex... well... it's not pretty
a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}
{3:{112:123123214321}}{4:20:asdasd3214213}"
# split based on }{ allowing for newlines and spaces
out <- strsplit(a, "\\}[[:space:]]*\\{")
# Make a single vector
out <- unlist(out)
# Have an excess open bracket in first
out[1] <- substring(out[1], 2)
# Have an excess closing bracket in last
n <- length(out)
out[length(out)] <- substring(out[n], 1, nchar(out[n])-1)
# Remove the number colon at the beginning of the string
answer <- gsub("^[0-9]*\\:", "", out)
which gives
> answer
[1] "0987617820"
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098"
[3] "{112:123123214321}"
[4] "20:asdasd3214213"
You could wrap something like that in a function if you need to do this for multiple strings.
Using PERL. This way is a bit more robust.
a = "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}{3:{112:123123214321}}{4:20:asdasd3214213}"
foohacky = function(str){
#remove opening bracket
pt1 = gsub('\\{+[0-9]:', '##',str)
#remove a closing bracket that is preceded by any alphanumeric character
pt2 = gsub('([0-9a-zA-Z])(\\})', '\\1',pt1, perl=TRUE)
#split up and hack together the result
pt3 = strsplit(pt2, "##")[[1]][-1]
pt3
}
For example
> foohacky(a)
[1] "0987617820"
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098"
[3] "{112:123123214321}"
[4] "20:asdasd3214213"
It also works with nesting
> a = "{1:0987617820}{{3:{112:123123214321}}{4:{20:asdasd3214213}}"
> foohacky(a)
[1] "0987617820" "{112:123123214321}" "{20:asdasd3214213}"
Here's a more general way, which returns any pattern between {[0-9]: and } allowing for a single nest of {} inside the match.
regPattern <- gregexpr("(?<=\\{[0-9]\\:)(\\{.*\\}|.*?)(?=\\})", a, perl=TRUE)
a_parse <- regmatches(a, regPattern)
a <- unlist(a_parse)

regex match within parenthesis

I'm attempting to use some regular expressions that I made for Python also work with R.
Here is what I have in Python (using the excellent re module), with my expected 3 matches:
import re
line = 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
re.findall('"(.*?)"', line)
# ['First [T]', 'Second [L]', 'Third [1/T]']
Now with R, here is my best attempt:
line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line)
regmatches(line, m)[[1]]
# [1] "\"First [T]\"" "\"Second [L]\"" "\"Third [1/T]\""
Why did R match the whole pattern, rather than just within the parenthesis? I was expecting:
[1] "First [T]" "Second [L]" "Third [1/T]"
Furthermore, perl=TRUE didn't make any difference. Is it safe to assume that R's regex does not consider matching only the parenthesis, or is there some trick that I'm missing?
Summary of solution: thanks #flodel, it appears to work well with other patterns too, so it appears to be a good general solution. A compact form of the solution using an input string line and regular expression pattern pat is:
pat <- '"(.*?)"'
sub(pat, "\\1", regmatches(line, gregexpr(pat, line))[[1]])
Furthermore, perl=TRUE should be added to gregexpr if using PCRE features in pat.
If you print m, you'll see gregexpr(..., perl = TRUE) gives you the positions and lengths of matches for a) your full pattern including the leading and closing quotes and b) the captured (.*).
Unfortunately for you, when m is used by regmatches, it use the positions and lengths of the former.
There are two solutions I can think of.
Pass your final output through sub:
line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line, perl = TRUE)
z <- regmatches(line, m)[[1]]
sub('"(.*?)"', "\\1", z)
Or use substring using the positions and lengths of the captured expressions:
start.pos <- attr(m[[1]], "capture.start")
end.pos <- start.pos + attr(m[[1]], "capture.length") - 1L
substring(line, start.pos, end.pos)
To further your understanding, see what happens if your pattern is trying to capture more than one thing. Also see that you can give names to your captures groups (what the doc refers to as Python-style named captures), here "capture1" and "capture2":
m <- gregexpr('"(?P<capture1>.*?) \\[(?P<capture2>.*?)\\]"', line, perl = TRUE)
m
start.pos <- attr(m[[1]], "capture.start")
end.pos <- start.pos + attr(m[[1]], "capture.length") - 1L
substring(line, start.pos[, "capture1"],
end.pos[, "capture1"])
# [1] "First" "Second" "Third"
substring(line, start.pos[, "capture2"],
end.pos[, "capture2"])
# [1] "T" "L" "1/T"
1) strapplyc in the gsubfn package acts in the way you were expecting:
> library(gsubfn)
> strapplyc(line, '"(.*?)"')[[1]]
[1] "First [T]" "Second [L]" "Third [1/T]"
2) Although it involves delving into m's attributes, its possible to make regmatches work by reconstructing m to refer to the captures rather than the whole match:
at <- attributes( m[[1]] )
m2 <- list( structure( c(at$capture.start), match.length = at$capture.length ) )
regmatches( line, m2 )[[1]]
3) If we knew that the strings always ended in ] and were willing to modify the regular expression then this would work:
> m3 <- gregexpr('[^"]*]', line)
> regmatches( line, m3 )[[1]]
[1] "First [T]" "Second [L]" "Third [1/T]"