conditional strsplit - regex

I have a dataframe where one of the columns contains a set of names. I would like to stringsplit a portion of the column names and have done so as follows:
DF$newname <- sapply(strsplit(as.character(DF$oldname), "_"), '[', 5)
in this example the fifth part of the split contains the name part of the character string. The problem is that this dataset contains $oldname names that are in different formats. In the first format the name is as follows where XXX are numbers:
xxx_xxx_xxx_xxx_name_xx (name is in fifth position)
and the second format the $oldname looks like this
xxx_xxx_xxx_xxx_xxx_name_xx (name is in sixth position)
I was thinking that I could use an ifelse command from within a function but am running into a little bit of trouble with the following code:
namesplit = function(df){
x <- strsplit(as.character(df$oldname), "_"), '[', 5)
y <- strsplit(as.character(df$oldname), "_"), '[', 6)
ifelse(is.character(x),x,y) }
DF$newname <- sapply(DF,namesplit)
this code doesn't work as I know I can's use the [ in this way but I am not sure of the best way. while I think I could get this working within a for loop, I would prefer to find a way to extract the names in a way that would allow me to use an apply.
thanks.

You can easily do this using gsub
names <- c('xxx_xxx_xxx_xxx_xxx_name1_xx', 'xxx_xxx_xxx_xxx_name2_xx')
gsub("^.*_([[:alnum:]]+)_.*$", "\\1", names)
[1] "name1" "name2"

If the name is the penultimate portion how about this:
x <- c("xxx_xxx_xxx_xxx_name_xx", "xxx_xxx_xxx_xxx_xxx_name_xx")
namesplit = function(x){
x <- strsplit(as.character(x), "_")
sapply(x, function(x) x[length(x)-1])
}
HTH

Related

conditional string splitting in R (using tidyr)

I have a data frame like this:
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
I'd like to split the variable column into two; one column to indicate if the variable is a 'cost' and another column to indicate whether or not the variable is "reed". I cannot seem to figure out the right regex for the split (e.g. using tidyr)
If my data were something nicer, say:
Y <- data.frame(value = c(1,2,3,4),
variable = c("adjusted_cost", "adjusted_cost", "reed_cost", "reed_cost"))
Then this is trivial with tidyr:
separate(Y, variable, c("Type", "Model"), "_")
and bingo. Instead, it looks like I need some kind of conditional statement to split on "_" if it is present, and otherwise split on the start of the pattern ("^").
I tried:
separate(X, variable, c("Policy-cost", "Reed"), "(?(_)_|^)", perl=TRUE)
but no luck. I realize I cannot even split to an empty string successfully:
separate(X, variable, c("Policy-cost", "Reed"), "^", perl=TRUE)
how should I do this?
Edit Note that this is a minimal example of a larger problem, in which there are many possible variables (not just cost and reed_cost) so I do not want to string match each one.
I am looking for a solution that splits arbitrary variables by the _ pattern if present and otherwise splits them into a blank string and the original label.
I also realize I could just grep for the presence of _ and then construct the columns manually. That's fine if rather less elegant; it seems there should be a way to split on a string using a conditional that can return an empty string...
Assuming you may or may not have a separator and that cost and reed aren't necessarily mutually exclusive, why not search for the specific string instead of the separator?
Example:
library(stringr)
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
X$cost <- str_detect(X$variable,"cost")
X$reed <- str_detect(X$variable,"reed")
You could try:
X$variable <- ifelse(!grepl("_", X$variable), paste0("_", X$variable), as.character(X$variable))
separate(X, variable, c("Policy-cost", "Reed"), "_")
# value Policy-cost Reed
#1 1 cost
#2 2 cost
#3 3 reed cost
#4 4 reed cost
Or
X$variable <- gsub("\\b(?=[A-Za-z]+\\b)", "_", X$variable, perl=T)
X$variable
#[1] "_cost" "_cost" "reed_cost" "reed_cost"
separate(X, variable, c("Policy-cost", "Reed"), "_")
Explanation
\\b(?=[A-Za-z]+\\b) : matches a word boundary \\b and looks ahead for characters followed by word boundary. The third and fourth elements does not match, so it was not replaced.
Another approach with base R:
cbind(X["value"],
setNames(as.data.frame(t(sapply(strsplit(as.character(X$variable), "_"),
function(x)
if (length(x) == 1) c("", x)
else x))),
c("Policy-cost", "Reed")))
# value Policy-cost Reed
# 1 1 cost
# 2 2 cost
# 3 3 reed cost
# 4 4 reed cost

Replace character string elements by indices efficiently in R

I would like to efficiently replace elements in my character object with other particular elements in particular places (these places are indices which I know as they are results of the gregexpr function).
I would like some foo function that works like:
foo("qwerty", c(1,3,5), c("z", "x", "y"))
giving me:
[1] "zwxryy"
I searched the stringr package cran pdf but nothing hit my mind. Thank you in advance for any suggestions.
For example:
xx <- unlist(strsplit("qwerty",""))
xx[c(1,3,5)] <- c("z", "x", "y")
paste0(xx,collapse='')
[1] "zwxryy"
You could also try the one below, if you don't have that many characters to replace
st1 <- "qwerty"
gsub("^.(.).(.).","z\\1x\\2y", st1)
#[1] "zwxryy"
In stringi package there is stri_sub function that works like this:
a <- "12345"
stri_sub(a, from=c(1,3,5),len=1) <- letters[c(1,3,5)]
a
## [1] "a2345" "12c45" "1234e"
it's almost what you want. Just use this in loop:
a <- "12345"
for(i in c(1,3,5)){
stri_sub(a, from=i,len=1) <- letters[i]
}
a
## [1] "a2c4e"
Be aware that this kind of function is on our TODO list, check:
https://github.com/Rexamine/stringi/issues?state=open

Subsetting data using regular expressions in R

I want to extract specific information from within a column in a data frame and add it on to a new column in the same data frame. The complication lies in the fact that some rows do not have the information I want to extract (the 6 characters after "UniProt:") at all, while others have multiple occurrences - I want these to be displayed accordingly as this column contains the identifiers in my data frame.
Here's an example; I've copied a few rows of the column Fasta.headers from my data frame:
Row 1:
H05C05.1c;CE43771;WBGene00019157;status:Partially_confirmed;UniProt:H2L0A8;protein_id:CCD72193.1;>H05C05.1a;CE37385;WBGene00019157;status:Partially_confirmed;UniProt:Q9TXU2;protein_id:CCD72188.1
Row 2:
C02B10.5;CE16802;WBGene00015330;status:Partially_confirmed;UniProt:O44447;protein_id:CCD61167.1
Row 3:
ZK1127.4;CE07643;WBGene00022851;status:Confirmed;protein_id:CCD73716.1
Row 4:
T27C4.4a;CE21211;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:O61907;protein_id:CCD74255.1;>T27C4.4b;CE21212;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:Q76NP4;protein_id:CCD74256.1;>T27C4.4d;CE33331;>F54F2.9;CE39158;WBGene00018836;status:Confirmed;UniProt:P34454;protein_id:CCD71243.1
I want the output to be:
H2L0A8;Q9TXU2
O44447
O61907;Q76NP4;P34454
Here strapplyc from the gsubfn package extracts the desired strings from x and sapply collapses multiple strings into a single string separated by semicolons:
library(gsubfn)
sapply(strapplyc(x, "UniProt:([^;]*)"), paste, collapse = ";")
giving:
[1] "H2L0A8;Q9TXU2" "O44447" ""
[4] "O61907;Q76NP4;P34454"
where x is:
x <- c("H05C05.1c;CE43771;WBGene00019157;status:Partially_confirmed;UniProt:H2L0A8;protein_id:CCD72193.1;>H05C05.1a;CE37385;WBGene00019157;status:Partially_confirmed;UniProt:Q9TXU2;protein_id:CCD72188.1",
"C02B10.5;CE16802;WBGene00015330;status:Partially_confirmed;UniProt:O44447;protein_id:CCD61167.1",
"ZK1127.4;CE07643;WBGene00022851;status:Confirmed;protein_id:CCD73716.1",
"T27C4.4a;CE21211;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:O61907;protein_id:CCD74255.1;>T27C4.4b;CE21212;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:Q76NP4;protein_id:CCD74256.1;>T27C4.4d;CE33331;>F54F2.9;CE39158;WBGene00018836;status:Confirmed;UniProt:P34454;protein_id:CCD71243.1")
ADDED some explanation.
An alternative using the infrequently used: regmatches<-
regmatches(x,gregexpr("UniProt:.{7}",x),invert=TRUE) <- ""
gsub("UniProt:","",x)
#[1] "H2L0A8;Q9TXU2;"
#[2] "O44447;"
#[3] ""
#[4] "O61907;Q76NP4;P34454;"
You can also get there with lookaheads and lookbehinds specifying perl=TRUE to the regex:
sapply(regmatches(x,gregexpr("(?<=UniProt:).+?(?=;)",x,perl=TRUE)),
paste,collapse=";")
#[1] "H2L0A8;Q9TXU2" "O44447"
#[3] "" "O61907;Q76NP4;P34454"

extract partial string based on pattern in r

I would like to extract partial string from a list. I don't know how to define the pattern of the strings. Thank you for your helps.
library(stringr)
names = c("GAPIT..flowerdate.GWAS.Results.csv","GAPIT..flwrcolor.GWAS.Results.csv",
"GAPIT..height.GWAS.Results.csv","GAPIT..matdate.GWAS.Results.csv")
# I want to extract out "flowerdate", "flwrcolor", "height" and "matdate"
traits <- str_extract_all(string = files, pattern = "..*.")
# the result is not what I want.
You can also use regmatches
> regmatches(c, regexpr("[[:lower:]]+", c))
[1] "flowerdate" "flwrcolor" "height" "matdate"
I encourage you not to use c as a variable name, because you're overwriting c function.
I borrow the answer from Roman Luštrik for my previous question “How to extract out a partial name as new column name in a data frame”
traits <- unlist(lapply(strsplit(names, "\\."), "[[", 3))
Use sub:
sub(".*\\.{2}(.+?)\\..*", "\\1", names)
# [1] "flowerdate" "flwrcolor" "height" "matdate"
Here are a few solutions. The first two do not use regular expressions at all. The lsat one uses a single gsub:
1) read.table. This assumes the desired string is always the 3rd field:
read.table(text = names, sep = ".", as.is = TRUE)[[3]]
2) strsplit This assumes the desired string has more than 3 characters and is lower case:
sapply(strsplit(names, "[.]"), Filter, f = function(x) nchar(x) > 3 & tolower(x) == x)
3) gsub This assumes that two dots preceed the string and one dot plus junk not containing two successive dots comes afterwards:
gsub(".*[.]{2}|[.].*", "", names)
REVISED Added additional solutions.

Split on first comma in string

How can I efficiently split the following string on the first comma using base?
x <- "I want to split here, though I don't want to split elsewhere, even here."
strsplit(x, ???)
Desired outcome (2 strings):
[[1]]
[1] "I want to split here" "though I don't want to split elsewhere, even here."
Thank you in advance.
EDIT: Didn't think to mention this. This needs to be able to generalize to a column, vector of strings like this, as in:
y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")
The outcome can be two columns or one long vector (that I can take every other element of) or a list of stings with each index ([[n]]) having two strings.
Apologies for the lack of clarity.
Here's what I'd probably do. It may seem hacky, but since sub() and strsplit() are both vectorized, it will also work smoothly when handed multiple strings.
XX <- "SoMeThInGrIdIcUlOuS"
strsplit(sub(",\\s*", XX, x), XX)
# [[1]]
# [1] "I want to split here"
# [2] "though I don't want to split elsewhere, even here."
From the stringr package:
str_split_fixed(x, pattern = ', ', n = 2)
# [,1]
# [1,] "I want to split here"
# [,2]
# [1,] "though I don't want to split elsewhere, even here."
(That's a matrix with one row and two columns.)
Here is yet another solution, with a regular expression to capture what is before and after the first comma.
x <- "I want to split here, though I don't want to split elsewhere, even here."
library(stringr)
str_match(x, "^(.*?),\\s*(.*)")[,-1]
# [1] "I want to split here"
# [2] "though I don't want to split elsewhere, even here."
library(stringr)
str_sub(x,end = min(str_locate(string=x, ',')-1))
This will get the first bit you want. Change the start= and end= in str_sub to get what ever else you want.
Such as:
str_sub(x,start = min(str_locate(string=x, ',')+1 ))
and wrap in str_trim to get rid of the leading space:
str_trim(str_sub(x,start = min(str_locate(string=x, ',')+1 )))
This works but I like Josh Obrien's better:
y <- strsplit(x, ",")
sapply(y, function(x) data.frame(x= x[1],
z=paste(x[-1], collapse=",")), simplify=F))
Inspired by chase's response.
A number of people gave non base approaches so I figure I'd add the one I usually use (though in this case I needed a base response):
y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")
library(reshape2)
colsplit(y, ",", c("x","z"))