conditional string splitting in R (using tidyr) - regex

I have a data frame like this:
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
I'd like to split the variable column into two; one column to indicate if the variable is a 'cost' and another column to indicate whether or not the variable is "reed". I cannot seem to figure out the right regex for the split (e.g. using tidyr)
If my data were something nicer, say:
Y <- data.frame(value = c(1,2,3,4),
variable = c("adjusted_cost", "adjusted_cost", "reed_cost", "reed_cost"))
Then this is trivial with tidyr:
separate(Y, variable, c("Type", "Model"), "_")
and bingo. Instead, it looks like I need some kind of conditional statement to split on "_" if it is present, and otherwise split on the start of the pattern ("^").
I tried:
separate(X, variable, c("Policy-cost", "Reed"), "(?(_)_|^)", perl=TRUE)
but no luck. I realize I cannot even split to an empty string successfully:
separate(X, variable, c("Policy-cost", "Reed"), "^", perl=TRUE)
how should I do this?
Edit Note that this is a minimal example of a larger problem, in which there are many possible variables (not just cost and reed_cost) so I do not want to string match each one.
I am looking for a solution that splits arbitrary variables by the _ pattern if present and otherwise splits them into a blank string and the original label.
I also realize I could just grep for the presence of _ and then construct the columns manually. That's fine if rather less elegant; it seems there should be a way to split on a string using a conditional that can return an empty string...

Assuming you may or may not have a separator and that cost and reed aren't necessarily mutually exclusive, why not search for the specific string instead of the separator?
Example:
library(stringr)
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
X$cost <- str_detect(X$variable,"cost")
X$reed <- str_detect(X$variable,"reed")

You could try:
X$variable <- ifelse(!grepl("_", X$variable), paste0("_", X$variable), as.character(X$variable))
separate(X, variable, c("Policy-cost", "Reed"), "_")
# value Policy-cost Reed
#1 1 cost
#2 2 cost
#3 3 reed cost
#4 4 reed cost
Or
X$variable <- gsub("\\b(?=[A-Za-z]+\\b)", "_", X$variable, perl=T)
X$variable
#[1] "_cost" "_cost" "reed_cost" "reed_cost"
separate(X, variable, c("Policy-cost", "Reed"), "_")
Explanation
\\b(?=[A-Za-z]+\\b) : matches a word boundary \\b and looks ahead for characters followed by word boundary. The third and fourth elements does not match, so it was not replaced.

Another approach with base R:
cbind(X["value"],
setNames(as.data.frame(t(sapply(strsplit(as.character(X$variable), "_"),
function(x)
if (length(x) == 1) c("", x)
else x))),
c("Policy-cost", "Reed")))
# value Policy-cost Reed
# 1 1 cost
# 2 2 cost
# 3 3 reed cost
# 4 4 reed cost

Related

R: how to convert part of a string to variable name and return its value in the same string?

Suppose I have a string marco <- 'polo'. Is there any way I can embed marco in the middle of another string, e.g. x <- 'John plays water marco.' and have x return 'John plays water polo.'?
EDIT
The solution David kindly offered does work for the hypothetical problem I posted above, but what I was trying to get to was this:
data <- c('kek','koki','ukak','ikka')
V <- c('a|e|i|o|u')
Rather than deleting all vowels, which the solution can manage (gsub(V,'',data)), how do I specify, say, all vowels between two k's? Obviously gsub('kVk','',data) doesn't work. Any help would be greatly appreciated.
If you want all vowels between two "k" letters removed, I propose the following:
V <- '[aeiou]'
data <- c('kek', 'koki', 'ukak', 'ikka', 'keeuiokaeioukaeiousk')
gsub(paste0('(?:\\G(?!^)|[^k]*k(?=[^k]+k))\\K', V), '', data, perl=T)
# [1] "kk" "kki" "ukk" "ikka" "kkksk"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. \K resets the starting point of the reported match and any previously consumed characters are no longer included which is similar to a lookbehind.
Regular Expression Explanation
Or, to use the example as given:
V <- 'a|e|i|o|u' ## or equivalently '[aeiou]'
dd <- c('kek','koki','ukak','ikka','kaaaak')
gsub(paste0("k(",V,")+k"),"kk",dd)
## [1] "kk" "kki" "ukk" "ikka" "kk"
I guessed that you might (?) want to delete multiple vowels between ks; I added a + to the regular expression to do this.

R use gsub as substr

I'm using H2O for some distributed computing work (via the h2o package in R). Many of the base R functions are present but I'm unable to find a suitable substitute for the substr function. I do have access to the sub and gsub functions and was hoping to possibly use some form of regex as a workaround.
I'm using the following code but not having any luck:
df1 <- data.frame(id = 1:10, var1 = seq(14102201,14103200, 100))
df1$var2 <- substr(df1$var1, 1,6)
df1$var3 <- gsub('\\d{1,8}','\\d{1,6}', df1$var1)
df1
The output in df1$var2 is what I'm looking for. Any suggestions?
EDIT:
Running this code:
library(h2o)
localH2O = h2o.init(nthreads = 2)
df1 <- data.frame(id = 1:10, var1 = seq(14102201,14103200, 100))
df1.hex <- as.h2o(localH2O , df1)
df1.hex$var2 <- substr(df1.hex$var1, 1, 6)
Gets this message:
> df1.hex$var2 <- substr(df1.hex$var1, 1, 6)
Error in as.character.default(x) :
no method for coercing this S4 class to a vector
Use capture groups:
gsub('(.+)..','\\1', df1$var1)
This regex matches (.+).. with df1$var1, and replace it with the substring that matches the first capture group (.+). Since there is .. at the end of the regex, the last two characters are not matched with the .+, thus they are not in the result.
Capture the first 6 value like so using a pattern that matches the whole sting
gsub('^(.{6}).*$','\\1', df1$var1)
A slightly more general replacement for substr(x,start,stop) is
if(start > 1)
gsub('^(.{*start-1*})(.{*stop-start+1*})).*$','\\1', 'asdfhjkl')
else
gsub('^(.{*stop*})).*$','\\1', 'asdfhjkl')
where the values between the * characters are the actual integer values of the expression. (although you'll have to make sure that nchar(x)is less than stop, otherwise the patterns won't match b/c the string is too short.)
The regex (?<=^.{6}).*$ matches al characters after the first 6 ones. If you want to replace substr(df1$var1, 1, 6) with sub, you can use this command:
sub('(?<=^.{6}).*$', '', df1$var1, perl = TRUE)
# [1] "141022" "141023" "141024" "141025" "141026" "141027" "141028" "141029"
# [9] "141030" "141031"
This command replaces all digits after the first 6 ones with the empty string.

Subsetting data using regular expressions in R

I want to extract specific information from within a column in a data frame and add it on to a new column in the same data frame. The complication lies in the fact that some rows do not have the information I want to extract (the 6 characters after "UniProt:") at all, while others have multiple occurrences - I want these to be displayed accordingly as this column contains the identifiers in my data frame.
Here's an example; I've copied a few rows of the column Fasta.headers from my data frame:
Row 1:
H05C05.1c;CE43771;WBGene00019157;status:Partially_confirmed;UniProt:H2L0A8;protein_id:CCD72193.1;>H05C05.1a;CE37385;WBGene00019157;status:Partially_confirmed;UniProt:Q9TXU2;protein_id:CCD72188.1
Row 2:
C02B10.5;CE16802;WBGene00015330;status:Partially_confirmed;UniProt:O44447;protein_id:CCD61167.1
Row 3:
ZK1127.4;CE07643;WBGene00022851;status:Confirmed;protein_id:CCD73716.1
Row 4:
T27C4.4a;CE21211;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:O61907;protein_id:CCD74255.1;>T27C4.4b;CE21212;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:Q76NP4;protein_id:CCD74256.1;>T27C4.4d;CE33331;>F54F2.9;CE39158;WBGene00018836;status:Confirmed;UniProt:P34454;protein_id:CCD71243.1
I want the output to be:
H2L0A8;Q9TXU2
O44447
O61907;Q76NP4;P34454
Here strapplyc from the gsubfn package extracts the desired strings from x and sapply collapses multiple strings into a single string separated by semicolons:
library(gsubfn)
sapply(strapplyc(x, "UniProt:([^;]*)"), paste, collapse = ";")
giving:
[1] "H2L0A8;Q9TXU2" "O44447" ""
[4] "O61907;Q76NP4;P34454"
where x is:
x <- c("H05C05.1c;CE43771;WBGene00019157;status:Partially_confirmed;UniProt:H2L0A8;protein_id:CCD72193.1;>H05C05.1a;CE37385;WBGene00019157;status:Partially_confirmed;UniProt:Q9TXU2;protein_id:CCD72188.1",
"C02B10.5;CE16802;WBGene00015330;status:Partially_confirmed;UniProt:O44447;protein_id:CCD61167.1",
"ZK1127.4;CE07643;WBGene00022851;status:Confirmed;protein_id:CCD73716.1",
"T27C4.4a;CE21211;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:O61907;protein_id:CCD74255.1;>T27C4.4b;CE21212;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:Q76NP4;protein_id:CCD74256.1;>T27C4.4d;CE33331;>F54F2.9;CE39158;WBGene00018836;status:Confirmed;UniProt:P34454;protein_id:CCD71243.1")
ADDED some explanation.
An alternative using the infrequently used: regmatches<-
regmatches(x,gregexpr("UniProt:.{7}",x),invert=TRUE) <- ""
gsub("UniProt:","",x)
#[1] "H2L0A8;Q9TXU2;"
#[2] "O44447;"
#[3] ""
#[4] "O61907;Q76NP4;P34454;"
You can also get there with lookaheads and lookbehinds specifying perl=TRUE to the regex:
sapply(regmatches(x,gregexpr("(?<=UniProt:).+?(?=;)",x,perl=TRUE)),
paste,collapse=";")
#[1] "H2L0A8;Q9TXU2" "O44447"
#[3] "" "O61907;Q76NP4;P34454"

extract partial string based on pattern in r

I would like to extract partial string from a list. I don't know how to define the pattern of the strings. Thank you for your helps.
library(stringr)
names = c("GAPIT..flowerdate.GWAS.Results.csv","GAPIT..flwrcolor.GWAS.Results.csv",
"GAPIT..height.GWAS.Results.csv","GAPIT..matdate.GWAS.Results.csv")
# I want to extract out "flowerdate", "flwrcolor", "height" and "matdate"
traits <- str_extract_all(string = files, pattern = "..*.")
# the result is not what I want.
You can also use regmatches
> regmatches(c, regexpr("[[:lower:]]+", c))
[1] "flowerdate" "flwrcolor" "height" "matdate"
I encourage you not to use c as a variable name, because you're overwriting c function.
I borrow the answer from Roman Luštrik for my previous question “How to extract out a partial name as new column name in a data frame”
traits <- unlist(lapply(strsplit(names, "\\."), "[[", 3))
Use sub:
sub(".*\\.{2}(.+?)\\..*", "\\1", names)
# [1] "flowerdate" "flwrcolor" "height" "matdate"
Here are a few solutions. The first two do not use regular expressions at all. The lsat one uses a single gsub:
1) read.table. This assumes the desired string is always the 3rd field:
read.table(text = names, sep = ".", as.is = TRUE)[[3]]
2) strsplit This assumes the desired string has more than 3 characters and is lower case:
sapply(strsplit(names, "[.]"), Filter, f = function(x) nchar(x) > 3 & tolower(x) == x)
3) gsub This assumes that two dots preceed the string and one dot plus junk not containing two successive dots comes afterwards:
gsub(".*[.]{2}|[.].*", "", names)
REVISED Added additional solutions.

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.