extract partial string based on pattern in r - regex

I would like to extract partial string from a list. I don't know how to define the pattern of the strings. Thank you for your helps.
library(stringr)
names = c("GAPIT..flowerdate.GWAS.Results.csv","GAPIT..flwrcolor.GWAS.Results.csv",
"GAPIT..height.GWAS.Results.csv","GAPIT..matdate.GWAS.Results.csv")
# I want to extract out "flowerdate", "flwrcolor", "height" and "matdate"
traits <- str_extract_all(string = files, pattern = "..*.")
# the result is not what I want.

You can also use regmatches
> regmatches(c, regexpr("[[:lower:]]+", c))
[1] "flowerdate" "flwrcolor" "height" "matdate"
I encourage you not to use c as a variable name, because you're overwriting c function.

I borrow the answer from Roman Luštrik for my previous question “How to extract out a partial name as new column name in a data frame”
traits <- unlist(lapply(strsplit(names, "\\."), "[[", 3))

Use sub:
sub(".*\\.{2}(.+?)\\..*", "\\1", names)
# [1] "flowerdate" "flwrcolor" "height" "matdate"

Here are a few solutions. The first two do not use regular expressions at all. The lsat one uses a single gsub:
1) read.table. This assumes the desired string is always the 3rd field:
read.table(text = names, sep = ".", as.is = TRUE)[[3]]
2) strsplit This assumes the desired string has more than 3 characters and is lower case:
sapply(strsplit(names, "[.]"), Filter, f = function(x) nchar(x) > 3 & tolower(x) == x)
3) gsub This assumes that two dots preceed the string and one dot plus junk not containing two successive dots comes afterwards:
gsub(".*[.]{2}|[.].*", "", names)
REVISED Added additional solutions.

Related

R use gsub as substr

I'm using H2O for some distributed computing work (via the h2o package in R). Many of the base R functions are present but I'm unable to find a suitable substitute for the substr function. I do have access to the sub and gsub functions and was hoping to possibly use some form of regex as a workaround.
I'm using the following code but not having any luck:
df1 <- data.frame(id = 1:10, var1 = seq(14102201,14103200, 100))
df1$var2 <- substr(df1$var1, 1,6)
df1$var3 <- gsub('\\d{1,8}','\\d{1,6}', df1$var1)
df1
The output in df1$var2 is what I'm looking for. Any suggestions?
EDIT:
Running this code:
library(h2o)
localH2O = h2o.init(nthreads = 2)
df1 <- data.frame(id = 1:10, var1 = seq(14102201,14103200, 100))
df1.hex <- as.h2o(localH2O , df1)
df1.hex$var2 <- substr(df1.hex$var1, 1, 6)
Gets this message:
> df1.hex$var2 <- substr(df1.hex$var1, 1, 6)
Error in as.character.default(x) :
no method for coercing this S4 class to a vector
Use capture groups:
gsub('(.+)..','\\1', df1$var1)
This regex matches (.+).. with df1$var1, and replace it with the substring that matches the first capture group (.+). Since there is .. at the end of the regex, the last two characters are not matched with the .+, thus they are not in the result.
Capture the first 6 value like so using a pattern that matches the whole sting
gsub('^(.{6}).*$','\\1', df1$var1)
A slightly more general replacement for substr(x,start,stop) is
if(start > 1)
gsub('^(.{*start-1*})(.{*stop-start+1*})).*$','\\1', 'asdfhjkl')
else
gsub('^(.{*stop*})).*$','\\1', 'asdfhjkl')
where the values between the * characters are the actual integer values of the expression. (although you'll have to make sure that nchar(x)is less than stop, otherwise the patterns won't match b/c the string is too short.)
The regex (?<=^.{6}).*$ matches al characters after the first 6 ones. If you want to replace substr(df1$var1, 1, 6) with sub, you can use this command:
sub('(?<=^.{6}).*$', '', df1$var1, perl = TRUE)
# [1] "141022" "141023" "141024" "141025" "141026" "141027" "141028" "141029"
# [9] "141030" "141031"
This command replaces all digits after the first 6 ones with the empty string.

conditional string splitting in R (using tidyr)

I have a data frame like this:
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
I'd like to split the variable column into two; one column to indicate if the variable is a 'cost' and another column to indicate whether or not the variable is "reed". I cannot seem to figure out the right regex for the split (e.g. using tidyr)
If my data were something nicer, say:
Y <- data.frame(value = c(1,2,3,4),
variable = c("adjusted_cost", "adjusted_cost", "reed_cost", "reed_cost"))
Then this is trivial with tidyr:
separate(Y, variable, c("Type", "Model"), "_")
and bingo. Instead, it looks like I need some kind of conditional statement to split on "_" if it is present, and otherwise split on the start of the pattern ("^").
I tried:
separate(X, variable, c("Policy-cost", "Reed"), "(?(_)_|^)", perl=TRUE)
but no luck. I realize I cannot even split to an empty string successfully:
separate(X, variable, c("Policy-cost", "Reed"), "^", perl=TRUE)
how should I do this?
Edit Note that this is a minimal example of a larger problem, in which there are many possible variables (not just cost and reed_cost) so I do not want to string match each one.
I am looking for a solution that splits arbitrary variables by the _ pattern if present and otherwise splits them into a blank string and the original label.
I also realize I could just grep for the presence of _ and then construct the columns manually. That's fine if rather less elegant; it seems there should be a way to split on a string using a conditional that can return an empty string...
Assuming you may or may not have a separator and that cost and reed aren't necessarily mutually exclusive, why not search for the specific string instead of the separator?
Example:
library(stringr)
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
X$cost <- str_detect(X$variable,"cost")
X$reed <- str_detect(X$variable,"reed")
You could try:
X$variable <- ifelse(!grepl("_", X$variable), paste0("_", X$variable), as.character(X$variable))
separate(X, variable, c("Policy-cost", "Reed"), "_")
# value Policy-cost Reed
#1 1 cost
#2 2 cost
#3 3 reed cost
#4 4 reed cost
Or
X$variable <- gsub("\\b(?=[A-Za-z]+\\b)", "_", X$variable, perl=T)
X$variable
#[1] "_cost" "_cost" "reed_cost" "reed_cost"
separate(X, variable, c("Policy-cost", "Reed"), "_")
Explanation
\\b(?=[A-Za-z]+\\b) : matches a word boundary \\b and looks ahead for characters followed by word boundary. The third and fourth elements does not match, so it was not replaced.
Another approach with base R:
cbind(X["value"],
setNames(as.data.frame(t(sapply(strsplit(as.character(X$variable), "_"),
function(x)
if (length(x) == 1) c("", x)
else x))),
c("Policy-cost", "Reed")))
# value Policy-cost Reed
# 1 1 cost
# 2 2 cost
# 3 3 reed cost
# 4 4 reed cost

regular expression in R: how to extract characters from a string

I have a vector of strings each containing last and first name of one or more authors. I would like to extract the last names of each author in each string. What I know is that the name that comes first is always the last name of an author (the first author), and the last names of the other authors are everything that is between between a ; and a ,. For example, in the following string:
tutu <- "goulenok, tiphaine miquel; meune, christophe; gossec, laure; dougados, maxime; kahan, andre; allanore, yannick"
I would like to extract:
"goulenok" "meune" "gossec" "dougados" "kahan" "allanore"
The last name may include punctuation characters such as ' or - but always be between a ; and a ,
Any idea?
> sub(",.*$", "", strsplit(tutu, ";[ ]+")[[1]])
[1] "goulenok" "meune" "gossec" "dougados" "kahan" "allanore"
Here is an approach that uses the gsubfn package:
library(gsubfn)
unlist(strapplyc(tutu, "(?:^|;) *([^,]+)"))
This is a bit more blunt but also works:
sapply(unlist(lapply(strsplit(tutu, ";"), strsplit, ",")), "[", 1)

conditional strsplit

I have a dataframe where one of the columns contains a set of names. I would like to stringsplit a portion of the column names and have done so as follows:
DF$newname <- sapply(strsplit(as.character(DF$oldname), "_"), '[', 5)
in this example the fifth part of the split contains the name part of the character string. The problem is that this dataset contains $oldname names that are in different formats. In the first format the name is as follows where XXX are numbers:
xxx_xxx_xxx_xxx_name_xx (name is in fifth position)
and the second format the $oldname looks like this
xxx_xxx_xxx_xxx_xxx_name_xx (name is in sixth position)
I was thinking that I could use an ifelse command from within a function but am running into a little bit of trouble with the following code:
namesplit = function(df){
x <- strsplit(as.character(df$oldname), "_"), '[', 5)
y <- strsplit(as.character(df$oldname), "_"), '[', 6)
ifelse(is.character(x),x,y) }
DF$newname <- sapply(DF,namesplit)
this code doesn't work as I know I can's use the [ in this way but I am not sure of the best way. while I think I could get this working within a for loop, I would prefer to find a way to extract the names in a way that would allow me to use an apply.
thanks.
You can easily do this using gsub
names <- c('xxx_xxx_xxx_xxx_xxx_name1_xx', 'xxx_xxx_xxx_xxx_name2_xx')
gsub("^.*_([[:alnum:]]+)_.*$", "\\1", names)
[1] "name1" "name2"
If the name is the penultimate portion how about this:
x <- c("xxx_xxx_xxx_xxx_name_xx", "xxx_xxx_xxx_xxx_xxx_name_xx")
namesplit = function(x){
x <- strsplit(as.character(x), "_")
sapply(x, function(x) x[length(x)-1])
}
HTH

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.