Subsetting data using regular expressions in R - regex

I want to extract specific information from within a column in a data frame and add it on to a new column in the same data frame. The complication lies in the fact that some rows do not have the information I want to extract (the 6 characters after "UniProt:") at all, while others have multiple occurrences - I want these to be displayed accordingly as this column contains the identifiers in my data frame.
Here's an example; I've copied a few rows of the column Fasta.headers from my data frame:
Row 1:
H05C05.1c;CE43771;WBGene00019157;status:Partially_confirmed;UniProt:H2L0A8;protein_id:CCD72193.1;>H05C05.1a;CE37385;WBGene00019157;status:Partially_confirmed;UniProt:Q9TXU2;protein_id:CCD72188.1
Row 2:
C02B10.5;CE16802;WBGene00015330;status:Partially_confirmed;UniProt:O44447;protein_id:CCD61167.1
Row 3:
ZK1127.4;CE07643;WBGene00022851;status:Confirmed;protein_id:CCD73716.1
Row 4:
T27C4.4a;CE21211;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:O61907;protein_id:CCD74255.1;>T27C4.4b;CE21212;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:Q76NP4;protein_id:CCD74256.1;>T27C4.4d;CE33331;>F54F2.9;CE39158;WBGene00018836;status:Confirmed;UniProt:P34454;protein_id:CCD71243.1
I want the output to be:
H2L0A8;Q9TXU2
O44447
O61907;Q76NP4;P34454

Here strapplyc from the gsubfn package extracts the desired strings from x and sapply collapses multiple strings into a single string separated by semicolons:
library(gsubfn)
sapply(strapplyc(x, "UniProt:([^;]*)"), paste, collapse = ";")
giving:
[1] "H2L0A8;Q9TXU2" "O44447" ""
[4] "O61907;Q76NP4;P34454"
where x is:
x <- c("H05C05.1c;CE43771;WBGene00019157;status:Partially_confirmed;UniProt:H2L0A8;protein_id:CCD72193.1;>H05C05.1a;CE37385;WBGene00019157;status:Partially_confirmed;UniProt:Q9TXU2;protein_id:CCD72188.1",
"C02B10.5;CE16802;WBGene00015330;status:Partially_confirmed;UniProt:O44447;protein_id:CCD61167.1",
"ZK1127.4;CE07643;WBGene00022851;status:Confirmed;protein_id:CCD73716.1",
"T27C4.4a;CE21211;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:O61907;protein_id:CCD74255.1;>T27C4.4b;CE21212;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:Q76NP4;protein_id:CCD74256.1;>T27C4.4d;CE33331;>F54F2.9;CE39158;WBGene00018836;status:Confirmed;UniProt:P34454;protein_id:CCD71243.1")
ADDED some explanation.

An alternative using the infrequently used: regmatches<-
regmatches(x,gregexpr("UniProt:.{7}",x),invert=TRUE) <- ""
gsub("UniProt:","",x)
#[1] "H2L0A8;Q9TXU2;"
#[2] "O44447;"
#[3] ""
#[4] "O61907;Q76NP4;P34454;"
You can also get there with lookaheads and lookbehinds specifying perl=TRUE to the regex:
sapply(regmatches(x,gregexpr("(?<=UniProt:).+?(?=;)",x,perl=TRUE)),
paste,collapse=";")
#[1] "H2L0A8;Q9TXU2" "O44447"
#[3] "" "O61907;Q76NP4;P34454"

Related

Using gsub in R to remove values in Zip Code field

I have a data frame that contains columns of values, one of which is United States Postal Zip codes.
Row_num Restaurant Address City State Zip
26698 m 1460 Memorial Drive Chicopee MA 01020-3964
For this entry, I want to only have the 5 digit zip code 01020 and remove the "-3964" after it and do this for every entry in my data frame. Right now the zip code column is being treated as a chr by r.
I have tried the following gsub code:
df$Zip <- gsub(df$Zip, pattern="-[0,9]{0,4}", replacement = "")
However, all that does is replace the "-" with no space. Not only is that not what I want but it is also not what I expected so any help as to how gsub behaves and how to get the desired result would be appreciated.
Thank you!
Edit: I have found out through trial and error that this block of code works as well
df$Zip <- gsub(df$Zip, pattern="-.*", replacement = "")
The character class you defined has only three elements 0, 9, and ",". Inside character class brackets you need to use dash as the range operator, so try:
df$Zip <- gsub(df$Zip, pattern="-[0-9]{0,4}", replacement = "")

filtering columns by regex in dataframe

I have a large dataframe (3000+ columns) and I am trying to get a list of all column names that follow this pattern:
"stat.mineBlock.minecraft.123456stone"
"stat.mineBlock.minecraft.DFHFFBSBstone2"
"stat.mineBlock.minecraft.AAAstoneAAAA"
My code:
stoneCombined<-grep("^[stat.mineBlock.minecraft.][a-zA-Z0-9]*?[stone][a-zA-Z0-9]*?", colnames(ingame), ignore.case =T)
where ingame is the dataframe I am searching. My code returns a list of numbers however instead of the dataframe columns (like those above) that I was expecting. Con someone tell me why?
After adding value=TRUE (Thanks to user227710):
I now get column names, but I get every column in my dataset not those that contain : stat.mineBlock.minecraft. and stone like I was trying to get.
To return the column names you need to set value=TRUE as an additional argument of grep. The default option in grep is to set value=FALSE and so it will give you indices of the matched colnames. .
help("grep")
value
if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.
grep("your regex pattern", colnames(ingame),value=TRUE, ignore.case =T)
Here is a solution in dplyr:
library(dplyr)
your_df %>%
select(starts_with("stat.mineBlock.minecraft"))
The more general way to match a column name to a regex is with matches() inside select(). See ?select for more information.
My answer is based on this SO post. As per the regex, you were very close.
Just [] create a character class matching a single character from the defined set, and it is the main reason it was not working. Also, perl=T is always safer to use with regex in R.
So, here is my sample code:
df <- data.frame(
"stat.mineBlock.minecraft.123456stone" = 1,
"stat.mineBlock.minecraft.DFHFFBSBwater2" = 2,
"stat.mineBlock.minecraft.DFHFFBSBwater3" = 3,
"stat.mineBlock.minecraft.DFHFFBSBstone4" = 4
)
grep("^stat\\.mineBlock\\.minecraft\\.[a-zA-Z0-9]*?stone[a-zA-Z0-9]*?", colnames(df), value=TRUE, ignore.case=T, perl=T)
See IDEONE demo

Correct wrongly formatted dates

I have some incorrect dates between good formatted dates, looking something like this:
df <- data.frame(col=c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05"))
How can I convert the incorrect format between the existing correctly formatted dates?
I'm able to remove the first dashes, but also the it requires to remove the last 3 characters -01 or -1. So that the corrected values are:
desired <- c("1.1.11","1.1.12","1.1.13","1.1.14","1.10.10","1.10.11","1.10.12","2010-03-31","2010-04-01","2010-04-05"))
What I'm strangling with is the -01 part, since by removing these, would also remove part of the correct formatted dates.
EDIT: The format is mm.dd.yy
Here is a pretty simple solution using sub ...
sub('^-+([^-]+).+', '\\1', df$col)
# [1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
# [6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"
Just remove all the non-word characters present at the start or -01 or -1 present at the end which was not preceded by -+ two digits.
> x <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> gsub("^\\W+|(?<!-\\d{2})-0?1$", "", x, perl=T)
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
[6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"
A simple regexp will solve these kinds of problems pretty well:
> df <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> df
[1] "--1.1.11-01" "--1.11.12-1" "--1.1.13-01" "--1.1.14-01" "--1.10.10-01" "-1.10.11-01" "---1.10.12-01"
[8] "2010-03-31" "2010-04-01" "2010-04-05"
> df <- sub(".*([0-9]{4}\\-[0-9]{2}\\-[0-9]{2}|[0-9]{1,2}\\.[0-9]{1,2}\\.[0-9]{1,2}).*", "\\1", df)
> df
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10" "1.10.11" "1.10.12" "2010-03-31" "2010-04-01"
[10] "2010-04-05"
Note that I made it a character vector instead of data.frame.
The solution itself is just matching one pattern or the other pattern and then dropping the rest by replacing it with the subpattern.
I here observe that if the prefix of a date has an entry as -1 or --1 then only there exists a illegal suffix i.e -01.
You could first take all the values in array.
So you will have an array of "--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01"
Now you can check for the prefix if is it -1 or --1. if there exists any such thing then you can mark it as to remove the suffix -01 as well .
According to the input pattern above I feel that the above strategy would work.
Please let me know if the strategy works

conditional string splitting in R (using tidyr)

I have a data frame like this:
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
I'd like to split the variable column into two; one column to indicate if the variable is a 'cost' and another column to indicate whether or not the variable is "reed". I cannot seem to figure out the right regex for the split (e.g. using tidyr)
If my data were something nicer, say:
Y <- data.frame(value = c(1,2,3,4),
variable = c("adjusted_cost", "adjusted_cost", "reed_cost", "reed_cost"))
Then this is trivial with tidyr:
separate(Y, variable, c("Type", "Model"), "_")
and bingo. Instead, it looks like I need some kind of conditional statement to split on "_" if it is present, and otherwise split on the start of the pattern ("^").
I tried:
separate(X, variable, c("Policy-cost", "Reed"), "(?(_)_|^)", perl=TRUE)
but no luck. I realize I cannot even split to an empty string successfully:
separate(X, variable, c("Policy-cost", "Reed"), "^", perl=TRUE)
how should I do this?
Edit Note that this is a minimal example of a larger problem, in which there are many possible variables (not just cost and reed_cost) so I do not want to string match each one.
I am looking for a solution that splits arbitrary variables by the _ pattern if present and otherwise splits them into a blank string and the original label.
I also realize I could just grep for the presence of _ and then construct the columns manually. That's fine if rather less elegant; it seems there should be a way to split on a string using a conditional that can return an empty string...
Assuming you may or may not have a separator and that cost and reed aren't necessarily mutually exclusive, why not search for the specific string instead of the separator?
Example:
library(stringr)
X <- data.frame(value = c(1,2,3,4),
variable = c("cost", "cost", "reed_cost", "reed_cost"))
X$cost <- str_detect(X$variable,"cost")
X$reed <- str_detect(X$variable,"reed")
You could try:
X$variable <- ifelse(!grepl("_", X$variable), paste0("_", X$variable), as.character(X$variable))
separate(X, variable, c("Policy-cost", "Reed"), "_")
# value Policy-cost Reed
#1 1 cost
#2 2 cost
#3 3 reed cost
#4 4 reed cost
Or
X$variable <- gsub("\\b(?=[A-Za-z]+\\b)", "_", X$variable, perl=T)
X$variable
#[1] "_cost" "_cost" "reed_cost" "reed_cost"
separate(X, variable, c("Policy-cost", "Reed"), "_")
Explanation
\\b(?=[A-Za-z]+\\b) : matches a word boundary \\b and looks ahead for characters followed by word boundary. The third and fourth elements does not match, so it was not replaced.
Another approach with base R:
cbind(X["value"],
setNames(as.data.frame(t(sapply(strsplit(as.character(X$variable), "_"),
function(x)
if (length(x) == 1) c("", x)
else x))),
c("Policy-cost", "Reed")))
# value Policy-cost Reed
# 1 1 cost
# 2 2 cost
# 3 3 reed cost
# 4 4 reed cost

extract partial string based on pattern in r

I would like to extract partial string from a list. I don't know how to define the pattern of the strings. Thank you for your helps.
library(stringr)
names = c("GAPIT..flowerdate.GWAS.Results.csv","GAPIT..flwrcolor.GWAS.Results.csv",
"GAPIT..height.GWAS.Results.csv","GAPIT..matdate.GWAS.Results.csv")
# I want to extract out "flowerdate", "flwrcolor", "height" and "matdate"
traits <- str_extract_all(string = files, pattern = "..*.")
# the result is not what I want.
You can also use regmatches
> regmatches(c, regexpr("[[:lower:]]+", c))
[1] "flowerdate" "flwrcolor" "height" "matdate"
I encourage you not to use c as a variable name, because you're overwriting c function.
I borrow the answer from Roman Luštrik for my previous question “How to extract out a partial name as new column name in a data frame”
traits <- unlist(lapply(strsplit(names, "\\."), "[[", 3))
Use sub:
sub(".*\\.{2}(.+?)\\..*", "\\1", names)
# [1] "flowerdate" "flwrcolor" "height" "matdate"
Here are a few solutions. The first two do not use regular expressions at all. The lsat one uses a single gsub:
1) read.table. This assumes the desired string is always the 3rd field:
read.table(text = names, sep = ".", as.is = TRUE)[[3]]
2) strsplit This assumes the desired string has more than 3 characters and is lower case:
sapply(strsplit(names, "[.]"), Filter, f = function(x) nchar(x) > 3 & tolower(x) == x)
3) gsub This assumes that two dots preceed the string and one dot plus junk not containing two successive dots comes afterwards:
gsub(".*[.]{2}|[.].*", "", names)
REVISED Added additional solutions.