R Subset Dataset Using Regular Expression - regex

Is there a way to make the R code below run quicker (i.e. vectorized to avoid use of for loops)?
My example contains two data frames. First is dimension n1*p. One of the p columns contains names. Second data frame is a column vector (n2*1). It contains names as well. I want to keep all rows of the first data frame, where some part of the name in the column vector of the second data frame appears in the corresponding first data frame. Sorry for the brutal explanation.
Example (Data frame 1):
x y
Doggy 1
Hello 2
Hi Dog 3
Zebra 4
Example (Data frame 2)
z
Hello
Dog
So in the above example I want to keep rows 1,2,3 but NOT 4. Since "Dog" appears in "Doggy" and "Hi Dog". And "Hello" appears in "Hello". Exclude row four since no part of "Hello" or "Dog" appears in "Zebra".
Below is my R code to do this...runs fine. However, for my real task. Data frame 1 has 1 million rows and data frame 2 has 50 items to match on. So runs pretty slow. Any suggestion on how to speed this up are appreciated.
x <- c("Doggy", "Hello", "Hi Dog", "Zebra")
y <- 1:4
dat <- as.data.frame(cbind(x,y))
names(dat) <- c("x","y")
z <- as.data.frame(c("Hello", "Dog"))
names(z) <- c("z")
dat$flag <- NA
for(j in 1:length(z$z)){
for(i in 1:dim(dat)[1]){
if ( is.na(dat$flag[i])==TRUE ) {
dat$flag[i] <- length(grep(paste(z[j,1]), dat[i,1], perl=TRUE, value=TRUE))
} else {
if (dat$flag[i]==0) {
dat$flag[i] <- length(grep(paste(z[j,1]), dat[i,1], perl=TRUE, value=TRUE))
} else {
if (dat$flag[i]==1) {
dat$flag[i]==1
}
}
}
}
}
dat1 <- subset(dat, flag==1)
dat1

Try this:
dat[grep(paste(z$z, collapse = "|"), dat$x), ]
or
subset(dat, grepl(paste(z$z, collapse = "|"), x))

This question inspired a boolean text search function (%bs%) in the qdap package and thus I thought I'd share the approach to this question:
library(qdap)
dat[dat$x %bs% paste(z$z, collapse = "OR"), ]
In this case no less typing but if multiple or/and statements are involved this may be a useful approach.

Related

Save results for each file of a list of files looping through a factor variable in R. Vector does not update

I am using a list of files, and I am trying to create a data frame that contains: for each sample, the percentage of two particular "GT" types by the levels of another factor variable called "chr" (with 1 to 24 levels).
It would have to look like this:
The problem I keep getting is that the vector never gets updated for the ith sample, it only keeps the first vector created. And then I am not sure how to save that updated vector on my data frame (df).
vector_chr <- vector();
for (i in seq_along(list_files)) {
GT <- list_files[[i]][,9]
chr <- list_files[[i]][,3]
GT$chr <- chr$chr # creating one df with both GT and chr
for (j in unique(GT$chr)){
dat_list = split(GT, GT$chr) # split data frames by chr (1 to 24)
table <- table(dat_list[[j]][,1]) # take GT and make a table
sum <- sum(table[3:4]) # sum GTs 3 and 4
perc <- sum/nrow(GT)
vector_chr <- c(vector_chr,perc) # assign the 24 percentages to a vector
}
df <- data.frame(matrix(ncol = 25, nrow = length(files)))
x <- c("Sample", "chr1", "chr2", "chr3",
"chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10",
"chr11", "chr12","chr13", "chr14", "chr15", "chr16",
"chr17", "chr18", "chr19", "chr20", "chr21", "chr22",
"chrX", "chrXY")
colnames(df) <- x
df$Sample <- names(list_files)
df[i,2:25] <- vector_chr # assign the 24 percentages for EACH sample
}

Need to extract 4 spaces of text before the occurrence of a word that appears in a column in a df, and may occur several times per row

I need to extract text (4 characters) before the occurrence of the word "exception" per row in a column of my dataframe. For example, see two lines of my data below:
MPSA: Original Version (01/16/2015); FMV Exception: Original Version (04/11/2014); MM Exception: 08.19.15 (08/19/2015)
MPSA: Original Version (02/10/2015); FMV Exception: Original Version (12/18/2014); MEI FMV: V3 (12/18/2014); MEI FMV: updated (11/18/2014); Meeting Material exception: Original Version (04/21/2014);
As you can see, "exception" occurrs more than one time per line, is sometimes capitalized and sometimes not, and has different text before. I need to extract the "FMV", "MM", and "ial" that come before in each case. The goal is to extract as a version of the following (comma separating would be fine but not needed):
"FMVMM"
"FMVial"
I am planning on making all text lower case for simplicity, but I cannot find a regex to extract the 4 characters of text I need after that. Any recommendations?
You basically need strsplit, substr and nchar:
t1 <- "1.MPSA: Original Version (01/16/2015); FMV Exception: Original Version (04/11/2014); MM Exception: 08.19.15 (08/19/2015)"
t2 <- "2.MPSA: Original Version (02/10/2015); FMV Exception: Original Version (12/18/2014); MEI FMV: V3 (12/18/2014); MEI FMV: updated (11/18/2014); Meeting Material exception: Original Version (04/21/2014); "
f <- function(x){
tmp <- strsplit(x, "[Ee]xception")[[1]]
ret <- array(dim = length(tmp) - 1)
for(i in 1:length(ret)){
ret[i] <- substr(tmp[i], start = nchar(tmp[i]) - 3, stop = nchar(tmp[i]))
}
return(paste(ret, collapse = ","))
}
f(t1) #gives "FMV , MM "
f(t2) #gives "FMV ,ial "
Avoiding the loop would be better but for now, this should work.
Edit by Qaswed: Improved the function (shorter and does not need tolower any more).
Edit by TigeronFire:
#Qaswed, thank you for your guidance - the answer, however, poses another problem. t1 and t2 are only two lines on a dataframe 10000 rows long. I attempted to add the column logic to the function you built a few different ways, but I always received the error message:
"Error in strsplit(BOSSMWF_practice$Documents, "[Ee]xception") : non-character argument"
I tried the following with reference to dataframe column BOSSMWF_practice$Documents:
f <- function(x){
tmp <- strsplit(BOSSMWF_practice$Documents, "[Ee]xception")[[1]]
ret <- array(dim = length(tmp) - 1)
for(i in 1:length(ret)){
ret[i] <- substr(tmp[i], start = nchar(tmp[i]) - 3, stop = nchar(tmp[i]))
}
return(paste(ret, collapse = ","))
}
AND:
f <- function(x){
BOSSMWF_practice$tmp <- strsplit(BOSSMWF_practice$Documents, "[Ee]xception")[[1]]
BOSSMWF_practice$ret <- array(dim = length(BOSSMWF_practice$tmp) - 1)
for(i in 1:length(BOSSMWF_practice$ret)){
BOSSMWF_practice$ret[i] <- substr(BOSSMWF_practice$tmp[i], start = nchar(BOSSMWF_practice$tmp[i]) - 3, stop = nchar(BOSSMWF_practice$tmp[i]))
}
return(paste(ret, collapse = ","))
}
I attempted to run the function on my applicable column using both function setups
BOSSMWF_practice$Funct <- f(BOSSMWF_practice$Documents)
But I always received the above error message. Can you take your advice one step further and indicate how to apply this to a dataframe and place the results in a new column?
Edit by Qaswed:
#TigeronFire you should have added a comment to my answer or editing your question, but not editing my question. To your comment:
#if your dataset looks something like this:
df <- data.frame(variable_name = c(t1, t2))
#...use
apply(df, 1, FUN = f)
#note: there was an error in f. You need strsplit(x, ...) and not strsplit(t1, ...).

R: replacing values in string all at once

I have a data frame that looks like this:
USequence
# 1 GATCAGATC
# 2 ATCAGAC
I'm trying to create a function that would replace all the G's with C's, A's with T's, C's with G's, and T's with A's:
USequence
# 1 CTAGTCTAG
# 2 TAGTCTG
This is what I have right now, the function accepts k, a data frame with a column named USequence.
conjugator <- function(k) {
k$USequence <- str_replace_all(k$USequence,"A","T")
k$USequence <- str_replace_all(k$USequence,"T","A")
k$USequence <- str_replace_all(k$USequence,"G","C")
k$USequence <- str_replace_all(k$USequence,"C","G")
}
However the obvious problem would be that this is doesn't replace the characters at once, but rather in steps which would not return the desired result. Any suggestions? Thanks
You could use chartr
df1$USequence <- chartr('GATC', 'CTAG', df1$USequence)
df1$USequence
#[1] "CTAGTCTAG" "TAGTCTG"
Or
library(gsubfn)
gsubfn('[GATC]', list(G='C', A='T', T='A', C='G'), df1$USequence)
#[1] "CTAGTCTAG" "TAGTCTG"

How to add column to data.table with values from list based on regex

I have the following data.table:
id fShort
1 432-12 1245
2 3242-12 453543
3 324-32 45543
4 322-34 45343
5 2324-34 13543
DT <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
and the following list:
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
I would like to create a new column "fComplete" that includes the complete filename from the list. For this the values of column "id" need to be matched with the filename-list. If the filename starts with the "id" string, the complete filename should be returned. I use the following regex
t <- grep("432-12","432-124343.png",value=T)
that return the correct filename.
This is how the final table should look like:
id fShort fComplete
1 432-12 1245 432-124343.png
2 3242-12 453543 3242-124342345.png
3 324-32 45543 NA
4 322-34 45343 NA
5 2324-34 13543 NA
DT2 <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fshort=c("1245", "453543", "45543", "45343", "13543"),
fComplete = c("432-124343.png", "3242-124342345.png", NA, NA, NA))
I tried using apply and data.table approaches but I always get warnings like
argument 'pattern' has length > 1 and only the first element will be used
What is a simple approach to accomplish this?
Here's a data.table solution:
DT[ , fComplete := lapply(id, function(x) {
m <- grep(x, filenames, value = TRUE)
if (!length(m)) NA else m})]
id fShort fComplete
1: 432-12 1245 432-124343.png
2: 3242-12 453543 3242-124342345.png
3: 324-32 45543 NA
4: 322-34 45343 NA
5: 2324-34 13543 NA
In my experience with similar functions, sometimes the regex functions return a list, so you have to consider that in the apply - I usually do an example manually
Also apply will not always in y experience on its own return something that always works into a data.frame,sometimes I had to use lap ply, and or unlist and data.frame to modify it
Here is an answer - I am not familiar with data.tables and I was having issues with the filenames being in a list, but with some transformations this works. I worked it out by seeing what apply was outputting and adding the [1] to get the piece I needed
DT <- data.frame(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
filenames1 <- unlist(filenames)
x<-apply(DT[1],1,function(x) grep(x,filenames1)[1])
DT$fielname <- filenames1[x]

Function to subset dataframe using pattern list argument

I have a pattern list
patternlist <- list('one' = paste(c('a','b','c'),collapse="|"), 'two' = paste(1:5,collapse="|"), 'three' = paste(c('k','l','m'),collapse="|"))
that I want to select from to extract rows from a data frame
dataframez <- data.frame('letters' = c('a','b','c'), 'numbers' = 1:3, 'otherletters' = c('k','l','m'))
with this function
pattern.record <- function(x, column="letters", value="one")
{
if (column %in% names(x))
{
result <- x[grep(patternlist$value, x$column, ignore.case=T),]
}
else
{
result <- NA
}
return(result)
}
oddly enough, I get an error when I run it:
> pattern.record(dataframez)
Error in grep(patternlist$value, x$column, ignore.case = T) :
invalid 'pattern' argument
The problem is your use of the `$` operator.
In your function, it is looking a column \ named element called column
It is far simpler here to use `[[`
Then x[[column]] uses what column is defined as, not column as a name.
The relevant lines in ?`$` are
Both [[ and $ select a single element of the list. The main difference is that $ does not allow computed indices, whereas [[ does. x$name is equivalent to x[["name", exact = FALSE]]. Also, the partial matching behavior of [[ can be controlled using the exact argument.
You are trying to use value and column as computed indices (i.e. computing what value and column are defined as), thus you need `[[`.
The function becomes
pattern.record <- function(x, column="letters", value="one", pattern_list)
{
if (column %in% names(x))
{
result <- x[grep(pattern_list[[value]], x[[column]], ignore.case=T),]
}
else
{
result <- NA
}
return(result)
}
pattern.record(dataframez, patternlist = pattern_list)
## letters numbers otherletters
## 1 a 1 k
## 2 b 2 l
## 3 c 3 m
note that I've also added an argumentpattern_list so it does not depend on an object named patternlist existing somewhere in the parent environments (in your case the global environment.