R: replacing values in string all at once - regex

I have a data frame that looks like this:
USequence
# 1 GATCAGATC
# 2 ATCAGAC
I'm trying to create a function that would replace all the G's with C's, A's with T's, C's with G's, and T's with A's:
USequence
# 1 CTAGTCTAG
# 2 TAGTCTG
This is what I have right now, the function accepts k, a data frame with a column named USequence.
conjugator <- function(k) {
k$USequence <- str_replace_all(k$USequence,"A","T")
k$USequence <- str_replace_all(k$USequence,"T","A")
k$USequence <- str_replace_all(k$USequence,"G","C")
k$USequence <- str_replace_all(k$USequence,"C","G")
}
However the obvious problem would be that this is doesn't replace the characters at once, but rather in steps which would not return the desired result. Any suggestions? Thanks

You could use chartr
df1$USequence <- chartr('GATC', 'CTAG', df1$USequence)
df1$USequence
#[1] "CTAGTCTAG" "TAGTCTG"
Or
library(gsubfn)
gsubfn('[GATC]', list(G='C', A='T', T='A', C='G'), df1$USequence)
#[1] "CTAGTCTAG" "TAGTCTG"

Related

R: Populate a new column in a dataframe based on matching one or several possible strings

Hypothetical dataframe:
strings new column
mesh 1
foo 0
bar 0
tack 1
suture 1
I would like the new column to contain "1" if df$strings contains the strings "mesh", "tack", or "sutur". Otherwise it should display zero in the same row. I tried the following:
df$new_column <- ifelse(grepl("mesh" | "tack" | "sutur",
df$strings, ignore.case = T), "1", "0")
but got this error:
Error in "mesh" | "tack" :
operations are possible only for numeric, logical or complex types
Thanks in advance.
You want to use a single string in grep:
df$new_column <- ifelse(grepl("mesh|tack|sutur", df$strings, ignore.case = T),
"1", "0")
will work, but the following will be faster:
df$new_column <- +(grepl("mesh|tack|sutur", df$strings, ignore.case = T))
This will return a 0 and 1 integer vector
We can also use %in%
df$new_column <- as.integer(df$strings %in% c("mesh", "tack", "sutur"))

How to add column to data.table with values from list based on regex

I have the following data.table:
id fShort
1 432-12 1245
2 3242-12 453543
3 324-32 45543
4 322-34 45343
5 2324-34 13543
DT <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
and the following list:
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
I would like to create a new column "fComplete" that includes the complete filename from the list. For this the values of column "id" need to be matched with the filename-list. If the filename starts with the "id" string, the complete filename should be returned. I use the following regex
t <- grep("432-12","432-124343.png",value=T)
that return the correct filename.
This is how the final table should look like:
id fShort fComplete
1 432-12 1245 432-124343.png
2 3242-12 453543 3242-124342345.png
3 324-32 45543 NA
4 322-34 45343 NA
5 2324-34 13543 NA
DT2 <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fshort=c("1245", "453543", "45543", "45343", "13543"),
fComplete = c("432-124343.png", "3242-124342345.png", NA, NA, NA))
I tried using apply and data.table approaches but I always get warnings like
argument 'pattern' has length > 1 and only the first element will be used
What is a simple approach to accomplish this?
Here's a data.table solution:
DT[ , fComplete := lapply(id, function(x) {
m <- grep(x, filenames, value = TRUE)
if (!length(m)) NA else m})]
id fShort fComplete
1: 432-12 1245 432-124343.png
2: 3242-12 453543 3242-124342345.png
3: 324-32 45543 NA
4: 322-34 45343 NA
5: 2324-34 13543 NA
In my experience with similar functions, sometimes the regex functions return a list, so you have to consider that in the apply - I usually do an example manually
Also apply will not always in y experience on its own return something that always works into a data.frame,sometimes I had to use lap ply, and or unlist and data.frame to modify it
Here is an answer - I am not familiar with data.tables and I was having issues with the filenames being in a list, but with some transformations this works. I worked it out by seeing what apply was outputting and adding the [1] to get the piece I needed
DT <- data.frame(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
filenames1 <- unlist(filenames)
x<-apply(DT[1],1,function(x) grep(x,filenames1)[1])
DT$fielname <- filenames1[x]

R Subset Dataset Using Regular Expression

Is there a way to make the R code below run quicker (i.e. vectorized to avoid use of for loops)?
My example contains two data frames. First is dimension n1*p. One of the p columns contains names. Second data frame is a column vector (n2*1). It contains names as well. I want to keep all rows of the first data frame, where some part of the name in the column vector of the second data frame appears in the corresponding first data frame. Sorry for the brutal explanation.
Example (Data frame 1):
x y
Doggy 1
Hello 2
Hi Dog 3
Zebra 4
Example (Data frame 2)
z
Hello
Dog
So in the above example I want to keep rows 1,2,3 but NOT 4. Since "Dog" appears in "Doggy" and "Hi Dog". And "Hello" appears in "Hello". Exclude row four since no part of "Hello" or "Dog" appears in "Zebra".
Below is my R code to do this...runs fine. However, for my real task. Data frame 1 has 1 million rows and data frame 2 has 50 items to match on. So runs pretty slow. Any suggestion on how to speed this up are appreciated.
x <- c("Doggy", "Hello", "Hi Dog", "Zebra")
y <- 1:4
dat <- as.data.frame(cbind(x,y))
names(dat) <- c("x","y")
z <- as.data.frame(c("Hello", "Dog"))
names(z) <- c("z")
dat$flag <- NA
for(j in 1:length(z$z)){
for(i in 1:dim(dat)[1]){
if ( is.na(dat$flag[i])==TRUE ) {
dat$flag[i] <- length(grep(paste(z[j,1]), dat[i,1], perl=TRUE, value=TRUE))
} else {
if (dat$flag[i]==0) {
dat$flag[i] <- length(grep(paste(z[j,1]), dat[i,1], perl=TRUE, value=TRUE))
} else {
if (dat$flag[i]==1) {
dat$flag[i]==1
}
}
}
}
}
dat1 <- subset(dat, flag==1)
dat1
Try this:
dat[grep(paste(z$z, collapse = "|"), dat$x), ]
or
subset(dat, grepl(paste(z$z, collapse = "|"), x))
This question inspired a boolean text search function (%bs%) in the qdap package and thus I thought I'd share the approach to this question:
library(qdap)
dat[dat$x %bs% paste(z$z, collapse = "OR"), ]
In this case no less typing but if multiple or/and statements are involved this may be a useful approach.

Function to subset dataframe using pattern list argument

I have a pattern list
patternlist <- list('one' = paste(c('a','b','c'),collapse="|"), 'two' = paste(1:5,collapse="|"), 'three' = paste(c('k','l','m'),collapse="|"))
that I want to select from to extract rows from a data frame
dataframez <- data.frame('letters' = c('a','b','c'), 'numbers' = 1:3, 'otherletters' = c('k','l','m'))
with this function
pattern.record <- function(x, column="letters", value="one")
{
if (column %in% names(x))
{
result <- x[grep(patternlist$value, x$column, ignore.case=T),]
}
else
{
result <- NA
}
return(result)
}
oddly enough, I get an error when I run it:
> pattern.record(dataframez)
Error in grep(patternlist$value, x$column, ignore.case = T) :
invalid 'pattern' argument
The problem is your use of the `$` operator.
In your function, it is looking a column \ named element called column
It is far simpler here to use `[[`
Then x[[column]] uses what column is defined as, not column as a name.
The relevant lines in ?`$` are
Both [[ and $ select a single element of the list. The main difference is that $ does not allow computed indices, whereas [[ does. x$name is equivalent to x[["name", exact = FALSE]]. Also, the partial matching behavior of [[ can be controlled using the exact argument.
You are trying to use value and column as computed indices (i.e. computing what value and column are defined as), thus you need `[[`.
The function becomes
pattern.record <- function(x, column="letters", value="one", pattern_list)
{
if (column %in% names(x))
{
result <- x[grep(pattern_list[[value]], x[[column]], ignore.case=T),]
}
else
{
result <- NA
}
return(result)
}
pattern.record(dataframez, patternlist = pattern_list)
## letters numbers otherletters
## 1 a 1 k
## 2 b 2 l
## 3 c 3 m
note that I've also added an argumentpattern_list so it does not depend on an object named patternlist existing somewhere in the parent environments (in your case the global environment.

Can one add a data.frame to itself?

I want to append or add a data.frame to itself...
Much in the same way the one adds:
n <- n + t
I have a function that creates a data.frame.
I have been using:
g <- function(compareA,compareB) {
for (i in 1:1000) {
ttr <- t.test(compareA, compareA, var.equal = TRUE)
tt_pvalues[i] <- ttr$p.value
}
name_tag <- paste(nameA, nameB, sep = "_Vs_")
tt_titles <- data.frame(name_tag, tt_titles)
# character vector which I want to add to a list
ALL_pvalues <- data.frame(tt_pvalues, ALL_pvalues)
# adding a numeric vector of values to a larger data.frame
}
Would cbind be better here?
There are two methods that would "add or append" data to a data.frame by columns and one that would append by rows. Assuming tag is the data.frame, and tt_titles is a vector of the same length that 'tag' has rows, then either of these would work:
tag <- cbind(tag, tt_titles)
# tt_titles could also be a data.frame with same number of rows
Or:
tag[["tt_titles"]] <- tt_titles
Now let's assume that we have instead two data.frames with the same column.names:
bigger.df <- rbind(tag, tag2)