grepl() and lapply to fill missing values - regex

I have the following data as an example:
fruit.region <- data.frame(full =c("US red apple","bombay Asia mango","gold kiwi New Zealand"), name = c("apple", "mango", "kiwi"), country = c("US","Asia","New Zealand"), type = c("red","bombay","gold"))
I would like R to be able to look at other items in the "full" (name) column that don't have values for "name", "country" and "type" and see if they match other items. For instance, if full had a 4th row with "bombay US mango" it would be able to identify that the country should read US, bombay should be under type and mango should be under name.
This is what I have so far, which merely identifies (logically) where the items match:
new.entry <- c("bombay US mango")
split.new.entry <- strsplit(new.entry, " ")
lapply(split.new.entry, function(x){
check = grepl(x, fruit.region, ignore.case=TRUE)
print(check)
})
I'm at a bit of a standstill..I've read through a number of regex posts and the r help guides on grepl but am not able to find a great solution. What I have doesn't fully identify a logical "match" vector so I'm unable to subset and use an if statement to concatenate on different elements. Ideally, I'd like to be able to replace these elements in data.table form as my fruit.region will actually be in a data table. Does anyone have any suggestions on the best approach?

Using the str_detect function from the stringr library. This gives a list, ready to rbind:
library(stringr)
addnewrow <- function(newfruit){
z<-lapply(fruit.region[,2:4], function(x) x[str_detect(new.entry, x)])
z$full <- newfruit
z
}
addnewrow(new.entry)
$name
[1] "mango"
$country
[1] "US"
$type
[1] "bombay"
$full
[1] "bombay US mango"
The next step would depend on your desired outcome - if you only want to add one, try:
rbind(fruit.region, addnewrow(new.entry))
If you have a lot:
z <- do.call(rbind, lapply(c(new.entry, new.entry), addnewrow))
rbind(fruit.region, z)
NB make sure your columns are character first:
fruit.region[] <- lapply(fruit.region, as.character)

Related

lapply with a list of lists

I believe that there must be some related questions in the community, but I failed to find the one very informative to my case:
Basically, I am trying to produce three plots with the lapply function. Below are my codes.
p_grid <- seq(0,1,length.out=20)
prior_uni <- rep(1,20)
prior_bi <- ifelse( p_grid < 0.5 , 0 , 1)
prior_exp <- exp(-5*abs(p_grid-0.5))
prior_list <- list(prior_uni, prior_bi, prior_exp)
ggs <- lapply(prior_list, function(x){
likelihood <- dbinom(6,9, prob = p_grid)
unstd.post <- likelihood*x
std.post <- unstd.post/sum(unstd.post)
plot_post <- plot(p_grid,std.post,type="b", ylim = c(0,max(x)))
mtext(paste0(x))
}
)
By doing so, I get the plots but the mtext function does not work well. Instead of showing the title prior_uni, prior_bi, prior_exp respectively, it gives every single value of the list (e.g., prior_uni) with overlapping each other.
It is a bit confusing to me. According to the plot results, the function within lapply seems to take the three lists of prior_list, not every single value. In other words, x is the three elements of prior_list, not the sixty (3*20) elements, but the function mtext does oppositely.
I hope I have expressed clearly. Look for your responses.
Best regards,
Jilong

Extract URL parameters and values in R

We want to extract parameters and values from a given URL like
http://www.exemple.com/?a=1&b=2&c=3#def
Using xml2::url_parse we were able to Parse a url into its component pieces. However we still want to devide the query into elements using gsub matching regular expression:
([^?&=#]+)=([^&#]*)
Desired output
a=1
b=2
c=3
Use urltools package to parse URLs.
> u <- "http://www.exemple.com/?a=1&b=2&c=3#def"
> strsplit(urltools::parameters(u), "&")[[1L]]
[1] "a=1" "b=2" "c=3"
> urltools::param_get(u, "b")
b
1 2
We can try
library(stringr)
matrix(str_extract_all(str1, "[a-z](?=\\=)|(?<=\\=)\\d+")[[1]], ncol=2, byrow=TRUE)
Or if we need the = also
str_extract_all(str1, "[a-z]=\\d+")[[1]]
#[1] "a=1" "b=2" "c=3"
data
str1 <- "http://www.exemple.com/?a=1&b=2&c=3#def"

R regex using a vector and two column dataframe

Suppose I have a vector and a two column data.frame.
motif <- c("DAGTACTHV","AGT","WSAT")
motif_ref <- data.frame("sym"=c("W","S","M","K","R","Y","B","D","H","V","N"),
"bases"=c("(A|T)","(C|G)","(A|C)","(G|T)","(A|G)","(C|T)","(C|G|T)","(A|G|T)","(A|C|T)","(A|C|G)","(A|C|G|T)"))
I'm trying to use stri_replace_all to replace all elements in motif_ref$sym with the corresponding elements in motif_ref$bases, in motif.
m <- stri_replace_all_regex(motif, motif_ref$sym, motif_ref$bases)
However this gives me:
> m
[1] "DAGTACTHV" "DAGTACTHV" "DAGTACTHV" "DAGTACTHV" "DAGTACTHV" "DAGTACTHV" "DAGTACTHV"
[8] "(A|G|T)AGTACTHV" "DAGTACT(A|C|T)V" "DAGTACTH(A|C|G)" "DAGTACTHV"
when I actually want something like:
> m
[1] "(A|G|T)AGTACT(A|C|T)(A|C|G)" "AGT" "(A|T)(C|G)AT"
I was thinking about using chartr, however I dont know if it'll work on replacing single characters with longer strings.
Thanks everyone
This is a perfect use case for its vectorize_all argument.
library(stringi)
stri_replace_all_fixed(motif, motif_ref$sym, motif_ref$bases, vectorize_all = FALSE)
# [1] "(A|G|T)AGTACT(A|C|T)(A|C|G)" "AGT" "(A|T)(C|G)AT"
Or a bit more clearly written -
with(motif_ref, {
stri_replace_all_fixed(motif, sym, bases, vectorize_all = FALSE)
})
Note that using stri_replace_all_fixed will be more efficient since we are searching for exact matches.

regex for matching column names on a matrix

I'll have two strings of the form
"Initestimate" or "L#estimate" with # being a 1 or 2 digit number
" Nameestimate" with Name being the name of the actual symbol. In the example below, the name of our symbol is "6JU4"
And I have a matrix containing, among other things, columns containing "InitSymbol" and "L#Symbol". I want to return the column name of the column where the first row holds the substring before "estimate".
I'm using stringr. Right now I have it coded with a bunch of calls to str_sub but its really sloppy and I wanted to clean it up and do it right.
example code:
> examplemat <- matrix(c("RYU4","6JU4","6EU4",1,2,3),ncol=6)
> colnames(examplemat) <- c("InitSymb","L1Symb","L2Symb","RYU4estimate","6JU4estimate","6EU4estimate")
> examplemat
InitSymb L1Symb L2Symb RYU4estimate 6JU4estimate 6EU4estimate
[1,] "RYU4" "6JU4" "6EU4" "1" "2" "3"
> searchStr <- "L1estimate"
So with answer being the answer I'm looking for, I want to be able to input examplemat[,answer] so I can extract the data column (in this case, "2")
I don't really know how to do regex, but I think the answer looks something like
examplemat[,paste0(**some regex function**("[(Init)|(L[:digit:]+)]",searchStr),"estimate")]
what function goes there, and is my regex code right?
May be you can try:
library(stringr)
Extr <- str_extract(searchStr, '^[A-Za-z]\\d+')
Extr
[1] "L1"
#If the searchStr is `Initestimate`
#Extr <- str_extract(searchStr, '^[A-Za-z]{4}')
pat1 <- paste0("(?<=",Extr,").*")
indx1 <-examplemat[,str_detect(colnames(examplemat),perl(pat1))]
pat2 <- paste0("(?<=",indx1,").*")
examplemat[,str_detect(colnames(examplemat), perl(pat2))]
#6JU4estimate
# "2"
#For searchStr using Initestimate;
#examplemat[,str_detect(colnames(examplemat), perl(pat2))]
#RYU4estimate
# "1"
The question is bit confusing so I am quite not sure if my interpretation is correct.
First, you would extract the values in the string "coolSymb" without "Symb"
Second, you can detect if column name contains "cool" and return the location (column index)
by which() statement.
Finally, you can extract the value using simple matrix indexing.
library(stringr)
a = str_split("coolSymb", "Symb")[[1]][1]
b = which(str_detect(colnames(examplemat), a))
examplemat[1, b]
Hope this helps,
won782's use of str_split inspired me to find an answer that works, although I still want to know how to do this by matching the prefix instead of excluding the suffix, so I'll accept an answer that does that.
Here's the step-by-step
> str_split("L1estimate","estimate")[[1]][1]
[1] "L1"
replace the above step with one that gets {L1} instead of getting {not estimate} for bonus points
> paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")
[1] "L1Symb"
> examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")]
L1Symb
[1,] "6JU4"
> paste0(examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")],"estimate")
[1] "6JU4estimate"
> examplemat[,paste0(examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")],"estimate")]
6JU4estimate
[1,] "2"

Regular Expressions in R - compare one column to another

I currently have a dataset which has two columns that I'd like to compare. In one column, I have a string that I'd like to search for (let's call it column A). In a second column (let's call it column B) are some more strings.
The problem is that both columns have varying contents, so the pattern being searched for in the regular expression is likely to change from one row to another. Normally, when I'm searching a column for a particular string, I use something like this:
df$output <- NA
df$output[grep("TARGET_STRING", df$column_B)] <- "STRING_FOUND"
However, now that I'm trying to do this:
df$output[grep(df$column_A, df$column_B)] <- "STRING_FOUND"
Unfortunately, this gives an error:
argument 'pattern' has length > 1 and
only the first element will be used
I've tried various methods to fix this, and can't seem to find a simple solution, and I'm sure there must be one. I can see why it's throwing an error (I think), but I'm not sure how to solve it. What do I need to do to get the regular expression working?
Edit: Here's the testing data.frame I've been using to explore it:
column_A <- c("A", "A", "B", "B")
column_B <- c("A", "zzz", "B", "zzz")
greptest <- data.frame(column_A, column_B)
greptest$output<-NA
greptest$output[grep(greptest$column_A, greptest$column_B)] <- "STRING_FOUND"
You can write a function that wraps grepl and then use apply:
grepFun <- function(rw){
grepl(rw[1],rw[2],fixed=TRUE)
}
xx <- apply(greptest,1,grepFun)
greptest$output[xx] <- "STRING_FOUND"
You've already excepted my answer, but I thought I'd provide another, somewhat more efficient version using ddply:
grepFun1 <- function(x){
ind <- grepl(x$column_A[1],x$column_B,fixed=TRUE)
x$output <- NA
x$output[ind] <- "STRING_FOUND"
x
}
ddply(greptest,.(column_A),.fun=grepFun1)
This version will likely be faster if you have lots of repetition in the values for column_A.
I'm not sure what your expected result is, but here's my code:
> grep(greptest[1,"column_A"], greptest$column_B)
[1] 1 2
> grep(greptest[2,"column_A"], greptest$column_B)
integer(0)
> grep(greptest[3,"column_A"], greptest$column_B)
[1] 3 4
> grep(greptest[4,"column_A"], greptest$column_B)
integer(0)
> cbind(column_A,column_B,column_A==column_B)
column_A column_B
[1,] "A" "A" "TRUE"
[2,] "A" "zzz" "FALSE"
[3,] "B" "B" "TRUE"
[4,] "B" "zzz" "FALSE"
I switched A and B in the grep code, because otherwise you only get one hit per grep. You have to loop through elements, if you'd like to search for all of them (or use a loop equivalent).
If you'd like just to compare row by row, then a simple == suffices.