Regular Expressions in R - compare one column to another

Regular Expressions in R - compare one column to another - regex

I currently have a dataset which has two columns that I'd like to compare. In one column, I have a string that I'd like to search for (let's call it column A). In a second column (let's call it column B) are some more strings.
The problem is that both columns have varying contents, so the pattern being searched for in the regular expression is likely to change from one row to another. Normally, when I'm searching a column for a particular string, I use something like this:
df$output <- NA
df$output[grep("TARGET_STRING", df$column_B)] <- "STRING_FOUND"
However, now that I'm trying to do this:
df$output[grep(df$column_A, df$column_B)] <- "STRING_FOUND"
Unfortunately, this gives an error:
argument 'pattern' has length > 1 and
only the first element will be used
I've tried various methods to fix this, and can't seem to find a simple solution, and I'm sure there must be one. I can see why it's throwing an error (I think), but I'm not sure how to solve it. What do I need to do to get the regular expression working?
Edit: Here's the testing data.frame I've been using to explore it:
column_A <- c("A", "A", "B", "B")
column_B <- c("A", "zzz", "B", "zzz")
greptest <- data.frame(column_A, column_B)
greptest$output<-NA
greptest$output[grep(greptest$column_A, greptest$column_B)] <- "STRING_FOUND"

You can write a function that wraps grepl and then use apply:
grepFun <- function(rw){
grepl(rw[1],rw[2],fixed=TRUE)
}
xx <- apply(greptest,1,grepFun)
greptest$output[xx] <- "STRING_FOUND"
You've already excepted my answer, but I thought I'd provide another, somewhat more efficient version using ddply:
grepFun1 <- function(x){
ind <- grepl(x$column_A[1],x$column_B,fixed=TRUE)
x$output <- NA
x$output[ind] <- "STRING_FOUND"
x
}
ddply(greptest,.(column_A),.fun=grepFun1)
This version will likely be faster if you have lots of repetition in the values for column_A.

I'm not sure what your expected result is, but here's my code:
> grep(greptest[1,"column_A"], greptest$column_B)
[1] 1 2
> grep(greptest[2,"column_A"], greptest$column_B)
integer(0)
> grep(greptest[3,"column_A"], greptest$column_B)
[1] 3 4
> grep(greptest[4,"column_A"], greptest$column_B)
integer(0)
> cbind(column_A,column_B,column_A==column_B)
column_A column_B
[1,] "A" "A" "TRUE"
[2,] "A" "zzz" "FALSE"
[3,] "B" "B" "TRUE"
[4,] "B" "zzz" "FALSE"
I switched A and B in the grep code, because otherwise you only get one hit per grep. You have to loop through elements, if you'd like to search for all of them (or use a loop equivalent).
If you'd like just to compare row by row, then a simple == suffices.

Related

Extract URL parameters and values in R

We want to extract parameters and values from a given URL like
http://www.exemple.com/?a=1&b=2&c=3#def
Using xml2::url_parse we were able to Parse a url into its component pieces. However we still want to devide the query into elements using gsub matching regular expression:
([^?&=#]+)=([^&#]*)
Desired output
a=1
b=2
c=3

Use urltools package to parse URLs.
> u <- "http://www.exemple.com/?a=1&b=2&c=3#def"
> strsplit(urltools::parameters(u), "&")[[1L]]
[1] "a=1" "b=2" "c=3"
> urltools::param_get(u, "b")
b
1 2

We can try
library(stringr)
matrix(str_extract_all(str1, "[a-z](?=\\=)|(?<=\\=)\\d+")[[1]], ncol=2, byrow=TRUE)
Or if we need the = also
str_extract_all(str1, "[a-z]=\\d+")[[1]]
#[1] "a=1" "b=2" "c=3"
data
str1 <- "http://www.exemple.com/?a=1&b=2&c=3#def"

grepl() and lapply to fill missing values

I have the following data as an example:
fruit.region <- data.frame(full =c("US red apple","bombay Asia mango","gold kiwi New Zealand"), name = c("apple", "mango", "kiwi"), country = c("US","Asia","New Zealand"), type = c("red","bombay","gold"))
I would like R to be able to look at other items in the "full" (name) column that don't have values for "name", "country" and "type" and see if they match other items. For instance, if full had a 4th row with "bombay US mango" it would be able to identify that the country should read US, bombay should be under type and mango should be under name.
This is what I have so far, which merely identifies (logically) where the items match:
new.entry <- c("bombay US mango")
split.new.entry <- strsplit(new.entry, " ")
lapply(split.new.entry, function(x){
check = grepl(x, fruit.region, ignore.case=TRUE)
print(check)
})
I'm at a bit of a standstill..I've read through a number of regex posts and the r help guides on grepl but am not able to find a great solution. What I have doesn't fully identify a logical "match" vector so I'm unable to subset and use an if statement to concatenate on different elements. Ideally, I'd like to be able to replace these elements in data.table form as my fruit.region will actually be in a data table. Does anyone have any suggestions on the best approach?

Using the str_detect function from the stringr library. This gives a list, ready to rbind:
library(stringr)
addnewrow <- function(newfruit){
z<-lapply(fruit.region[,2:4], function(x) x[str_detect(new.entry, x)])
z$full <- newfruit
z
}
addnewrow(new.entry)
$name
[1] "mango"
$country
[1] "US"
$type
[1] "bombay"
$full
[1] "bombay US mango"
The next step would depend on your desired outcome - if you only want to add one, try:
rbind(fruit.region, addnewrow(new.entry))
If you have a lot:
z <- do.call(rbind, lapply(c(new.entry, new.entry), addnewrow))
rbind(fruit.region, z)
NB make sure your columns are character first:
fruit.region[] <- lapply(fruit.region, as.character)

Count number of words in one list that appear in a string

I have a unique set of words in a character vector (that have been 'stemmed') and I want to know how many of them appear in a string.
Here's what I have so far:
library(RTextTools)
string <- "Players Information donation link controller support years fame glory addition champion Steer leader gang ghosts life Power Pellets tables gobble ghost"
wordstofind <- c("player","fame","field","donat")
# I created a stemmed list of the string
string.stem <- colnames(create_matrix(string, stemWords = T, removeStopwords = F))
I know the next step probably involves grepl("\\bword\\b,value") or some usage of regex, but I'm not sure what the fastest option is in this case.
Here are my criteria:
I have to do this many times, so it being as fast as possible is a concern.
It should match the entire word ("es" shouldn't match "test").
Any push in the right direction would be great.

Well, I never work with huge datasets, so time is never of the essence, but given the data you've provided this will give you a count of how many words exactly match something in the string. Might be a good starting point.
sum(wordstofind %in% unlist(strsplit(string, " ")))
> sum(wordstofind %in% unlist(strsplit(string, " ")))
[1] 1
Edit Using the stems to get the proper 3 matches, thanks to #Anthony Bissel:
sum(wordstofind %in% unlist(string.stem))
> sum(wordstofind %in% unlist(string.stem))
[1] 3

Take a look at stringr by Hadley Wickham. You are probably looking for the function str_count.

There certainly might be a faster option, but this works:
length(wordstofind) - length(setdiff(wordstofind, string.stem)) # 3
But it looks like Andrew Taylor's answer is faster:
`microbenchmark(sum(wordstofind %in% unlist(string.stem)), length(wordstofind) - length(setdiff(wordstofind, string.stem)))
Unit: microseconds
expr min lq mean median uq max neval
sum(wordstofind %in% unlist(string.stem)) 4.016 4.909 6.55562 5.355 5.801 37.485 100
length(wordstofind) - length(setdiff(wordstofind, string.stem)) 16.511 16.958 21.85303 17.404 18.296 81.218 100`

regex for matching column names on a matrix

I'll have two strings of the form
"Initestimate" or "L#estimate" with # being a 1 or 2 digit number
" Nameestimate" with Name being the name of the actual symbol. In the example below, the name of our symbol is "6JU4"
And I have a matrix containing, among other things, columns containing "InitSymbol" and "L#Symbol". I want to return the column name of the column where the first row holds the substring before "estimate".
I'm using stringr. Right now I have it coded with a bunch of calls to str_sub but its really sloppy and I wanted to clean it up and do it right.
example code:
> examplemat <- matrix(c("RYU4","6JU4","6EU4",1,2,3),ncol=6)
> colnames(examplemat) <- c("InitSymb","L1Symb","L2Symb","RYU4estimate","6JU4estimate","6EU4estimate")
> examplemat
InitSymb L1Symb L2Symb RYU4estimate 6JU4estimate 6EU4estimate
[1,] "RYU4" "6JU4" "6EU4" "1" "2" "3"
> searchStr <- "L1estimate"
So with answer being the answer I'm looking for, I want to be able to input examplemat[,answer] so I can extract the data column (in this case, "2")
I don't really know how to do regex, but I think the answer looks something like
examplemat[,paste0(**some regex function**("[(Init)|(L[:digit:]+)]",searchStr),"estimate")]
what function goes there, and is my regex code right?

May be you can try:
library(stringr)
Extr <- str_extract(searchStr, '^[A-Za-z]\\d+')
Extr
[1] "L1"
#If the searchStr is `Initestimate`
#Extr <- str_extract(searchStr, '^[A-Za-z]{4}')
pat1 <- paste0("(?<=",Extr,").*")
indx1 <-examplemat[,str_detect(colnames(examplemat),perl(pat1))]
pat2 <- paste0("(?<=",indx1,").*")
examplemat[,str_detect(colnames(examplemat), perl(pat2))]
#6JU4estimate
# "2"
#For searchStr using Initestimate;
#examplemat[,str_detect(colnames(examplemat), perl(pat2))]
#RYU4estimate
# "1"

The question is bit confusing so I am quite not sure if my interpretation is correct.
First, you would extract the values in the string "coolSymb" without "Symb"
Second, you can detect if column name contains "cool" and return the location (column index)
by which() statement.
Finally, you can extract the value using simple matrix indexing.
library(stringr)
a = str_split("coolSymb", "Symb")[[1]][1]
b = which(str_detect(colnames(examplemat), a))
examplemat[1, b]
Hope this helps,

won782's use of str_split inspired me to find an answer that works, although I still want to know how to do this by matching the prefix instead of excluding the suffix, so I'll accept an answer that does that.
Here's the step-by-step
> str_split("L1estimate","estimate")[[1]][1]
[1] "L1"
replace the above step with one that gets {L1} instead of getting {not estimate} for bonus points
> paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")
[1] "L1Symb"
> examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")]
L1Symb
[1,] "6JU4"
> paste0(examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")],"estimate")
[1] "6JU4estimate"
> examplemat[,paste0(examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")],"estimate")]
6JU4estimate
[1,] "2"

grep on any cell in a data.frame

A simple "is there a better way" question. I want to find if any cell in a data.frame contains the sub-string I'm looking for:
d=data.frame(V1=c("xxx","yyy","zzz"), V2=c(NA,"ewruinwe",NA))
grepl("ruin",d[2,2]) #TRUE
grepl("ruin",d) #FALSE FALSE
any(grepl("ruin",as.character(as.matrix(d)))) #TRUE
The last line does what I want, but it looks so ugly I'm wondering if I'm missing something simpler.
Background: d is one of the elements in t=readHTMLTable(url) (XML package). I was doing the d[2,2] approach, to check for an error message, and just discovered the website sometimes add another row to the HTML table, pushing the error message I was looking for to another cell.
UPDATE: so, it seems the two choices (thanks to mathematical.coffee and Roman Luštrik) are:
any(grepl("ruin",as.matrix(d)))
any(apply(d, 2, function(x) grepl("ruin", x)))

What about this?
d=data.frame(V1=c("xxx","yyy","zzz"), V2=c(NA,"ewruinwe",NA))
apply(d, c(1,2), function(x) grepl("ruin", x))
V1 V2
[1,] FALSE FALSE
[2,] FALSE TRUE
[3,] FALSE FALSE
As noted in the comments "2" does the same as "c(1,2)". Then to give a single boolean value:
any(apply(d, 2, function(x) grepl("ruin", x)))
[1] TRUE

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expressions in R - compare one column to another - regex

Related

Extract URL parameters and values in R

grepl() and lapply to fill missing values

Count number of words in one list that appear in a string

regex for matching column names on a matrix

grep on any cell in a data.frame

Categories

Resources