How to make str_detect or str_extract work?

How to make str_detect or str_extract work? - regex

Below is my code:
example <- c("aaaa","aaab","abab","abba","baaa","baba","bbba","bbbb")
example <- as.data.frame(example)
example1 <- c("zzzz","zzzy","zyzy","zyyz","yzzz","yzyz","yyyz","yyyy")
example1 <- as.data.frame(example1)
df <- cbind(example, example1)
library(stringr)
detect<- str_detect(df,"aaaa")
And yet this does not manage to detect the "aaaa" in one cell.
Instead, it shows FALSE for every row.

str_detect(df,"aaaa") performs a search on the column names, it is the same as using str_detect(colnames(df),"aaaa"). Since your example and example1 do not contain four as, you get two FALSE. If you rename example to eaaaaxample, you will get a match.
However, you must be probably searching for a match in the example column:
library(stringr)
str_detect(df$example,"aaaa")
# => [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
See the R demo

Related

Any way to require two matches instead of just one for TRUE with grepl?

I'm trying to detect terms using grepl, and I'm getting too many false positives. I was hoping there might be a way to require two successful matches of any term off the list (I have manual coding for a segment of my data and am trying to get the automation to at least roughly correspond to this, but I have about 5 times as many positives as I did with manual coding). I didn't see grepl as taking any argument requiring more than one match to trigger TRUE. Is there any way of requiring two matches to trigger a TRUE finding? Or is there some other function I should be using?
GenericColumn <- cbind(grepl(Genericpattern, Statement$Statement.Text, ignore.case = TRUE))
EDIT:
Here is a more concrete example:
Examplepattern <- 'apple|orange'
ExampleColumn <- cbind(grepl(Examplepattern, Rexample$Statement.Text, ignore.case = TRUE))
As is now, all of these will trigger true with grepl. I would only like the items with two references to trigger true.
Example data:
Rexample <- structure(list(Statement.Text = structure(c(2L, 1L, 3L, 5L, 4L
), .Label = c("This apple is a test about an apple.", "This is a test about apples.",
"This orange is a test about apples.", "This orange is a test about oranges.",
"This orange is a test."), class = "factor")), .Names = "Statement.Text", row.names = c(NA,
5L), class = "data.frame")
Desired Output: TRUE, FALSE, TRUE, TRUE, FALSE

You can specify how many times you want something repeated in regex with curly braces, like {2} (exactly twice whatever is before it), {2,5} (2-5 times), or {2,} (2 or more times). However, you need to allow for words between the ones you want to match, so you need a wildcard . quantified with * (0 or more times).
Thus, if you want either apple or orange matched twice (including apple and orange and vice versa), you can use
grepl('(apple.*|orange.*){2}', Rexample$Statement.Text, ignore.case = TRUE)
# [1] FALSE TRUE TRUE FALSE TRUE
If you want apple repeated twice or orange repeated twice (but not apple once and orange once), quantify separately:
grepl('(apple.*){2,}|(orange.*){2}', Rexample$Statement.Text, ignore.case = TRUE)
# [1] FALSE TRUE FALSE FALSE TRUE

You can try a regex that explicitly looks for the pattern again, like (?:apple|orange).*(?:apple|orange)
(pattern <- paste0("(?:", Examplepattern, ")", ".*", "(?:", Examplepattern, ")"))
#[1] "(?:apple|orange).*(?:apple|orange)"
grepl(pattern, Rexample$Statement.Text, ignore.case = TRUE, perl = TRUE)
#[1] FALSE TRUE TRUE FALSE TRUE

How to apply a and not b pattern match using regex in R

I would like to filter a list by only keeping items that contain dimension or that contain metric and not penetration
I can filter to those that contain dimension OR metric and penetation, but I can't see how to switch the logic of the second case to metric and not penetration
Example below:
> library(stringr)
> var_list <- c("other", "dimension_1", "dimension_2", "metric_1", "metric_2", "metric_3_penetration")
> str_detect(var_list, "dimension|(?=.*metric)(?=.*penetration)")
[1] FALSE TRUE TRUE FALSE FALSE TRUE
Result that I would like to return from the str_detect is below:
[1] FALSE TRUE TRUE TRUE TRUE FALSE

You can use a combination of a negative and positive lookaheads for the second case:
> library(stringr)
> var_list <- c("other", "dimension_1", "dimension_2", "metric_1", "metric_2", "metric_3_penetration")
> str_detect(var_list, "dimension|^(?=.*metric)(?!.*penetration)")
[1] FALSE TRUE TRUE TRUE TRUE FALSE
The ^(?=.*metric)(?!.*penetration) regex matches when a string has metric and does not have penetration.
To only check for whole words, add (?:\b|_) boundaries:
str_detect(var_list, "dimension|^(?=.*(?:\\b|_)metric(?:\\b|_))(?!.*(?:\\b|_)penetration(?:\\b|_))")

A logical combination of grepl calls is simple and involves no packages:
grepl("dimension",var_list) | (grepl("metric",var_list) & !grepl("penetration",var_list))
## [1] FALSE TRUE TRUE TRUE TRUE FALSE

grepl() and lapply to fill missing values

I have the following data as an example:
fruit.region <- data.frame(full =c("US red apple","bombay Asia mango","gold kiwi New Zealand"), name = c("apple", "mango", "kiwi"), country = c("US","Asia","New Zealand"), type = c("red","bombay","gold"))
I would like R to be able to look at other items in the "full" (name) column that don't have values for "name", "country" and "type" and see if they match other items. For instance, if full had a 4th row with "bombay US mango" it would be able to identify that the country should read US, bombay should be under type and mango should be under name.
This is what I have so far, which merely identifies (logically) where the items match:
new.entry <- c("bombay US mango")
split.new.entry <- strsplit(new.entry, " ")
lapply(split.new.entry, function(x){
check = grepl(x, fruit.region, ignore.case=TRUE)
print(check)
})
I'm at a bit of a standstill..I've read through a number of regex posts and the r help guides on grepl but am not able to find a great solution. What I have doesn't fully identify a logical "match" vector so I'm unable to subset and use an if statement to concatenate on different elements. Ideally, I'd like to be able to replace these elements in data.table form as my fruit.region will actually be in a data table. Does anyone have any suggestions on the best approach?

Using the str_detect function from the stringr library. This gives a list, ready to rbind:
library(stringr)
addnewrow <- function(newfruit){
z<-lapply(fruit.region[,2:4], function(x) x[str_detect(new.entry, x)])
z$full <- newfruit
z
}
addnewrow(new.entry)
$name
[1] "mango"
$country
[1] "US"
$type
[1] "bombay"
$full
[1] "bombay US mango"
The next step would depend on your desired outcome - if you only want to add one, try:
rbind(fruit.region, addnewrow(new.entry))
If you have a lot:
z <- do.call(rbind, lapply(c(new.entry, new.entry), addnewrow))
rbind(fruit.region, z)
NB make sure your columns are character first:
fruit.region[] <- lapply(fruit.region, as.character)

regex for matching column names on a matrix

I'll have two strings of the form
"Initestimate" or "L#estimate" with # being a 1 or 2 digit number
" Nameestimate" with Name being the name of the actual symbol. In the example below, the name of our symbol is "6JU4"
And I have a matrix containing, among other things, columns containing "InitSymbol" and "L#Symbol". I want to return the column name of the column where the first row holds the substring before "estimate".
I'm using stringr. Right now I have it coded with a bunch of calls to str_sub but its really sloppy and I wanted to clean it up and do it right.
example code:
> examplemat <- matrix(c("RYU4","6JU4","6EU4",1,2,3),ncol=6)
> colnames(examplemat) <- c("InitSymb","L1Symb","L2Symb","RYU4estimate","6JU4estimate","6EU4estimate")
> examplemat
InitSymb L1Symb L2Symb RYU4estimate 6JU4estimate 6EU4estimate
[1,] "RYU4" "6JU4" "6EU4" "1" "2" "3"
> searchStr <- "L1estimate"
So with answer being the answer I'm looking for, I want to be able to input examplemat[,answer] so I can extract the data column (in this case, "2")
I don't really know how to do regex, but I think the answer looks something like
examplemat[,paste0(**some regex function**("[(Init)|(L[:digit:]+)]",searchStr),"estimate")]
what function goes there, and is my regex code right?

May be you can try:
library(stringr)
Extr <- str_extract(searchStr, '^[A-Za-z]\\d+')
Extr
[1] "L1"
#If the searchStr is `Initestimate`
#Extr <- str_extract(searchStr, '^[A-Za-z]{4}')
pat1 <- paste0("(?<=",Extr,").*")
indx1 <-examplemat[,str_detect(colnames(examplemat),perl(pat1))]
pat2 <- paste0("(?<=",indx1,").*")
examplemat[,str_detect(colnames(examplemat), perl(pat2))]
#6JU4estimate
# "2"
#For searchStr using Initestimate;
#examplemat[,str_detect(colnames(examplemat), perl(pat2))]
#RYU4estimate
# "1"

The question is bit confusing so I am quite not sure if my interpretation is correct.
First, you would extract the values in the string "coolSymb" without "Symb"
Second, you can detect if column name contains "cool" and return the location (column index)
by which() statement.
Finally, you can extract the value using simple matrix indexing.
library(stringr)
a = str_split("coolSymb", "Symb")[[1]][1]
b = which(str_detect(colnames(examplemat), a))
examplemat[1, b]
Hope this helps,

won782's use of str_split inspired me to find an answer that works, although I still want to know how to do this by matching the prefix instead of excluding the suffix, so I'll accept an answer that does that.
Here's the step-by-step
> str_split("L1estimate","estimate")[[1]][1]
[1] "L1"
replace the above step with one that gets {L1} instead of getting {not estimate} for bonus points
> paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")
[1] "L1Symb"
> examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")]
L1Symb
[1,] "6JU4"
> paste0(examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")],"estimate")
[1] "6JU4estimate"
> examplemat[,paste0(examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")],"estimate")]
6JU4estimate
[1,] "2"

Regular Expressions in R - compare one column to another

I currently have a dataset which has two columns that I'd like to compare. In one column, I have a string that I'd like to search for (let's call it column A). In a second column (let's call it column B) are some more strings.
The problem is that both columns have varying contents, so the pattern being searched for in the regular expression is likely to change from one row to another. Normally, when I'm searching a column for a particular string, I use something like this:
df$output <- NA
df$output[grep("TARGET_STRING", df$column_B)] <- "STRING_FOUND"
However, now that I'm trying to do this:
df$output[grep(df$column_A, df$column_B)] <- "STRING_FOUND"
Unfortunately, this gives an error:
argument 'pattern' has length > 1 and
only the first element will be used
I've tried various methods to fix this, and can't seem to find a simple solution, and I'm sure there must be one. I can see why it's throwing an error (I think), but I'm not sure how to solve it. What do I need to do to get the regular expression working?
Edit: Here's the testing data.frame I've been using to explore it:
column_A <- c("A", "A", "B", "B")
column_B <- c("A", "zzz", "B", "zzz")
greptest <- data.frame(column_A, column_B)
greptest$output<-NA
greptest$output[grep(greptest$column_A, greptest$column_B)] <- "STRING_FOUND"

You can write a function that wraps grepl and then use apply:
grepFun <- function(rw){
grepl(rw[1],rw[2],fixed=TRUE)
}
xx <- apply(greptest,1,grepFun)
greptest$output[xx] <- "STRING_FOUND"
You've already excepted my answer, but I thought I'd provide another, somewhat more efficient version using ddply:
grepFun1 <- function(x){
ind <- grepl(x$column_A[1],x$column_B,fixed=TRUE)
x$output <- NA
x$output[ind] <- "STRING_FOUND"
x
}
ddply(greptest,.(column_A),.fun=grepFun1)
This version will likely be faster if you have lots of repetition in the values for column_A.

I'm not sure what your expected result is, but here's my code:
> grep(greptest[1,"column_A"], greptest$column_B)
[1] 1 2
> grep(greptest[2,"column_A"], greptest$column_B)
integer(0)
> grep(greptest[3,"column_A"], greptest$column_B)
[1] 3 4
> grep(greptest[4,"column_A"], greptest$column_B)
integer(0)
> cbind(column_A,column_B,column_A==column_B)
column_A column_B
[1,] "A" "A" "TRUE"
[2,] "A" "zzz" "FALSE"
[3,] "B" "B" "TRUE"
[4,] "B" "zzz" "FALSE"
I switched A and B in the grep code, because otherwise you only get one hit per grep. You have to loop through elements, if you'd like to search for all of them (or use a loop equivalent).
If you'd like just to compare row by row, then a simple == suffices.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to make str_detect or str_extract work? - regex

Related

Any way to require two matches instead of just one for TRUE with grepl?

How to apply a and not b pattern match using regex in R

grepl() and lapply to fill missing values

regex for matching column names on a matrix

Regular Expressions in R - compare one column to another

Categories

Resources