Filter/grep functions behaving oddly - regex

Take the following code to select only alphanumeric strings from a list of strings:
isValid = function(string){
return(grep("^[A-z0-9]+$", string))
}
strings = c("aaa", "test#test.com", "", "valid")
print(Filter(isValid, strings))
The output is [1] "aaa" "test#test.com".
Why is "valid" not outputted, and why is "test#test.com" outputted?

The Filter function accepts a logical vector, you supplied a numeric. Use grepl:
isValid = function(string){
return(grepl("^[A-z0-9]+$", string))
}
strings = c("aaa", "test#test.com", "", "valid")
print(Filter(isValid, strings))
[1] "aaa" "valid"
Why didn't grep work? It is due to R's coercion of numeric values to logical and the weirdness of Filter.
Here's what happened, grep("^[A-z0-9]+$", string) correctly returns 1 4. That is the index of matches on the first and fourth elements.
But that is not how Filter works. It runs the condition on each element with as.logical(unlist(lapply(x, f))).
So it ran isValid(strings[1]) then isValid(strings[2]) and so on. It created this:
[[1]]
[1] 1
[[2]]
integer(0)
[[3]]
integer(0)
[[4]]
[1] 1
It then called unlist on that list to get 1 1 and turned that into a logical vector TRUE TRUE. So in the end you got:
strings[which(c(TRUE, TRUE))]
which turned into
strings[c(1,2)]
[1] "aaa" "test#test.com"
Moral of the story, don't use Filter :)

You could go the opposite direction with this and exclude any strings with punctuation, i.e.
isValid <- function(string){
v1 <- string[!string %in% grep('[[:punct:]]', string, value = TRUE)]
return(v1[v1 != ''])
}
isValid(strings)
#[1] "aaa" "valid"

Related

R: how to find the first digit in a string

string = "ABC3JFD456"
Suppose I have the above string, and I wish to find what the first digit in the string is and store its value. In this case, I would want to store the value 3 (since it's the first-occuring digit in the string). grepl("\\d", string) only returns a logical value, but does not tell me anything about where or what the first digit is. Which regular expression should I use to find the value of the first digit?
Base R
regmatches(string, regexpr("\\d", string))
## [1] "3"
Or using stringi
library(stringi)
stri_extract_first(string, regex = "\\d")
## [1] "3"
Or using stringr
library(stringr)
str_extract(string, "\\d")
## [1] "3"
1) sub Try sub with the indicated regular expression which takes the shortest string until a digit, a digit and then everything following and replaces it with the digit:
sub(".*?(\\d).*", "\\1", string)
giving:
[1] "3"
This also works if string is a vector of strings.
2) strapplyc It would also be possible to use strapplyc from gsubfn in which case an even simpler regular expression could be used:
strapplyc(string, "\\d", simplify = TRUE)[1]
giving the same or use this which gives the same answer again but also works if string is a vector of strings:
sapply(strapplyc(string, "\\d"), "[[", 1)
Get the locations of the digits
tmp <- gregexpr("[0-9]", string)
iloc <- unlist(tmp)[1]
Extract the first digit
as.numeric(substr(string,iloc,iloc))
Using regexpr is simpler
tmp<-regexpr("[0-9]",string)
if(tmp[[1]]>=0) {
iloc <- tmp[1]
num <- as.numeric(substr(string,iloc,iloc))
}
Using rex may make this type of task a little simpler.
string = c("ABC3JFD456", "ARST4DS324")
re_matches(string,
rex(
capture(name = "first_number", digit)
)
)
#> first_number
#> 1 3
#> 2 4
> which( sapply( strsplit(string, ""), grepl, patt="[[:digit:]]"))[1]
[1] 4
Or
> gregexpr("[[:digit:]]", string)[[1]][1]
[1] 4
So:
> splstr[[1]][ which( sapply( splstr, grepl, patt="[[:digit:]]"))[1] ]
[1] "3"
Note that a full result from a gregexpr call is a list, hence the need to extract its first element with "[[":
> gregexpr("[[:digit:]]", string)
[[1]]
[1] 4 8 9 10
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
A gsub solution that is based on replacing the substrings preceding and following the first digit with the empty string:
gsub("^\\D*(?=\\d)|(?<=\\d).*", "", string, perl = TRUE)
# [1] "3"

Why does strsplit return a list

Consider
text <- "who let the dogs out"
fooo <- strsplit(text, " ")
fooo
[[1]]
[1] "who" "let" "the" "dogs" "out"
the output of strsplit is a list. The list's first element then is a vector, that contains the words above.
Why does the function behave that way? Is there any case in which it would return a list with more than one element?
And I can access the words using
fooo[[1]][1]
[1] "who"
, but is there no simpler way?
To your first question, one reason that comes to mind is so that it can keep different length result vectors in the same object, since it is vectorized over x:
text <- "who let the dogs out"
vtext <- c(text, "who let the")
##
> strsplit(text, " ")
[[1]]
[1] "who" "let" "the" "dogs" "out"
> strsplit(vtext, " ")
[[1]]
[1] "who" "let" "the" "dogs" "out"
[[2]]
[1] "who" "let" "the"
If this were to be returned as a data.frame, matrix, etc... instead of a list, it would have to be padded with additional elements.

Modifying elements in a vector that do not match a pattern

I have a vector with entries like these:
[5484] "QUERY___05nirs_WM_WATMOP079"
[5485] "QUERY___05nirs_WM_WATMAP075"
[5486] "QUERY___05nirs_WM_WATMAP037"
[5487] "QUERY___05nirs_WM_WATMOP071"
[5488] "QUERY___03nirs_WM_WATMAP168"
[5489] "2022819637_Scalindua_MAnamSca741_C384"
[5490] "237637177_clone_PeruG11"
[5491] "237637158_clone_PeruD2"
[5492] "237637172_clone_PeruD12"
[5493] "237637168_clone_PeruE11"
I would like to append "QUERY___" at the beginning of those elements that do not contain it already. I figured out how to get a logical vector with grepl that tells me which elements do not have "QUERY", but have no idea on how to use that vector to change the original vector.
Try:
gsub("(^[^QUERY].*)","QUERY___\\1",string)
#[1] "QUERY___05nirs_WM_WATMOP079"
#[2] "QUERY___05nirs_WM_WATMAP075"
#[3] "QUERY___05nirs_WM_WATMAP037"
#[4] "QUERY___05nirs_WM_WATMOP071"
#[5] "QUERY___03nirs_WM_WATMAP168"
#[6] "QUERY___2022819637_Scalindua_MAnamSca741_C384"
#[7] "QUERY___237637177_clone_PeruG11"
#[8] "QUERY___237637158_clone_PeruD2"
#[9] "QUERY___237637172_clone_PeruD12"
#[10] "QUERY___237637168_clone_PeruE11"
Here's a simple approach
string <- c("QUERY___05nirs_WM_WATMOP079",
"QUERY___05nirs_WM_WATMAP075",
"QUERY___05nirs_WM_WATMAP037",
"QUERY___05nirs_WM_WATMOP071",
"QUERY___03nirs_WM_WATMAP168",
"2022819637_Scalindua_MAnamSca741_C384",
"237637177_clone_PeruG11",
"237637158_clone_PeruD2",
"237637172_clone_PeruD12",
"237637168_clone_PeruE11")
> ind <- grepl("^QUERY___", string)
> ( string2 <- c(string[ind], paste0("QUERY___", string[!ind])) )
#[1] "QUERY___05nirs_WM_WATMOP079" "QUERY___05nirs_WM_WATMAP075"
#[3] "QUERY___05nirs_WM_WATMAP037" "QUERY___05nirs_WM_WATMOP071"
#[5] "QUERY___03nirs_WM_WATMAP168" "QUERY___2022819637_Scalindua_MAnamSca741_C384"
#[7] "QUERY___237637177_clone_PeruG11" "QUERY___237637158_clone_PeruD2"
#[9] "QUERY___237637172_clone_PeruD12" "QUERY___237637168_clone_PeruE11"

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.
So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Huh?! What is going on?
The theory of #Aprillion is exact, from R documentation:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)
To simply illustrate this behavior:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to #JoshO'Brien for the link.)

How to prevent regmatches drop non matches?

I would like to capture the first match, and return NA if there is no match.
regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1] 1 -1 3 1
# attr(,"match.length")
# [1] 1 -1 1 2
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1] "a" "a" "aa"
So I expected "a", NA, "a", "aa"
Staying with regexpr:
r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a" NA "a" "aa"
use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting
R <- regmatches(x, regexec("a+", x))
unlist({R[sapply(R, length)==0] <- NA; R})
# [1] "a" NA "a" "aa"
In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says
if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).
The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.
myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] "" "a" "bc"
[[2]]
[1] "def"
[[3]]
[1] "cb" "a" " a"
[[4]]
[1] "" "aa" ""
So to extract what you want (with "" in place of NA), you can use sapply as follows:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a" "" "a" "aa"
At this point, if you really want NA instead of "", you can use
is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a" NA "a" "aa"
Some revisions:
Note that you can collapse the last two lines into a single line:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})
The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.
An even slicker extraction method for the final line is to use [:
sapply(myMatch, `[`, 2)
[1] "a" NA "a" "aa"
So you can do the whole thing in a fairly readable single line:
sapply(regmatches(x, m, invert=NA), `[`, 2)
Using more or less the same construction as yours -
chars <- c("abc", "def", "cba a", "aa")
chars[
regexpr("a+", chars, perl=TRUE) > 0
][1] #abc
chars[
regexpr("q", chars, perl=TRUE) > 0
][1] #NA
#vector[
# find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]
Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.
You can use stringr::str_extract(string, pattern). It will return NA if there is no matches. It has simpler function interface than regmatches() as well.