Applying Regex Across a Vector - regex

I'm at a loss why the following code doesn't work. The intention is to input a vector of strings, some of which can be converted to a number, some can't. The following 'sapply' function should use a regex to match numbers and then return the number or (if not) return the original.
sapply(c("test","6","-99.99","test2"), function(v){
if(grepl("^[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?$",v)){as.numeric(v)} else {v}
})
Which returns the following result:
"test" "6" "-99.99" "test2"
Edit: What I expect the code to return:
"test" 6 -99.99 "test2
I can run the if statement on each element successfully.
> if(grepl("^[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?$","test")){as.numeric("test")} else {"test"}
[1] "test"
if(grepl("^[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?$","6")){as.numeric("6")} else {"6"}
[1] 6
And etc...
I don't understand why this is happening. I guess I have two questions. One: Why is this happening? And two: Usually I'm pretty good at troubleshooting, but I have no idea where to even look for this. If you know the problem, how did you find/know the solution? Should I open up the internal lapply function code?

that happens because sapply returns a vector, and a vector can't be mixed. If you use lapply then you get a list result which can be mixed, the same code but with lapply instead of sapply works how you want it to.

#Jeremy points into right direction, you can use lapply, which returns a list. Or, you can tell sapply not to simplify result.
If simplification occurs, the output type is determined from the
highest type of the return values in the hierarchy NULL < raw <
logical < integer < double < complex < character < list < expression,
after coercion of pairlists to lists.
out <- sapply(c("test","6","-99.99","test2"), function(v){
if(grepl("^[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?$",v)){
as.numeric(v)
} else {
v
}
}, simplify = FALSE)
> out
$test
[1] "test"
$`6`
[1] 6
$`-99.99`
[1] -99.99
$test2
[1] "test2"

Related

How do I match Regex Pattern on List to filter out decimal elements in Scala

I am wondering without creating a function, how can I filter out numbers from a list with both numbers and strings:
val a = sc.parallelize(List(“cat”,“horse”,4.0,3.5,2,“dog”))
I believe my question indeed is looking for how to use regex in Scala to find out matched pattern
----Updated on 20180302 11pm EST:
Thanks to #Nyavro which is the closest answer, I slightly modified as below:
val doubles = a.collect {
case v: Double => v
case v: Int => v
}
Now I get:
res10: Array[Double] = Array(4.0, 3.5, 2.0)
Just be curious, can types be mixed in a collect result in Scala?
Thanks to all replies.
Use collect:
val doubles = a.collect {
case v: Double => v
}
To filter for elements of type Int and Double, and to retain their respective types, you might try this.
a.flatMap {
case v: Int => Some(v)
case v: Double => Some(v)
case _ => None
}
//res0: List[AnyVal] = List(4.0, 3.5, 2)
To help understand why this is a really bad idea, read this question, and its answers.
You can use isInstanceOf to check whether an element of your list is a string.
val l = List("cat","horse",4.0,3.5,2,"dog")
l.filter(_.isInstanceOf[String])
>> List[Any] = List(cat, horse, dog)
Regex is (largely) irrelevant here because you do not have strings, you have a List[Any] that you're turning into an RDD[Any]. (The RDD is largely irrelevant here, too, except RDDs have no filterNot and Lists do--I can't tell if you want to keep the strings or drop the strings.)
Note also that filter takes a function as an argument--having some function here is inescapable, even if it's anonymous, as it is in my example.
I have an inkling, though, that I've given an answer to the opposite of what you're asking, and you have an RDD[String] that you want to convert to RDD[Double], throwing away the strings that don't convert. In that case, I would try to convert the strings to doubles, wrapping that in a Try and check for success, using the result to filter:
def isDouble(s: String) = Try(s.toDouble).isSuccess

R - looping across 2 objects

I'm trying to do something fairly simple (I think) but I can't get my head round it. I'm trying to write a loop that checks if a character variable in a data frame contains any of a certain list of substrings, and to assign a corresponding value to a dummy variable.
so, imagine a data.frame, n=2000, with a variable data.frame$text. Furthermore, I have a character vector containing all the substrings I want to text data.frame$text for. Let's call it hillary_exists :
hillary_exists <- c("Hilary Clinton", "hilary clinton","hilaryclinton", "hillaryclinton", "HilaryClinton",
"HillaryClinton","Hillary Clinton", "Hillary Rodham Clinton", "Hillary", "Hilary", "#Hillary2016", "#ImWithHer",
"Hillary2016", "hillary", "hilary", "Clinton 2016", "Clinton", "Secretary of State Clinton",
"Senator Clinton", "Hilary Rodham", "Hilary Rodham Clinton", "Hilary Rodham-Clinton", "Hillary Rodham-Clinton")
Now, I want my loop to test every row of data.frame$text for the existence of every element of hillary_exists, and if any of them is TRUE, to generate a new value of 1 for the variable data.frame$hillary_mention . This is what I tried:
for(i in hillary_exists){
if(grepl(hillary_exists[i], data.frame$text)){
data.frame$hillary_mention <- 1
} else {
data.frame$hillary_mention <- 0 }
}
But obviously I'm missing the i component for the data.frame$text element, but I don't know how to address it.
Any help would be greatly appreciated! Thanks
One approach we can use to get this to work is to turn hillary_exists into a regex: hillary_regex <- paste(hillary_exists, collapse = "|"). Essentially, this just takes all of your terms and turns it into a big OR statement. This takes care of one of the loops for us automatically. Next, we just loop over our text column, data.frame$text, using sapply.
data.frame$hillary_mention <- sapply(data.frame$text, function(s) grepl(hillary_regex, s, ignore.case = TRUE))
It's good to use ignore.case = TRUE here because there may be mentions in the text that aren't accounted for in hillary_exists, such as "hIllary cLinTon".

R: indexing result from a regular expression

I am trying to use the indexes that were returned from searching through a string for every instance of a character. When I use gregexp (pattern, text),
lookfor<-"n"
string<-"ATTnGGCnATTn"
gregexpr(pattern=lookfor,text=string)
I get the following:
[[1]]
[1] 4 8 12
attr(,"match.length")
[1] 1 1 1
attr(,"useBytes")
[1] TRUE
How do I index through the first line to be able to use those locations? Thank you in advance for your help!
Addition (2) : After thinking about this for a while, I came to the conclusion that you could have simply used unlist on your original gregexpr call
> unlist(gregexpr("n", string))
# [1] 4 8 12
From your comment
I am looking for the position of each letter n
it follows that you could do any of these:
> which(strsplit(string, "")[[1]] == "n")
# [1] 4 8 12
> cumsum(nchar(strsplit(string, "n")[[1]])+1)
# [1] 4 8 12
> nc <- 1:nchar(string)
> which(substring(string, nc, nc) == "n")
# [1] 4 8 12
Addition (1) in regards to the similar strings (comment in another answer) : You could use strsplit again, and locate those values with one of the methods above
> string2 <- "ATTTGGCCATTG"
> w <- which(strsplit(string, "")[[1]] == "n")
> strsplit(string2, "")[[1]][w]
[1] "T" "C" "G"
If you want to extract all the matches, you can use the builtin function regmatches()
m <- gregexpr(regexp,string)
regmatches(string,m)
This will return a list of character vectors because string can be greater than length 1. If you're only passing one string in, you can get at the vector of matches bypassing the list with
regmatches(string,m)[[1]]
Here is a step-by-step method to find the indices. I suspect there are more efficient ways to achieve the same result. The argument fixed = TRUE tells R to look for the literal lower case "n" rather than treat it as a regular expression.
Having done so, the [[1]] portion at the end retains only the indices element of the list
To show all indices, use the length function.
string="ATTnGGCnATTn"
index <- gregexpr(pattern = "n", text = string, fixed = TRUE)[[1]]
first.index <- index[1:length(index)]
I think I figured out my own answer!
Using the Biostrings package in Bioconductor:
string<-"ATTnGGCnATTn"
matches<-matchPattern(pattern="n", subject=string)
m<-as.vector(ranges(matches))
Now I can call up each position of of "n" in all similar strings. Thank you for those that took the time to answer.

Getting distance between two words in R

Say I have a line in a file:
string <- "thanks so much for your help all along. i'll let you know when...."
I want to return a value indicating if the word know is within 6 words of help.
This is essentially a very crude implementation of Crayon's answer as a basic function:
withinRange <- function(string, term1, term2, threshold = 6) {
x <- strsplit(string, " ")[[1]]
abs(grep(term1, x) - grep(term2, x)) <= threshold
}
withinRange(string, "help", "know")
# [1] TRUE
withinRange(string, "thanks", "know")
# [1] FALSE
I would suggest getting a basic idea of the text tools available to you, and using them to write such a function. Note Tyler's comment: As implemented, this can match multiple terms ("you" would match "you" and "your") leading to funny results. You'll need to determine how you want to deal with these cases to have a more useful function.
you won't be able to get this from regex alone. I suggest splitting using space as delimiter, then loop or use a built-in function to do array search of your two terms and subtract the difference of the indexes (array positions).
edit: Okay I thought about it a second and perhaps this will work for you as a regex pattern:
\bhelp(\s+[^\s]+){1,5}+\s+know\b
This takes the same "space is the delimiter" concept. First matches for help then greedily up to 5 " word" then looks for " know" (since "know" would be the 6th).
Split your string:
> words <- strsplit(string, '\\s')[[1]]
Build a indices vector:
> indices <- 1:length(words)
Name indices:
> names(indices) <- words
Compute distance between words:
> abs(indices["help"] - indices["know"]) < 6
FALSE
EDIT In a function
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
distance(string, "help", "know") < 6
EDIT Plus
There is a great advantage in indexing words, once its done you can work on a lot of statistics on a text.

Pattern matching and replacement in R

I am not familiar at all with regular expressions, and would like to do pattern matching and replacement in R.
I would like to replace the pattern #1, #2 in the vector: original = c("#1", "#2", "#10", "#11") with each value of the vector vec = c(1,2).
The result I am looking for is the following vector: c("1", "2", "#10", "#11")
I am not sure how to do that. I tried doing:
for(i in 1:2) {
pattern = paste("#", i, sep = "")
original = gsub(pattern, vec[i], original, fixed = TRUE)
}
but I get :
#> original
#[1] "1" "2" "10" "11"
instead of: "1" "2" "#10" "#11"
I would appreciate any help I can get! Thank you!
Specify that you are matching the entire string from start (^) to end ($).
Here, I've matched exactly the conditions you are looking at in this example, but I'm guessing you'll need to extend it:
> gsub("^#([1-2])$", "\\1", original)
[1] "1" "2" "#10" "#11"
So, that's basically, "from the start, look for a hash symbol followed by either the exact number one or two. The one or two should be just one digit (that's why we don't use * or + or something) and also ends the string. Oh, and capture that one or two because we want to 'backreference' it."
Another option using gsubfn:
library(gsubfn)
gsubfn("^#([1-2])$", I, original) ## Function substituting
[1] "1" "2" "#10" "#11"
Or if you want to explicitly use the values of your vector , using vec values:
gsubfn("^#[1-2]$", as.list(setNames(vec,c("#1", "#2"))), original)
Or formula notation equivalent to function notation:
gsubfn("^#([1-2])$", ~ x, original) ## formula substituting
Here's a slightly different take that uses zero width negative lookahead assertion (what a mouthful!). This is the (?!...) which matches # at the start of a string as long as it is not followed by whatever is in .... In this case two (or equivalently, more as long as they are contiguous) digits. It replaces them with nothing.
gsub( "^#(?![0-9]{2})" , "" , original , perl = TRUE )
[1] "1" "2" "#10" "#11"