Unexpected result of negative lookahead on word (R regex) - regex

I'm trying to create rules for a sentence that contains "dog" but not "cat". I would like the function to return FALSE since the string contains both "dog" and "cat".
Using negation:
grepl("cat.*[^dog]", "asdfasdfasdf cat adsfafds dog", perl=T)
Using negative lookahead:
grepl("cat.*(?!dog)", "asdfasdfasdf cat adsfafds dog", perl=T)
Using str_detect function in the stringr package
require(stringr)
str_detect("asdfasdfasdf cat adsfafds dog", "cat.*(?!dog|$)")
All these three methods return true.

You can use this regex to find strings that contain cat but not dog:
^((cat((?!dog).)*)|(((?!dog).)*?cat((?!dog).)*)+)$
It's based on the answer here. It takes into account that dog can come before or after cat.
The problem with ALL of your solutions is that cat.* will find catand then .* will eat up EVERYTHING, including dogs.
Also, you forgot to handle the cases where dog comes before cat.
As Druzion points out, char classes are not the way to go.

A simple solution will be to create a function to check :-
i) If the string contains both cat and dog, then return FALSE
ii) otherwise, return TRUE
R Code
cat_dog <- function(x) { if (length(grep("(?=.*cat)(?=.*dog)", x, perl = TRUE)) != 0) {return(FALSE)} else {return(TRUE)} }
Updated Code
cat_dog <- function(x) { if (length(grep("(?=.*dog)", x, perl = TRUE) != 0)) {if (length(grep("(?=.*cat)", x, perl = TRUE)) != 0) {return(FALSE)} else {return(TRUE)}} else {return(FALSE)}}
Ideone Demo

Related

Regex - Match Words which are not Strings

I am trying to distinguish between words and strings. I managed to get strings working, but I can't quite figure out how to only match words which are not surrounded by double quotes:
So I want this to match:
test
But this shouldn't match:
"test"
This is what I have so far:
[^\"][a-zA-Z]*[^\"]
It still gets the test although it is surrounded by double quotes.
Input: "\"this is a string\" word"
Expected Output: word
Any suggestions?
How about it?
assert("\"<quoted>\" word".words == listOf("word"))
assert("head \"<quoted>\" word".words == listOf("head", "word"))
assert("head\"<quoted>\"word".words == listOf("head", "word"))
assert("\"<escaped\\\"quoted>\"".words == emptyList())
assert("; punctuations , ".words == listOf("punctuations"))
inline val String.words get() = dropStrings().split("[^\\p{Alpha}]+".toRegex())
.filter { it.isNotBlank() }
#Suppress("NOTHING_TO_INLINE")
inline fun String.dropStrings() = replace("\"(\\[\"]|.*)?\"".toRegex(), " ")

Keep string up to first occurrence of pattern in R

I would like to keep the string up to the first occurrence of the following pattern: lower case letter followed by upper case, followed by lower case again.
For example
"This is My testString, how to keepUntil test"
I would like to return This is My test
This is what I have tried unsuccessfully so far:
library("magrittr")
"This is My testString, how to keepUntil test" %>% gsub("(.*[a-z])[A-Z][a-z]?.*", "\\1", .)
We can use strsplit
strsplit(str1, "(?<=[a-z])(?=[A-Z])", perl = TRUE)[[1]][1]
#[1] "This is My test"
or with sub
sub("([A-Za-z ]+[a-z])[A-Z].*", "\\1", str1)
#[1] "This is My test"
data
str1 <- "This is My testString, how to keepUntil test"
You can use a recursive function with regex capturing groups to extract always the first (leftmost) instance of the pattern you want, regardless of how many sections your text has.
regex <- "^(.*[a-z])[A-Z].*$"
text <- "This is My testString, how to keepUntil test"
library(stringr)
ExtractFirstPart <- function(Text,Regex) {
firstpart <- str_match(Text,Regex)[2]
if (is.na(firstpart)) {
return(Text)
} else {
firstpart <- ExtractFirstPart(firstpart,Regex)
return(firstpart)
}
}
Using this function, you will get:
> ExtractFirstPart(text,regex)
[1] "This is My test"

Combine regex 'or' with stop at first occurence

Conceptually, I want to search for (a|b) and get only the first occurrence. I know this is a lazy/non-greedy application, but can't seem to combine it properly with the or.
Moving beyond the conceptual level, which might change things a lot, a and b are actually longer patterns, but they have been tested separately and work fine. And I'm using this in strapply from package gsubfn which intrinsically finds all matches.
I suspect the answer is here in SO somewhere, but it's hard to search on such things.
Details: I'm trying to find function expressions var functionName = function(...) and function declarations function functionName(...) and extract the name of the function in javascript (parsing the lines with R). a is \\s*([[:alnum:]]*)\\s*=*\\s*function\\s*\\([^d|i] and b is \\s*function\\s*([[:alnum:]]+)\\s*\\([^d|i]. They work fine individually. A single function definition will take one form or the other, so I need to stop searching when I find one.
EDIT: In this string Here is a string of blah blah blah I'd like to find only the first 'a' using (a|b) or the first 'b' only using (b|a), plus of course whatever regex goodies I am missing.
EDIT 2: A big thanks to all who have looked at this. The details turn out to be important, so I'm going to post more info. Here are the test lines I am searching:
dput(lines)
c("var activateBrush = function() {", " function brushed() { // Handles the response to brushing",
" var followMouse = function(mX, mY) { // This draws the guides, nothing else",
".x(function(d) { return xContour(d.x); })", ".x(function(i) { return xContour(d.x); })"
)
Here are the two patterns I want to use, and how I use them individually.
fnPat1 <- "\\s*function\\s*([[:alnum:]]+)\\s*\\([^d|i]" # conveniently drops 'var'
fnNames <- unlist(strapply(pattern = fnPat1, replacement = paste0, X = lines))
fnPat2 <- "\\s*([[:alnum:]]*)\\s*=*\\s*function\\s*\\([^d|i]" # conveniently drops 'var'
fnNames <- unlist(strapply(pattern = fnPat2, replacement = paste0, X = lines))
They return, in order:
[1] "brushed" "brushed"
[1] "activateBrush" "followMouse" "activateBrush" "followMouse"
What I want to do is use both of these patterns at the same time. What I tried was
fnPat3 <- paste("((", fnPat1, ")|(", fnPat2, "))") # which is (a|b) of the orig. question
But that returns
[1] " activateBrush = function() " " function brushed() "
What I want is a vector of all the function names, namely c("brushed", "activateBrush", "followMouse") Duplicates are fine, I can call unique.
Maybe this is clearer now, maybe someone sees an entirely different approach. Thanks everyone!
To match the first a or b,
> x <- "Here is a string of blah blah blah"
> m <- regexpr("[ab]", x)
> regmatches(x, m)
[1] "a"
> x <- "Here b is a string of blah blah blah"
> m <- regexpr("[ab]", x)
> regmatches(x, m)
[1] "b"
Check the regex with sub function whether the regex matches the first a,b or not. In the below , using sub function i just replaced first a or b with ***. We use the advantage of sub function here, ie it won't do a global replacement. It only replace the first occurance of the characters which matches the given pattern or regex.
> x <- "Here is a string of blah blah blah"
> sub("[ab]", "***", x)
[1] "Here is *** string of blah blah blah"
> x <- "Here b is a string of blah blah blah"
> sub("[ab]", "***", x)
[1] "Here *** is a string of blah blah blah"
We could use gregexpr or gsub functions also.
> x <- "Here is a string of blah blah blah"
> m <- gregexpr("^[^ab]*\\K[ab]", x, perl=TRUE)
> regmatches(x, m)
[[1]]
[1] "a"
> gsub("^[^ab]*\\K[ab]", "***", x, perl=TRUE)
[1] "Here is *** string of blah blah blah"
> x <- "Here b is a string of blah blah blah"
> gsub("^[^ab]*\\K[ab]", "***", x, perl=TRUE)
[1] "Here *** is a string of blah blah blah"
Explanation:
^ Asserts that we are at the start.
[^ab]*, negated character class which matches any character but not of a or b zero or more times. We don't use [^ab]+ because there is a chance of a or b would be present at the start of the line.
\K discards the previously matched characters. ie, it removes all the characters which are matched by [^ab]* regex from printing.
[ab] Now it matches the following a or b
It seems to me this would be alot easier combining the expressions ...
strapply(lines, '(?:var|function)\\s*([[:alnum:]]+)', simplify = c)
# [1] "activateBrush" "brushed" "followMouse"
(?: ... ) is a Non-capturing group. By placing ?: inside you specify that the group is not to be captured, but to group things. Saying, group but do not capture "var" or "function" then capture the word characters that follow.
Try str_extract() from stringr package.
str_extract("b a", "a|b")
[1] "b"
str_extract("a b", "a|b")
[1] "a"
str_extract(c("a b", "b a"), "a|b")
[1] "a" "b"

Removing repeating substrings from within a string in R

Is there any way (using regular expressions such as gsub or other means) to remove repetitions from a string?
Essentially:
a = c("abc, def, def, abc")
f(a)
#[1] "abc, def"
One obvious way is to strsplit the string, get unique strings and stitch them together.
paste0(unique(strsplit(a, ",[ ]*")[[1]]), collapse=", ")
You can also use stringr::str_extract_all
require(stringr)
unique(unlist(str_extract_all(a, '\\w+')))
you can also use this function based on gsub. I was not able to directly do it with a single regular expression.
f <- function(x) {
x <- gsub("(.+)(.+)?\\1", "\\1\\2", x, perl=T)
if (grepl("(.+)(.+)?\\1", x, perl=T))
x <- f(x)
else
return(x)
}
b <- f(a)
b
[1] "abc, def"
hth

R: Capitalizing everything after a certain character

I would like to capitalize everything in a character vector that comes after the first _. For example the following vector:
x <- c("NYC_23df", "BOS_3_rb", "mgh_3_3_f")
Should come out like this:
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F"
I have been trying to play with regular expressions, but am not able to do this. Any suggestions would be appreciated.
You were very close:
gsub("(_.*)","\\U\\1",x,perl=TRUE)
seems to work. You just needed to use _.* (underscore followed by zero or more other characters) rather than _* (zero or more underscores) ...
To take this apart a bit more:
_.* gives a regular expression pattern that matches an underscore _ followed by any number (including 0) of additional characters; . denotes "any character" and * denotes "zero or more repeats of the previous element"
surrounding this regular expression with parentheses () denotes that it is a pattern we want to store
\\1 in the replacement string says "insert the contents of the first matched pattern", i.e. whatever matched _.*
\\U, in conjunction with perl=TRUE, says "put what follows in upper case" (uppercasing _ has no effect; if we wanted to capitalize everything after (for example) a lower-case g, we would need to exclude the g from the stored pattern and include it in the replacement pattern: gsub("g(.*)","g\\U\\1",x,perl=TRUE))
For more details, search for "replacement" and "capitalizing" in ?gsub (and ?regexp for general information about regular expressions)
gsubfn in the gsubfn package is like gsub except the replacement string can be a function. Here we match _ and everything afterwards feeding the match through toupper :
library(gsubfn)
gsubfn("_.*", toupper, x)
## [1] "NYC_23DF" "BOS_3_RB" "mgh_3_3_F"
Note that this approach involves a particularly simple regular expression.
Simple example using base::strsplit
x <- c("NYC_23df", "BOS_3_rb", "mgh_3_3_f", "a")
myCap <- function(x) {
out <- sapply(x, function(y) {
temp <- unlist(strsplit(y, "_"))
out <- temp[1]
if (length(temp[-1])) {
out <- paste(temp[1], paste(toupper(temp[-1]),
collapse="_"), sep="_")
}
return(out)
})
out
}
> myCap(x)
NYC_23df BOS_3_rb mgh_3_3_f a
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F" "a"
Example using the stringr package
pkg <- "stringr"
if (!require(pkg, character.only=TRUE)) {
install.packages(pkg)
require(pkg, character.only=TRUE)
}
myCap.2 <- function(x) {
out <- sapply(x, function(y) {
idx <- str_locate(y, "_")
if (!all(is.na(idx[1,]))) {
str_sub(y, idx[,1], nchar(y)) <- toupper(str_sub(y, idx[,1], nchar(y)))
}
return(y)
})
out
}
> myCap.2(x)
NYC_23df BOS_3_rb mgh_3_3_f a
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F" "a"