Can the pattern argument in ls() be inverted? - regex

I'm trying to get a vector of all the function names in the base package that contain only a . as punctuation, or no punctuation at all. I'd like to do it using only the ls() function.
ls() takes a pattern argument that is defined as
an optional regular expression. Only names matching pattern are returned. glob2rx can be used to convert wildcard patterns to regular expressions.
I'm trying to invert my regular expression. But I also want to keep the functions that contain .. Here's an example of some of the ones I don't want.
lsBase1 <- ls("package:base", pattern = "[[:punct:]]")
head(lsBase1)
# [1] "^" "~" "<" "<<-" "<=" "<-"
I want the inverted version of this, as if I was using invert = TRUE in grep, or by doing the following. But I also want the functions that contain only . if they contain punctuation.
lsBase2 <- ls("package:base")
lsBase2 <- lsBase[!grepl("[[:punct:]]", lsBase)]
head(lsBase2)
# [1] "abbreviate" "abs" "acos" "acosh"
# [5] "addNA" "addTaskCallback"
Is there a way to invert the pattern argument in ls()? Or, more generally can I invert the regular expression [[:punct:]] so it returns the opposite, but includes those matches that contain only . as punctuation?
Note: More than one . is fine.
Another example of what I want is: Yes I want is.vector but no I don't want [.data.frame.

I believe this is what you are looking for:
m <- ls("package:base", pattern="^(\\.|[^[:punct:]])*$")
The | is regex for "OR", so in words, it says something like "match a sequence of characters, running from the start of the string to its end, each of which is either a ., OR not a punctuation character".
To confirm that this works:
## Dissolve the matched strings and check for any verboten characters.
sort(unique(unlist(strsplit(m, ""))))
# [1] "." "0" "1" "2" "3" "4" "8" "a" "A" "b" "B" "c" "C" "d" "D" "e"
# [17] "E" "f" "F" "g" "G" "h" "H" "i" "I" "j" "J" "k" "K" "l" "L" "m"
# [33] "M" "n" "N" "o" "O" "p" "P" "q" "Q" "r" "R" "s" "S" "t" "T" "u"
# [49] "U" "v" "V" "w" "W" "x" "X" "y" "Y" "z"
## Have a look at (at least a few of) the names _excluded_ by the regex:
n <- setdiff(ls("package:base"), m)
sample(n, 10)
# [1] "names<-.POSIXlt" "[[<-.data.frame" "!.hexmode" "$<-"
# [5] "<-" "&&" "%*%" "package_version"
# [9] "$" "regmatches<-"

The following will work for what you are asking.
> lsBase2[grepl('^([^\\pP\\pS]|\\.)+$', lsBase2, perl=T)]
Edit: Or you could simply use the following (R version 3.1.1) returns 1029 results on this:
> ls("package:base", pattern="^[a-zA-Z0-9.]+$")

This is easy if you think about it in steps. First remove the . characters, then scan for additional punctuation:
lsBase2[!grepl('[[:punct:]]', gsub('[.]', '', lsBase2))]

Related

jasper replace String code

I have a string variable wich contain only one letter like "D" or "A" etc ...
I want to replace this letter by an explicit text but when I do (in this case "$F{Action}" contain "D"):
$F{Action}.replace('D','Apple').replace('A','text')
my result is "textpple" Because Apple beggin with an "A" and my second replace is on the letter "A"
How can I do to only replace the letter by the firt replace statment and not do the others replace statement?
There is no need of scriptlet for this small work. You can try using the below expression
$F{Action}.contains( "D" ) ? "Apple" : $F{Action}.contains( "A" ) ? "text" : ""
Hope this is your requirement.

Using regexes in grep function in R

Could anyone maybe know how to extract x and y from this character: "x and y" using grep function (not using stringi package) if x and y are random characters?
I am so not skilled in regular expressions.
Thanks for any response.
The regex here matches any chars "and" chars and then extracts them with regmatches:
txt <- c("x and y", "a and b", " C and d", "qq and rr")
matches <- regexec("([[:alpha:]]+)[[:blank:]]+and[[:blank:]]+([[:alpha:]]+)", txt)
regmatches(txt, matches)[[1]][2:3]
## [1] "x" "y"
regmatches(txt, matches)[[2]][2:3]
## [1] "a" "b"
regmatches(txt, matches)[[3]][2:3]
## [1] "C" "d"
regmatches(txt, matches)[[4]][2:3]
## [1] "qq" "rr"
([[:alpha:]]+) matches one or more alpha characters and places it in a match group. [[:blank:]]+ matches one or more "whitespace" characters. There are less verbose ways to write these regexes but the expanded ones (to me) help make it easier to grok if there will be folks reading the code that aren't familiar with regexes.
I also didn't need to call regmatches 4x, but it was faster to cut/paste for a toy example.
As #MrFlick commented, grep is not the right function to extract these substrings.
You can use regmatches and do something like this:
> x <- c('x and y', 'abc and def', 'foo and bar')
> regmatches(x, gregexpr('and(*SKIP)(*F)|\\w+', x, perl=T))
# [[1]]
# [1] "x" "y"
# [[2]]
# [1] "abc" "def"
# [[3]]
# [1] "foo" "bar"
Or if " and " is always constant, then use strsplit as suggested in the comments.
> x <- c('x and y', 'abc and def', 'foo and bar')
> strsplit(x, ' and ', fixed=T)
# [[1]]
# [1] "x" "y"
# [[2]]
# [1] "abc" "def"
# [[3]]
# [1] "foo" "bar"

Split string recursively

Say I have text like this:
pattern = "This_is some word/expression I'd like to parse:intelligently(using special symbols-like '.')"
The challenge is how to split it into words, using word separators from the
c(" ","-","/","\\","_",":","(",")",".",",")
family.
Desired result:
"This" "is" "some" "word" "expression" "I'd" "like" "to" "parse" "intelligently" "using" "special" "symbols" "like"
Methods:
I could do sapply or for loop using:
keywords = unlist(strsplit(pattern," "))
keywords = unlist(strsplit(keywords,"-"))
# etc.
Question:
But what's the solution using Reduce(f, x, init, accummulate=TRUE)?
You shouldn't need Reduce here. You should be able to do something like the following:
splitters <- c(" ","/","\\","_",":","(",")",".",",","-") # dash should come last
pattern <- paste0("[", paste(splitters, collapse = ""), "]")
string <- "This_is some word/expression I'd like to parse:intelligently(using special symbols-like '.')"
strsplit(string, pattern)[[1]]
# [1] "This" "is" "some" "word"
# [5] "expression" "I'd" "like" "to"
# [9] "parse" "intelligently" "using" "special"
# [13] "symbols" "like" "'" "'"
Note that a - in a regex character class should come first or last, so I've edited your vector of "splitters" accordingly. Also, you may want to add a + at the end of your "pattern" in case you want to collapse, say, multiple spaces into one.
You can use option perl = TRUE and then split on punctuation or space
> strsplit(pattern, '[[:punct:]]|[[:space:]]', perl = TRUE)
[[1]]
[1] "This" "is" "some" "word" "expression"
[6] "I" "d" "like" "to" "parse"
[11] "intelligently" "using" "special" "symbols" "like"
[16] ""
I'd go with (It will keep "I'd" together)
strsplit(pattern, "[^[:alnum:][:digit:]']")
## [[1]]
## [1] "This" "is" "some" "word" "expression" "I'd" "like" "to" "parse"
## [10] "intelligently" "using" "special" "symbols" "like" "'" "'"

R Regex: Parenthesis Not Acting as Metacharacter

I am trying to split a string by the group "%in%" and the character "#". All documentation and everything I can find says that parenthesis are metacharacters used for grouping in R regex. So the code
> strsplit('example%in%aa(bbb)aa#cdef', '[(%in%)#]', perl=TRUE)
SHOULD give me
[[1]]
[1] "example" "aa(bbb)aa" "cdef"
That is, it should leave the parentheses in "aa(bbb)aa" alone, because the parentheses in the matching expression are not escaped. But instead it ACTUALLY gives me
[[1]]
[1] "example" "" "" "" "aa" "bbb" "aa" "cdef"
as if the parentheses were not metacharacters! What is up with this and how can I fix it? Thanks!
This is true with and without the argument perl=TRUE in strsplit.
Not sure what documentation you're reading, but the Extended Regular Expressions section in ?regex says:
Most metacharacters lose their special meaning inside a character class. ...
(Only '^ - \ ]' are special inside character classes.)
You don't need to create a character class. Just use "or" | (you likely don't need to group "%in%" either, but it shouldn't hurt anything):
> strsplit('example%in%aa(bbb)aa#cdef', '(%in%)|#', perl=TRUE)
[[1]]
[1] "example" "aa(bbb)aa" "cdef"
No need to use [ or ( here , just this :
strsplit('example%in%aa(bbb)aa#cdef', '%in%|#')
[[1]]
[1] "example" "aa(bbb)aa" "cdef"
Inside character class [], most of the characters lose their special meaning, including ().
You might want this regex instead:
'%in%|#'

Why does strsplit use positive lookahead and lookbehind assertion matches differently?

Common sense and a sanity-check using gregexpr() indicate that the look-behind and look-ahead assertions below should each match at exactly one location in testString:
testString <- "text XX text"
BB <- "(?<= XX )"
FF <- "(?= XX )"
as.vector(gregexpr(BB, testString, perl=TRUE)[[1]])
# [1] 9
as.vector(gregexpr(FF, testString, perl=TRUE)[[1]][1])
# [1] 5
strsplit(), however, uses those match locations differently, splitting testString at one location when using the lookbehind assertion, but at two locations -- the second of which seems incorrect -- when using the lookahead assertion.
strsplit(testString, BB, perl=TRUE)
# [[1]]
# [1] "text XX " "text"
strsplit(testString, FF, perl=TRUE)
# [[1]]
# [1] "text" " " "XX text"
I have two questions: (Q1) What's going on here? And (Q2) how can one get strsplit() to be better behaved?
Update: Theodore Lytras' excellent answer explains what's going on, and so addresses (Q1). My answer builds on his to identify a remedy, addressing (Q2).
I am not sure whether this qualifies as a bug, because I believe this is expected behaviour based on the R documentation. From ?strsplit:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
Note that this means that if there is a match at the beginning of
a (non-empty) string, the first element of the output is ‘""’, but
if there is a match at the end of the string, the output is the
same as with the match removed.
The problem is that lookahead (and lookbehind) assertions are zero-length. So for example in this case:
FF <- "(?=funky)"
testString <- "take me to funky town"
gregexpr(FF,testString,perl=TRUE)
# [[1]]
# [1] 12
# attr(,"match.length")
# [1] 0
# attr(,"useBytes")
# [1] TRUE
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f" "unky town"
What happens is that the lonely lookahead (?=funky) matches at position 12. So the first split includes the string up to position 11 (left of the match), and it is removed from the string, together with the match, which -however- has zero length.
Now the remaining string is funky town, and the lookahead matches at position 1. However there's nothing to remove, because there's nothing at the left of the match, and the match itself has zero length. So the algorithm is stuck in an infinite loop. Apparently R resolves this by splitting a single character, which incidentally is the documented behaviour when strspliting with an empty regex (when argument split=""). After this the remaining string is unky town, which is returned as the last split since there's no match.
Lookbehinds are no problem, because each match is split and removed from the remaining string, so the algorithm is never stuck.
Admittedly this behaviour looks weird at first glance. Behaving otherwise however would violate the assumption of zero length for lookaheads. Given that the strsplit algorithm is documented, I belive this does not meet the definition of a bug.
Based on Theodore Lytras' careful explication of substr()'s behavior, a reasonably clean workaround is to prefix the to-be-matched lookahead assertion with a positive lookbehind assertion that matches any single character:
testString <- "take me to funky town"
FF2 <- "(?<=.)(?=funky)"
strsplit(testString, FF2, perl=TRUE)
# [[1]]
# [1] "take me to " "funky town"
Looks like a bug to me. This doesn't appear to just be related to spaces, specifically, but rather any lonely lookahead (positive or negative):
FF <- "(?=funky)"
testString <- "take me to funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f" "unky town"
FF <- "(?=funky)"
testString <- "funky take me to funky funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "f" "unky take me to " "f" "unky "
# [5] "f" "unky town"
FF <- "(?!y)"
testString <- "xxxyxxxxxxx"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "xxx" "y" "xxxxxxx"
Seems to work fine if given something to capture along with the zero-width assertion, such as:
FF <- " (?=XX )"
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text" "XX text"
FF <- "(?= XX ) "
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text" "XX text"
Perhaps something like that might function as a workaround.