R grepl - matching two strings - regex

I am facing problem with using grep/grepl function in R. When I run
grepl("[Aa][Bb][Cc]x", c("Abcx", "abCy"))
I got:
[1] TRUE FALSE
And it's OK. Similarly, for:
grepl("[Aa][Bb][Cc]y", c("Abcx", "abCy"))
I got:
[1] FALSE TRUE
And it's also allrighty. But when I write:
grepl("[Aa][Bb][Cc]x | [Aa][Bb][Cc]y", c("Abcx", "abCy"))
it gives me counter-intuitive
[1] FALSE FALSE
What's the problem?

You need to remove spaces around |:
grepl("[Aa][Bb][Cc]x|[Aa][Bb][Cc]y", c("Abcx", "abCy"))
These spaces matter. You might use a PCRE regex though with a (?x) modifier (see demo) that makes it possible to introduce some formatting whitespace in between subpatterns for better readability:
grepl("(?x)[Aa][Bb][Cc]x | [Aa][Bb][Cc]y", c("Abcx", "abCy"), perl=TRUE)
Or better use this shorter version:
grepl("[Aa][Bb][Cc][xy]", c("Abcx", "abCy"))
where the pattern is first shrunk to [Aa][Bb][Cc](x|y) and since these are single characters, I recommend using a character class ((x|y) -> [xy]).

Related

Matching last and first bracket in gsub/r and leaving the remaining content intact

I'm working with a character vector of the following format:
[-0.2122,-0.1213)
[-0.2750,-0.2122)
[-0.1213,-0.0222)
[-0.1213,-0.0222)
I would like to remove [ and ) so I can get the desired result resembling:
-0.2122,-0.1213
-0.2750,-0.2122
-0.1213,-0.0222
-0.1213,-0.0222
Attempts
1 - Groups,
I was thinking of capturing first and second group, on the lines of the syntax:
[[^\[{1}(?![[:digit:]])\){1}
but it doesn't seem to work, (regex101).
2 - Punctuation
The code: [[:punct:]] will capture all punctuation regex101.
3 - Groups again
Then I tried to match the two groups: (\[)(\)), but, again, no lack regex101.
The problem can be easily solved by applying gsub twice or making use of the multigsub available in the qdap package but I'm interested in solving this via one expression, is possible.
You could try using lookaheads and lookbehinds in Perl-style regular expressions.
x <- scan(what = character(),
text = "[-0.2122,-0.1213)
[-0.2750,-0.2122)
[-0.1213,-0.0222)
[-0.1213,-0.0222)")
regmatches(x, regexpr("(?<=\\[).+(?=\\))", x, perl = TRUE))
# [1] "-0.2122,-0.1213" "-0.2750,-0.2122" "-0.1213,-0.0222" "-0.1213,-0.0222"

R Wildcard matching for certain number of terms

Suppose I have a string and am searching for particular wildcard terms. For example:
x <- "AJSDKLAFJASFJABJKADL"
z <- stri_locate_all_regex(x, 'A*****AF')
I want to search for all terms that have any 5 characters in between A and AF, like ABJDKAAF or AJSDKLAF... However the above code does not work. Is there a simple way to do this that I am overlooking? Thank you!
In regular expressions (as opposed to standard wildcards that you might be used to), * means "0 or more of the preceding character", so "A*" means "0 or more A". You can't stack them like '****', for that you want '.' which means "one character".
z <- stri_locate_all_regex(x, 'A.....AF')
TL,DR: regex problem, not R problem.
For a simple way to do this, and by this I assume you mean that you want to use your wildcard characters as in the question, you can turn these into proper regular expressions using glob2rx(). A "wildcard" expression, also known as a "glob", is a sort of poor man's regular expression (?regex). For your expression, you can specify five ? characters, because in a glob, ? means any single character.
x <- c("ABCDEFAF", "XABCDEFAFX", "abcdeaf", "A55555AF", "A666666AF")
# the (simpler?) "wildcard" way
stringi::stri_detect_regex(x, glob2rx("A?????AF"))
## [1] TRUE FALSE FALSE TRUE FALSE
# the regular expression way (probably WRONG)
stringi::stri_detect_regex(x, "A.{5}AF")
## [1] TRUE TRUE FALSE TRUE FALSE
# the regular expression way (CORRECT)
stringi::stri_detect_regex(x, "^A.{5}AF$")
## [1] TRUE FALSE FALSE TRUE FALSE
This returns a logical vector if the wildcard matches.
By contrast, stri_locate_all_regex() returns a list of matrixes of dimensions 1, 2 where the columns are the starting and ending character positions of the matches within the string, or a pair of NA values if the pattern is not found.
Note that one of the differences in your wildcard/glob expression is that to get A + any five characters + AF without any preceding or trailing characters, you would need to specify the regular expression characters for the start and end of the string, as per above. Otherwise the match picks up "XABCDEFAFX" too. For a wildcard/glob, this is not a problem since the start and end of the expression match the beginning and end of the string:
> glob2rx("A?????AF")
[1] "^A.....AF$"

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

R-regex: match strings not beginning with a pattern

I'd like to use regex to see if a string does not begin with a certain pattern. While I can use: [^ to blacklist certain characters, I can't figure out how to blacklist a pattern.
> grepl("^[^abc].+$", "foo")
[1] TRUE
> grepl("^[^abc].+$", "afoo")
[1] FALSE
I'd like to do something like grepl("^[^(abc)].+$", "afoo") and get TRUE, i.e. to match if the string does not start with abc sequence.
Note that I'm aware of this post, and I also tried using perl = TRUE, but with no success:
> grepl("^((?!hede).)*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^((?!hede).)*$", "foohede", perl = TRUE)
[1] FALSE
Any ideas?
Yeah. Put the zero width lookahead /outside/ the other parens. That should give you this:
> grepl("^(?!hede).*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^(?!hede).*$", "foohede", perl = TRUE)
[1] TRUE
which I think is what you want.
Alternately if you want to capture the entire string, ^(?!hede)(.*)$ and ^((?!hede).*)$ are both equivalent and acceptable.
There is now (years later) another possibility with the stringr package.
library(stringr)
str_detect("dsadsf", "^abc", negate = TRUE)
#> [1] TRUE
str_detect("abcff", "^abc", negate = TRUE)
#> [1] FALSE
Created on 2020-01-13 by the reprex package (v0.3.0)
I got stuck on the following special case, so I thought I would share...
What if there are multiple instances of the regular expression, but you still only want the first segment?
Apparently you can turn off the implicit greediness of the search
with specific perl wildcard modifiers
Suppose the string I wanted to process was
myExampleString = paste0(c(letters[1:13], "_", letters[14:26], "__",
LETTERS[1:13], "_", LETTERS[14:26], "__",
"laksjdl", "_", "lakdjlfalsjdf"),
collapse = "")
myExampleString
"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ__laksjdl_lakdjlfalsjd"
and that I wanted only the first segment before the first "__".
I cannot simply search on "_", because single-underscore is
an allowable non-delimiter in this example string.
The following doesn't work. It instead gives me the first and second segments because of the default greediness (but not third, because of the forward-look).
gsub("^(.+(?=__)).*$", "\\1", myExampleString, perl = TRUE)
"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ"
But this does work
gsub("^(.+?(?=__)).*$", "\\1", myExampleString, perl = TRUE)
"abcdefghijklm_nopqrstuvwxyz"
The difference is the greedy-modifier "?" after the wildcard ".+"
in the (perl) regular expression.

Complete word matching using grepl in R

Consider the following example:
> testLines <- c("I don't want to match this","This is what I want to match")
> grepl('is',testLines)
> [1] TRUE TRUE
What I want, though, is to only match 'is' when it stands alone as a single word. From reading a bit of perl documentation, it seemed that the way to do this is with \b, an anchor that can be used to identify what comes before and after the patter, i.e. \bword\b matches 'word' but not 'sword'. So I tried the following example, with use of Perl syntax set to 'TRUE':
> grepl('\bis\b',testLines,perl=TRUE)
> [1] FALSE FALSE
The output I'm looking for is FALSE TRUE.
"\<" is another escape sequence for the beginning of a word, and "\>" is the end.
In R strings you need to double the backslashes, so:
> grepl("\\<is\\>", c("this", "who is it?", "is it?", "it is!", "iso"))
[1] FALSE TRUE TRUE TRUE FALSE
Note that this matches "is!" but not "iso".
you need double-escaping to pass escape to regex:
> grepl("\\bis\\b",testLines)
[1] FALSE TRUE
Very simplistically, match on a leading space:
testLines <- c("I don't want to match this","This is what I want to match")
grepl(' is',testLines)
[1] FALSE TRUE
There's a whole lot more than this to regular expressions, but essentially the pattern needs to be more specific. What you will need in more general cases is a huge topic. See ?regex
Other possibilities that will work for this example:
grepl(' is ',testLines)
[1] FALSE TRUE
grepl('\\sis',testLines)
[1] FALSE TRUE
grepl('\\sis\\s',testLines)
[1] FALSE TRUE