R Grep help: match exact substring. RStudio on Mac OSX - regex

I'm trying to match an exact substring using grep. I'm using the following expression:
grep("^.*apple().*$",inputString)
Expected output:
1) input string is "apple()" - expected to match
2) input string is "appleSomethingElse()" - expected not to match
Case 1 works and I get a match. However case two also matches. I'm trying to write a regular expression that only matches when "apple" and "()" are next to each other in the string. Is my expression wrong?

When, you have metacharacters in your expression that you want to match, you can simply use the fixed = TRUE argument within grep and thus leave your expression simple.
x <- c('apple()', 'appleSomethingElse()', 'adadaapple()aaa')
grep('apple()', x, fixed = TRUE)
## [1] 1 3

We need to escape (\\) the parentheses (()) to make this work using the same syntax as in the OP's code.
grep("^.*apple\\(\\).*$", x)
#[1] 1 3
As #DavidArenburg mentioned in the comments, if this is for matching a string instead of substring, == would be more useful.
x=='apple()'
#[1] TRUE FALSE FALSE
data
x <- c('apple()', 'appleSomethingElse()', 'adadaapple()aaa')

Related

Matching last and first bracket in gsub/r and leaving the remaining content intact

I'm working with a character vector of the following format:
[-0.2122,-0.1213)
[-0.2750,-0.2122)
[-0.1213,-0.0222)
[-0.1213,-0.0222)
I would like to remove [ and ) so I can get the desired result resembling:
-0.2122,-0.1213
-0.2750,-0.2122
-0.1213,-0.0222
-0.1213,-0.0222
Attempts
1 - Groups,
I was thinking of capturing first and second group, on the lines of the syntax:
[[^\[{1}(?![[:digit:]])\){1}
but it doesn't seem to work, (regex101).
2 - Punctuation
The code: [[:punct:]] will capture all punctuation regex101.
3 - Groups again
Then I tried to match the two groups: (\[)(\)), but, again, no lack regex101.
The problem can be easily solved by applying gsub twice or making use of the multigsub available in the qdap package but I'm interested in solving this via one expression, is possible.
You could try using lookaheads and lookbehinds in Perl-style regular expressions.
x <- scan(what = character(),
text = "[-0.2122,-0.1213)
[-0.2750,-0.2122)
[-0.1213,-0.0222)
[-0.1213,-0.0222)")
regmatches(x, regexpr("(?<=\\[).+(?=\\))", x, perl = TRUE))
# [1] "-0.2122,-0.1213" "-0.2750,-0.2122" "-0.1213,-0.0222" "-0.1213,-0.0222"

Reconciling regex behaviors

I am trying a regex ((?:I\d-?)*I3(?:-?I\d)*) here:
Out of the string A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I3-I1-I1-I3-I2-L-K-I3-P-F-I2-I2 I get the following matches I1-I3, I1-I1-I3-I1-I1-I3-I2, and I3 - this is the desired behavior. However, in R:
x <- "A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I3-I1-I1-I3-I2-L-K-I3-P-F-I2-I2"
strsplit(x, "(?:I\d-?)*I3(?:-?I\d)*")
this returns an error:
Error: '\d' is an unrecognized escape in character string starting ""(?:I\d"
I have tried perl=TRUE, but it doesn't make a difference.
I have also tried to modify the regex to read: (?:I\\d-?)*I3(?:-?I\\d)*, however this does not give the correct result, rather it matches A-B-C-I1-I2-D-E-F-, -D-D-D-D-, -L-K-, and -P-F-I2-I2.
`
How can I replicate the desired behavior in R?
If we need to split the string and get the substring based on the pattern showed, we may be use that as the pattern to be skipped ((*SKIP)(*F)) and split the string with the rest of the characters.
v1 <- strsplit(x, '(?:I\\d-?)*I3(?:-?I\\d)*(*SKIP)(*F)|.', perl=TRUE)[[1]]
The blank/empty elements can be removed using nzchar to return a logical vector of TRUE/FALSE depending on whether there the string is not blank or is blank.
v1[nzchar(v1)]
#[1] "I1-I3" "I1-I1-I3-I1-I1-I3-I2" "I3"
Or as we are interested more in extracting the pattern, str_extract would be useful.
library(stringr)
str_extract_all(x, '(?:I\\d-?)*I3(?:-?I\\d)*')[[1]]
#[1] "I1-I3" "I1-I1-I3-I1-I1-I3-I2" "I3"

R Wildcard matching for certain number of terms

Suppose I have a string and am searching for particular wildcard terms. For example:
x <- "AJSDKLAFJASFJABJKADL"
z <- stri_locate_all_regex(x, 'A*****AF')
I want to search for all terms that have any 5 characters in between A and AF, like ABJDKAAF or AJSDKLAF... However the above code does not work. Is there a simple way to do this that I am overlooking? Thank you!
In regular expressions (as opposed to standard wildcards that you might be used to), * means "0 or more of the preceding character", so "A*" means "0 or more A". You can't stack them like '****', for that you want '.' which means "one character".
z <- stri_locate_all_regex(x, 'A.....AF')
TL,DR: regex problem, not R problem.
For a simple way to do this, and by this I assume you mean that you want to use your wildcard characters as in the question, you can turn these into proper regular expressions using glob2rx(). A "wildcard" expression, also known as a "glob", is a sort of poor man's regular expression (?regex). For your expression, you can specify five ? characters, because in a glob, ? means any single character.
x <- c("ABCDEFAF", "XABCDEFAFX", "abcdeaf", "A55555AF", "A666666AF")
# the (simpler?) "wildcard" way
stringi::stri_detect_regex(x, glob2rx("A?????AF"))
## [1] TRUE FALSE FALSE TRUE FALSE
# the regular expression way (probably WRONG)
stringi::stri_detect_regex(x, "A.{5}AF")
## [1] TRUE TRUE FALSE TRUE FALSE
# the regular expression way (CORRECT)
stringi::stri_detect_regex(x, "^A.{5}AF$")
## [1] TRUE FALSE FALSE TRUE FALSE
This returns a logical vector if the wildcard matches.
By contrast, stri_locate_all_regex() returns a list of matrixes of dimensions 1, 2 where the columns are the starting and ending character positions of the matches within the string, or a pair of NA values if the pattern is not found.
Note that one of the differences in your wildcard/glob expression is that to get A + any five characters + AF without any preceding or trailing characters, you would need to specify the regular expression characters for the start and end of the string, as per above. Otherwise the match picks up "XABCDEFAFX" too. For a wildcard/glob, this is not a problem since the start and end of the expression match the beginning and end of the string:
> glob2rx("A?????AF")
[1] "^A.....AF$"

Retrieve digits after specific string in R

I have a bunch of strings that contain the word "radius" followed by one or two digits. They also contain a lot of other letters, digits, and underscores. For example, one is "inflow100_radius6_distance12". I want a regex that will just return the one or two digits following "radius." If R recognized \K, then I would just use this:
radius\K[0-9]{1,2}
and be done. But R doesn't allow \K, so I ended up with this instead (which selects radius and the following numbers, and then cuts off "radius"):
result <- regmatches(input_string, gregexpr("radius[0-9]{1,2}", input_string))
result <- unlist(substr(result, 7, 8)))
I'm pretty new to regex, so I'm sure there's a better way. Any ideas?
\K is recognized. You can solve the problem by turning on the perl = TRUE parameter.
result <- regmatches(x, gregexpr('radius\\K\\d+', x, perl=T))
1) Match the entire string replacing it with the digits after radius:
sub(".*radius(\\d+).*", "\\1", "inflow100_radius6_distance12")
## [1] "6"
The regular expression can be visualized as follows:
.*radius(\d+).*
Debuggex Demo
2) This also works, involves a simpler regular expression and converts it to numeric at the same time:
library(gsubfn)
strapply("inflow100_radius6_distance12", "radius(\\d+)", as.numeric, simplify = TRUE)
## [1] 6
Here is a visualization of the regular expression:
radius(\d+)
Debuggex Demo

R: Find the last dot in a string

In R, is there a better/simpler way than the following of finding the location of the last dot in a string?
x <- "hello.world.123.456"
g <- gregexpr(".", x, fixed=TRUE)
loc <- g[[1]]
loc[length(loc)] # returns 16
This finds all the dots in the string and then returns the last one, but it seems rather clumsy. I tried using regular expressions, but didn't get very far.
Does this work for you?
x <- "hello.world.123.456"
g <- regexpr("\\.[^\\.]*$", x)
g
\. matches a dot
[^\.] matches everything but a dot
* specifies that the previous expression (everything but a dot) may occur between 0 and unlimited times
$ marks the end of the string.
Taking everything together: find a dot that is followed by anything but a dot until the string ends. R requires \ to be escaped, hence \\ in the expression above. See regex101.com to experiment with regex.
How about a minor syntax improvement?
This will work for your literal example where the input vector is of length 1. Use escapes to get a literal "." search, and reverse the result to get the last index as the "first":
rev(gregexpr("\\.", x)[[1]])[1]
A more proper vectorized version (in case x is longer than 1):
sapply(gregexpr("\\.", x), function(x) rev(x)[1])
and another tidier option to use tail instead:
sapply(gregexpr("\\.", x), tail, 1)
Someone posted the following answer which I really liked, but I notice that they've deleted it:
regexpr("\\.[^\\.]*$", x)
I like it because it directly produces the desired location, without having to search through the results. The regexp is also fairly clean, which is a bit of an exception where regexps are concerned :)
There is a slick stri_locate_last function in the stringi package, that can accept both literal strings and regular expressions.
To just find a dot, no regex is required, and it is as easy as
stringi::stri_locate_last_fixed(x, ".")[,1]
If you need to use this function with a regex, to find the location of the last regex match in the string, you should replace _fixed with _regex:
stringi::stri_locate_last_regex(x, "\\.")[,1]
Note the . is a special regex metacharacter and should be escaped when used in a regex to match a literal dot char.
See an R demo online:
x <- "hello.world.123.456"
stringi::stri_locate_last_fixed(x, ".")[,1]
stringi::stri_locate_last_regex(x, "\\.")[,1]