R-regex: match strings not beginning with a pattern - regex

I'd like to use regex to see if a string does not begin with a certain pattern. While I can use: [^ to blacklist certain characters, I can't figure out how to blacklist a pattern.
> grepl("^[^abc].+$", "foo")
[1] TRUE
> grepl("^[^abc].+$", "afoo")
[1] FALSE
I'd like to do something like grepl("^[^(abc)].+$", "afoo") and get TRUE, i.e. to match if the string does not start with abc sequence.
Note that I'm aware of this post, and I also tried using perl = TRUE, but with no success:
> grepl("^((?!hede).)*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^((?!hede).)*$", "foohede", perl = TRUE)
[1] FALSE
Any ideas?

Yeah. Put the zero width lookahead /outside/ the other parens. That should give you this:
> grepl("^(?!hede).*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^(?!hede).*$", "foohede", perl = TRUE)
[1] TRUE
which I think is what you want.
Alternately if you want to capture the entire string, ^(?!hede)(.*)$ and ^((?!hede).*)$ are both equivalent and acceptable.

There is now (years later) another possibility with the stringr package.
library(stringr)
str_detect("dsadsf", "^abc", negate = TRUE)
#> [1] TRUE
str_detect("abcff", "^abc", negate = TRUE)
#> [1] FALSE
Created on 2020-01-13 by the reprex package (v0.3.0)

I got stuck on the following special case, so I thought I would share...
What if there are multiple instances of the regular expression, but you still only want the first segment?
Apparently you can turn off the implicit greediness of the search
with specific perl wildcard modifiers
Suppose the string I wanted to process was
myExampleString = paste0(c(letters[1:13], "_", letters[14:26], "__",
LETTERS[1:13], "_", LETTERS[14:26], "__",
"laksjdl", "_", "lakdjlfalsjdf"),
collapse = "")
myExampleString
"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ__laksjdl_lakdjlfalsjd"
and that I wanted only the first segment before the first "__".
I cannot simply search on "_", because single-underscore is
an allowable non-delimiter in this example string.
The following doesn't work. It instead gives me the first and second segments because of the default greediness (but not third, because of the forward-look).
gsub("^(.+(?=__)).*$", "\\1", myExampleString, perl = TRUE)
"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ"
But this does work
gsub("^(.+?(?=__)).*$", "\\1", myExampleString, perl = TRUE)
"abcdefghijklm_nopqrstuvwxyz"
The difference is the greedy-modifier "?" after the wildcard ".+"
in the (perl) regular expression.

Related

R grep and exact matches

It seems grep is "greedy" in the way it returns matches. Assuming I've the following data:
Sources <- c(
"Coal burning plant",
"General plant",
"coalescent plantation",
"Charcoal burning plant"
)
Registry <- seq(from = 1100, to = 1103, by = 1)
df <- data.frame(Registry, Sources)
If I perform grep("(?=.*[Pp]lant)(?=.*[Cc]oal)", df$Sources, perl = TRUE, value = TRUE), it returns
"Coal burning plant"
"coalescent plantation"
"Charcoal burning plant"
However, I only want to return exact match, i.e. only where "coal" and "plant" occur. I don't want "coalescent", "plantation" and so on. So for this, I only want to see "Coal burning plant"
You want to use word boundaries \b around your word patterns. A word boundary does not consume any characters. It asserts that on one side there is a word character, and on the other side there is not. You may also want to consider using the inline (?i) modifier for case-insensitive matching.
grep('(?i)(?=.*\\bplant\\b)(?=.*\\bcoal\\b)', df$Sources, perl=T, value=T)
Working Demo
If you always want the order "coal" then "plant", then this should work
grep("\\b[Cc]oal\\b.*\\b[Pp]lant\\b", Sources, perl = TRUE, value=T)
Here we add \b match which stands for a word boundary. You can add the word boundaries to your original attempt we well
grep("(?=.*\\b[Pp]lant\\b)(?=.*\\b[Cc]oal\\b)", Sources,
perl = TRUE, value = TRUE)

Convert character to lowerCamelCase in R

I have character vector which looks like this:
x <- c("cult", "brother sister relationship", "word title")
And I want to convert it to the lowerCamelCase style looking like this:
c("cult", "brotherSisterRelationship", "wordTitle")
I played around with gsub, gregexpr, strplit, regmatches and many other functions, but couldn't get a grip.
Especially two spaces in a character seem to be difficult to handle.
Maybe someone here has an idea how to do this.
> x <- c("cult", "brother sister relationship", "word title")
> gsub(" ([^ ])", "\\U\\1", x, perl=TRUE)
[1] "cult" "brotherSisterRelationship"
[3] "wordTitle"
Quoting from pattern matching and replacement:
For perl = TRUE only, it can also contain "\U" or "\L" to convert the
rest of the replacement to upper or lower case and "\E" to end case
conversion.
A non-base alternative:
library(R.utils)
toCamelCase(x, capitalize = FALSE)
# [1] "cult" "brotherSisterRelationship" "wordTitle"

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

Complete word matching using grepl in R

Consider the following example:
> testLines <- c("I don't want to match this","This is what I want to match")
> grepl('is',testLines)
> [1] TRUE TRUE
What I want, though, is to only match 'is' when it stands alone as a single word. From reading a bit of perl documentation, it seemed that the way to do this is with \b, an anchor that can be used to identify what comes before and after the patter, i.e. \bword\b matches 'word' but not 'sword'. So I tried the following example, with use of Perl syntax set to 'TRUE':
> grepl('\bis\b',testLines,perl=TRUE)
> [1] FALSE FALSE
The output I'm looking for is FALSE TRUE.
"\<" is another escape sequence for the beginning of a word, and "\>" is the end.
In R strings you need to double the backslashes, so:
> grepl("\\<is\\>", c("this", "who is it?", "is it?", "it is!", "iso"))
[1] FALSE TRUE TRUE TRUE FALSE
Note that this matches "is!" but not "iso".
you need double-escaping to pass escape to regex:
> grepl("\\bis\\b",testLines)
[1] FALSE TRUE
Very simplistically, match on a leading space:
testLines <- c("I don't want to match this","This is what I want to match")
grepl(' is',testLines)
[1] FALSE TRUE
There's a whole lot more than this to regular expressions, but essentially the pattern needs to be more specific. What you will need in more general cases is a huge topic. See ?regex
Other possibilities that will work for this example:
grepl(' is ',testLines)
[1] FALSE TRUE
grepl('\\sis',testLines)
[1] FALSE TRUE
grepl('\\sis\\s',testLines)
[1] FALSE TRUE

R: Extract data from string using POSIX regular expression

How to extract only DATABASE_NAME from this string using POSIX-style regular expressions?
st <- "MICROSOFT_SQL_SERVER.DATABASE\INSTANCE.DATABASE_NAME."
First of all, this generates an error
Error: '\I' is an unrecognized escape in character string starting "MICROSOFT_SQL_SERVER.DATABASE\I"
I was thinking something like
sub(".*\\.", st, "")
The first problem is that you need to escape the \ in your string:
st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."
As for the main problem, this will return the bit you want from the string you gave:
> sub("\\.$", "", sub("[A-Za-z0-9\\._]*\\\\[A-Za-z]*\\.", "", st))
[1] "DATABASE_NAME"
But a simpler solution would be to split on the \\. and select the last chunk:
> strsplit(st, "\\.")[[1]][3]
[1] "DATABASE_NAME"
or slightly more automated
> sst <- strsplit(st, "\\.")[[1]]
> tail(sst, 1)
[1] "DATABASE_NAME"
Other answers provided some really good alternative ways of cracking the problem using strsplit or str_split.
However, if you really want to use a regex and gsub, this solution substitutes the first two occurrences of a (string followed by a period) with an empty string.
Note the use of the ? modifier to tell the regex not to be greedy, as well as the {2} modifier to tell it to repeat the expression in brackets two times.
gsub("\\.", "", gsub("(.+?\\.){2}", "", st))
[1] "DATABASE_NAME"
An alternative approach is to use str_split in package stringr. The idea is to split st into strings at each period, and then to isolate the third string:
st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."
library(stringr)
str_split(st, "\\.")[[1]][3]
[1] "DATABASE_NAME"