R: Extract data from string using POSIX regular expression - regex

How to extract only DATABASE_NAME from this string using POSIX-style regular expressions?
st <- "MICROSOFT_SQL_SERVER.DATABASE\INSTANCE.DATABASE_NAME."
First of all, this generates an error
Error: '\I' is an unrecognized escape in character string starting "MICROSOFT_SQL_SERVER.DATABASE\I"
I was thinking something like
sub(".*\\.", st, "")

The first problem is that you need to escape the \ in your string:
st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."
As for the main problem, this will return the bit you want from the string you gave:
> sub("\\.$", "", sub("[A-Za-z0-9\\._]*\\\\[A-Za-z]*\\.", "", st))
[1] "DATABASE_NAME"
But a simpler solution would be to split on the \\. and select the last chunk:
> strsplit(st, "\\.")[[1]][3]
[1] "DATABASE_NAME"
or slightly more automated
> sst <- strsplit(st, "\\.")[[1]]
> tail(sst, 1)
[1] "DATABASE_NAME"

Other answers provided some really good alternative ways of cracking the problem using strsplit or str_split.
However, if you really want to use a regex and gsub, this solution substitutes the first two occurrences of a (string followed by a period) with an empty string.
Note the use of the ? modifier to tell the regex not to be greedy, as well as the {2} modifier to tell it to repeat the expression in brackets two times.
gsub("\\.", "", gsub("(.+?\\.){2}", "", st))
[1] "DATABASE_NAME"

An alternative approach is to use str_split in package stringr. The idea is to split st into strings at each period, and then to isolate the third string:
st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."
library(stringr)
str_split(st, "\\.")[[1]][3]
[1] "DATABASE_NAME"

Related

How to find pattern next to a given string using regex in R

I have a string formatted for example like "segmentation_level1_id_10" and would like to extract the level number associated to it (i.e. the number directly after the word level).
I have a solution that does this in two steps, first finds the pattern level\\d+ then replaces the level with missing after, but I would like to know if it's possible to do this in one step just with str_extract
Example below:
library(stringr)
segmentation_id <- "segmentation_level1_id_10"
segmentation_level <- str_replace(str_extract(segmentation_id, "level\\d+"), "level", "")
One way to do it is by using a stringr library str_extract function with a regex featuring a lookbehind:
> library(stringr)
> s = "segmentation_level1_id_10"
> str_extract(s, "(?<=level)\\d+")
## or to make sure we match the level after _: str_extract(s, "(?<=_level)\\d+")
[1] "1"
Or using str_match that allows extracting captured group texts:
> str_match(s, "_level(\\d+)")[,2]
[1] "1"
It can be done with base R using the gsub and making use of the same capturing mechanism used in str_match, but also using a backreference to restore the captured text in the replacement result:
> gsub("^.*level(\\d+).*", "\\1", s)
[1] "1"

Reconciling regex behaviors

I am trying a regex ((?:I\d-?)*I3(?:-?I\d)*) here:
Out of the string A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I3-I1-I1-I3-I2-L-K-I3-P-F-I2-I2 I get the following matches I1-I3, I1-I1-I3-I1-I1-I3-I2, and I3 - this is the desired behavior. However, in R:
x <- "A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I3-I1-I1-I3-I2-L-K-I3-P-F-I2-I2"
strsplit(x, "(?:I\d-?)*I3(?:-?I\d)*")
this returns an error:
Error: '\d' is an unrecognized escape in character string starting ""(?:I\d"
I have tried perl=TRUE, but it doesn't make a difference.
I have also tried to modify the regex to read: (?:I\\d-?)*I3(?:-?I\\d)*, however this does not give the correct result, rather it matches A-B-C-I1-I2-D-E-F-, -D-D-D-D-, -L-K-, and -P-F-I2-I2.
`
How can I replicate the desired behavior in R?
If we need to split the string and get the substring based on the pattern showed, we may be use that as the pattern to be skipped ((*SKIP)(*F)) and split the string with the rest of the characters.
v1 <- strsplit(x, '(?:I\\d-?)*I3(?:-?I\\d)*(*SKIP)(*F)|.', perl=TRUE)[[1]]
The blank/empty elements can be removed using nzchar to return a logical vector of TRUE/FALSE depending on whether there the string is not blank or is blank.
v1[nzchar(v1)]
#[1] "I1-I3" "I1-I1-I3-I1-I1-I3-I2" "I3"
Or as we are interested more in extracting the pattern, str_extract would be useful.
library(stringr)
str_extract_all(x, '(?:I\\d-?)*I3(?:-?I\\d)*')[[1]]
#[1] "I1-I3" "I1-I1-I3-I1-I1-I3-I2" "I3"

R regular expression

I have a string like below.
testSampe <- "Old:windows\r\nNew:linux\r\n"
I want to erase the string between ":" an "\".
Like this "Old\r\nNew\r\n".
How can I construct the regex for this?
I tried to gsub function with regex ":.*\\\\", It doesn't work.
gsub(":.*\\\\", "\\\\r", testSampe)
> testSampe <- "Old:windows\r\nNew:linux\r\n"
> gsub(":[^\r\n]*", "", testSampe)
[1] "Old\r\nNew\r\n"
You have a choice of a few different regular expressions that will match. See falsetru's answer or use:
rx <- ":[[:alnum:]]*(?=\\r)"
As a more readable alternative to gsub, use str_replace_all in the stringr package.
library(stringr)
str_replace_all(testSampe, perl(rx), "")

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

Gsub to get part matched strings in R regular expression?

gsub('[a-zA-Z]+([0-9]{5})','\\1','htf84756.iuy')
[1] "84756.iuy"
I want to get 84756,how can i do?
Using gregexpr() with regmatches() has the advantage of only requiring that your pattern match the bit that you actually want to extract:
string <- 'htf84756.iuy'
pat <- "(\\d){5}"
regmatches(string, gregexpr(pat, string))[[1]]
# [1] "84756"
(In practice, these functions are more useful when a supplied string might contain more than one substring matching pat.)
Try this:
R> gsub('[a-zA-Z]+([0-9]{5}).*','\\1','htf84756.iuy')
[1] "84756"
R>
You need the added .* at the end of the "greedy" regexp to terminate it after the 5 digits.
This could work as well (like Dirk's answer better) based on what to add to yours:
gsub('[a-zA-Z]+([0-9]{5})\\.([a-zA-Z])+','\\1','htf84756.iuy')
If you just want the numeric string this may be helpful as well:
gsub('[^0-9]','','htf84756.iuy')
With stringr, you can use str_extract:
library(stringr)
str_extract("htf84756.iuy", "[0-9]+")