Reconciling regex behaviors - regex

I am trying a regex ((?:I\d-?)*I3(?:-?I\d)*) here:
Out of the string A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I3-I1-I1-I3-I2-L-K-I3-P-F-I2-I2 I get the following matches I1-I3, I1-I1-I3-I1-I1-I3-I2, and I3 - this is the desired behavior. However, in R:
x <- "A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I3-I1-I1-I3-I2-L-K-I3-P-F-I2-I2"
strsplit(x, "(?:I\d-?)*I3(?:-?I\d)*")
this returns an error:
Error: '\d' is an unrecognized escape in character string starting ""(?:I\d"
I have tried perl=TRUE, but it doesn't make a difference.
I have also tried to modify the regex to read: (?:I\\d-?)*I3(?:-?I\\d)*, however this does not give the correct result, rather it matches A-B-C-I1-I2-D-E-F-, -D-D-D-D-, -L-K-, and -P-F-I2-I2.
`
How can I replicate the desired behavior in R?

If we need to split the string and get the substring based on the pattern showed, we may be use that as the pattern to be skipped ((*SKIP)(*F)) and split the string with the rest of the characters.
v1 <- strsplit(x, '(?:I\\d-?)*I3(?:-?I\\d)*(*SKIP)(*F)|.', perl=TRUE)[[1]]
The blank/empty elements can be removed using nzchar to return a logical vector of TRUE/FALSE depending on whether there the string is not blank or is blank.
v1[nzchar(v1)]
#[1] "I1-I3" "I1-I1-I3-I1-I1-I3-I2" "I3"
Or as we are interested more in extracting the pattern, str_extract would be useful.
library(stringr)
str_extract_all(x, '(?:I\\d-?)*I3(?:-?I\\d)*')[[1]]
#[1] "I1-I3" "I1-I1-I3-I1-I1-I3-I2" "I3"

Related

R Grep help: match exact substring. RStudio on Mac OSX

I'm trying to match an exact substring using grep. I'm using the following expression:
grep("^.*apple().*$",inputString)
Expected output:
1) input string is "apple()" - expected to match
2) input string is "appleSomethingElse()" - expected not to match
Case 1 works and I get a match. However case two also matches. I'm trying to write a regular expression that only matches when "apple" and "()" are next to each other in the string. Is my expression wrong?
When, you have metacharacters in your expression that you want to match, you can simply use the fixed = TRUE argument within grep and thus leave your expression simple.
x <- c('apple()', 'appleSomethingElse()', 'adadaapple()aaa')
grep('apple()', x, fixed = TRUE)
## [1] 1 3
We need to escape (\\) the parentheses (()) to make this work using the same syntax as in the OP's code.
grep("^.*apple\\(\\).*$", x)
#[1] 1 3
As #DavidArenburg mentioned in the comments, if this is for matching a string instead of substring, == would be more useful.
x=='apple()'
#[1] TRUE FALSE FALSE
data
x <- c('apple()', 'appleSomethingElse()', 'adadaapple()aaa')

R: regular expression lookaround(s) to grab whats between two patterns

I have a vector with strings like:
x <-c('kjsdf_class-X1(z)20_sample-318TT1X.3','kjjwer_class-Z3(z)29_sample-318TT2X.4')
I wanted to use regular expressions to get what is between substrings 'class-' and '_sample' (such as 'X1(z)20' and 'Z3(z)29' in x), and thought the lookaround regex ((?=...), (?!...),... and so) would do it. Cannot get it to work though!
Sorry if this is similar to other SO questions eg here or here).
This is a bit different then what you had in mind, but it will do the job.
gsub("(.*class-)|(.)|(_sample.*)", "\\2", x)
The logic is the following, you have 3 "sets" of strings:
1) characters .* ending in class-
2) characters .
3) Characters starting with _sample and characters afterwords .*
From those you want to keep the second "set" \\2.
Or another maybe easier to understand:
gsub("(.*class-)|(_sample.*)", "", x)
Take any number of characters that end in class- and the string _sample followed by any number of characters, and substitute them with the NULL character ""
We could use str_extract_all from library(stringr)
library(stringr)
unlist(str_extract_all(x, '(?<=class-)[^_]+(?=_sample)'))
#[1] "X1(z)20" "Z3(z)29"
This should also work if there are multiple instances of the pattern within a string
x1 <- paste(x, x)
str_extract_all(x1, '(?<=class-)[^_]+(?=_sample)')
#[[1]]
#[1] "X1(z)20" "X1(z)20"
#[[2]]
#[1] "Z3(z)29" "Z3(z)29"
Basically, we are matching the characters that are between the two lookarounds ((?<=class-) and (?=_sample)). We extract characters that is not a _ (based on the example) preceded by class- and succeded by _sample.
gsub('.*-([^-]+)_.*','\\1',x)
[1] "X1(z)20" "Z3(z)29"

Extracting clock time from string

I have a dataframe that consists of web-scraped data. One of the fields scraped was a time in clock time, but the scraping process wasn't perfect. Most of the 'good' data look something like '4:33, or '103:20 (so a leading single quote, and two fields, minutes and seconds). Also, there is some bad data, the most common one being '],, but also some containing text. I'd like a new string that is something like 4:33, and for bad data, just blank.
So my plan of attack is to match my good data form, and then replace everything else with a blank space. Sometime like time <- gsub('[0-9]+:[0-9]+', '', time). I know this would replace my pattern with a blank, and I want the opposite, but I'm unsure as to how to negate this whole pattern. A simple carat doesn't seem to work, nor applying it to a group. I tried something like gsub("(.)+([0-9]+)(:)([0-9]+)", "\\2\\3\\4", time) but that isn't working either.
Sample:
dput(sample)
c("'], ", "' Ling (2-0)vsThe Dragon(2-0)", "'8:18", "'13:33",
"'43:33")
Expected output:
c("", "", "8:18", "13:33", "43:33")
We can use grep to replace the elements that do not follow the pattern to '' and then replace the quotes (') with ''. Here, the pattern is the strings that start (^) with ' followed by numbers, :, numbers in that order to the end ($) of the string. So, all other string elements (by negating i.e. !) are assigned to '' using the logical index from grepl and we use sub to replace the '.
sample[!grepl("^'\\d+:\\d+$", sample)] <- ''
sub("'", '', sample)
#[1] "" "" "8:18" "13:33" "43:33"
Or we can also do this in one step using gsub by replacing all those characters (.) that do not follow the pattern \\d+:\\d+ with ''.
gsub("(\\d+:\\d+)(*SKIP)(*F)|.", '', sample, perl=TRUE)
#[1] "" "" "8:18" "13:33" "43:33"
Or another option is str_extract from library(stringr). It is not clear whether there are other patterns such as "some text '08:20 value" in the OP's original dataset or not. The str_extract will also extract those time values, if present.
library(stringr)
str_extract(sample, '\\d+:\\d+')
#[1] NA NA "8:18" "13:33" "43:33"
It will give NA instead of '' for those that doesn't follow the pattern.
You can use sub:
sub('.+?(?=[0-9]+:[0-9]+)|.+', '', sample, perl = TRUE)
[1] "" "" "8:18" "13:33" "43:33"
The regex consists of two parts that are combined with a logical or (|).
.+?(?=[0-9]+:[0-9]+)
This regex matches a positive number of characters followed by the target pattern.
.+ This regex matches a positive number of characters.
The logic: Replace everything preceding thte target pattern with an empty string (''). If there is no target pattern, replace everything with the empty string.

Gsub to get part matched strings in R regular expression?

gsub('[a-zA-Z]+([0-9]{5})','\\1','htf84756.iuy')
[1] "84756.iuy"
I want to get 84756,how can i do?
Using gregexpr() with regmatches() has the advantage of only requiring that your pattern match the bit that you actually want to extract:
string <- 'htf84756.iuy'
pat <- "(\\d){5}"
regmatches(string, gregexpr(pat, string))[[1]]
# [1] "84756"
(In practice, these functions are more useful when a supplied string might contain more than one substring matching pat.)
Try this:
R> gsub('[a-zA-Z]+([0-9]{5}).*','\\1','htf84756.iuy')
[1] "84756"
R>
You need the added .* at the end of the "greedy" regexp to terminate it after the 5 digits.
This could work as well (like Dirk's answer better) based on what to add to yours:
gsub('[a-zA-Z]+([0-9]{5})\\.([a-zA-Z])+','\\1','htf84756.iuy')
If you just want the numeric string this may be helpful as well:
gsub('[^0-9]','','htf84756.iuy')
With stringr, you can use str_extract:
library(stringr)
str_extract("htf84756.iuy", "[0-9]+")

R: Extract data from string using POSIX regular expression

How to extract only DATABASE_NAME from this string using POSIX-style regular expressions?
st <- "MICROSOFT_SQL_SERVER.DATABASE\INSTANCE.DATABASE_NAME."
First of all, this generates an error
Error: '\I' is an unrecognized escape in character string starting "MICROSOFT_SQL_SERVER.DATABASE\I"
I was thinking something like
sub(".*\\.", st, "")
The first problem is that you need to escape the \ in your string:
st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."
As for the main problem, this will return the bit you want from the string you gave:
> sub("\\.$", "", sub("[A-Za-z0-9\\._]*\\\\[A-Za-z]*\\.", "", st))
[1] "DATABASE_NAME"
But a simpler solution would be to split on the \\. and select the last chunk:
> strsplit(st, "\\.")[[1]][3]
[1] "DATABASE_NAME"
or slightly more automated
> sst <- strsplit(st, "\\.")[[1]]
> tail(sst, 1)
[1] "DATABASE_NAME"
Other answers provided some really good alternative ways of cracking the problem using strsplit or str_split.
However, if you really want to use a regex and gsub, this solution substitutes the first two occurrences of a (string followed by a period) with an empty string.
Note the use of the ? modifier to tell the regex not to be greedy, as well as the {2} modifier to tell it to repeat the expression in brackets two times.
gsub("\\.", "", gsub("(.+?\\.){2}", "", st))
[1] "DATABASE_NAME"
An alternative approach is to use str_split in package stringr. The idea is to split st into strings at each period, and then to isolate the third string:
st <- "MICROSOFT_SQL_SERVER.DATABASE\\INSTANCE.DATABASE_NAME."
library(stringr)
str_split(st, "\\.")[[1]][3]
[1] "DATABASE_NAME"