R: regular expression lookaround(s) to grab whats between two patterns - regex

I have a vector with strings like:
x <-c('kjsdf_class-X1(z)20_sample-318TT1X.3','kjjwer_class-Z3(z)29_sample-318TT2X.4')
I wanted to use regular expressions to get what is between substrings 'class-' and '_sample' (such as 'X1(z)20' and 'Z3(z)29' in x), and thought the lookaround regex ((?=...), (?!...),... and so) would do it. Cannot get it to work though!
Sorry if this is similar to other SO questions eg here or here).

This is a bit different then what you had in mind, but it will do the job.
gsub("(.*class-)|(.)|(_sample.*)", "\\2", x)
The logic is the following, you have 3 "sets" of strings:
1) characters .* ending in class-
2) characters .
3) Characters starting with _sample and characters afterwords .*
From those you want to keep the second "set" \\2.
Or another maybe easier to understand:
gsub("(.*class-)|(_sample.*)", "", x)
Take any number of characters that end in class- and the string _sample followed by any number of characters, and substitute them with the NULL character ""

We could use str_extract_all from library(stringr)
library(stringr)
unlist(str_extract_all(x, '(?<=class-)[^_]+(?=_sample)'))
#[1] "X1(z)20" "Z3(z)29"
This should also work if there are multiple instances of the pattern within a string
x1 <- paste(x, x)
str_extract_all(x1, '(?<=class-)[^_]+(?=_sample)')
#[[1]]
#[1] "X1(z)20" "X1(z)20"
#[[2]]
#[1] "Z3(z)29" "Z3(z)29"
Basically, we are matching the characters that are between the two lookarounds ((?<=class-) and (?=_sample)). We extract characters that is not a _ (based on the example) preceded by class- and succeded by _sample.

gsub('.*-([^-]+)_.*','\\1',x)
[1] "X1(z)20" "Z3(z)29"

Related

Select substring bracketed by two whitespaces

I am using regular expressions in R.
My question is somewhat similar to this one, but I need a more specific solution. I have a character vector. Each string is formatted like this:
"text text1 text2 text3"
with lots of whitespace between the text chunks. I want to extract text1 from every string. Text1 always has at least two whitespaces on either side, but so does every other text chunk. Text1 will be a name like "Monty Python": may contain a space, but never two spaces.
I'm using stringr, and the str_extract function extracts only a pattern's first occurrence. But I am not sure how to specify my pattern. I tried str_extract(z, "\\s{2,}[a-z]*\\s{2,}"), indicating that I wanted at least one letter between the whitespaces. That resulted in NAs. Is there a way to isolate text1?
You would need to acknowledge the letter case since your substring could have lower/upper case letters and include an optional group construct to match the second word instance of the substring.
Character vector (based off your description of input):
x <- c('foo Monty Python baz quz',
'foo Monty baz quz')
Using the stringr package:
str_trim(str_extract(x, "\\s{2,}[a-zA-Z]+( [a-zA-Z]+)?\\s{2,}"))
# [1] "Monty Python" "Monty"
Using the regular expression in base R:
trimws(regmatches(x, gregexpr('\\s{2,}[a-zA-Z]+( [a-zA-Z]+)?\\s{2,}', x)))
# [1] "Monty Python" "Monty"
Although, I would simply just utilize strsplit here:
sapply(strsplit(x, '\\s{2,}'), '[', 2)
# [1] "Monty Python" "Monty"

Removing character from regexp class in R

Edit: Changing the whole question to make it clearer.
Can I remove a single character from one of the regular expression classes in R (such as [:alnum:])?
For example, match all punctuation ([:punct:]) except the _ character.
I am trying the replace underscores used in markdown for italicizing but the italicized substring may contain a single underscore which I would want to keep.
Edit: As another example, I want to capture everything between pairs of underscores (note one pair contains a single underscore that I want to keep between 1 and 10)
This is _a random_ string with _underscores: rate 1_10 please_
You won't believe it, but lazy matching achieved with a mere ? works as expected here:
str <- 'This is a _string with_ some _random underscores_ in it.'
gsub("_+([[:print:]]+?)_+", "\\1", str)
str <- 'This is a _random string with_ a scale of 1_10.'
gsub("_+([[:print:]]+?)_+", "\\1", str)
Result:
[1] "This is a string with some random underscores in it."
[1] "This is a random string with a scale of 1_10."
Here is the demo program
However, if you want to modify the [[:print:]] class, mind it is basically a [\x20-\x7E] range. The underscore being \x5F, you can easily exclude it from the range, and use [\x20-\x5E\x60-\x7E].
str <- 'This is a _string with_ some _random underscores_ in it.'
gsub("_+([\x20-\x5E\x60-\x7E]+)_+", "\\1", str)
Returns
[1] "This is a string with some random underscores in it."
Similar to #stribizhev:
x <- "This is _a random_ string with _underscores: rate 1_10 please_"
gsub("\\b_(.*?)_\\b", "\\1", x, perl=T)
produces:
[1] "This is a random string with underscores: rate 1_10 please"
Here we use word boundaries and lazy matching. Note that the default regexp engine has issues with lazy repetition and capture groups, so you may want to use perl=T
gsub('(?<=\\D)\\_(?=\\D|$)','',str,perl=T)

Retrieve digits after specific string in R

I have a bunch of strings that contain the word "radius" followed by one or two digits. They also contain a lot of other letters, digits, and underscores. For example, one is "inflow100_radius6_distance12". I want a regex that will just return the one or two digits following "radius." If R recognized \K, then I would just use this:
radius\K[0-9]{1,2}
and be done. But R doesn't allow \K, so I ended up with this instead (which selects radius and the following numbers, and then cuts off "radius"):
result <- regmatches(input_string, gregexpr("radius[0-9]{1,2}", input_string))
result <- unlist(substr(result, 7, 8)))
I'm pretty new to regex, so I'm sure there's a better way. Any ideas?
\K is recognized. You can solve the problem by turning on the perl = TRUE parameter.
result <- regmatches(x, gregexpr('radius\\K\\d+', x, perl=T))
1) Match the entire string replacing it with the digits after radius:
sub(".*radius(\\d+).*", "\\1", "inflow100_radius6_distance12")
## [1] "6"
The regular expression can be visualized as follows:
.*radius(\d+).*
Debuggex Demo
2) This also works, involves a simpler regular expression and converts it to numeric at the same time:
library(gsubfn)
strapply("inflow100_radius6_distance12", "radius(\\d+)", as.numeric, simplify = TRUE)
## [1] 6
Here is a visualization of the regular expression:
radius(\d+)
Debuggex Demo

I would like to use gsub in R to match all items which are not alphanumeric

I am searching raw twitter snippets using R but keep getting issues where there are non standard Alphanumeric chars such as the following "🏄".
I would like to take out all non [abcdefghijklmnopqrstuvwxyz0123456789] characters using gsub.
Can you use gsub to specify a replace for those items NOT in [abcdefghijklmnopqrstuvwxyz0123456789]?
You could simply negate you pattern with [^ ...]:
x <- "abcde🏄fgh"
gsub("[^A-Za-z0-9]", "", x)
# [1] "abcdefgh"
Please note that the class [:alnum:] matches all your given special characters. That's why gsub("[^[:alnum:]]", "", x) doesn't work.

R: Find the last dot in a string

In R, is there a better/simpler way than the following of finding the location of the last dot in a string?
x <- "hello.world.123.456"
g <- gregexpr(".", x, fixed=TRUE)
loc <- g[[1]]
loc[length(loc)] # returns 16
This finds all the dots in the string and then returns the last one, but it seems rather clumsy. I tried using regular expressions, but didn't get very far.
Does this work for you?
x <- "hello.world.123.456"
g <- regexpr("\\.[^\\.]*$", x)
g
\. matches a dot
[^\.] matches everything but a dot
* specifies that the previous expression (everything but a dot) may occur between 0 and unlimited times
$ marks the end of the string.
Taking everything together: find a dot that is followed by anything but a dot until the string ends. R requires \ to be escaped, hence \\ in the expression above. See regex101.com to experiment with regex.
How about a minor syntax improvement?
This will work for your literal example where the input vector is of length 1. Use escapes to get a literal "." search, and reverse the result to get the last index as the "first":
rev(gregexpr("\\.", x)[[1]])[1]
A more proper vectorized version (in case x is longer than 1):
sapply(gregexpr("\\.", x), function(x) rev(x)[1])
and another tidier option to use tail instead:
sapply(gregexpr("\\.", x), tail, 1)
Someone posted the following answer which I really liked, but I notice that they've deleted it:
regexpr("\\.[^\\.]*$", x)
I like it because it directly produces the desired location, without having to search through the results. The regexp is also fairly clean, which is a bit of an exception where regexps are concerned :)
There is a slick stri_locate_last function in the stringi package, that can accept both literal strings and regular expressions.
To just find a dot, no regex is required, and it is as easy as
stringi::stri_locate_last_fixed(x, ".")[,1]
If you need to use this function with a regex, to find the location of the last regex match in the string, you should replace _fixed with _regex:
stringi::stri_locate_last_regex(x, "\\.")[,1]
Note the . is a special regex metacharacter and should be escaped when used in a regex to match a literal dot char.
See an R demo online:
x <- "hello.world.123.456"
stringi::stri_locate_last_fixed(x, ".")[,1]
stringi::stri_locate_last_regex(x, "\\.")[,1]