Select substring bracketed by two whitespaces - regex

I am using regular expressions in R.
My question is somewhat similar to this one, but I need a more specific solution. I have a character vector. Each string is formatted like this:
"text text1 text2 text3"
with lots of whitespace between the text chunks. I want to extract text1 from every string. Text1 always has at least two whitespaces on either side, but so does every other text chunk. Text1 will be a name like "Monty Python": may contain a space, but never two spaces.
I'm using stringr, and the str_extract function extracts only a pattern's first occurrence. But I am not sure how to specify my pattern. I tried str_extract(z, "\\s{2,}[a-z]*\\s{2,}"), indicating that I wanted at least one letter between the whitespaces. That resulted in NAs. Is there a way to isolate text1?

You would need to acknowledge the letter case since your substring could have lower/upper case letters and include an optional group construct to match the second word instance of the substring.
Character vector (based off your description of input):
x <- c('foo Monty Python baz quz',
'foo Monty baz quz')
Using the stringr package:
str_trim(str_extract(x, "\\s{2,}[a-zA-Z]+( [a-zA-Z]+)?\\s{2,}"))
# [1] "Monty Python" "Monty"
Using the regular expression in base R:
trimws(regmatches(x, gregexpr('\\s{2,}[a-zA-Z]+( [a-zA-Z]+)?\\s{2,}', x)))
# [1] "Monty Python" "Monty"
Although, I would simply just utilize strsplit here:
sapply(strsplit(x, '\\s{2,}'), '[', 2)
# [1] "Monty Python" "Monty"

Related

Using Gsub to get matched strings in R - regular expression

I am trying to extract words after the first space using
species<-gsub(".* ([A-Za-z]+)", "\1", x=genus)
This works fine for the other rows that have two words, however row [9] "Eulamprus tympanum marnieae" has 3 words and my code is only returning the last word in the string "marnieae". How can I extract the words after the first space so I can retrieve "tympanum marnieae" instead of "marnieae" but have the answers stored in one variable called >species.
genus
[9] "Eulamprus tympanum marnieae"
Your original pattern didn't work because the subpattern [A-Za-z]+ doesn't match spaces, and therefore will only match a single word.
You can use the following pattern to match any number of words (other than 0) after the first, within double quotes:
"[A-Za-z]+ ([A-Za-z ]+)" https://regex101.com/r/p6ET3I/1
https://regex101.com/r/p6ET3I/2
This is a relatively simple, but imperfect, solution. It will also match trailing spaces, or just one or more spaces after the first word even if a second word doesn't exist. "Eulamprus " for example will successfully match the pattern, and return 5 spaces. You should only use this pattern if you trust your data to be properly formatted.
A more reliable approach would be the following:
"[A-Za-z]+ ([A-Za-z]+(?: [A-Za-z]+)*)"
https://regex101.com/r/p6ET3I/3
This pattern will capture one word (following the first), followed by any number of addition words (including 0), separated by spaces.
However, from what I remember from biology class, species are only ever comprised of one or two names, and never capitalized. The following pattern will reflect this format:
"[A-Za-z]+ ([a-z]+(?: [a-z]+)?)"
https://regex101.com/r/p6ET3I/4

R: regular expression lookaround(s) to grab whats between two patterns

I have a vector with strings like:
x <-c('kjsdf_class-X1(z)20_sample-318TT1X.3','kjjwer_class-Z3(z)29_sample-318TT2X.4')
I wanted to use regular expressions to get what is between substrings 'class-' and '_sample' (such as 'X1(z)20' and 'Z3(z)29' in x), and thought the lookaround regex ((?=...), (?!...),... and so) would do it. Cannot get it to work though!
Sorry if this is similar to other SO questions eg here or here).
This is a bit different then what you had in mind, but it will do the job.
gsub("(.*class-)|(.)|(_sample.*)", "\\2", x)
The logic is the following, you have 3 "sets" of strings:
1) characters .* ending in class-
2) characters .
3) Characters starting with _sample and characters afterwords .*
From those you want to keep the second "set" \\2.
Or another maybe easier to understand:
gsub("(.*class-)|(_sample.*)", "", x)
Take any number of characters that end in class- and the string _sample followed by any number of characters, and substitute them with the NULL character ""
We could use str_extract_all from library(stringr)
library(stringr)
unlist(str_extract_all(x, '(?<=class-)[^_]+(?=_sample)'))
#[1] "X1(z)20" "Z3(z)29"
This should also work if there are multiple instances of the pattern within a string
x1 <- paste(x, x)
str_extract_all(x1, '(?<=class-)[^_]+(?=_sample)')
#[[1]]
#[1] "X1(z)20" "X1(z)20"
#[[2]]
#[1] "Z3(z)29" "Z3(z)29"
Basically, we are matching the characters that are between the two lookarounds ((?<=class-) and (?=_sample)). We extract characters that is not a _ (based on the example) preceded by class- and succeded by _sample.
gsub('.*-([^-]+)_.*','\\1',x)
[1] "X1(z)20" "Z3(z)29"

Extracting clock time from string

I have a dataframe that consists of web-scraped data. One of the fields scraped was a time in clock time, but the scraping process wasn't perfect. Most of the 'good' data look something like '4:33, or '103:20 (so a leading single quote, and two fields, minutes and seconds). Also, there is some bad data, the most common one being '],, but also some containing text. I'd like a new string that is something like 4:33, and for bad data, just blank.
So my plan of attack is to match my good data form, and then replace everything else with a blank space. Sometime like time <- gsub('[0-9]+:[0-9]+', '', time). I know this would replace my pattern with a blank, and I want the opposite, but I'm unsure as to how to negate this whole pattern. A simple carat doesn't seem to work, nor applying it to a group. I tried something like gsub("(.)+([0-9]+)(:)([0-9]+)", "\\2\\3\\4", time) but that isn't working either.
Sample:
dput(sample)
c("'], ", "' Ling (2-0)vsThe Dragon(2-0)", "'8:18", "'13:33",
"'43:33")
Expected output:
c("", "", "8:18", "13:33", "43:33")
We can use grep to replace the elements that do not follow the pattern to '' and then replace the quotes (') with ''. Here, the pattern is the strings that start (^) with ' followed by numbers, :, numbers in that order to the end ($) of the string. So, all other string elements (by negating i.e. !) are assigned to '' using the logical index from grepl and we use sub to replace the '.
sample[!grepl("^'\\d+:\\d+$", sample)] <- ''
sub("'", '', sample)
#[1] "" "" "8:18" "13:33" "43:33"
Or we can also do this in one step using gsub by replacing all those characters (.) that do not follow the pattern \\d+:\\d+ with ''.
gsub("(\\d+:\\d+)(*SKIP)(*F)|.", '', sample, perl=TRUE)
#[1] "" "" "8:18" "13:33" "43:33"
Or another option is str_extract from library(stringr). It is not clear whether there are other patterns such as "some text '08:20 value" in the OP's original dataset or not. The str_extract will also extract those time values, if present.
library(stringr)
str_extract(sample, '\\d+:\\d+')
#[1] NA NA "8:18" "13:33" "43:33"
It will give NA instead of '' for those that doesn't follow the pattern.
You can use sub:
sub('.+?(?=[0-9]+:[0-9]+)|.+', '', sample, perl = TRUE)
[1] "" "" "8:18" "13:33" "43:33"
The regex consists of two parts that are combined with a logical or (|).
.+?(?=[0-9]+:[0-9]+)
This regex matches a positive number of characters followed by the target pattern.
.+ This regex matches a positive number of characters.
The logic: Replace everything preceding thte target pattern with an empty string (''). If there is no target pattern, replace everything with the empty string.

Removing character from regexp class in R

Edit: Changing the whole question to make it clearer.
Can I remove a single character from one of the regular expression classes in R (such as [:alnum:])?
For example, match all punctuation ([:punct:]) except the _ character.
I am trying the replace underscores used in markdown for italicizing but the italicized substring may contain a single underscore which I would want to keep.
Edit: As another example, I want to capture everything between pairs of underscores (note one pair contains a single underscore that I want to keep between 1 and 10)
This is _a random_ string with _underscores: rate 1_10 please_
You won't believe it, but lazy matching achieved with a mere ? works as expected here:
str <- 'This is a _string with_ some _random underscores_ in it.'
gsub("_+([[:print:]]+?)_+", "\\1", str)
str <- 'This is a _random string with_ a scale of 1_10.'
gsub("_+([[:print:]]+?)_+", "\\1", str)
Result:
[1] "This is a string with some random underscores in it."
[1] "This is a random string with a scale of 1_10."
Here is the demo program
However, if you want to modify the [[:print:]] class, mind it is basically a [\x20-\x7E] range. The underscore being \x5F, you can easily exclude it from the range, and use [\x20-\x5E\x60-\x7E].
str <- 'This is a _string with_ some _random underscores_ in it.'
gsub("_+([\x20-\x5E\x60-\x7E]+)_+", "\\1", str)
Returns
[1] "This is a string with some random underscores in it."
Similar to #stribizhev:
x <- "This is _a random_ string with _underscores: rate 1_10 please_"
gsub("\\b_(.*?)_\\b", "\\1", x, perl=T)
produces:
[1] "This is a random string with underscores: rate 1_10 please"
Here we use word boundaries and lazy matching. Note that the default regexp engine has issues with lazy repetition and capture groups, so you may want to use perl=T
gsub('(?<=\\D)\\_(?=\\D|$)','',str,perl=T)

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)