R - Manipulate string based on pattern - regex

This is the name of a file that I have on R:
> lst.files[1]
[1] "clt_Amon_CanESM2_rcp45_185001-230012.nc"
What I need to do is capture just the part until the 4th underscore (including), so it would be something like this:
clt_Amon_CanESM2_rcp45_
How can I get this in R?

If you know you always have (at least) four underscores, then you could do something like this:
regmatches(lst, regexec(".*_.*_.*_.*_", lst.files[1]))[[1]]
# [1] "clt_Amon_CanESM2_rcp45_"
If potentially not always four, but no underscores in the second part, you could do something like this:
regmatches(lst, regexec(".*_", lst.files[1]))[[1]]
# [1] "clt_Amon_CanESM2_rcp45_"
This doesn't require any extra package, just base R.

Using the qdap package, you can do the following.
x <- "clt_Amon_CanESM2_rcp45_185001-230012.nc"
library(qdap)
beg2char(x, "_", 4, include = TRUE)
# [1] "clt_Amon_CanESM2_rcp45_"

We can also capture the repeating patterns as a group using sub. We match one more more characters from the beginning (^) of the string that is not an underscore ([^_]+) followed by an underscore (\\_) which is repeated 4 times ({4}), capture that as a group by wrapping with parentheses followed by zero or more characters (.*). We replace it with the capture group (\\1) to get the expected output.
sub('^(([^_]+\\_){4}).*', '\\1', str1)
#[1] "clt_Amon_CanESM2_rcp45_"
data
str1 <- "clt_Amon_CanESM2_rcp45_185001-230012.nc"

Related

R: regular expression lookaround(s) to grab whats between two patterns

I have a vector with strings like:
x <-c('kjsdf_class-X1(z)20_sample-318TT1X.3','kjjwer_class-Z3(z)29_sample-318TT2X.4')
I wanted to use regular expressions to get what is between substrings 'class-' and '_sample' (such as 'X1(z)20' and 'Z3(z)29' in x), and thought the lookaround regex ((?=...), (?!...),... and so) would do it. Cannot get it to work though!
Sorry if this is similar to other SO questions eg here or here).
This is a bit different then what you had in mind, but it will do the job.
gsub("(.*class-)|(.)|(_sample.*)", "\\2", x)
The logic is the following, you have 3 "sets" of strings:
1) characters .* ending in class-
2) characters .
3) Characters starting with _sample and characters afterwords .*
From those you want to keep the second "set" \\2.
Or another maybe easier to understand:
gsub("(.*class-)|(_sample.*)", "", x)
Take any number of characters that end in class- and the string _sample followed by any number of characters, and substitute them with the NULL character ""
We could use str_extract_all from library(stringr)
library(stringr)
unlist(str_extract_all(x, '(?<=class-)[^_]+(?=_sample)'))
#[1] "X1(z)20" "Z3(z)29"
This should also work if there are multiple instances of the pattern within a string
x1 <- paste(x, x)
str_extract_all(x1, '(?<=class-)[^_]+(?=_sample)')
#[[1]]
#[1] "X1(z)20" "X1(z)20"
#[[2]]
#[1] "Z3(z)29" "Z3(z)29"
Basically, we are matching the characters that are between the two lookarounds ((?<=class-) and (?=_sample)). We extract characters that is not a _ (based on the example) preceded by class- and succeded by _sample.
gsub('.*-([^-]+)_.*','\\1',x)
[1] "X1(z)20" "Z3(z)29"

Removing character from regexp class in R

Edit: Changing the whole question to make it clearer.
Can I remove a single character from one of the regular expression classes in R (such as [:alnum:])?
For example, match all punctuation ([:punct:]) except the _ character.
I am trying the replace underscores used in markdown for italicizing but the italicized substring may contain a single underscore which I would want to keep.
Edit: As another example, I want to capture everything between pairs of underscores (note one pair contains a single underscore that I want to keep between 1 and 10)
This is _a random_ string with _underscores: rate 1_10 please_
You won't believe it, but lazy matching achieved with a mere ? works as expected here:
str <- 'This is a _string with_ some _random underscores_ in it.'
gsub("_+([[:print:]]+?)_+", "\\1", str)
str <- 'This is a _random string with_ a scale of 1_10.'
gsub("_+([[:print:]]+?)_+", "\\1", str)
Result:
[1] "This is a string with some random underscores in it."
[1] "This is a random string with a scale of 1_10."
Here is the demo program
However, if you want to modify the [[:print:]] class, mind it is basically a [\x20-\x7E] range. The underscore being \x5F, you can easily exclude it from the range, and use [\x20-\x5E\x60-\x7E].
str <- 'This is a _string with_ some _random underscores_ in it.'
gsub("_+([\x20-\x5E\x60-\x7E]+)_+", "\\1", str)
Returns
[1] "This is a string with some random underscores in it."
Similar to #stribizhev:
x <- "This is _a random_ string with _underscores: rate 1_10 please_"
gsub("\\b_(.*?)_\\b", "\\1", x, perl=T)
produces:
[1] "This is a random string with underscores: rate 1_10 please"
Here we use word boundaries and lazy matching. Note that the default regexp engine has issues with lazy repetition and capture groups, so you may want to use perl=T
gsub('(?<=\\D)\\_(?=\\D|$)','',str,perl=T)

Extract part of string between two different patterns

I try to use stringr package to extract part of a string, which is between two particular patterns.
For example, I have:
my.string <- "nanaqwertybaba"
left.border <- "nana"
right.border <- "baba"
and by the use of str_extract(string, pattern) function (where pattern is defined by a POSIX regular expression) I would like to receive:
"qwerty"
Solutions from Google did not work.
In base R you can use gsub. The parentheses in the pattern create numbered capturing groups. Here we select the second group in the replacement, i.e. the group between the borders. The . matches any character. The * means that there is zero or more of the preceeding element
gsub(pattern = "(.*nana)(.*)(baba.*)",
replacement = "\\2",
x = "xxxnanaRisnicebabayyy")
# "Risnice"
I do not know whether and how this is possible with functions provided by stringr but you can also use base regexpr and substring:
pattern <- paste0("(?<=", left.border, ")[a-z]+(?=", right.border, ")")
# "(?<=nana)[a-z]+(?=baba)"
rx <- regexpr(pattern, text=my.string, perl=TRUE)
# [1] 5
# attr(,"match.length")
# [1] 6
substring(my.string, rx, rx+attr(rx, "match.length")-1)
# [1] "qwerty"
I would use str_match from stringr: "str_match extracts capture groups formed by
() from the first match. It returns a character matrix with one column for the complete match and one column for each group." ref
str_match(my.string, paste(left.border, '(.+)', right.border, sep=''))[,2]
The code above creates a regular expression with paste concatenating the capture group (.+) that captures 1 or more characters, with left and right borders (no spaces between strings).
A single match is assumed. So, [,2] selects the second column from the matrix returned by str_match.
You can use the package unglue:
library(unglue)
my.string <- "nanaqwertybaba"
unglue_vec(my.string, "nana{res}baba")
#> [1] "qwerty"

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

I would like to use gsub in R to match all items which are not alphanumeric

I am searching raw twitter snippets using R but keep getting issues where there are non standard Alphanumeric chars such as the following "🏄".
I would like to take out all non [abcdefghijklmnopqrstuvwxyz0123456789] characters using gsub.
Can you use gsub to specify a replace for those items NOT in [abcdefghijklmnopqrstuvwxyz0123456789]?
You could simply negate you pattern with [^ ...]:
x <- "abcde🏄fgh"
gsub("[^A-Za-z0-9]", "", x)
# [1] "abcdefgh"
Please note that the class [:alnum:] matches all your given special characters. That's why gsub("[^[:alnum:]]", "", x) doesn't work.