regular expression -- greedy matching? - regex

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)

Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"

Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"

A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"

Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

Related

How to find pattern next to a given string using regex in R

I have a string formatted for example like "segmentation_level1_id_10" and would like to extract the level number associated to it (i.e. the number directly after the word level).
I have a solution that does this in two steps, first finds the pattern level\\d+ then replaces the level with missing after, but I would like to know if it's possible to do this in one step just with str_extract
Example below:
library(stringr)
segmentation_id <- "segmentation_level1_id_10"
segmentation_level <- str_replace(str_extract(segmentation_id, "level\\d+"), "level", "")
One way to do it is by using a stringr library str_extract function with a regex featuring a lookbehind:
> library(stringr)
> s = "segmentation_level1_id_10"
> str_extract(s, "(?<=level)\\d+")
## or to make sure we match the level after _: str_extract(s, "(?<=_level)\\d+")
[1] "1"
Or using str_match that allows extracting captured group texts:
> str_match(s, "_level(\\d+)")[,2]
[1] "1"
It can be done with base R using the gsub and making use of the same capturing mechanism used in str_match, but also using a backreference to restore the captured text in the replacement result:
> gsub("^.*level(\\d+).*", "\\1", s)
[1] "1"

R - Manipulate string based on pattern

This is the name of a file that I have on R:
> lst.files[1]
[1] "clt_Amon_CanESM2_rcp45_185001-230012.nc"
What I need to do is capture just the part until the 4th underscore (including), so it would be something like this:
clt_Amon_CanESM2_rcp45_
How can I get this in R?
If you know you always have (at least) four underscores, then you could do something like this:
regmatches(lst, regexec(".*_.*_.*_.*_", lst.files[1]))[[1]]
# [1] "clt_Amon_CanESM2_rcp45_"
If potentially not always four, but no underscores in the second part, you could do something like this:
regmatches(lst, regexec(".*_", lst.files[1]))[[1]]
# [1] "clt_Amon_CanESM2_rcp45_"
This doesn't require any extra package, just base R.
Using the qdap package, you can do the following.
x <- "clt_Amon_CanESM2_rcp45_185001-230012.nc"
library(qdap)
beg2char(x, "_", 4, include = TRUE)
# [1] "clt_Amon_CanESM2_rcp45_"
We can also capture the repeating patterns as a group using sub. We match one more more characters from the beginning (^) of the string that is not an underscore ([^_]+) followed by an underscore (\\_) which is repeated 4 times ({4}), capture that as a group by wrapping with parentheses followed by zero or more characters (.*). We replace it with the capture group (\\1) to get the expected output.
sub('^(([^_]+\\_){4}).*', '\\1', str1)
#[1] "clt_Amon_CanESM2_rcp45_"
data
str1 <- "clt_Amon_CanESM2_rcp45_185001-230012.nc"

Regex for known start and end characters in Perl and R-lang

I'm looking to match mentions of foo in a username. I need to be able to match text strings that start with '#' and contain the word 'foo' at any location within that username, ending by either a space or grammar.
I neeed to be able to match:
example1: #anycharacterhere_foo, anything else here
example2: #foo_anymorecharacters here
I'm looking to use the stringr library like so:
str_extract_all(x, perl("?<=#"))
What I don't understand is the match all function
Assuming that your usernames won't have special characters:
x <- "#anycharacterhere_foo, anything else here"
username <- str_extract_all(x, "\\w*(foo)\\w*")
which yields a string with your username. This will pick up additional foos in the remaining string, but you could fix that with str_extract rather than all. I am not certain if you really need all foo from the string or simply the username which in your example data is at the beginning. You could also limit that with the all match by including the #, thus:
username <- str_extract_all(x, "\\#\\w*(foo)\\w*")
You need to look for "zero or more" word characters that precede or follow:
x <- '#anycharacterhere_foo #foo_anymorecharacters here anything else here'
str_extract_all(x, '#\\w*foo\\w*')[[1]]
# [1] "#anycharacterhere_foo" "#foo_anymorecharacters"
If you don't want to include the marker:
str_extract_all(x, '(?<=#)\\w*foo\\w*')[[1]]
# [1] "anycharacterhere_foo" "foo_anymorecharacters"
You could also use rm_tag from the qdapRegex package for this:
library(qdapRegex)
rm_tag(x, extract=TRUE)[[1]]
# [1] "#anycharacterhere_foo" "#foo_anymorecharacters"

Regular expression in R: gsub pattern

I'm learning R's regular expression and I am having trouble understanding this
gsub example:
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", x)
So far I think I get:
if x is alphanumeric it doesn't match so all nothing modified
if x contains a . or | or ( or { or } or + or $ or ? it adds \\ in front of it
I can't explain:
> gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", '10\1')
[1] "10\001"
or
> gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", '10/1')
[1] "10/1"
I am also confused why the replacement "\\\\\\1" add only two brackets.
I'm suppose to figure out what this function does and I think it's suppose to escape certain special characters ?
The entire pattern is wrapped in parentheses which allows back-references. This part:
[.|()\\^{}+$*?]
... is a "character class" so it matches any one of the characters inside teh square-brackets, and as you say it is changing the way the pattern syntax will interpret what would otherwise be meta-characters within the pattern definition.
The next part is a "pipe" character which is the regex-OR followed by an escaped open-square-bracket, another "OR"-pipe, and then an escaped close-square-bracket. Since both R and regex use backslashes as escapes, you need to double them to get an R+regex-escape in patterns ... but not in replacement strings. The close-square-bracket can only be entered in a character class if it is placed first in the string, sot that entire pattern could have been more compactly formed with:
"[][.|()\\^{}+$*?]" # without the "|\\[|\\])"
In replacement strings the form "\\n" refers to whatever matched the n-th parenthetical portion of the 'pattern', in this case '\1' is the second portion of the replacement. The first position is "\" which forms an escape and the second "\" forms the backslash. Now get ready to the even weirder part ... how many characters are in that result?
> nchar( gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\1", '10\1') )
[1] 3
And then of course none of the items in the match is equal to '\1". Somebody writing whatever tutorial you have before you (which I do not think is the gsub help page) has a weird sense of humor. Here are a couple of functions that may be useful if you need to create characters that would otherwise be intercepted by the system readline function:
> intToUtf8(1)
[1] "\001"
> ?intToUtf8
> 0x0
[1] 0
> intToUtf8(0)
[1] ""
> utf8ToInt("")
integer(0)
And do look at ?Quotes where a lot of useful information can be found (under what I would consider a rather unlikely title) about how R handles octal, hexadecimal and other numbers and special characters.
The first regex broken down is this
( # (1 start)
[.|()\^{}+$*?]
| \[
| \]
) # (1 end)
It captures any what's in the 'class' or '[' or ']' then it looks like it replaces it with \\\1 which is an escape plus whatever was in capture 1.
So, basically it just escapes a single occurrence of one of those chars.
The regex could be better written as ([.|()^{}\[\]+$*?]) or within a
string as "([.|()^{}\\[\\]+$*?])"
Edit (promoting a comment) -
The regex won't match string 10\1 so there should be no replacement. There must be an interpolation (language) on the print out. Looks like its converting it to octal \001. - Since it cant show binary 1 it shows its octal equivalent.

R regex to remove all except letters, apostrophes and specified multi-character strings

Is there an R regex to remove all except letters, apostrophes and specified multi-character strings? The "specified multi-character strings" are arbitrary and of arbitrary length. Let's say "~~" & && in this case (so ~ & & should be removed but not ~~ & &&)
Here I have:
gsub("[^ a-zA-Z']", "", "I like~~cake~too&&much&now.")
Which gives:
## [1] "I like~~cake~toomuchnow"
And...
gsub("[^ a-zA-Z'~&]", "", "I like~~cake~too&&much&now.")
gives...
## "I like~~cake~too&&much&now"
How can I write an R regex to give:
"I like~~caketoo&&muchnow"
EDIT Corner cases from Casimir and BrodieG...
I'd expect this behavior:
x <- c("I like~~cake~too&&much&now.", "a~~~b", "a~~~~b", "a~~~~~b", "a~&a")
## [1] "I like~~caketoo&&muchnow." "a~~b"
## [3] "a~~~~b" "a~~~~b"
## [5] "aa"
Neither of the current approaches gives this.
One way, match/capture the "specified multi-character strings" while replacing the others.
gsub("(~~|&&)|[^a-zA-Z' ]", "\\1", x)
# [1] "I like~~caketoo&&muchnow" "a~~b"
# [3] "a~~~~b" "a~~~~b"
# [5] "aa"
(?<![&~])[^ a-zA-Z'](?![&~])
Try this.See demo.Use this with perl=True option.
https://regex101.com/r/wU7sQ0/25
You can use this pattern:
gsub("[A-Za-z ']*(?:(?:~~|&&)[A-Za-z ']*)*\\K(?:[^A-Za-z ']|\\z)", "", x, perl=TRUE)
online demo
The idea is to build an always true pattern that is the translation of this sentence:
substrings I want to keep are always followed by a character I want to remove or the end of the string
So, all you need to do is to describe the substring you want to keep:
[A-Za-z ']*(?:(?:~~|&&)[A-Za-z ']*)*
Note that, since this subpattern is optional (it matches the empty string) and greedy, the whole pattern will never fail whatever the position on the string, so all matches are consecutive (no need to add a \G anchor) from the begining to the end.
For the same reason there is no need to add possessive quantifiers or to use atomic groups to prevent catastrophic backtrackings because (?:[^A-Za-z ']|\\z) can't fail.
This pattern allows to replace a string in few steps, but you can improve it more:
if you avoid the last match (that is useless since it matches only characters you want to keep or the empty string before the end) with the backtracking control verb (*COMMIT).
It forces the regex engine to stop the search once the end of the string is reached:
[A-Za-z ']*(?:(?:~~|&&)[A-Za-z ']*)*\\K(?:[^A-Za-z ']|\\z(*COMMIT).)
if you make the pattern able to match several special characters in one match:
(except if they are ~ or &)
[A-Za-z ']*(?:(?:~~|&&)[A-Za-z ']*)*\\K(?:[^A-Za-z '][^A-Za-z '~&]*|\\z(*COMMIT).)
demo