Creating a regular expression in R statistics

Creating a regular expression in R statistics - regex

I am trying to create a regular expression in "R" to capture two groups of characters for me and I seem not to be able to figure out why it does not work.
Here is what I am trying to achieve ...
From this string:
"air.BattleofZombies 0.0008 0.0006 -0.0027"
I would like to return:
"air.BattleofZombies=0.0008 0.0006 -0.0027"
Instead, here is what I get:
"air.BattleofZombie= 0.0008 0.0006 -0.0027="
My regular expression query is:
gsub("([^\\s]*)[\\s]*([-?\\d*\\.?\\d*\\s*]*)","\\1=\\2", "air.BattleofZombies 0.0008 0.0006 -0.0027")
Any help is welcome.

I find character classes easier to use. (I think #Simon is wrong about what "\s" will match.)
> tst <- "air.BattleofZombies 0.0008 0.0006 -0.0027"
> sub("[ ]{2,}", "=", tst)
[1] "air.BattleofZombies=0.0008 0.0006 -0.0027"'
See the ?regex page and notice this sentence: "Symbols \d, \s, \D and \S denote the digit and space classes and their negations." Nontheless, I have found that a literal space, " ", often works even without the character-class mechanism. (I'm unable to comment on a deleted post but I see now that this is the same answer posted earlier by #KaraWoo and the only reason it didn't deliver the desired result was that gsub was used.)

Another short solution:
vec <- "air.BattleofZombies 0.0008 0.0006 -0.0027"
sub("\\s+", "=", vec)
# [1] "air.BattleofZombies=0.0008 0.0006 -0.0027"

Just turn the starting ([^\\s]*) to ([^\\s]+) because the regex you used must catch empty strings also and remove all the *'s inside the character class, because * inside character class will looses his special meaning and matches only the literal *. So turn [\\d*\\s*\\.] to [\\d\\s.]
> gsub("([^\\s]+)\\s*([-\\d.\\d\\s]*)", "\\1=\\2", x, perl=T)
[1] "air.BattleofZombies=0.0008 0.0006 -0.0027"
OR
> gsub("(\\S+)\\s*((-?\\d+(?:\\.\\d+)?)(?:\\s+(?3))*)", "\\1=\\2", x, perl=T)
[1] "air.BattleofZombies=0.0008 0.0006 -0.0027"
(?3) recurses the pattern inside the third capturing group. Easy understandable form of this regex was given below.
OR
> gsub("(\\S+)\\s+(-?\\d+(?:\\.\\d+)?(?:\\s+-?\\d+(?:\\.\\d+)?)*)", "\\1=\\2", x, perl=T)
[1] "air.BattleofZombies=0.0008 0.0006 -0.0027"
DEMO

There are a couple of problems to solve, I think. First, \\s in a character class (i.e. inside []) matches an s rather than a space unless one uses perl=T (so I've replaced it with just a space). Second, gsub() replaces multiple times so I've replaced it with sub(). Also, the character class in the second set of parentheses would be better as parentheses instead. The following regexp solves the problem:
sub("([^ ]*) +((-?\\d*\\.?\\d* *)*)","\\1=\\2", "air.BattleofZombies 0.0008 0.0006 -0.0027",1)
[1] "air.BattleofZombies=0.0008 0.0006 -0.0027"

Related

R: regular expression lookaround(s) to grab whats between two patterns

I have a vector with strings like:
x <-c('kjsdf_class-X1(z)20_sample-318TT1X.3','kjjwer_class-Z3(z)29_sample-318TT2X.4')
I wanted to use regular expressions to get what is between substrings 'class-' and '_sample' (such as 'X1(z)20' and 'Z3(z)29' in x), and thought the lookaround regex ((?=...), (?!...),... and so) would do it. Cannot get it to work though!
Sorry if this is similar to other SO questions eg here or here).

This is a bit different then what you had in mind, but it will do the job.
gsub("(.*class-)|(.)|(_sample.*)", "\\2", x)
The logic is the following, you have 3 "sets" of strings:
1) characters .* ending in class-
2) characters .
3) Characters starting with _sample and characters afterwords .*
From those you want to keep the second "set" \\2.
Or another maybe easier to understand:
gsub("(.*class-)|(_sample.*)", "", x)
Take any number of characters that end in class- and the string _sample followed by any number of characters, and substitute them with the NULL character ""

We could use str_extract_all from library(stringr)
library(stringr)
unlist(str_extract_all(x, '(?<=class-)[^_]+(?=_sample)'))
#[1] "X1(z)20" "Z3(z)29"
This should also work if there are multiple instances of the pattern within a string
x1 <- paste(x, x)
str_extract_all(x1, '(?<=class-)[^_]+(?=_sample)')
#[[1]]
#[1] "X1(z)20" "X1(z)20"
#[[2]]
#[1] "Z3(z)29" "Z3(z)29"
Basically, we are matching the characters that are between the two lookarounds ((?<=class-) and (?=_sample)). We extract characters that is not a _ (based on the example) preceded by class- and succeded by _sample.

gsub('.*-([^-]+)_.*','\\1',x)
[1] "X1(z)20" "Z3(z)29"

Regular expression in R: gsub pattern

I'm learning R's regular expression and I am having trouble understanding this
gsub example:
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", x)
So far I think I get:
if x is alphanumeric it doesn't match so all nothing modified
if x contains a . or | or ( or { or } or + or $ or ? it adds \\ in front of it
I can't explain:
> gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", '10\1')
[1] "10\001"
or
> gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", '10/1')
[1] "10/1"
I am also confused why the replacement "\\\\\\1" add only two brackets.
I'm suppose to figure out what this function does and I think it's suppose to escape certain special characters ?

The entire pattern is wrapped in parentheses which allows back-references. This part:
[.|()\\^{}+$*?]
... is a "character class" so it matches any one of the characters inside teh square-brackets, and as you say it is changing the way the pattern syntax will interpret what would otherwise be meta-characters within the pattern definition.
The next part is a "pipe" character which is the regex-OR followed by an escaped open-square-bracket, another "OR"-pipe, and then an escaped close-square-bracket. Since both R and regex use backslashes as escapes, you need to double them to get an R+regex-escape in patterns ... but not in replacement strings. The close-square-bracket can only be entered in a character class if it is placed first in the string, sot that entire pattern could have been more compactly formed with:
"[][.|()\\^{}+$*?]" # without the "|\\[|\\])"
In replacement strings the form "\\n" refers to whatever matched the n-th parenthetical portion of the 'pattern', in this case '\1' is the second portion of the replacement. The first position is "\" which forms an escape and the second "\" forms the backslash. Now get ready to the even weirder part ... how many characters are in that result?
> nchar( gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\1", '10\1') )
[1] 3
And then of course none of the items in the match is equal to '\1". Somebody writing whatever tutorial you have before you (which I do not think is the gsub help page) has a weird sense of humor. Here are a couple of functions that may be useful if you need to create characters that would otherwise be intercepted by the system readline function:
> intToUtf8(1)
[1] "\001"
> ?intToUtf8
> 0x0
[1] 0
> intToUtf8(0)
[1] ""
> utf8ToInt("")
integer(0)
And do look at ?Quotes where a lot of useful information can be found (under what I would consider a rather unlikely title) about how R handles octal, hexadecimal and other numbers and special characters.

The first regex broken down is this
( # (1 start)
[.|()\^{}+$*?]
| \[
| \]
) # (1 end)
It captures any what's in the 'class' or '[' or ']' then it looks like it replaces it with \\\1 which is an escape plus whatever was in capture 1.
So, basically it just escapes a single occurrence of one of those chars.
The regex could be better written as ([.|()^{}\[\]+$*?]) or within a
string as "([.|()^{}\\[\\]+$*?])"
Edit (promoting a comment) -
The regex won't match string 10\1 so there should be no replacement. There must be an interpolation (language) on the print out. Looks like its converting it to octal \001. - Since it cant show binary 1 it shows its octal equivalent.

Retrieve digits after specific string in R

I have a bunch of strings that contain the word "radius" followed by one or two digits. They also contain a lot of other letters, digits, and underscores. For example, one is "inflow100_radius6_distance12". I want a regex that will just return the one or two digits following "radius." If R recognized \K, then I would just use this:
radius\K[0-9]{1,2}
and be done. But R doesn't allow \K, so I ended up with this instead (which selects radius and the following numbers, and then cuts off "radius"):
result <- regmatches(input_string, gregexpr("radius[0-9]{1,2}", input_string))
result <- unlist(substr(result, 7, 8)))
I'm pretty new to regex, so I'm sure there's a better way. Any ideas?

\K is recognized. You can solve the problem by turning on the perl = TRUE parameter.
result <- regmatches(x, gregexpr('radius\\K\\d+', x, perl=T))

1) Match the entire string replacing it with the digits after radius:
sub(".*radius(\\d+).*", "\\1", "inflow100_radius6_distance12")
## [1] "6"
The regular expression can be visualized as follows:
.*radius(\d+).*
Debuggex Demo
2) This also works, involves a simpler regular expression and converts it to numeric at the same time:
library(gsubfn)
strapply("inflow100_radius6_distance12", "radius(\\d+)", as.numeric, simplify = TRUE)
## [1] 6
Here is a visualization of the regular expression:
radius(\d+)
Debuggex Demo

Extract subset of a string following specific text in R

I am trying to extract all of the words in the string below contained within the brackets following the word 'tokens' only if the 'tokens' occurs after 'tag(noun)'.
For example, I have the string:
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),
inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),
inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),
head([lexmatch([department]),inputmatch(['Department']),tag(noun),
tokens([department])])],0/29,[])."
I want to get a list of all of the words that occur within the brackets after the word 'tokens' only when the word tokens occurs after 'tag(noun)'.
Therefore, I want my output to be a vector of the following:
[1] new, york, state, department
How do I do this? I'm assuming I have to use a regular expression, but I'm lost on how to write this in R.
Thanks!

Remove newlines and then extract the portion matched to the part between parentheses in pattern pat. Then split apart such strings by commas and simplify into a character vector:
library(gsubfn)
pat <- "tag.noun.,tokens..(.*?)\\]"
strapply(gsub("\\n", "", m), pat, ~ unlist(strsplit(x, ",")), simplify = c)
giving:
[1] "new" "york" "state" "department"
Visualization: Here is the debuggex representation of the regular expression in pat. (Note that we need to double the backslash when put within R's double quotes):
tag.noun.,tokens..(.*?)\]
Debuggex Demo
Note that .*? means match the shortetst string of any characters such that the entire pattern matches - without the ? it would try to match the longest string.

How about something like this. Here i'll use the regcatputedmatches helper function to make it easier to extract the captured matches.
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),head([lexmatch([department]),inputmatch(['Department']),tag(noun),tokens([department])])],0/29,[])."
rx <- gregexpr("tag\\(noun\\),tokens\\(\\[([^]]+)\\]\\)", m, perl=T)
lapply(regcapturedmatches(m,rx), function(x) {
unlist(strsplit(c(x),","))
})
# [[1]]
# [1] "new" "york" "state" "department"
The regular expression is a bit messy because your desired match contains many special regular expression symbols so we need to properly escape them.

Here is a one liner if you like:
paste(unlist(regmatches(m, gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T))), collapse=",")
[1] "new,york,state,department"
Broken down:
# Get match indices
indices <- gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T)
# Extract the matches
matches <- regmatches(m, indices)
# unlist and paste together
paste(unlist(matches), collapse=",")
[1] "new,york,state,department"

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)

Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"

Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"

A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"

Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Creating a regular expression in R statistics - regex

Another short solution: vec <- "air.BattleofZombies 0.0008 0.0006 -0.0027" sub("\\s+", "=", vec) # [1] "air.BattleofZombies=0.0008 0.0006 -0.0027"

Related

R: regular expression lookaround(s) to grab whats between two patterns

Regular expression in R: gsub pattern

Retrieve digits after specific string in R

Extract subset of a string following specific text in R

regular expression -- greedy matching?

Categories

Resources