Extract subset of a string following specific text in R - regex

I am trying to extract all of the words in the string below contained within the brackets following the word 'tokens' only if the 'tokens' occurs after 'tag(noun)'.
For example, I have the string:
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),
inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),
inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),
head([lexmatch([department]),inputmatch(['Department']),tag(noun),
tokens([department])])],0/29,[])."
I want to get a list of all of the words that occur within the brackets after the word 'tokens' only when the word tokens occurs after 'tag(noun)'.
Therefore, I want my output to be a vector of the following:
[1] new, york, state, department
How do I do this? I'm assuming I have to use a regular expression, but I'm lost on how to write this in R.
Thanks!

Remove newlines and then extract the portion matched to the part between parentheses in pattern pat. Then split apart such strings by commas and simplify into a character vector:
library(gsubfn)
pat <- "tag.noun.,tokens..(.*?)\\]"
strapply(gsub("\\n", "", m), pat, ~ unlist(strsplit(x, ",")), simplify = c)
giving:
[1] "new" "york" "state" "department"
Visualization: Here is the debuggex representation of the regular expression in pat. (Note that we need to double the backslash when put within R's double quotes):
tag.noun.,tokens..(.*?)\]
Debuggex Demo
Note that .*? means match the shortetst string of any characters such that the entire pattern matches - without the ? it would try to match the longest string.

How about something like this. Here i'll use the regcatputedmatches helper function to make it easier to extract the captured matches.
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),head([lexmatch([department]),inputmatch(['Department']),tag(noun),tokens([department])])],0/29,[])."
rx <- gregexpr("tag\\(noun\\),tokens\\(\\[([^]]+)\\]\\)", m, perl=T)
lapply(regcapturedmatches(m,rx), function(x) {
unlist(strsplit(c(x),","))
})
# [[1]]
# [1] "new" "york" "state" "department"
The regular expression is a bit messy because your desired match contains many special regular expression symbols so we need to properly escape them.

Here is a one liner if you like:
paste(unlist(regmatches(m, gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T))), collapse=",")
[1] "new,york,state,department"
Broken down:
# Get match indices
indices <- gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T)
# Extract the matches
matches <- regmatches(m, indices)
# unlist and paste together
paste(unlist(matches), collapse=",")
[1] "new,york,state,department"

Related

Extracting variable name from bspline-coeffiecient in R with regex

In a linear model, I have some splines, using the bs function from the splines package (like lm(y ~ bs(x, 3))).
In the model summary and model data frame (colnames(fit$model)) these terms appear as, e.g., bs(efc17age, 3).
Now I would like to extract the variable name using regular expressions. However,, I just don't understand regex syntax.
This is how far I came:
x <- "bs(e17age, 3)1"
sub("bs\\((*?)", "", x)
> [1] "e17age, 3)1"
I just want to have "e17age"... It must be so easy, if you understand regex...
You can use the following snippet:
x <- "bs(e17age, 3)1"
sub("^bs\\(([^,]*).*", "\\1", x)
Regex ^bs\\(([^,]*).* matches bs( at the start of the string, then captures any number of characters other than , with ([^,]*), and then matches any character up to the end. With the replacement string \\1, we get our captured text back.
See IDEONE demo

Removing character from regexp class in R

Edit: Changing the whole question to make it clearer.
Can I remove a single character from one of the regular expression classes in R (such as [:alnum:])?
For example, match all punctuation ([:punct:]) except the _ character.
I am trying the replace underscores used in markdown for italicizing but the italicized substring may contain a single underscore which I would want to keep.
Edit: As another example, I want to capture everything between pairs of underscores (note one pair contains a single underscore that I want to keep between 1 and 10)
This is _a random_ string with _underscores: rate 1_10 please_
You won't believe it, but lazy matching achieved with a mere ? works as expected here:
str <- 'This is a _string with_ some _random underscores_ in it.'
gsub("_+([[:print:]]+?)_+", "\\1", str)
str <- 'This is a _random string with_ a scale of 1_10.'
gsub("_+([[:print:]]+?)_+", "\\1", str)
Result:
[1] "This is a string with some random underscores in it."
[1] "This is a random string with a scale of 1_10."
Here is the demo program
However, if you want to modify the [[:print:]] class, mind it is basically a [\x20-\x7E] range. The underscore being \x5F, you can easily exclude it from the range, and use [\x20-\x5E\x60-\x7E].
str <- 'This is a _string with_ some _random underscores_ in it.'
gsub("_+([\x20-\x5E\x60-\x7E]+)_+", "\\1", str)
Returns
[1] "This is a string with some random underscores in it."
Similar to #stribizhev:
x <- "This is _a random_ string with _underscores: rate 1_10 please_"
gsub("\\b_(.*?)_\\b", "\\1", x, perl=T)
produces:
[1] "This is a random string with underscores: rate 1_10 please"
Here we use word boundaries and lazy matching. Note that the default regexp engine has issues with lazy repetition and capture groups, so you may want to use perl=T
gsub('(?<=\\D)\\_(?=\\D|$)','',str,perl=T)

Retrieve digits after specific string in R

I have a bunch of strings that contain the word "radius" followed by one or two digits. They also contain a lot of other letters, digits, and underscores. For example, one is "inflow100_radius6_distance12". I want a regex that will just return the one or two digits following "radius." If R recognized \K, then I would just use this:
radius\K[0-9]{1,2}
and be done. But R doesn't allow \K, so I ended up with this instead (which selects radius and the following numbers, and then cuts off "radius"):
result <- regmatches(input_string, gregexpr("radius[0-9]{1,2}", input_string))
result <- unlist(substr(result, 7, 8)))
I'm pretty new to regex, so I'm sure there's a better way. Any ideas?
\K is recognized. You can solve the problem by turning on the perl = TRUE parameter.
result <- regmatches(x, gregexpr('radius\\K\\d+', x, perl=T))
1) Match the entire string replacing it with the digits after radius:
sub(".*radius(\\d+).*", "\\1", "inflow100_radius6_distance12")
## [1] "6"
The regular expression can be visualized as follows:
.*radius(\d+).*
Debuggex Demo
2) This also works, involves a simpler regular expression and converts it to numeric at the same time:
library(gsubfn)
strapply("inflow100_radius6_distance12", "radius(\\d+)", as.numeric, simplify = TRUE)
## [1] 6
Here is a visualization of the regular expression:
radius(\d+)
Debuggex Demo

Extract part of string between two different patterns

I try to use stringr package to extract part of a string, which is between two particular patterns.
For example, I have:
my.string <- "nanaqwertybaba"
left.border <- "nana"
right.border <- "baba"
and by the use of str_extract(string, pattern) function (where pattern is defined by a POSIX regular expression) I would like to receive:
"qwerty"
Solutions from Google did not work.
In base R you can use gsub. The parentheses in the pattern create numbered capturing groups. Here we select the second group in the replacement, i.e. the group between the borders. The . matches any character. The * means that there is zero or more of the preceeding element
gsub(pattern = "(.*nana)(.*)(baba.*)",
replacement = "\\2",
x = "xxxnanaRisnicebabayyy")
# "Risnice"
I do not know whether and how this is possible with functions provided by stringr but you can also use base regexpr and substring:
pattern <- paste0("(?<=", left.border, ")[a-z]+(?=", right.border, ")")
# "(?<=nana)[a-z]+(?=baba)"
rx <- regexpr(pattern, text=my.string, perl=TRUE)
# [1] 5
# attr(,"match.length")
# [1] 6
substring(my.string, rx, rx+attr(rx, "match.length")-1)
# [1] "qwerty"
I would use str_match from stringr: "str_match extracts capture groups formed by
() from the first match. It returns a character matrix with one column for the complete match and one column for each group." ref
str_match(my.string, paste(left.border, '(.+)', right.border, sep=''))[,2]
The code above creates a regular expression with paste concatenating the capture group (.+) that captures 1 or more characters, with left and right borders (no spaces between strings).
A single match is assumed. So, [,2] selects the second column from the matrix returned by str_match.
You can use the package unglue:
library(unglue)
my.string <- "nanaqwertybaba"
unglue_vec(my.string, "nana{res}baba")
#> [1] "qwerty"

Using variable to create regular expression pattern in R

I have a function:
ncount <- function(num = NULL) {
toRead <- readLines("abc.txt")
n <- as.character(num)
x <- grep("{"n"} number",toRead,value=TRUE)
}
While grep-ing, I want the num passed in the function to dynamically create the pattern to be searched? How can this be done in R? The text file has number and text in every line
You could use paste to concatenate strings:
grep(paste("{", n, "} number", sep = ""),homicides,value=TRUE)
In order to build a regular expression from variables in R, in the current scenarion, you may simply concatenate string literals with your variable using paste0:
grep(paste0('\\{', n, '} number'), homicides, value=TRUE)
Note that { is a special character outside a [...] bracket expression (also called character class), and should be escaped if you need to find a literal { char.
In case you use a list of items as an alternative list, you may use a combination of paste/paste0:
words <- c('bananas', 'mangoes', 'plums')
regex <- paste0('Ben likes (', paste(words, collapse='|'), ')\\.')
The resulting Ben likes (bananas|mangoes|plums)\. regex will match Ben likes bananas., Ben likes mangoes. or Ben likes plums.. See the R demo and the regex demo.
NOTE: PCRE (when you pass perl=TRUE to base R regex functions) or ICU (stringr/stringi regex functions) have proved to better handle these scenarios, it is recommended to use those engines rather than the default TRE regex library used in base R regex functions.
Oftentimes, you will want to build a pattern with a list of words that should be matched exactly, as whole words. Here, a lot will depend on the type of boundaries and whether the words can contain special regex metacharacters or not, whether they can contain whitespace or not.
In the most general case, word boundaries (\b) work well.
regex <- paste0('\\b(', paste(words, collapse='|'), ')\\b')
unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE)))
## => [1] "bananas" "mangoes" "plums"
The \b(bananas|mangoes|plums)\b pattern will match bananas, but won't match banana (see an R demo).
If your list is like
words <- c('cm+km', 'uname\\vname')
you will have to escape the words first, i.e. append \ before each of the metacharacter:
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- c('Text: cm+km, and some uname\\vname?')
words <- c('cm+km', 'uname\\vname')
regex <- paste0('\\b(', paste(regex.escape(words), collapse='|'), ')\\b')
cat( unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE))) )
## => cm+km uname\vname
If your words can start or end with a special regex metacharacter, \b word boundaries won't work. Use
Unambiguous word boundaries, (?<!\w) / (?!\w), when the match is expected between non-word chars or start/end of string
Whitespace boundaries, (?<!\S) / (?!\S), when the match is expected to be enclosed with whitespace chars, or start/end of string
Build your own using the lookbehind/lookahead combination and your custom character class / bracket expression, or even more sophisticad patterns.
Example of the first two approaches in R (replacing with the match enclosed with << and >>):
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- 'Text: cm+km, +km and C++,Delphi,C++CLI and C++/CLI.'
words <- c('+km', 'C++')
# Unambiguous word boundaries
regex <- paste0('(?<!\\w)(', paste(regex.escape(words), collapse='|'), ')(?!\\w)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and <<C++>>,Delphi,C++CLI and <<C++>>/CLI."
# Whitespace boundaries
regex <- paste0('(?<!\\S)(', paste(regex.escape(words), collapse='|'), ')(?!\\S)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and C++,Delphi,C++CLI and C++/CLI."