Extract part of string between two different patterns - regex

I try to use stringr package to extract part of a string, which is between two particular patterns.
For example, I have:
my.string <- "nanaqwertybaba"
left.border <- "nana"
right.border <- "baba"
and by the use of str_extract(string, pattern) function (where pattern is defined by a POSIX regular expression) I would like to receive:
"qwerty"
Solutions from Google did not work.

In base R you can use gsub. The parentheses in the pattern create numbered capturing groups. Here we select the second group in the replacement, i.e. the group between the borders. The . matches any character. The * means that there is zero or more of the preceeding element
gsub(pattern = "(.*nana)(.*)(baba.*)",
replacement = "\\2",
x = "xxxnanaRisnicebabayyy")
# "Risnice"

I do not know whether and how this is possible with functions provided by stringr but you can also use base regexpr and substring:
pattern <- paste0("(?<=", left.border, ")[a-z]+(?=", right.border, ")")
# "(?<=nana)[a-z]+(?=baba)"
rx <- regexpr(pattern, text=my.string, perl=TRUE)
# [1] 5
# attr(,"match.length")
# [1] 6
substring(my.string, rx, rx+attr(rx, "match.length")-1)
# [1] "qwerty"

I would use str_match from stringr: "str_match extracts capture groups formed by
() from the first match. It returns a character matrix with one column for the complete match and one column for each group." ref
str_match(my.string, paste(left.border, '(.+)', right.border, sep=''))[,2]
The code above creates a regular expression with paste concatenating the capture group (.+) that captures 1 or more characters, with left and right borders (no spaces between strings).
A single match is assumed. So, [,2] selects the second column from the matrix returned by str_match.

You can use the package unglue:
library(unglue)
my.string <- "nanaqwertybaba"
unglue_vec(my.string, "nana{res}baba")
#> [1] "qwerty"

Related

How to find pattern next to a given string using regex in R

I have a string formatted for example like "segmentation_level1_id_10" and would like to extract the level number associated to it (i.e. the number directly after the word level).
I have a solution that does this in two steps, first finds the pattern level\\d+ then replaces the level with missing after, but I would like to know if it's possible to do this in one step just with str_extract
Example below:
library(stringr)
segmentation_id <- "segmentation_level1_id_10"
segmentation_level <- str_replace(str_extract(segmentation_id, "level\\d+"), "level", "")
One way to do it is by using a stringr library str_extract function with a regex featuring a lookbehind:
> library(stringr)
> s = "segmentation_level1_id_10"
> str_extract(s, "(?<=level)\\d+")
## or to make sure we match the level after _: str_extract(s, "(?<=_level)\\d+")
[1] "1"
Or using str_match that allows extracting captured group texts:
> str_match(s, "_level(\\d+)")[,2]
[1] "1"
It can be done with base R using the gsub and making use of the same capturing mechanism used in str_match, but also using a backreference to restore the captured text in the replacement result:
> gsub("^.*level(\\d+).*", "\\1", s)
[1] "1"

R - Manipulate string based on pattern

This is the name of a file that I have on R:
> lst.files[1]
[1] "clt_Amon_CanESM2_rcp45_185001-230012.nc"
What I need to do is capture just the part until the 4th underscore (including), so it would be something like this:
clt_Amon_CanESM2_rcp45_
How can I get this in R?
If you know you always have (at least) four underscores, then you could do something like this:
regmatches(lst, regexec(".*_.*_.*_.*_", lst.files[1]))[[1]]
# [1] "clt_Amon_CanESM2_rcp45_"
If potentially not always four, but no underscores in the second part, you could do something like this:
regmatches(lst, regexec(".*_", lst.files[1]))[[1]]
# [1] "clt_Amon_CanESM2_rcp45_"
This doesn't require any extra package, just base R.
Using the qdap package, you can do the following.
x <- "clt_Amon_CanESM2_rcp45_185001-230012.nc"
library(qdap)
beg2char(x, "_", 4, include = TRUE)
# [1] "clt_Amon_CanESM2_rcp45_"
We can also capture the repeating patterns as a group using sub. We match one more more characters from the beginning (^) of the string that is not an underscore ([^_]+) followed by an underscore (\\_) which is repeated 4 times ({4}), capture that as a group by wrapping with parentheses followed by zero or more characters (.*). We replace it with the capture group (\\1) to get the expected output.
sub('^(([^_]+\\_){4}).*', '\\1', str1)
#[1] "clt_Amon_CanESM2_rcp45_"
data
str1 <- "clt_Amon_CanESM2_rcp45_185001-230012.nc"

Return only matching portion of regular expression

I have:
> pattern
[1] "(/[[:digit:]]{4}/)"
so I want to extract only the matching portions...the digits plus the /.../. Here's what I tried:
> gsub(pattern, '\\1', grep(pattern, c("t3tg3wgw/5764/", "ggg", "grsgs/gwgew/5656/bfsbs"), value=TRUE))
[1] "t3tg3wgw/5764/" "grsgs/gwgew/5656/bfsbs"
However this still returns letters attached to the actual match that do not themselves match the regex. How can I extract only /5764/ and /5656/?
We could extract the pattern / followed by one or more numbers ([0-9]+) followed by / using str_extract_all from library(stringr) to output a list, which can be unlisted to convert to vector
library(stringr)
unlist(str_extract_all(v1, '/[0-9]+/'))
#[1] "/5764/" "/5656/"
Or we use the same pattern and using regmatches/gregexpr from base R
unlist(regmatches(v1, gregexpr('/[0-9]+/',v1)))
#[1] "/5764/" "/5656/"
data
v1 <- c("t3tg3wgw/5764/", "ggg", "grsgs/gwgew/5656/bfsbs")
Try changing the pattern to .*(/[[:digit:]]{4}/).*

Extract subset of a string following specific text in R

I am trying to extract all of the words in the string below contained within the brackets following the word 'tokens' only if the 'tokens' occurs after 'tag(noun)'.
For example, I have the string:
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),
inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),
inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),
head([lexmatch([department]),inputmatch(['Department']),tag(noun),
tokens([department])])],0/29,[])."
I want to get a list of all of the words that occur within the brackets after the word 'tokens' only when the word tokens occurs after 'tag(noun)'.
Therefore, I want my output to be a vector of the following:
[1] new, york, state, department
How do I do this? I'm assuming I have to use a regular expression, but I'm lost on how to write this in R.
Thanks!
Remove newlines and then extract the portion matched to the part between parentheses in pattern pat. Then split apart such strings by commas and simplify into a character vector:
library(gsubfn)
pat <- "tag.noun.,tokens..(.*?)\\]"
strapply(gsub("\\n", "", m), pat, ~ unlist(strsplit(x, ",")), simplify = c)
giving:
[1] "new" "york" "state" "department"
Visualization: Here is the debuggex representation of the regular expression in pat. (Note that we need to double the backslash when put within R's double quotes):
tag.noun.,tokens..(.*?)\]
Debuggex Demo
Note that .*? means match the shortetst string of any characters such that the entire pattern matches - without the ? it would try to match the longest string.
How about something like this. Here i'll use the regcatputedmatches helper function to make it easier to extract the captured matches.
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),head([lexmatch([department]),inputmatch(['Department']),tag(noun),tokens([department])])],0/29,[])."
rx <- gregexpr("tag\\(noun\\),tokens\\(\\[([^]]+)\\]\\)", m, perl=T)
lapply(regcapturedmatches(m,rx), function(x) {
unlist(strsplit(c(x),","))
})
# [[1]]
# [1] "new" "york" "state" "department"
The regular expression is a bit messy because your desired match contains many special regular expression symbols so we need to properly escape them.
Here is a one liner if you like:
paste(unlist(regmatches(m, gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T))), collapse=",")
[1] "new,york,state,department"
Broken down:
# Get match indices
indices <- gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T)
# Extract the matches
matches <- regmatches(m, indices)
# unlist and paste together
paste(unlist(matches), collapse=",")
[1] "new,york,state,department"

Extract capture group matches from regular expressions? (or: where is gregexec?)

Given a regular expression containing capture groups (parentheses) and a string, how can I obtain all the substrings matching the capture groups, i.e., the substrings usually referenced by "\1", "\2"?
Example: consider a regex capturing digits preceded by "xy":
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
Desired result:
[1] "1234" "567"
First attempt: gregexpr:
regmatches(s,gregexpr(r,s))
#[[1]]
#[1] "xy1234" "xy567"
Not what I want because it returns the substrings matching the entire pattern.
Second try: regexec:
regmatches(s,regexec("xy(\\d+)",s))
#[[1]]
#[1] "xy1234" "1234"
Not what I want because it returns only the first occurence of a matching for the entire pattern and the capture group.
If there was a gregexec function, extending regexec as gregexpr extends regexpr, my problem would be solved.
So the question is: how to retrieve all substrings (or indices that can be passed to regmatches as in the examples above) matching capture groups in an arbitrary regular expression?
Note: the pattern for r given above is just a silly example, it must remain arbitrary.
For a base R solution, what about just using gsub() to finish processing the strings extracted by gregexpr() and regmatches()?
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
gsub(r, "\\1", regmatches(s,gregexpr(r,s))[[1]])
# [1] "1234" "567"
Not sure about doing this in base, but here's a package for your needs:
library(stringr)
str_match_all(s, r)
#[[1]]
# [,1] [,2]
#[1,] "xy1234" "1234"
#[2,] "xy567" "567"
Many stringr functions also have parallels in base R, so you can also achieve this without using stringr.
For instance, here's a simplified version of how the above works, using base R:
sapply(regmatches(s,gregexpr(r,s))[[1]], function(m) regmatches(m,regexec(r,m)))
strapplyc in the gsubfn package does that:
> library(gsubfn)
>
> strapplyc(s, r)
[[1]]
[1] "1234" "567"
Try ?strapplyc for additional info and examples.
Related Functions
1) A generalization of strapplyc is strapply in the same package. It takes a function which inputs the captured portions of each match and returns the output of the function. When the function is c it reduces to strapplyc. For example, suppose we wish to return results as numeric:
> strapply(s, r, as.numeric)
[[1]]
[1] 1234 567
2) gsubfn is another related function in the same package. It is like gsub except the replacement string can be a replacement function (or a replacement list or a replacement proto object). The replacement function inputs the captured portions and outputs the replacement. The replacement replaces the match in the input string. If a formula is used, as in this example, the right hand side of the formula is regarded as the function body. In this example we replace the match with XY{#} where # is twice the matched input number.
> gsubfn(r, ~ paste0("XY{", 2 * as.numeric(x), "}"), s)
[1] "XY{2468}wz98XY{1134}"
UPDATE: Added strapply and gsubfn examples.
Since R 4.1.0, there is gregexec:
regmatches(s,gregexec(r,s))[[1]][2, ]
[1] "1234" "567"