R capture everything from pattern until pattern - regex

I am trying to extract a substring in between the two patterns BB and </p>:
require("stringr")
str = "<notes>\n <p>AA:</p>\n <p>BB: word, otherword</p>\n <p>Number:</p>\n <p>Level: 1</p>\n"
str_extract(str, "BB.*?:</p>")
The extracted substring should be "word, otherword", but I capture too much:
[1] "BB: word, otherword</p>\n <p>Number:</p>"

Maybe something like this?
> gsub(".*BB: (.*?)</p>.*$", "\\1", str)
# [1] "word, otherword"

This is a job for Perl regular expressions. Namely, lookahead and lookbehind references. In stringr you can wrap the regex in a perl function like so:
str_extract(str, perl("(?<=BB: ).*?(?=</p>)"))
[1] "word, otherword"
You can also do this with base:
regmatches(str, regexpr(perl("(?<=BB: ).*?(?=</p>)"), str, perl=TRUE))
[1] "word, otherword"

Related

extracting multiple overlapping substrings

i have strings of amino-acids like this:
x <- "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
and i would like to extract all non-overlapping substrings starting with M and finishing with *. so, for the above example i would need:
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
as the output. predictably regexpr gives me the greedy solution:
regmatches(x, regexpr("M.+\\*", x))
#[1] "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
i have also tried things suggested here, as this is the question that resembles my problem the most (but not quite), but to no avail.
any help would be appreciated.
I will add an option for capture of non-overlapping patterns as you requested. We have to check that another pattern hasn't begun within our match:
regmatches(x, gregexpr("M[^M]+?\\*", x))[[1]]
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
Use a non-greedy .+? instead of .+, and switch to gregexpr for multiple matches:
R> regmatches(x, gregexpr("M.+?\\*", x))[[1]]
#"MEALYRAQVLVDLT*"
#"MQLPSSFAALAAQFDQL*"
#"MFSLLVASVFTPCSALPFWSIKFTLFILS*"
M[^*]+\\*
use negated character class.See demo.Also use perl=True option.
https://regex101.com/r/tD0dU9/6

Return only matching portion of regular expression

I have:
> pattern
[1] "(/[[:digit:]]{4}/)"
so I want to extract only the matching portions...the digits plus the /.../. Here's what I tried:
> gsub(pattern, '\\1', grep(pattern, c("t3tg3wgw/5764/", "ggg", "grsgs/gwgew/5656/bfsbs"), value=TRUE))
[1] "t3tg3wgw/5764/" "grsgs/gwgew/5656/bfsbs"
However this still returns letters attached to the actual match that do not themselves match the regex. How can I extract only /5764/ and /5656/?
We could extract the pattern / followed by one or more numbers ([0-9]+) followed by / using str_extract_all from library(stringr) to output a list, which can be unlisted to convert to vector
library(stringr)
unlist(str_extract_all(v1, '/[0-9]+/'))
#[1] "/5764/" "/5656/"
Or we use the same pattern and using regmatches/gregexpr from base R
unlist(regmatches(v1, gregexpr('/[0-9]+/',v1)))
#[1] "/5764/" "/5656/"
data
v1 <- c("t3tg3wgw/5764/", "ggg", "grsgs/gwgew/5656/bfsbs")
Try changing the pattern to .*(/[[:digit:]]{4}/).*

R regular expression

I have a string like below.
testSampe <- "Old:windows\r\nNew:linux\r\n"
I want to erase the string between ":" an "\".
Like this "Old\r\nNew\r\n".
How can I construct the regex for this?
I tried to gsub function with regex ":.*\\\\", It doesn't work.
gsub(":.*\\\\", "\\\\r", testSampe)
> testSampe <- "Old:windows\r\nNew:linux\r\n"
> gsub(":[^\r\n]*", "", testSampe)
[1] "Old\r\nNew\r\n"
You have a choice of a few different regular expressions that will match. See falsetru's answer or use:
rx <- ":[[:alnum:]]*(?=\\r)"
As a more readable alternative to gsub, use str_replace_all in the stringr package.
library(stringr)
str_replace_all(testSampe, perl(rx), "")

Extract capture group matches from regular expressions? (or: where is gregexec?)

Given a regular expression containing capture groups (parentheses) and a string, how can I obtain all the substrings matching the capture groups, i.e., the substrings usually referenced by "\1", "\2"?
Example: consider a regex capturing digits preceded by "xy":
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
Desired result:
[1] "1234" "567"
First attempt: gregexpr:
regmatches(s,gregexpr(r,s))
#[[1]]
#[1] "xy1234" "xy567"
Not what I want because it returns the substrings matching the entire pattern.
Second try: regexec:
regmatches(s,regexec("xy(\\d+)",s))
#[[1]]
#[1] "xy1234" "1234"
Not what I want because it returns only the first occurence of a matching for the entire pattern and the capture group.
If there was a gregexec function, extending regexec as gregexpr extends regexpr, my problem would be solved.
So the question is: how to retrieve all substrings (or indices that can be passed to regmatches as in the examples above) matching capture groups in an arbitrary regular expression?
Note: the pattern for r given above is just a silly example, it must remain arbitrary.
For a base R solution, what about just using gsub() to finish processing the strings extracted by gregexpr() and regmatches()?
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
gsub(r, "\\1", regmatches(s,gregexpr(r,s))[[1]])
# [1] "1234" "567"
Not sure about doing this in base, but here's a package for your needs:
library(stringr)
str_match_all(s, r)
#[[1]]
# [,1] [,2]
#[1,] "xy1234" "1234"
#[2,] "xy567" "567"
Many stringr functions also have parallels in base R, so you can also achieve this without using stringr.
For instance, here's a simplified version of how the above works, using base R:
sapply(regmatches(s,gregexpr(r,s))[[1]], function(m) regmatches(m,regexec(r,m)))
strapplyc in the gsubfn package does that:
> library(gsubfn)
>
> strapplyc(s, r)
[[1]]
[1] "1234" "567"
Try ?strapplyc for additional info and examples.
Related Functions
1) A generalization of strapplyc is strapply in the same package. It takes a function which inputs the captured portions of each match and returns the output of the function. When the function is c it reduces to strapplyc. For example, suppose we wish to return results as numeric:
> strapply(s, r, as.numeric)
[[1]]
[1] 1234 567
2) gsubfn is another related function in the same package. It is like gsub except the replacement string can be a replacement function (or a replacement list or a replacement proto object). The replacement function inputs the captured portions and outputs the replacement. The replacement replaces the match in the input string. If a formula is used, as in this example, the right hand side of the formula is regarded as the function body. In this example we replace the match with XY{#} where # is twice the matched input number.
> gsubfn(r, ~ paste0("XY{", 2 * as.numeric(x), "}"), s)
[1] "XY{2468}wz98XY{1134}"
UPDATE: Added strapply and gsubfn examples.
Since R 4.1.0, there is gregexec:
regmatches(s,gregexec(r,s))[[1]][2, ]
[1] "1234" "567"

Gsub to get part matched strings in R regular expression?

gsub('[a-zA-Z]+([0-9]{5})','\\1','htf84756.iuy')
[1] "84756.iuy"
I want to get 84756,how can i do?
Using gregexpr() with regmatches() has the advantage of only requiring that your pattern match the bit that you actually want to extract:
string <- 'htf84756.iuy'
pat <- "(\\d){5}"
regmatches(string, gregexpr(pat, string))[[1]]
# [1] "84756"
(In practice, these functions are more useful when a supplied string might contain more than one substring matching pat.)
Try this:
R> gsub('[a-zA-Z]+([0-9]{5}).*','\\1','htf84756.iuy')
[1] "84756"
R>
You need the added .* at the end of the "greedy" regexp to terminate it after the 5 digits.
This could work as well (like Dirk's answer better) based on what to add to yours:
gsub('[a-zA-Z]+([0-9]{5})\\.([a-zA-Z])+','\\1','htf84756.iuy')
If you just want the numeric string this may be helpful as well:
gsub('[^0-9]','','htf84756.iuy')
With stringr, you can use str_extract:
library(stringr)
str_extract("htf84756.iuy", "[0-9]+")