extracting multiple overlapping substrings - regex

i have strings of amino-acids like this:
x <- "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
and i would like to extract all non-overlapping substrings starting with M and finishing with *. so, for the above example i would need:
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
as the output. predictably regexpr gives me the greedy solution:
regmatches(x, regexpr("M.+\\*", x))
#[1] "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
i have also tried things suggested here, as this is the question that resembles my problem the most (but not quite), but to no avail.
any help would be appreciated.

I will add an option for capture of non-overlapping patterns as you requested. We have to check that another pattern hasn't begun within our match:
regmatches(x, gregexpr("M[^M]+?\\*", x))[[1]]
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"

Use a non-greedy .+? instead of .+, and switch to gregexpr for multiple matches:
R> regmatches(x, gregexpr("M.+?\\*", x))[[1]]
#"MEALYRAQVLVDLT*"
#"MQLPSSFAALAAQFDQL*"
#"MFSLLVASVFTPCSALPFWSIKFTLFILS*"

M[^*]+\\*
use negated character class.See demo.Also use perl=True option.
https://regex101.com/r/tD0dU9/6

Related

R: regular expression lookaround(s) to grab whats between two patterns

I have a vector with strings like:
x <-c('kjsdf_class-X1(z)20_sample-318TT1X.3','kjjwer_class-Z3(z)29_sample-318TT2X.4')
I wanted to use regular expressions to get what is between substrings 'class-' and '_sample' (such as 'X1(z)20' and 'Z3(z)29' in x), and thought the lookaround regex ((?=...), (?!...),... and so) would do it. Cannot get it to work though!
Sorry if this is similar to other SO questions eg here or here).
This is a bit different then what you had in mind, but it will do the job.
gsub("(.*class-)|(.)|(_sample.*)", "\\2", x)
The logic is the following, you have 3 "sets" of strings:
1) characters .* ending in class-
2) characters .
3) Characters starting with _sample and characters afterwords .*
From those you want to keep the second "set" \\2.
Or another maybe easier to understand:
gsub("(.*class-)|(_sample.*)", "", x)
Take any number of characters that end in class- and the string _sample followed by any number of characters, and substitute them with the NULL character ""
We could use str_extract_all from library(stringr)
library(stringr)
unlist(str_extract_all(x, '(?<=class-)[^_]+(?=_sample)'))
#[1] "X1(z)20" "Z3(z)29"
This should also work if there are multiple instances of the pattern within a string
x1 <- paste(x, x)
str_extract_all(x1, '(?<=class-)[^_]+(?=_sample)')
#[[1]]
#[1] "X1(z)20" "X1(z)20"
#[[2]]
#[1] "Z3(z)29" "Z3(z)29"
Basically, we are matching the characters that are between the two lookarounds ((?<=class-) and (?=_sample)). We extract characters that is not a _ (based on the example) preceded by class- and succeded by _sample.
gsub('.*-([^-]+)_.*','\\1',x)
[1] "X1(z)20" "Z3(z)29"

Extract capture group matches from regular expressions? (or: where is gregexec?)

Given a regular expression containing capture groups (parentheses) and a string, how can I obtain all the substrings matching the capture groups, i.e., the substrings usually referenced by "\1", "\2"?
Example: consider a regex capturing digits preceded by "xy":
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
Desired result:
[1] "1234" "567"
First attempt: gregexpr:
regmatches(s,gregexpr(r,s))
#[[1]]
#[1] "xy1234" "xy567"
Not what I want because it returns the substrings matching the entire pattern.
Second try: regexec:
regmatches(s,regexec("xy(\\d+)",s))
#[[1]]
#[1] "xy1234" "1234"
Not what I want because it returns only the first occurence of a matching for the entire pattern and the capture group.
If there was a gregexec function, extending regexec as gregexpr extends regexpr, my problem would be solved.
So the question is: how to retrieve all substrings (or indices that can be passed to regmatches as in the examples above) matching capture groups in an arbitrary regular expression?
Note: the pattern for r given above is just a silly example, it must remain arbitrary.
For a base R solution, what about just using gsub() to finish processing the strings extracted by gregexpr() and regmatches()?
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
gsub(r, "\\1", regmatches(s,gregexpr(r,s))[[1]])
# [1] "1234" "567"
Not sure about doing this in base, but here's a package for your needs:
library(stringr)
str_match_all(s, r)
#[[1]]
# [,1] [,2]
#[1,] "xy1234" "1234"
#[2,] "xy567" "567"
Many stringr functions also have parallels in base R, so you can also achieve this without using stringr.
For instance, here's a simplified version of how the above works, using base R:
sapply(regmatches(s,gregexpr(r,s))[[1]], function(m) regmatches(m,regexec(r,m)))
strapplyc in the gsubfn package does that:
> library(gsubfn)
>
> strapplyc(s, r)
[[1]]
[1] "1234" "567"
Try ?strapplyc for additional info and examples.
Related Functions
1) A generalization of strapplyc is strapply in the same package. It takes a function which inputs the captured portions of each match and returns the output of the function. When the function is c it reduces to strapplyc. For example, suppose we wish to return results as numeric:
> strapply(s, r, as.numeric)
[[1]]
[1] 1234 567
2) gsubfn is another related function in the same package. It is like gsub except the replacement string can be a replacement function (or a replacement list or a replacement proto object). The replacement function inputs the captured portions and outputs the replacement. The replacement replaces the match in the input string. If a formula is used, as in this example, the right hand side of the formula is regarded as the function body. In this example we replace the match with XY{#} where # is twice the matched input number.
> gsubfn(r, ~ paste0("XY{", 2 * as.numeric(x), "}"), s)
[1] "XY{2468}wz98XY{1134}"
UPDATE: Added strapply and gsubfn examples.
Since R 4.1.0, there is gregexec:
regmatches(s,gregexec(r,s))[[1]][2, ]
[1] "1234" "567"

Grab from beginning to first occurrence of character with gsub

I have the following regex that I'd like to grab everything from the beginning of the sentence until the first ##. I could use strsplit as I demonstrate to do this task but am preferring a gsub solution. If gusub is not the correct tool (I think it is though) I'd prefer a base solution because I want to learn the base regex tools.
x <- "gfd gdr tsvfvetrv erv tevgergre ## vev fe ## vgrrgf"
strsplit(x, "##")[[c(1, 1)]] #works
gsub("(.*)(##.*)", "\\1", x) #I want to work
Just add one character, putting a ? after the first quantifier to make it "non-greedy":
gsub("(.*?)(##.*)", "\\1", x)
# [1] "gfd gdr tsvfvetrv erv tevgergre "
Here's the relevant documentation, from ?regex
By default repetition is greedy, so the maximal possible number of
repeats is used. This can be changed to 'minimal' by appending
'?' to the quantifier.
I'd say:
sub("##.*", "", x)
Removes everything including and after the first occurance of ##.
In this case, I'd say to the inverse, i.e. replace everything following # with an empty string:
gsub("#.*$", "", x)
[1] "gfd gdr tsvfvetrv erv tevgergre "
But you can also use the non-greedy modifier ? to make your regex work in the way you suggested:
gsub("(.*?)#.*$", "\\1", x)
[1] "gfd gdr tsvfvetrv erv tevgergre "
Here's another approach that uses more string tools instead of a more complicated regular expression. It first finds the location of the first ## and then extracts the substring up to that point:
library(stringr)
x <- "gfd gdr tsvfvetrv erv tevgergre ## vev fe ## vgrrgf"
loc <- str_locate(x, "##")
str_sub(x, 1, loc[, "start"] - 1)
Generally, I think this sort of step-by-step approach is more maintainable than complex regular expressions.
Try this as your regex
^[^#]+
starts at the beginning of the string and matches anything not a # up to the first #
There are several simpler answers already here, but since you indicated in your question that you'd like to learn about regex support in base R, here's another way, using positive lookahead assertion (?=#) and non-greedy option (?U).
regmatches(x, regexpr('(?U)^.+(?=#)', x, perl=TRUE))
[1] "gfd gdr tsvfvetrv erv tevgergre "

Gsub to get part matched strings in R regular expression?

gsub('[a-zA-Z]+([0-9]{5})','\\1','htf84756.iuy')
[1] "84756.iuy"
I want to get 84756,how can i do?
Using gregexpr() with regmatches() has the advantage of only requiring that your pattern match the bit that you actually want to extract:
string <- 'htf84756.iuy'
pat <- "(\\d){5}"
regmatches(string, gregexpr(pat, string))[[1]]
# [1] "84756"
(In practice, these functions are more useful when a supplied string might contain more than one substring matching pat.)
Try this:
R> gsub('[a-zA-Z]+([0-9]{5}).*','\\1','htf84756.iuy')
[1] "84756"
R>
You need the added .* at the end of the "greedy" regexp to terminate it after the 5 digits.
This could work as well (like Dirk's answer better) based on what to add to yours:
gsub('[a-zA-Z]+([0-9]{5})\\.([a-zA-Z])+','\\1','htf84756.iuy')
If you just want the numeric string this may be helpful as well:
gsub('[^0-9]','','htf84756.iuy')
With stringr, you can use str_extract:
library(stringr)
str_extract("htf84756.iuy", "[0-9]+")

R- regexp question

I need to re-shape my data frame using regexp and, in particular, this kind of line
X21_GS04.A.mzdata
must became:
GS04.A
I tryed
pluto <- sub('^X[0-90_]+','', my.data.frame$File.Name, perl=TRUE)
and it works; than I tryed
pluto <- sub('.mzdata$','', my.data.frame$File.Name, perl=TRUE)
and it works too.
The problem is that I have no idea how to combine the two code in one, I tryed a script such this
pluto <- sub('^X[0-90_]+ | .mzdata$','', my.data.frame$File.Name, perl=TRUE)
but nothing appens.
Can someone say to me where I wrong??
Best
Riccardo
The regular expression you’re after is this:
^X\d+_(.*)\.mzdata$
This will match your whole expression and capture the part that you want to retain in a group. You can now replace this by \1 (a reference to the capture group).
In R, this would be:
result <- sub('^X\\d+_(.*)\\.mzdata$', '\\1', my.data.frame$File.Name, perl=TRUE)
Remove space in your regex. Also escape . char: \., i.e.:
^X[0-9]+_|\.mzdata$