R- regexp question - regex

I need to re-shape my data frame using regexp and, in particular, this kind of line
X21_GS04.A.mzdata
must became:
GS04.A
I tryed
pluto <- sub('^X[0-90_]+','', my.data.frame$File.Name, perl=TRUE)
and it works; than I tryed
pluto <- sub('.mzdata$','', my.data.frame$File.Name, perl=TRUE)
and it works too.
The problem is that I have no idea how to combine the two code in one, I tryed a script such this
pluto <- sub('^X[0-90_]+ | .mzdata$','', my.data.frame$File.Name, perl=TRUE)
but nothing appens.
Can someone say to me where I wrong??
Best
Riccardo

The regular expression you’re after is this:
^X\d+_(.*)\.mzdata$
This will match your whole expression and capture the part that you want to retain in a group. You can now replace this by \1 (a reference to the capture group).
In R, this would be:
result <- sub('^X\\d+_(.*)\\.mzdata$', '\\1', my.data.frame$File.Name, perl=TRUE)

Remove space in your regex. Also escape . char: \., i.e.:
^X[0-9]+_|\.mzdata$

Related

extracting multiple overlapping substrings

i have strings of amino-acids like this:
x <- "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
and i would like to extract all non-overlapping substrings starting with M and finishing with *. so, for the above example i would need:
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
as the output. predictably regexpr gives me the greedy solution:
regmatches(x, regexpr("M.+\\*", x))
#[1] "MEALYRAQVLVDLT*MQLPSSFAALAAQFDQL*EKEKF*SLIARSLHRPQ**LLMFSLLVASVFTPCSALPFWSIKFTLFILS*SFLISDSILFIRVIDQEIKYVVPL*DLK*LTPDYCKCD*"
i have also tried things suggested here, as this is the question that resembles my problem the most (but not quite), but to no avail.
any help would be appreciated.
I will add an option for capture of non-overlapping patterns as you requested. We have to check that another pattern hasn't begun within our match:
regmatches(x, gregexpr("M[^M]+?\\*", x))[[1]]
#[1] "MEALYRAQVLVDLT*"
#[2] "MQLPSSFAALAAQFDQL*"
#[3] "MFSLLVASVFTPCSALPFWSIKFTLFILS*"
Use a non-greedy .+? instead of .+, and switch to gregexpr for multiple matches:
R> regmatches(x, gregexpr("M.+?\\*", x))[[1]]
#"MEALYRAQVLVDLT*"
#"MQLPSSFAALAAQFDQL*"
#"MFSLLVASVFTPCSALPFWSIKFTLFILS*"
M[^*]+\\*
use negated character class.See demo.Also use perl=True option.
https://regex101.com/r/tD0dU9/6

Unable to replace string with back reference using gsub in R

I am trying to replace some text in a character vector using regex in R where, if there is a set of letters inside a bracket, the bracket content is to erplace the whole thing. So, given the input:
tst <- c("85", "86 (TBA)", "87 (LAST)")
my desired output would be equivalent to c("85", "TBA", "LAST")
I tried gsub("\\(([[:alpha:]])\\)", "\\1", tst) but it didn't replace anything. What do I need to correct in my regular expression here?
I think you want
gsub(".*\\(([[:alpha:]]+)\\)", "\\1", tst)
# [1] "85" "TBA" "LAST"
Your first expression was trying to match exactly one alpha character rather than one-or-more. I also added the ".*" to capture the beginning part of the string so it gets replaced as well, otherwise, it would be left untouched.
gsub("(?=.*\\([^)]*\\)).*\\(([^)]*)\\)", "\\1", tst, perl=TRUE)
## [1] "85" "TBA" "LAST"
You can try this.See demo.Replace by \1.
https://regex101.com/r/sH8aR8/38
The following would work. Note that white-spaces within the brackets may be problematic
A<-sapply(strsplit(tst," "),tail,1)
B<-gsub("\\(|\\)", "", A)
I like the purely regex answers better. I'm showing a solution using the qdapRegex package that I maintain as the result is pretty speedy and easy to remember and generalize. It pulls out the strings that are in parenthesis and then replaces any NA (no bracket) with the original value. Note that the result is a list and you'd need to use unlist to match your desired output.
library(qdpRegex)
m <- rm_round(tst, extract=TRUE)
m[is.na(m)] <- tst[is.na(m)]
## [[1]]
## [1] "85"
##
## [[2]]
## [1] "TBA"
##
## [[3]]
## [1] "LAST"

Extract subset of a string following specific text in R

I am trying to extract all of the words in the string below contained within the brackets following the word 'tokens' only if the 'tokens' occurs after 'tag(noun)'.
For example, I have the string:
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),
inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),
inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),
head([lexmatch([department]),inputmatch(['Department']),tag(noun),
tokens([department])])],0/29,[])."
I want to get a list of all of the words that occur within the brackets after the word 'tokens' only when the word tokens occurs after 'tag(noun)'.
Therefore, I want my output to be a vector of the following:
[1] new, york, state, department
How do I do this? I'm assuming I have to use a regular expression, but I'm lost on how to write this in R.
Thanks!
Remove newlines and then extract the portion matched to the part between parentheses in pattern pat. Then split apart such strings by commas and simplify into a character vector:
library(gsubfn)
pat <- "tag.noun.,tokens..(.*?)\\]"
strapply(gsub("\\n", "", m), pat, ~ unlist(strsplit(x, ",")), simplify = c)
giving:
[1] "new" "york" "state" "department"
Visualization: Here is the debuggex representation of the regular expression in pat. (Note that we need to double the backslash when put within R's double quotes):
tag.noun.,tokens..(.*?)\]
Debuggex Demo
Note that .*? means match the shortetst string of any characters such that the entire pattern matches - without the ? it would try to match the longest string.
How about something like this. Here i'll use the regcatputedmatches helper function to make it easier to extract the captured matches.
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),head([lexmatch([department]),inputmatch(['Department']),tag(noun),tokens([department])])],0/29,[])."
rx <- gregexpr("tag\\(noun\\),tokens\\(\\[([^]]+)\\]\\)", m, perl=T)
lapply(regcapturedmatches(m,rx), function(x) {
unlist(strsplit(c(x),","))
})
# [[1]]
# [1] "new" "york" "state" "department"
The regular expression is a bit messy because your desired match contains many special regular expression symbols so we need to properly escape them.
Here is a one liner if you like:
paste(unlist(regmatches(m, gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T))), collapse=",")
[1] "new,york,state,department"
Broken down:
# Get match indices
indices <- gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T)
# Extract the matches
matches <- regmatches(m, indices)
# unlist and paste together
paste(unlist(matches), collapse=",")
[1] "new,york,state,department"

Extract capture group matches from regular expressions? (or: where is gregexec?)

Given a regular expression containing capture groups (parentheses) and a string, how can I obtain all the substrings matching the capture groups, i.e., the substrings usually referenced by "\1", "\2"?
Example: consider a regex capturing digits preceded by "xy":
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
Desired result:
[1] "1234" "567"
First attempt: gregexpr:
regmatches(s,gregexpr(r,s))
#[[1]]
#[1] "xy1234" "xy567"
Not what I want because it returns the substrings matching the entire pattern.
Second try: regexec:
regmatches(s,regexec("xy(\\d+)",s))
#[[1]]
#[1] "xy1234" "1234"
Not what I want because it returns only the first occurence of a matching for the entire pattern and the capture group.
If there was a gregexec function, extending regexec as gregexpr extends regexpr, my problem would be solved.
So the question is: how to retrieve all substrings (or indices that can be passed to regmatches as in the examples above) matching capture groups in an arbitrary regular expression?
Note: the pattern for r given above is just a silly example, it must remain arbitrary.
For a base R solution, what about just using gsub() to finish processing the strings extracted by gregexpr() and regmatches()?
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
gsub(r, "\\1", regmatches(s,gregexpr(r,s))[[1]])
# [1] "1234" "567"
Not sure about doing this in base, but here's a package for your needs:
library(stringr)
str_match_all(s, r)
#[[1]]
# [,1] [,2]
#[1,] "xy1234" "1234"
#[2,] "xy567" "567"
Many stringr functions also have parallels in base R, so you can also achieve this without using stringr.
For instance, here's a simplified version of how the above works, using base R:
sapply(regmatches(s,gregexpr(r,s))[[1]], function(m) regmatches(m,regexec(r,m)))
strapplyc in the gsubfn package does that:
> library(gsubfn)
>
> strapplyc(s, r)
[[1]]
[1] "1234" "567"
Try ?strapplyc for additional info and examples.
Related Functions
1) A generalization of strapplyc is strapply in the same package. It takes a function which inputs the captured portions of each match and returns the output of the function. When the function is c it reduces to strapplyc. For example, suppose we wish to return results as numeric:
> strapply(s, r, as.numeric)
[[1]]
[1] 1234 567
2) gsubfn is another related function in the same package. It is like gsub except the replacement string can be a replacement function (or a replacement list or a replacement proto object). The replacement function inputs the captured portions and outputs the replacement. The replacement replaces the match in the input string. If a formula is used, as in this example, the right hand side of the formula is regarded as the function body. In this example we replace the match with XY{#} where # is twice the matched input number.
> gsubfn(r, ~ paste0("XY{", 2 * as.numeric(x), "}"), s)
[1] "XY{2468}wz98XY{1134}"
UPDATE: Added strapply and gsubfn examples.
Since R 4.1.0, there is gregexec:
regmatches(s,gregexec(r,s))[[1]][2, ]
[1] "1234" "567"

Gsub to get part matched strings in R regular expression?

gsub('[a-zA-Z]+([0-9]{5})','\\1','htf84756.iuy')
[1] "84756.iuy"
I want to get 84756,how can i do?
Using gregexpr() with regmatches() has the advantage of only requiring that your pattern match the bit that you actually want to extract:
string <- 'htf84756.iuy'
pat <- "(\\d){5}"
regmatches(string, gregexpr(pat, string))[[1]]
# [1] "84756"
(In practice, these functions are more useful when a supplied string might contain more than one substring matching pat.)
Try this:
R> gsub('[a-zA-Z]+([0-9]{5}).*','\\1','htf84756.iuy')
[1] "84756"
R>
You need the added .* at the end of the "greedy" regexp to terminate it after the 5 digits.
This could work as well (like Dirk's answer better) based on what to add to yours:
gsub('[a-zA-Z]+([0-9]{5})\\.([a-zA-Z])+','\\1','htf84756.iuy')
If you just want the numeric string this may be helpful as well:
gsub('[^0-9]','','htf84756.iuy')
With stringr, you can use str_extract:
library(stringr)
str_extract("htf84756.iuy", "[0-9]+")