Regex matches processing in R - regex

I would like to extract the 2 matching groups using R.
Right now I've got this, but is not working well:
Code:
str = '123abc'
vector <- gregexpr('(?<first>\\d+)(?<second>\\w+)', str, perl=TRUE)
regmatches(str, vector)
Result:
[[1]]
[1] "123abc"
I want the result to be something like this:
[1] "123"
[2] "abc"

I'm not sure if you have a specific reason for using regmatches, unless you are e.g. importing the expressions in that format. If well-defined groups are common to all your entries, you can match them in this way:
x <- "123abc"
sub("([[:digit:]]+)[[:alpha:]]+","\\1",x)
sub("[[:digit:]]+([[:alpha:]]+)","\\1",x)
Result
[1] "123"
[1] "abc"
I.e., match the entire structure of the string, then replace it with the part you want to retain by enclosing it in round brackets and referring to it with a backreference ("\\1").

I've renamed your string s to avoid clobbering str. Here is one approach:
library(stringr)
s <- '123abc'
reg <- '([[:digit:]]+)([[:alpha:]]+)'
complete <- unlist(str_extract_all(s, reg))
partials <- unlist(str_match_all(s, reg))
partials <- partials[!(partials %in% complete)]
partials
[1] "123" "abc"

Depending on how well structured your inputs are, you may want to use strsplit to split the string.
Documentation here.

Try this:
> library(gsubfn)
> strapplyc("123abc", '(\\d+)(\\w+)')[[1]]
[1] "123" "abc"

Related

How to find pattern next to a given string using regex in R

I have a string formatted for example like "segmentation_level1_id_10" and would like to extract the level number associated to it (i.e. the number directly after the word level).
I have a solution that does this in two steps, first finds the pattern level\\d+ then replaces the level with missing after, but I would like to know if it's possible to do this in one step just with str_extract
Example below:
library(stringr)
segmentation_id <- "segmentation_level1_id_10"
segmentation_level <- str_replace(str_extract(segmentation_id, "level\\d+"), "level", "")
One way to do it is by using a stringr library str_extract function with a regex featuring a lookbehind:
> library(stringr)
> s = "segmentation_level1_id_10"
> str_extract(s, "(?<=level)\\d+")
## or to make sure we match the level after _: str_extract(s, "(?<=_level)\\d+")
[1] "1"
Or using str_match that allows extracting captured group texts:
> str_match(s, "_level(\\d+)")[,2]
[1] "1"
It can be done with base R using the gsub and making use of the same capturing mechanism used in str_match, but also using a backreference to restore the captured text in the replacement result:
> gsub("^.*level(\\d+).*", "\\1", s)
[1] "1"

stringr package str_extract() with inversion of the regex

I have a string like the following:
14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.0
The following regex extracts the last part that ends in a dot and a digit. I want to extract everything but that part and can't seem to find a way to invert the regex (using ^) is not helping:
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.0'
> str_extract(s, '(\\.[0-9]{1})$')
[1] ".0"
I instead want the output to be:
[1] 14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27
To clarify further, I want it to return the string as is, if it does not end in a dot and one single digit.
Following example:
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27'
> str_extract(s, someRegex)
[1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.1'
> str_extract(s, someRegex)
[1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.4'
> str_extract(s, someRegex)
[1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"
Try this regex:
^.*(?=\.\d+$)|^.*
Regex live here.
One option would be substituting for the last bit,
sub("\\.\\d$", '', s)
str_extract(s, ([\w ]+(?:\.|\-)){7})
Then you can access the returned string to its lenght-1, and it will give you the required output!
PS: You may have to use escape characters.
You could use stringr::str_remove() for example:
s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.0'
stringr::str_remove(s, '(\\.[0-9]{1})$')
#> [1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"

Unable to replace string with back reference using gsub in R

I am trying to replace some text in a character vector using regex in R where, if there is a set of letters inside a bracket, the bracket content is to erplace the whole thing. So, given the input:
tst <- c("85", "86 (TBA)", "87 (LAST)")
my desired output would be equivalent to c("85", "TBA", "LAST")
I tried gsub("\\(([[:alpha:]])\\)", "\\1", tst) but it didn't replace anything. What do I need to correct in my regular expression here?
I think you want
gsub(".*\\(([[:alpha:]]+)\\)", "\\1", tst)
# [1] "85" "TBA" "LAST"
Your first expression was trying to match exactly one alpha character rather than one-or-more. I also added the ".*" to capture the beginning part of the string so it gets replaced as well, otherwise, it would be left untouched.
gsub("(?=.*\\([^)]*\\)).*\\(([^)]*)\\)", "\\1", tst, perl=TRUE)
## [1] "85" "TBA" "LAST"
You can try this.See demo.Replace by \1.
https://regex101.com/r/sH8aR8/38
The following would work. Note that white-spaces within the brackets may be problematic
A<-sapply(strsplit(tst," "),tail,1)
B<-gsub("\\(|\\)", "", A)
I like the purely regex answers better. I'm showing a solution using the qdapRegex package that I maintain as the result is pretty speedy and easy to remember and generalize. It pulls out the strings that are in parenthesis and then replaces any NA (no bracket) with the original value. Note that the result is a list and you'd need to use unlist to match your desired output.
library(qdpRegex)
m <- rm_round(tst, extract=TRUE)
m[is.na(m)] <- tst[is.na(m)]
## [[1]]
## [1] "85"
##
## [[2]]
## [1] "TBA"
##
## [[3]]
## [1] "LAST"

Extract capture group matches from regular expressions? (or: where is gregexec?)

Given a regular expression containing capture groups (parentheses) and a string, how can I obtain all the substrings matching the capture groups, i.e., the substrings usually referenced by "\1", "\2"?
Example: consider a regex capturing digits preceded by "xy":
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
Desired result:
[1] "1234" "567"
First attempt: gregexpr:
regmatches(s,gregexpr(r,s))
#[[1]]
#[1] "xy1234" "xy567"
Not what I want because it returns the substrings matching the entire pattern.
Second try: regexec:
regmatches(s,regexec("xy(\\d+)",s))
#[[1]]
#[1] "xy1234" "1234"
Not what I want because it returns only the first occurence of a matching for the entire pattern and the capture group.
If there was a gregexec function, extending regexec as gregexpr extends regexpr, my problem would be solved.
So the question is: how to retrieve all substrings (or indices that can be passed to regmatches as in the examples above) matching capture groups in an arbitrary regular expression?
Note: the pattern for r given above is just a silly example, it must remain arbitrary.
For a base R solution, what about just using gsub() to finish processing the strings extracted by gregexpr() and regmatches()?
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
gsub(r, "\\1", regmatches(s,gregexpr(r,s))[[1]])
# [1] "1234" "567"
Not sure about doing this in base, but here's a package for your needs:
library(stringr)
str_match_all(s, r)
#[[1]]
# [,1] [,2]
#[1,] "xy1234" "1234"
#[2,] "xy567" "567"
Many stringr functions also have parallels in base R, so you can also achieve this without using stringr.
For instance, here's a simplified version of how the above works, using base R:
sapply(regmatches(s,gregexpr(r,s))[[1]], function(m) regmatches(m,regexec(r,m)))
strapplyc in the gsubfn package does that:
> library(gsubfn)
>
> strapplyc(s, r)
[[1]]
[1] "1234" "567"
Try ?strapplyc for additional info and examples.
Related Functions
1) A generalization of strapplyc is strapply in the same package. It takes a function which inputs the captured portions of each match and returns the output of the function. When the function is c it reduces to strapplyc. For example, suppose we wish to return results as numeric:
> strapply(s, r, as.numeric)
[[1]]
[1] 1234 567
2) gsubfn is another related function in the same package. It is like gsub except the replacement string can be a replacement function (or a replacement list or a replacement proto object). The replacement function inputs the captured portions and outputs the replacement. The replacement replaces the match in the input string. If a formula is used, as in this example, the right hand side of the formula is regarded as the function body. In this example we replace the match with XY{#} where # is twice the matched input number.
> gsubfn(r, ~ paste0("XY{", 2 * as.numeric(x), "}"), s)
[1] "XY{2468}wz98XY{1134}"
UPDATE: Added strapply and gsubfn examples.
Since R 4.1.0, there is gregexec:
regmatches(s,gregexec(r,s))[[1]][2, ]
[1] "1234" "567"

regex to pickout some text between parenthesis [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Extract info inside all parenthesis in R (regex)
I have a string
df
Peoplesoft(id-1290)
I like to capture characters between the parentesis, for example. I like to get id-1290 from the above example.
I used this:
x <- regexpr("\\((.*)\\)", df)
this is giving me numbers like
[1] 10
Is there an easy way to grab text between parentesis using regex in R?
I prefer to use gsub() for this:
gsub(".*\\((.*)\\).*", "\\1", df)
[1] "id-1290"
The regex works like this:
Find text inside the parentheses - not your real parentheses, but my extra set of parentheses, i.e. (.*)
Return this as a back-reference, \\1
In other words, substitute all text in the string with the back reference
If you want to use regexp rather than gsub, then do this:
x <- regexpr("\\((.*)\\)", df)
x
[1] 11
attr(,"match.length")
[1] 9
attr(,"useBytes")
[1] TRUE
This returns a value of 11, i.e. the starting position of the found expression. And note the attribute match.length that indicates how many characters were matched.
You can extract this with attr:
attr(x, "match.length")
[1] 9
And then use substring to extract the characters:
substring(df, x+1, x+attr(x, "match.length")-2)
[1] "id-1290"
Here is a slightly different way, using lookbehind/ahead:
df <- "Peoplesoft(id-1290)"
regmatches(df,gregexpr("(?<=\\().*?(?=\\))", df, perl=TRUE))
Difference with Andrie's answer is that this also works to extract multiple strings in brackets. e.g.:
df <- "Peoplesoft(id-1290) blabla (foo)"
regmatches(df,gregexpr("(?<=\\().*?(?=\\))", df, perl=TRUE))
Gives:
[[1]]
[1] "id-1290" "foo"