stringr package str_extract() with inversion of the regex - regex

I have a string like the following:
14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.0
The following regex extracts the last part that ends in a dot and a digit. I want to extract everything but that part and can't seem to find a way to invert the regex (using ^) is not helping:
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.0'
> str_extract(s, '(\\.[0-9]{1})$')
[1] ".0"
I instead want the output to be:
[1] 14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27
To clarify further, I want it to return the string as is, if it does not end in a dot and one single digit.
Following example:
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27'
> str_extract(s, someRegex)
[1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.1'
> str_extract(s, someRegex)
[1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.4'
> str_extract(s, someRegex)
[1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"

Try this regex:
^.*(?=\.\d+$)|^.*
Regex live here.

One option would be substituting for the last bit,
sub("\\.\\d$", '', s)

str_extract(s, ([\w ]+(?:\.|\-)){7})
Then you can access the returned string to its lenght-1, and it will give you the required output!
PS: You may have to use escape characters.

You could use stringr::str_remove() for example:
s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.0'
stringr::str_remove(s, '(\\.[0-9]{1})$')
#> [1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"

Related

Split string on special character

I have a string (fasta format), something like this:
a = ">atttaggacctta\nattgtcggta\n>ccattnnnn\ncccatt\n>ttaggccta"
and would like to seperate at character >, filter out the newlines and put the thre substrings seperated by > into a vector or list with three elements:
>atttaggaccttaattgtcggta
>ccattnnnncccatt
>ttaggccta
I tried strsplit:
unlist(strsplit(a, "(?<=>)", perl=T))
but this puts the delimiter > at the end of the each string.
I found related questions are here or here but I can't really get it to work without making a complicated construct.
Is there a simple solution to do this in one go?
Your regex only contains a lookbehind that matches any empty location after a >, see your regex demo. The engine processes a string from left to right, checks if there is a > to the left of the current location, and then returns a valid empty string match if < is found.
You may use (?<=[^>])(?=>) regex:
> res <- unlist(strsplit(a, "(?<=[^>])(?=>)", perl=T))
> res
[1] ">atttaggacctta\nattgtcggta\n" ">ccattnnnn\ncccatt\n"
[3] ">ttaggccta"
> gsub("\n", "", res, fixed=TRUE)
[1] ">atttaggaccttaattgtcggta" ">ccattnnnncccatt"
[3] ">ttaggccta"
The pattern matches a location that is preceded with a non-> char and is followed with > char.
Note that using a lookbehind pattern only with strsplit often leads to unexpected behavior. See Why does strsplit use positive lookahead and lookbehind assertion matches differently?
library(stringi)
library(magrittr)
a <- ">atttaggacctta\nattgtcggta\n>ccattnnnn\ncccatt\n>ttaggccta"
stri_replace_all_regex(a, "\\n", "") %>%
stri_extract_all_regex("(>[[:alpha:]]+)") %>%
unlist()
## [1] ">atttaggaccttaattgtcggta" ">ccattnnnncccatt" ">ttaggccta"
If one must use base only:
a <- gsub("\\n", "", a)
unlist(regmatches(a, gregexpr("(>[[:alpha:]]+)", a)))

How to find pattern next to a given string using regex in R

I have a string formatted for example like "segmentation_level1_id_10" and would like to extract the level number associated to it (i.e. the number directly after the word level).
I have a solution that does this in two steps, first finds the pattern level\\d+ then replaces the level with missing after, but I would like to know if it's possible to do this in one step just with str_extract
Example below:
library(stringr)
segmentation_id <- "segmentation_level1_id_10"
segmentation_level <- str_replace(str_extract(segmentation_id, "level\\d+"), "level", "")
One way to do it is by using a stringr library str_extract function with a regex featuring a lookbehind:
> library(stringr)
> s = "segmentation_level1_id_10"
> str_extract(s, "(?<=level)\\d+")
## or to make sure we match the level after _: str_extract(s, "(?<=_level)\\d+")
[1] "1"
Or using str_match that allows extracting captured group texts:
> str_match(s, "_level(\\d+)")[,2]
[1] "1"
It can be done with base R using the gsub and making use of the same capturing mechanism used in str_match, but also using a backreference to restore the captured text in the replacement result:
> gsub("^.*level(\\d+).*", "\\1", s)
[1] "1"

Return the first occurrence of a character in a string

I have been trying to extract a portion of string after the occurrence of a first ^ sign. For example, the string looks like abc^28092015^def^1234. I need to extract 28092015 sandwiched between the 1st two ^ signs.
So, I need to extract 8 characters from the occurrence of the 1st ^ sign. I have been trying to extract the position of the first ^ sign and then use it as an argument in the substr function.
I tried to use this:
x=abc^28092015^def^1234 `rev(gregexpr("\\^", x)[[1]])[1]`
Referring the answer discussed here.
But it continues to return the last position. Can anyone please help me out?
I would use sub.
x <- "^28092015^def^1234"
sub("^.*?\\^(.*?)\\^.*", "\\1", x)
# [1] "28092015"
Since ^ is a special char in regex, you need to escape that in-order to match literal ^ symbols.
or
Do splitting on ^ and get the value of second index.
strsplit(x,"^", fixed=-T)[[1]][2]
# [1] "28092015"
or
You may use gsub aslo.
gsub("^.*?\\^|\\^.*", "", x, perl=T)
# [1] "28092015"
Here's one option with base R:
x <- "abc^28092015^def^1234"
m <- regexpr("(?<=\\^)(.+?)(?=\\^)", x, perl = TRUE)
##
R> regmatches(x, m)
#[1] "28092015"
Another option is stri_extract_first from library(stringi)
library(stringi)
stri_extract_first_regex(str1, '(?<=\\^)\\d+(?=\\^)')
#[1] "28092015"
If it is any character between two ^
stri_extract(str1, regex='(?<=\\^)[^^]+')
#[1] "28092015"
data
str1 <- 'abc^28092015^def^1234'
x <- 'abc^28092015^def^1234'
library(qdapRegex)
unlist(rm_between(x, '^', '^', extract=TRUE))[1]
# [1] "28092015"
It would be better if you split it using ^. But if you still want the pattern, you can try this.
^\S+\^(\d+)(?=\^)
Then match group 1.
OUTPUT
28092015
See DEMO

Return only matching portion of regular expression

I have:
> pattern
[1] "(/[[:digit:]]{4}/)"
so I want to extract only the matching portions...the digits plus the /.../. Here's what I tried:
> gsub(pattern, '\\1', grep(pattern, c("t3tg3wgw/5764/", "ggg", "grsgs/gwgew/5656/bfsbs"), value=TRUE))
[1] "t3tg3wgw/5764/" "grsgs/gwgew/5656/bfsbs"
However this still returns letters attached to the actual match that do not themselves match the regex. How can I extract only /5764/ and /5656/?
We could extract the pattern / followed by one or more numbers ([0-9]+) followed by / using str_extract_all from library(stringr) to output a list, which can be unlisted to convert to vector
library(stringr)
unlist(str_extract_all(v1, '/[0-9]+/'))
#[1] "/5764/" "/5656/"
Or we use the same pattern and using regmatches/gregexpr from base R
unlist(regmatches(v1, gregexpr('/[0-9]+/',v1)))
#[1] "/5764/" "/5656/"
data
v1 <- c("t3tg3wgw/5764/", "ggg", "grsgs/gwgew/5656/bfsbs")
Try changing the pattern to .*(/[[:digit:]]{4}/).*

Unable to replace string with back reference using gsub in R

I am trying to replace some text in a character vector using regex in R where, if there is a set of letters inside a bracket, the bracket content is to erplace the whole thing. So, given the input:
tst <- c("85", "86 (TBA)", "87 (LAST)")
my desired output would be equivalent to c("85", "TBA", "LAST")
I tried gsub("\\(([[:alpha:]])\\)", "\\1", tst) but it didn't replace anything. What do I need to correct in my regular expression here?
I think you want
gsub(".*\\(([[:alpha:]]+)\\)", "\\1", tst)
# [1] "85" "TBA" "LAST"
Your first expression was trying to match exactly one alpha character rather than one-or-more. I also added the ".*" to capture the beginning part of the string so it gets replaced as well, otherwise, it would be left untouched.
gsub("(?=.*\\([^)]*\\)).*\\(([^)]*)\\)", "\\1", tst, perl=TRUE)
## [1] "85" "TBA" "LAST"
You can try this.See demo.Replace by \1.
https://regex101.com/r/sH8aR8/38
The following would work. Note that white-spaces within the brackets may be problematic
A<-sapply(strsplit(tst," "),tail,1)
B<-gsub("\\(|\\)", "", A)
I like the purely regex answers better. I'm showing a solution using the qdapRegex package that I maintain as the result is pretty speedy and easy to remember and generalize. It pulls out the strings that are in parenthesis and then replaces any NA (no bracket) with the original value. Note that the result is a list and you'd need to use unlist to match your desired output.
library(qdpRegex)
m <- rm_round(tst, extract=TRUE)
m[is.na(m)] <- tst[is.na(m)]
## [[1]]
## [1] "85"
##
## [[2]]
## [1] "TBA"
##
## [[3]]
## [1] "LAST"