Split string on special character - regex

I have a string (fasta format), something like this:
a = ">atttaggacctta\nattgtcggta\n>ccattnnnn\ncccatt\n>ttaggccta"
and would like to seperate at character >, filter out the newlines and put the thre substrings seperated by > into a vector or list with three elements:
>atttaggaccttaattgtcggta
>ccattnnnncccatt
>ttaggccta
I tried strsplit:
unlist(strsplit(a, "(?<=>)", perl=T))
but this puts the delimiter > at the end of the each string.
I found related questions are here or here but I can't really get it to work without making a complicated construct.
Is there a simple solution to do this in one go?

Your regex only contains a lookbehind that matches any empty location after a >, see your regex demo. The engine processes a string from left to right, checks if there is a > to the left of the current location, and then returns a valid empty string match if < is found.
You may use (?<=[^>])(?=>) regex:
> res <- unlist(strsplit(a, "(?<=[^>])(?=>)", perl=T))
> res
[1] ">atttaggacctta\nattgtcggta\n" ">ccattnnnn\ncccatt\n"
[3] ">ttaggccta"
> gsub("\n", "", res, fixed=TRUE)
[1] ">atttaggaccttaattgtcggta" ">ccattnnnncccatt"
[3] ">ttaggccta"
The pattern matches a location that is preceded with a non-> char and is followed with > char.
Note that using a lookbehind pattern only with strsplit often leads to unexpected behavior. See Why does strsplit use positive lookahead and lookbehind assertion matches differently?

library(stringi)
library(magrittr)
a <- ">atttaggacctta\nattgtcggta\n>ccattnnnn\ncccatt\n>ttaggccta"
stri_replace_all_regex(a, "\\n", "") %>%
stri_extract_all_regex("(>[[:alpha:]]+)") %>%
unlist()
## [1] ">atttaggaccttaattgtcggta" ">ccattnnnncccatt" ">ttaggccta"
If one must use base only:
a <- gsub("\\n", "", a)
unlist(regmatches(a, gregexpr("(>[[:alpha:]]+)", a)))

Related

How to find pattern next to a given string using regex in R

I have a string formatted for example like "segmentation_level1_id_10" and would like to extract the level number associated to it (i.e. the number directly after the word level).
I have a solution that does this in two steps, first finds the pattern level\\d+ then replaces the level with missing after, but I would like to know if it's possible to do this in one step just with str_extract
Example below:
library(stringr)
segmentation_id <- "segmentation_level1_id_10"
segmentation_level <- str_replace(str_extract(segmentation_id, "level\\d+"), "level", "")
One way to do it is by using a stringr library str_extract function with a regex featuring a lookbehind:
> library(stringr)
> s = "segmentation_level1_id_10"
> str_extract(s, "(?<=level)\\d+")
## or to make sure we match the level after _: str_extract(s, "(?<=_level)\\d+")
[1] "1"
Or using str_match that allows extracting captured group texts:
> str_match(s, "_level(\\d+)")[,2]
[1] "1"
It can be done with base R using the gsub and making use of the same capturing mechanism used in str_match, but also using a backreference to restore the captured text in the replacement result:
> gsub("^.*level(\\d+).*", "\\1", s)
[1] "1"

Return only matching portion of regular expression

I have:
> pattern
[1] "(/[[:digit:]]{4}/)"
so I want to extract only the matching portions...the digits plus the /.../. Here's what I tried:
> gsub(pattern, '\\1', grep(pattern, c("t3tg3wgw/5764/", "ggg", "grsgs/gwgew/5656/bfsbs"), value=TRUE))
[1] "t3tg3wgw/5764/" "grsgs/gwgew/5656/bfsbs"
However this still returns letters attached to the actual match that do not themselves match the regex. How can I extract only /5764/ and /5656/?
We could extract the pattern / followed by one or more numbers ([0-9]+) followed by / using str_extract_all from library(stringr) to output a list, which can be unlisted to convert to vector
library(stringr)
unlist(str_extract_all(v1, '/[0-9]+/'))
#[1] "/5764/" "/5656/"
Or we use the same pattern and using regmatches/gregexpr from base R
unlist(regmatches(v1, gregexpr('/[0-9]+/',v1)))
#[1] "/5764/" "/5656/"
data
v1 <- c("t3tg3wgw/5764/", "ggg", "grsgs/gwgew/5656/bfsbs")
Try changing the pattern to .*(/[[:digit:]]{4}/).*

stringr package str_extract() with inversion of the regex

I have a string like the following:
14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.0
The following regex extracts the last part that ends in a dot and a digit. I want to extract everything but that part and can't seem to find a way to invert the regex (using ^) is not helping:
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.0'
> str_extract(s, '(\\.[0-9]{1})$')
[1] ".0"
I instead want the output to be:
[1] 14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27
To clarify further, I want it to return the string as is, if it does not end in a dot and one single digit.
Following example:
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27'
> str_extract(s, someRegex)
[1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.1'
> str_extract(s, someRegex)
[1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"
> s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.4'
> str_extract(s, someRegex)
[1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"
Try this regex:
^.*(?=\.\d+$)|^.*
Regex live here.
One option would be substituting for the last bit,
sub("\\.\\d$", '', s)
str_extract(s, ([\w ]+(?:\.|\-)){7})
Then you can access the returned string to its lenght-1, and it will give you the required output!
PS: You may have to use escape characters.
You could use stringr::str_remove() for example:
s <- '14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27.0'
stringr::str_remove(s, '(\\.[0-9]{1})$')
#> [1] "14ed0d69fa2.bbd.7f5512.filter-132.21026.55B67C8E27"

Combining regex with a literal string

I have the following code:
input <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
This outputs:
"I2-I3" "I3-I1" "I3-I2" "I2-I1-I3" "I3" "I2-I3"
However, I only want to extract matches to the regex that are following immediately to a specific string, e.g.:
only match the regex when it's preceded by the literal string FA-I2-I2-I2-EX.
This, for example, would be the first match of the regex, while the second match is preceded by FA-I1-I2-TR-I1-I2-FA.
The expected output is roughly the same as in the regex above, but only selecting one of the 5 matches, because it needs to be preceded by a specific literal string.
How can I modify this regex to achieve this purpose? I assume it needs to use a positive lookbehind to first identify the literal string, then execute the regex.
I don't know if I'm fully understanding what you mean, but it seems you could use positive lookbehind.
For instance:
(?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt
There should be something more intuitive but i think this will do the job
literal <- "FA-I2-I2-I2-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
a <- lapply(strsplit(innovation_patterns, literal )[[1]], str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
b <- lapply(2:length(a), function(x){
a[[x]][[1]][1]
})
print(b)
Use (*SKIP)(*F)
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, perl('FA-I1-I2-TR-I1-I2-FA.*(*SKIP)(*F)|(?:I\\d-?)*I3(?:-?I\\d)*'))
Syntax would be like,
partIDontWant.*(*SKIP)(*F)|choose from the string which exists before partIDontWant
DEMO
Here's is another way you could go about this.
x <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
CODE
substr <- 'FA-I2-I2-I2-EX'
regex <- paste0(substr, '-?((?:I\\d-?)*I3(?:-?I\\d)*)')
gsubfn::strapply(gsub('-1-', '-', x), regex, simplify = c)
## [1] "I2-I3"
Here's how to implement it:
lapply(innovation_patterns, str_extract_all, '(?<=FA-I2-I2-I2-EX-?)(?:I\\d-?)*I3(?:-?I\\d)*');
## [[1]]
## [[1]][[1]]
## [1] "I2-I3"

R regex: grep excluding hyphen/dash as boundary

I am trying to match an exact word in in a vector with variable strings. For this I am using boundaries. However, I would like for hyphen/dash not to be considered as word boundary. Here is an example:
vector<-c(
"ARNT",
"ACF, ASP, ACF64",
"BID",
"KTN1, KTN",
"NCRNA00181, A1BGAS, A1BG-AS",
"KTN1-AS1")
To match strings that contain "KTN1" I am using:
grep("(?i)(?=.*\\bKTN1\\b)", vector, perl=T)
But this matches both "KTN1" and "KTN1-AS1".
Is there a way I could treat the dash as a character so that "KTN1-AS1" is considered a whole word?
To match a particular word from an vector element, you need to use functions like regmatches , str_extract_all (from stringr package) not grep, since grep would return only the element index where the match is found.
> vector<-c(
+ "ARNT",
+ "ACF, ASP, ACF64",
+ "BID",
+ "KTN1, KTN",
+ "NCRNA00181, A1BGAS, A1BG-AS",
+ "KTN1-AS1")
> regmatches(vector, regexpr("(?i)\\bKTN1[-\\w]*\\b", vector, perl=T))
[1] "KTN1" "KTN1-AS1"
OR
> library(stringr)
> unlist(str_extract_all(vector[grep("(?i)\\bKTN1[-\\w]*\\b", vector)], perl("(?i).*\\bKTN1[-\\w]*\\b")))
[1] "KTN1" "KTN1-AS1"
Update:
> grep("\\bKTN1(?=$|,)", vector, perl=T, value=T)
[1] "KTN1, KTN"
Returns the element which contain the string KTN1 followed by a comma or end of the line.
OR
> grep("\\bKTN1\\b(?!-)", vector, perl=T, value=T)
[1] "KTN1, KTN"
Returns the element which contain the string KTN1 not followed by a hyphen.
I would keep this simple and create a DIY Boundary.
grep('(^|[^-\\w])KTN1([^-\\w]|$)', vector, ignore.case = TRUE)
We use a capture group to define the boundaries. We match a character that is not a hyphen or a word character — beginning or end of string, which is closer to the intent of the \b boundary .